word occurence counting (DNA) [Matlab]

Prev: efficient storage, subsets of a set
Next: Finding First 50 max. values in an array

From: us on 2 Apr 2010 11:54

"ambrosia nightwish" <mess_imen(a)yahoo.fr> wrote in message <hp53f7$97n$1(a)fred.mathworks.com>...
> no not yet

....and the example(?)...

us

From: ambrosia nightwish on 2 Apr 2010 12:00

in the FCGR toolbox of Jesús Mena-Chalco we have an example

From: ambrosia nightwish on 2 Apr 2010 12:10

if i have the sequence: AACCGTTAACGT, and i want to find the frequencies of all the dinucleotides (word of 2 letters for example), at the position 1 we read AA so the frequency of appearence is freq=1/12 , at the second position w read AC and freq=1/12, at the eighth position AA appeared fo the second time so freq=2/12 ,the calcul is stopped at the position N-len+1 (N:length of the sequence, len: length of word)

From: Roger Stafford on 2 Apr 2010 14:06

"ambrosia nightwish" <mess_imen(a)yahoo.fr> wrote in message <hp54tn$2ii$1(a)fred.mathworks.com>...
> if i have the sequence: AACCGTTAACGT, and i want to find the frequencies of all the dinucleotides (word of 2 letters for example), at the position 1 we read AA so the frequency of appearence is freq=1/12 , at the second position w read AC and freq=1/12, at the eighth position AA appeared fo the second time so freq=2/12 ,the calcul is stopped at the position N-len+1 (N:length of the sequence, len: length of word)
--------------
Here's an outline of how you might go about it. Let v be a vector of length N with the nucleotide sequence - I am assuming they are represented by four numbers in this discussion - and k be the desired word length.

1) Create a N-k+1 by k matrix, M, containing the successive length-k words. You can use the 'hankel' function for this purpose.

2) Apply [B,m,n] = unique(M,'rows') to M. B will be a table of all the words appearing in the sequence in sorted order.

3) Apply 'histc' to the vector n to obtain the counts of B words in the sequence.

4) From the counts you can obtain the frequencies.

Can you take it from there?

Roger Stafford

From: us on 2 Apr 2010 14:25

"ambrosia nightwish" <mess_imen(a)yahoo.fr> wrote in message <hp54tn$2ii$1(a)fred.mathworks.com>...
> if i have the sequence: AACCGTTAACGT, and i want to find the frequencies of all the dinucleotides (word of 2 letters for example), at the position 1 we read AA so the frequency of appearence is freq=1/12 , at the second position w read AC and freq=1/12, at the eighth position AA appeared fo the second time so freq=2/12 ,the calcul is stopped at the position N-len+1 (N:length of the sequence, len: length of word)

one of the many solutions

% the data
s='AACCGTTAACGT';
wl=2;
% the engine
rpat=sprintf('\\S{%d,%d}',wl,wl);
t=cell(wl,1);
for i=1:wl
t{i,1}=regexp(s(i:end),rpat,'match').';
end
t=cat(1,t{:});
[tu,ix,ix]=unique(t);
n=histc(ix,1:max(ix));
r=[tu,num2cell(n)];
% the result
disp(s);
disp(r);
%{
% S =
AACCGTTAACGT
% R =
'AA' [2]
'AC' [2]
'CC' [1]
'CG' [2]
'GT' [2]
'TA' [1]
'TT' [1]
%}

us

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: efficient storage, subsets of a set
Next: Finding First 50 max. values in an array