From: us on
"ambrosia nightwish" <mess_imen(a)yahoo.fr> wrote in message <hp53f7$97n$1(a)fred.mathworks.com>...
> no not yet

....and the example(?)...

us
From: ambrosia nightwish on
in the FCGR toolbox of Jesús Mena-Chalco we have an example
From: ambrosia nightwish on
if i have the sequence: AACCGTTAACGT, and i want to find the frequencies of all the dinucleotides (word of 2 letters for example), at the position 1 we read AA so the frequency of appearence is freq=1/12 , at the second position w read AC and freq=1/12, at the eighth position AA appeared fo the second time so freq=2/12 ,the calcul is stopped at the position N-len+1 (N:length of the sequence, len: length of word)
From: Roger Stafford on
"ambrosia nightwish" <mess_imen(a)yahoo.fr> wrote in message <hp54tn$2ii$1(a)fred.mathworks.com>...
> if i have the sequence: AACCGTTAACGT, and i want to find the frequencies of all the dinucleotides (word of 2 letters for example), at the position 1 we read AA so the frequency of appearence is freq=1/12 , at the second position w read AC and freq=1/12, at the eighth position AA appeared fo the second time so freq=2/12 ,the calcul is stopped at the position N-len+1 (N:length of the sequence, len: length of word)
--------------
Here's an outline of how you might go about it. Let v be a vector of length N with the nucleotide sequence - I am assuming they are represented by four numbers in this discussion - and k be the desired word length.

1) Create a N-k+1 by k matrix, M, containing the successive length-k words. You can use the 'hankel' function for this purpose.

2) Apply [B,m,n] = unique(M,'rows') to M. B will be a table of all the words appearing in the sequence in sorted order.

3) Apply 'histc' to the vector n to obtain the counts of B words in the sequence.

4) From the counts you can obtain the frequencies.

Can you take it from there?

Roger Stafford
From: us on
"ambrosia nightwish" <mess_imen(a)yahoo.fr> wrote in message <hp54tn$2ii$1(a)fred.mathworks.com>...
> if i have the sequence: AACCGTTAACGT, and i want to find the frequencies of all the dinucleotides (word of 2 letters for example), at the position 1 we read AA so the frequency of appearence is freq=1/12 , at the second position w read AC and freq=1/12, at the eighth position AA appeared fo the second time so freq=2/12 ,the calcul is stopped at the position N-len+1 (N:length of the sequence, len: length of word)

one of the many solutions

% the data
s='AACCGTTAACGT';
wl=2;
% the engine
rpat=sprintf('\\S{%d,%d}',wl,wl);
t=cell(wl,1);
for i=1:wl
t{i,1}=regexp(s(i:end),rpat,'match').';
end
t=cat(1,t{:});
[tu,ix,ix]=unique(t);
n=histc(ix,1:max(ix));
r=[tu,num2cell(n)];
% the result
disp(s);
disp(r);
%{
% S =
AACCGTTAACGT
% R =
'AA' [2]
'AC' [2]
'CC' [1]
'CG' [2]
'GT' [2]
'TA' [1]
'TT' [1]
%}

us