word occurence counting (DNA) [Matlab]

Prev: efficient storage, subsets of a set
Next: Finding First 50 max. values in an array

From: Ashish Uthama on 2 Apr 2010 14:50

On Fri, 02 Apr 2010 13:10:31 -0300, ambrosia nightwish
<mess_imen(a)yahoo.fr> wrote:

> if i have the sequence: AACCGTTAACGT, and i want to find the frequencies
> of all the dinucleotides (word of 2 letters for example), at the
> position 1 we read AA so the frequency of appearence is freq=1/12 , at
> the second position w read AC and freq=1/12, at the eighth position AA
> appeared fo the second time so freq=2/12 ,the calcul is stopped at the
> position N-len+1 (N:length of the sequence, len: length of word)

s='AACCGTTAACGT';

wLen=2;

%associative array, hash, lookup table ...(please see help)
countMap = containers.Map();

for indx=1: length(s)-wLen+1

curWord = s(indx:indx+wLen-1);

if(isKey(countMap,curWord))
%we have seen this, increment count
countMap(curWord)=countMap(curWord)+1;
else
countMap(curWord)=1;
end

end

words = countMap.keys;

frequency = countMap.values;
%Convert to an array
frequency = [frequency{:}];

prob = frequency./sum(frequency)

From: ambrosia nightwish on 3 Apr 2010 17:36

THe problem still exists:The first solution shows the number of the counted words and gives a final result what I want to do is to find the number of appearance of words in every step i walk (increment by 1and word reading by wl), Let us take the same example s='AACCGTTAACGT'
for the words:
AAC: n=1
ACC : n=1
CCG: n=1
CGT: n=1
TTA: n=1
TAA: n=1
AAC: n=2
ACG: n=1
CGT: n=2
AS for the second solution, the containers.Map function dont exist in the matlab version that i have.

From: Bruno Luong on 4 Apr 2010 06:33

Something like this?

s = 'AACCGTTAACGT';
k = 3;

d = double(s);
A = hankel(d(1:end-k+1),d(end-k+1:end));
[u i j] = unique(A,'rows');
b = zeros(length(i),1);
c = zeros(size(j));
for n=1:length(j)
jn = j(n);
b(jn) = b(jn)+1;
c(n) = b(jn);
end

S = char(A)
c

% Bruno

From: ambrosia nightwish on 4 Apr 2010 07:11

That's working Bruno, thank you all

From: Bruno Luong on 4 Apr 2010 09:38

% Here is an vectorized code (not necessary meant faster)
% http://www.mathworks.com/matlabcentral/fileexchange/24255

s = 'AACCGTTAACGT';
k = 3;

d = double(s);
A = hankel(d(1:end-k+1),d(end-k+1:end));
[u i j] = unique(A,'rows');
[js is]=sort(j);
clear c
c(is) = cell2mat(SplitVec(js,[],@(x) (1:length(x))')) % SplitVec on FEX

% Bruno

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: efficient storage, subsets of a set
Next: Finding First 50 max. values in an array