From: Sudheer Tumu on
Hi All,

I have two dimensional matrix with more than 10 million rows.
I want to get the indexes of rows which are repeted more than n times. n should be passed as argument.

Eg: A=[1 2
3 4
5 6
1 2
1 2
3 4
1 2
5 6
7 8
1 2
3 4]
I want result like this by passing n=3 as parameter
Result=[1 2 4 5 6 7 10 11] Since [ 1 2] and [3 4] are the only rows which are repeated more than 3 times. Is there any way to do this without using loops.
I am using loops which is taking nearly 30 mins to get the solution.

This is what I am doing:
for i=min(A(:,1)):max(A(:,1))
a1=find(A(:,1)==i);
for j=min(A(:,2)):max(A(:,2))
a2=find(A(a1(:),2)== j);

if size(a2,1) >= n
end
end
end


Thank you,
Sudheer Tumu.
From: us on
"Sudheer Tumu" <tumusudheer(a)gmail.com> wrote in message <hhudae$4p$1(a)fred.mathworks.com>...
> Hi All,
>
> I have two dimensional matrix with more than 10 million rows.
> I want to get the indexes of rows which are repeted more than n times. n should be passed as argument.
>
> Eg: A=[1 2
> 3 4
> 5 6
> 1 2
> 1 2
> 3 4
> 1 2
> 5 6
> 7 8
> 1 2
> 3 4]
> I want result like this by passing n=3 as parameter
> Result=[1 2 4 5 6 7 10 11] Since [ 1 2] and [3 4] are the only rows which are repeated more than 3 times. Is there any way to do this without using loops.
> I am using loops which is taking nearly 30 mins to get the solution.
>
> This is what I am doing:
> for i=min(A(:,1)):max(A(:,1))
> a1=find(A(:,1)==i);
> for j=min(A(:,2)):max(A(:,2))
> a2=find(A(a1(:),2)== j);
>
> if size(a2,1) >= n
> end
> end
> end
>
>
> Thank you,
> Sudheer Tumu.

one of the solutions

n=3;
v=[
1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
].';
[vu,vx,vx]=unique(v,'rows');
vn=histc(vx,1:max(vx));
ix=find(ismember(vx,find(vn>=n))).'
% ix = 1 2 4 5 6 7 10 11

us
From: Sudheer Tumu on
"us " <us(a)neurol.unizh.ch> wrote in message <hhuths$led$1(a)fred.mathworks.com>...
> "Sudheer Tumu" <tumusudheer(a)gmail.com> wrote in message <hhudae$4p$1(a)fred.mathworks.com>...
> > Hi All,
> >
> > I have two dimensional matrix with more than 10 million rows.
> > I want to get the indexes of rows which are repeted more than n times. n should be passed as argument.
> >
> > Eg: A=[1 2
> > 3 4
> > 5 6
> > 1 2
> > 1 2
> > 3 4
> > 1 2
> > 5 6
> > 7 8
> > 1 2
> > 3 4]
> > I want result like this by passing n=3 as parameter
> > Result=[1 2 4 5 6 7 10 11] Since [ 1 2] and [3 4] are the only rows which are repeated more than 3 times. Is there any way to do this without using loops.
> > I am using loops which is taking nearly 30 mins to get the solution.
> >
> > This is what I am doing:
> > for i=min(A(:,1)):max(A(:,1))
> > a1=find(A(:,1)==i);
> > for j=min(A(:,2)):max(A(:,2))
> > a2=find(A(a1(:),2)== j);
> >
> > if size(a2,1) >= n
> > end
> > end
> > end
> >
> >
> > Thank you,
> > Sudheer Tumu.
>
> one of the solutions
>
> n=3;
> v=[
> 1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
> 2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
> ].';
> [vu,vx,vx]=unique(v,'rows');
> vn=histc(vx,1:max(vx));
> ix=find(ismember(vx,find(vn>=n))).'
> % ix = 1 2 4 5 6 7 10 11
>
> us

Hi us,

Thank you very very much, Your code is working great.

Now It is taking 30 sec . It is great when It is compared to 1300 sec. Is there any fast way to do than taking 30 sec.
One more doubt, how did you know all these fast way doing things on Mtalab. Did you know all these by practice or is there any other way to learn things on own, like any material like etc ..
From: Jan Simon on
Dear Sudheer!

> > n=3;
> > v=[
> > 1, 3, 5, 1, 1, 3, 1, 5, 7, 1, 3
> > 2, 4, 6, 2, 2, 4, 2, 6, 8, 2, 4
> > ].';
> > [vu,vx,vx]=unique(v,'rows');
> > vn=histc(vx,1:max(vx));
> > ix=find(ismember(vx,find(vn>=n))).'
> > % ix = 1 2 4 5 6 7 10 11
> >
> > us

> Now It is taking 30 sec . It is great when It is compared to 1300 sec. Is there any fast way to do than taking 30 sec.

If the values are integers e.g. between 0 and 16, it might be an advantage to combine the 2 columns before running the method posted by us:
v2 = v(:,1) + 17 * v(:, 2);
Then UNIQUE does not need the 'rows' flag and sorting might become faster.
There is some potential in us' method: UNIQUE and ISMEMBER sort their input (and perhaps HISTC does also?). So it would be faster to sort the large array once and operate with the sorted index.

Kind regards, Jan
From: Bruno Luong on
"Jan Simon" <matlab.THIS_YEAR(a)nMINUSsimon.de> wrote in message <hi0kgl$c56$1(a)fred.mathworks.com>...
unning the method posted by us:
> v2 = v(:,1) + 17 * v(:, 2);
> Then UNIQUE does not need the 'rows' flag and sorting might become faster.
> There is some potential in us' method: UNIQUE and ISMEMBER sort their input (and perhaps HISTC does also?). So it would be faster to sort the large array once and operate with the sorted index.

Jan,

HISCT is a dichotomy insertion method, which would be O(N*log(M)) in complexity where N and M are respectively the number of data and edges. So HISTC is faster than SORT when M<N in theory.

A better way for large array is ACCUMARRAY, see a thread here: http://www.mathworks.co.uk/matlabcentral/newsreader/view_thread/252639

Bruno