DeleteDuplicates is too slow? [Mathematica]

Prev: Diagonalizing large matrices
Next: intersecting surfaces

From: Zeringue, Clint M Civ USAF AFMC AFRL/RDLAF on 4 Feb 2010 06:25

Hello,

Suppose you have the following.

Data = RandomReal[1,{N,2}];

sameQ[_,_]=False;
sameQ[{x_,y_},{x_,z_}]=True;

Timing[DeleteDuplicates[data,sameQ]][[1]];

If N is a large number this takes an ungodly amount of time?

Is there a more efficient way to delete the duplicate entries of Data ?

ie.

Data = {{1.,2.},{1.,3.},{2.,3.}};

Would become:
{{1.,2.},{ 2.,3.}};

Thanks,

Clint Zeringue

From: Szabolcs Horvát on 5 Feb 2010 03:27

On 2010.02.04. 12:25, Zeringue, Clint M Civ USAF AFMC AFRL/RDLAF wrote:
> Hello,
>
> Suppose you have the following.
>
> Data = RandomReal[1,{N,2}];
>
> sameQ[_,_]=False;
> sameQ[{x_,y_},{x_,z_}]=True;
>
> Timing[DeleteDuplicates[data,sameQ]][[1]];
>
> If N is a large number this takes an ungodly amount of time?
>
> Is there a more efficient way to delete the duplicate entries of Data ?
>
> ie.
>
> Data = {{1.,2.},{1.,3.},{2.,3.}};
>
> Would become:
> {{1.,2.},{ 2.,3.}};
>

Take care not to use N as a variable as it already has a built-in meaning.

If it is not necessary to keep the elements of the list in the same
order, then a different, lower complexity algorithm can be used:

SplitBy[SortBy[data, First], First][[All, 1]]

This will be much faster, but will not remove exactly the same elements
as DeleteDuplicates because the second element of the pairs is always
ignored. DeleteDuplicates will always keep the very first occurrence of
equivalent elements. Is this important for your calculation?

From: Clint Zeringue on 5 Feb 2010 03:29

Thanks for all the great responses.

I have summarized them below.

Use Tally or, even better, GatherBy, to obtain very substantial reductions in time:

In[1]:= data=RandomInteger[{1,99},{100000,2}];

In[2]:=
sameQ[_,_]=False;
sameQ[{x_,y_},{x_,z_}]=True;

In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];]
Out[4]= {7.987,Null}

In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]]
Out[5]= 0.063

In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]]
Out[6]= 0.016

In[7]:= t0===t1===t2
Out[7]= True

Tomas
Tomas Garza [tgarza10(a)msn.com]

n = 100000;

data = RandomInteger[{0, 9}, {n, 2}] // N;

Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing

{1.39164,10}

Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing

{0.289784,10}

Bob Hanlon
Bob Hanlon [hanlonr(a)cox.net]

Take care not to use N as a variable as it already has a built-in meaning.

If it is not necessary to keep the elements of the list in the same order, then a different, lower complexity algorithm can be used:

SplitBy[SortBy[data, First], First][[All, 1]]

This will be much faster, but will not remove exactly the same elements as DeleteDuplicates because the second element of the pairs is always ignored. DeleteDuplicates will always keep the very first occurrence of equivalent elements. Is this important for your calculation?

Szabolcs Horvát [szhorvat(a)gmail.com]

The fastest solution was Tomas Garza's :

Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[

From: Daniel Lichtblau on 5 Feb 2010 03:30

Zeringue, Clint M Civ USAF AFMC AFRL/RDLAF wrote:
> Hello,
>
> Suppose you have the following.
>
> Data = RandomReal[1,{N,2}];
>
> sameQ[_,_]=False;
> sameQ[{x_,y_},{x_,z_}]=True;
>
> Timing[DeleteDuplicates[data,sameQ]][[1]];
>
> If N is a large number this takes an ungodly amount of time?
>
> Is there a more efficient way to delete the duplicate entries of Data ?
>
> ie.
>
> Data = {{1.,2.},{1.,3.},{2.,3.}};
>
> Would become:
> {{1.,2.},{ 2.,3.}};
>
>
> Thanks,
>
> Clint Zeringue

DeleteDuplicates is not able to recognize that there might be a fast way
to use your sameQ, hence it is an O(n^2) proposition when list length is
n. The variant below will behave better.

deldupes[ll_] := Module[{h, res},
res = Reap[Do[
If[! TrueQ[h[ll[[j, 1]]]], h[ll[[j, 1]]] = True; Sow[ll[[j]]]];
, {j, Length[ll]}]
][[2, 1]];
Clear[h];
res]

Example:

data = RandomInteger[10^3, {10^4, 2}];

In[27]:= Timing[dd1 = deldupes[data];]
Out[27]= {0.029996, Null}

In[28]:= Timing[dd2 = DeleteDuplicates[data, sameQ];]
Out[28]= {6.28405, Null}

In[31]:= dd2 // Length
Out[31]= 1001

In[32]:= dd1 === dd2
Out[32]= True

Daniel Lichtblau
Wolfram Research

From: Bob Hanlon on 5 Feb 2010 03:32

n = 100000;

data = RandomInteger[{0, 9}, {n, 2}] // N;

Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing

{1.39164,10}

Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing

{0.289784,10}

Bob Hanlon

---- "Zeringue wrote:

=============
Hello,

Suppose you have the following.

Data = RandomReal[1,{N,2}];

sameQ[_,_]=False;
sameQ[{x_,y_},{x_,z_}]=True;

Timing[DeleteDuplicates[data,sameQ]][[1]];

If N is a large number this takes an ungodly amount of time?

Is there a more efficient way to delete the duplicate entries of Data ?

ie.

Data = {{1.,2.},{1.,3.},{2.,3.}};

Would become:
{{1.,2.},{ 2.,3.}};

Thanks,

Clint Zeringue

| Next | Last
Pages: 1 2
Prev: Diagonalizing large matrices
Next: intersecting surfaces