From: Zeringue, Clint M Civ USAF AFMC AFRL/RDLAF on 4 Feb 2010 06:25 Hello, Suppose you have the following. Data = RandomReal[1,{N,2}]; sameQ[_,_]=False; sameQ[{x_,y_},{x_,z_}]=True; Timing[DeleteDuplicates[data,sameQ]][[1]]; If N is a large number this takes an ungodly amount of time? Is there a more efficient way to delete the duplicate entries of Data ? ie. Data = {{1.,2.},{1.,3.},{2.,3.}}; Would become: {{1.,2.},{ 2.,3.}}; Thanks, Clint Zeringue
From: Szabolcs Horvát on 5 Feb 2010 03:27 On 2010.02.04. 12:25, Zeringue, Clint M Civ USAF AFMC AFRL/RDLAF wrote: > Hello, > > Suppose you have the following. > > Data = RandomReal[1,{N,2}]; > > sameQ[_,_]=False; > sameQ[{x_,y_},{x_,z_}]=True; > > Timing[DeleteDuplicates[data,sameQ]][[1]]; > > If N is a large number this takes an ungodly amount of time? > > Is there a more efficient way to delete the duplicate entries of Data ? > > ie. > > Data = {{1.,2.},{1.,3.},{2.,3.}}; > > Would become: > {{1.,2.},{ 2.,3.}}; > Take care not to use N as a variable as it already has a built-in meaning. If it is not necessary to keep the elements of the list in the same order, then a different, lower complexity algorithm can be used: SplitBy[SortBy[data, First], First][[All, 1]] This will be much faster, but will not remove exactly the same elements as DeleteDuplicates because the second element of the pairs is always ignored. DeleteDuplicates will always keep the very first occurrence of equivalent elements. Is this important for your calculation?
From: Clint Zeringue on 5 Feb 2010 03:29 Thanks for all the great responses. I have summarized them below. Use Tally or, even better, GatherBy, to obtain very substantial reductions in time: In[1]:= data=RandomInteger[{1,99},{100000,2}]; In[2]:= sameQ[_,_]=False; sameQ[{x_,y_},{x_,z_}]=True; In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];] Out[4]= {7.987,Null} In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]] Out[5]= 0.063 In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]] Out[6]= 0.016 In[7]:= t0===t1===t2 Out[7]= True Tomas Tomas Garza [tgarza10(a)msn.com] n = 100000; data = RandomInteger[{0, 9}, {n, 2}] // N; Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing {1.39164,10} Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing {0.289784,10} Bob Hanlon Bob Hanlon [hanlonr(a)cox.net] Take care not to use N as a variable as it already has a built-in meaning. If it is not necessary to keep the elements of the list in the same order, then a different, lower complexity algorithm can be used: SplitBy[SortBy[data, First], First][[All, 1]] This will be much faster, but will not remove exactly the same elements as DeleteDuplicates because the second element of the pairs is always ignored. DeleteDuplicates will always keep the very first occurrence of equivalent elements. Is this important for your calculation? Szabolcs Horvát [szhorvat(a)gmail.com] The fastest solution was Tomas Garza's : Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[
From: Daniel Lichtblau on 5 Feb 2010 03:30 Zeringue, Clint M Civ USAF AFMC AFRL/RDLAF wrote: > Hello, > > Suppose you have the following. > > Data = RandomReal[1,{N,2}]; > > sameQ[_,_]=False; > sameQ[{x_,y_},{x_,z_}]=True; > > Timing[DeleteDuplicates[data,sameQ]][[1]]; > > If N is a large number this takes an ungodly amount of time? > > Is there a more efficient way to delete the duplicate entries of Data ? > > ie. > > Data = {{1.,2.},{1.,3.},{2.,3.}}; > > Would become: > {{1.,2.},{ 2.,3.}}; > > > Thanks, > > Clint Zeringue DeleteDuplicates is not able to recognize that there might be a fast way to use your sameQ, hence it is an O(n^2) proposition when list length is n. The variant below will behave better. deldupes[ll_] := Module[{h, res}, res = Reap[Do[ If[! TrueQ[h[ll[[j, 1]]]], h[ll[[j, 1]]] = True; Sow[ll[[j]]]]; , {j, Length[ll]}] ][[2, 1]]; Clear[h]; res] Example: data = RandomInteger[10^3, {10^4, 2}]; In[27]:= Timing[dd1 = deldupes[data];] Out[27]= {0.029996, Null} In[28]:= Timing[dd2 = DeleteDuplicates[data, sameQ];] Out[28]= {6.28405, Null} In[31]:= dd2 // Length Out[31]= 1001 In[32]:= dd1 === dd2 Out[32]= True Daniel Lichtblau Wolfram Research
From: Bob Hanlon on 5 Feb 2010 03:32
n = 100000; data = RandomInteger[{0, 9}, {n, 2}] // N; Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing {1.39164,10} Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing {0.289784,10} Bob Hanlon ---- "Zeringue wrote: ============= Hello, Suppose you have the following. Data = RandomReal[1,{N,2}]; sameQ[_,_]=False; sameQ[{x_,y_},{x_,z_}]=True; Timing[DeleteDuplicates[data,sameQ]][[1]]; If N is a large number this takes an ungodly amount of time? Is there a more efficient way to delete the duplicate entries of Data ? ie. Data = {{1.,2.},{1.,3.},{2.,3.}}; Would become: {{1.,2.},{ 2.,3.}}; Thanks, Clint Zeringue |