From: David Park on 21 Nov 2009 03:38 One possibility is: datalist = Range[11, 3411, 10]; If[40 <= # <= 1500 \[Or] 1600 <= # <= 3300, Unevaluated[Sequence[]], #] & /@ datalist Sequence[] is dropped from a sequence of items. It has to be Unevaluated in the If statement so it's not dropped there! David Park djmpark(a)comcast.net http://home.comcast.net/~djmpark/ From: blamm64 [mailto:blamm64(a)charter.net] I have a couple of functions designed to poke a single hole, and to poke multiple holes, in a one-level list: We define a function which, given the imported pressure data, finds the subset of that pressure data excluding the pressure data points between "targetL " and "targetU". In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_? NumericQ] := Select[data,(#<=targetL || #>=targetU &)] This function will pluck out multiple holes in the data list. In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1}, tmp=data; Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1, {i,Dimensions[tarList][[1]]}]; tmp ] The following works fine (big holes chosen not to give large result): In[7]:= datalist=Range[11,3411,10]; In[12]:= targetlist={{40, 1500},{1600,3300}}; In[13]:= resultdata=subsets[datalist,targetlist] Out[13]= {11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591,3301,3311,3321,3 331,3341,3351,3361,3371,3381,3391,3401,3411} But if "datalist" happens to be very large, surely there is a (much) more efficient method? I tried unsuccessfully to use pure functions with Select, but have a somewhat nebulous feeling there's a pure function way of doing this effectively much more efficiently. I know, I know: the above have no consistency checking. I also know "subsets" could be used in place of "findsubset" just by replacing the call of "findsubset" with the code of "findsubset" in "subsets". >From what I've seen on this forum there are some really experienced people who might provide an efficient way of implementing the above. -Brian L.
From: Leonid Shifrin on 21 Nov 2009 03:42 Hi Brian, Your solution will be reasonably efficient if you need just a couple of holes, but will become less and less efficient when you increase the number of holes. Perhaps you will get better /more concise solutions which will be using some built-in that I am not aware of, but here is a (rather verbose) solution that should be fast enough even for a large number of holes. Clear[bsearchMinCompMassive ]; bsearchMinCompMassive = Compile[{{list, _Real, 1}, {elems, _Real, 1}}, Module[{n0 = 1, n2 = 1, n1 = Length[list], res = Array[0 &, {Length[elems]}], m = 1}, For[n2 = 1, n2 <= Length[elems], n2++, n0 = 1; n1 = Length[list]; While[n0 <= n1, m = Floor[(n0 + n1)/2]; If[list[[m]] < elems[[n2]], n0 = m + 1, n1 = m - 1]]; res[[n2]] = If[list[[m]] < elems[[n2]], m, m - 1] ]; res]]; Clear[bsearchMaxCompMassive]; bsearchMaxCompMassive = Compile[{{list, _Real, 1}, {elems, _Real, 1}}, Module[{n0 = 1, n2 = 1, n1 = Length[list], res = Array[0 &, {Length[elems]}], m = 1}, For[n2 = 1, n2 <= Length[elems], n2++, n0 = 1; n1 = Length[list]; While[n0 <= n1, m = Floor[(n0 + n1)/2]; If[list[[m]] > elems[[n2]], n1 = m - 1, n0 = m + 1]]; res[[n2]] = If[list[[m]] > elems[[n2]], m, m + 1] ]; res]]; Clear[deleteRegions]; deleteRegions[x : {__?NumberQ}, regs : {{_, _} ..}] := Module[{sorted, ord, result, positions, sortedRegs = Sort(a)regs}, sorted = x[[ord = Ordering[x]]]; positions = Sort(a)ord[[Complement[Range[Length[sorted]], Apply[Sequence, Range @@@ Transpose[{bsearchMaxCompMassive[sorted, #1], bsearchMinCompMassive[sorted, #2]} & @@ Transpose[sortedRegs]]]]]]; result = x[[positions]]]; Here are some benchmarks: In[1]:= datalist=RandomSample[#,Length[#]]&@Range[11,100000,10]; In[2]:= targetlist={{40,1500},{1600,3300}}; In[3]:= targetList2 = Table[{10*i,10*i+50},{i,2,150}]; In[4]:= (resultdata10=subsets[datalist,targetlist])//Short//Timing Out[4]= {0.032,{88671,91041,81971,6041,86891,<<9673>>,55451,73361,28271,51601,72831}} In[5]:= (resultdata20 = deleteRegions[datalist,targetlist])//Short//Timing Out[5]= {0.,{88671,91041,81971,6041,86891,<<9673>>,55451,73361,28271,51601,72831}} In[6]:= resultdata10==resultdata20 Out[6]= True In[7]:= (resultdata1=subsets[datalist,targetList2])//Short//Timing Out[7]= {2.313,{88671,91041,81971,6041,86891,<<9836>>,55451,73361,28271,51601,72831}} In[8]:= (resultdata2 = deleteRegions[datalist,targetList2])//Short//Timing Out[8]= {0.016,{88671,91041,81971,6041,86891,<<9836>>,55451,73361,28271,51601,72831}} In[9]:= resultdata1==resultdata2 Out[9]= True Regards, Leonid On Fri, Nov 20, 2009 at 2:38 PM, blamm64 <blamm64(a)charter.net> wrote: > I have a couple of functions designed to poke a single hole, and to > poke multiple holes, in a one-level list: > > We define a function which, given the imported pressure data, finds > the subset of that pressure data excluding the pressure data points > between "targetL " and "targetU". > > In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_? > NumericQ] := Select[data,(#<=targetL || #>=targetU &)] > > This function will pluck out multiple holes in the data list. > > In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1}, > tmp=data; > Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1, > {i,Dimensions[tarList][[1]]}]; > tmp > ] > > The following works fine (big holes chosen not to give large result): > > In[7]:= datalist=Range[11,3411,10]; > > In[12]:= targetlist={{40, 1500},{1600,3300}}; > > In[13]:= resultdata=subsets[datalist,targetlist] > > Out[13]= > > {11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591,3301,3311,3321,3331,3341,3351,3361,3371,3381,3391,3401,3411} > > But if "datalist" happens to be very large, surely there is a (much) > more efficient method? > > I tried unsuccessfully to use pure functions with Select, but have a > somewhat nebulous feeling there's a pure function way of doing this > effectively much more efficiently. > > I know, I know: the above have no consistency checking. I also know > "subsets" could be used in place of "findsubset" just by replacing the > call of "findsubset" with the code of "findsubset" in "subsets". > > >From what I've seen on this forum there are some really experienced > people who might provide an efficient way of implementing the above. > > -Brian L. > >
From: Leonid Shifrin on 22 Nov 2009 06:16 Hi Brian, This is a follow-up to my previous post. Since you mentioned performance on large datasets as a key objective, I further optimized the code of my previous post. It now gives 3-4 times the speed of the solutions based on Select/IntervalMemberQ, at least in the region of the problem's parameter space that I was able to test. Here is the code: ------------------------------------------------------------------------------- Clear[deleteRegionsNew]; deleteRegionsNew[x_?VectorQ, regs_?ListQ, splitNumber_: Automatic] := With[{snum = If[splitNumber === Automatic, 50000*(1 + IntegerPart[Length[regs]/10000]), splitNumber]}, Join @@ Map[deleteRegionsAux[#, regs] &, partitionWithTail[x, snum]]]; Clear[partitionWithTail]; partitionWithTail[l_List, size_Integer] := If[IntegerQ[Length[l]/size], #, Append[#, Drop[l, Length[Flatten[#, 1]]]]] &@ Partition[l, size]; Clear[deleteRegionsAux]; deleteRegionsAux[x_, regs_] := Module[{sorted, ord, sortedRegs, ones , xl, minn = Min[x] - 2, maxx = Max[x] + 2}, xl = Join[x, {minn, maxx}]; sortedRegs = Sort(a)DeleteCases[ Clip[regs, {minn + 1, maxx - 1}], {minn + 1, minn + 1} | {maxx - 1, maxx - 1}]; ones = ConstantArray[1, {Length[sortedRegs]}]; sorted = xl[[ord = Ordering[xl]]]; Drop[ xl[[Sort[ord[[#]]]]], -2] &@ Complement[Range[Length[sorted]], Apply[Join, Range @@@ Transpose[{bsearchMaxMassive[sorted, #1, ones], bsearchMinMassive[sorted, #2, ones]} & @@ Transpose[sortedRegs]]]]]; Clear[bsearchMinMassive]; bsearchMinMassive = Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}}, Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0, m = ones, diff = ones, un1 = ones, un2 = ones}, n1 = Length[list]*n0; While[Sign[n0 - n1] != ones, m = Floor[(n0 + n1)/2]; un1 = Floor(a)UnitStep[list[[m]] - elems]; un2 = ones - un1; n1 = n1*un2 + (m - 1)*un1; n0 = n0*un1 + (m + 1)*un2; ]; Floor[m - UnitStep[list[[m*UnitStep[m]]] - elems]]]]; Clear[bsearchMaxMassive]; bsearchMaxMassive = Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}}, Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0, m = ones, diff = ones, un1 = ones, un2 = ones, m1 = ones}, n1 = Length[list]*n0; While[Sign[n0 - n1] != ones, m = Floor[(n0 + n1)/2]; un1 = Floor(a)UnitStep[elems - list[[m]]]; un2 = ones - un1; n1 = n1*un1 + (m - 1)*un2; n0 = n0*un2 + (m + 1)*un1; ]; Floor[m + UnitStep[elems - list[[m]]]]]]; (* Daniel's solution for comparison*) Clear[subsets2]; subsets2[data_?VectorQ, tarList_?ListQ] := Module[{intv = Apply[Interval, Map[# + {.5, -.5} &, tarList]]}, Select[data, ! IntervalMemberQ[intv, #] &]] ---------------------------------------------------------------------------------- Benchmarking on a large dataset: In[1]:= datalist=RandomInteger[1100000,6500000]; targetlist=Table[{n,n+20},{n,100,6500000,50}]; In[2]:= Length[targetlist] Out[2]= 129999 In[3]:= Timing[resultdata=subsets2[datalist,targetlist];] Out[3]= {88.797,Null} In[4]:= Timing[resultdata2=deleteRegionsNew[datalist,targetlist];] Out[4]= {26.629,Null} In[5]:= resultdata===resultdata2 Out[5]= True In[6]:= datalist = RandomInteger[1100000, 6500000]; targetlist = Table[{n, n + 20}, {n, 100, 6500000, 5000}]; In[7]:= Length[targetlist] Out[7]= 1300 In[8]:= Timing[resultdata = subsets2[datalist, targetlist];] Out[8]= {46.927, Null} In[9]:= Timing[resultdata2 = deleteRegionsNew[datalist, targetlist];] Out[9]= {10.625, Null} In[10]:= resultdata === resultdata2 Out[10]= True My code uses one tuning parameter - the third optional parameter which determines the size of the data chunk to split the original data list into. I use some heuristics to set it up when it is not explicitly set. The main idea of the implementation is to sort the initial dataset, use binary search to determine the starting and ending positions of elements in sorted set which get into the holes, and then extract only those elements which do not. I can expand this description upon request. Regards, Leonid On Fri, Nov 20, 2009 at 3:38 AM, blamm64 <blamm64(a)charter.net> wrote: > I have a couple of functions designed to poke a single hole, and to > poke multiple holes, in a one-level list: > > We define a function which, given the imported pressure data, finds > the subset of that pressure data excluding the pressure data points > between "targetL " and "targetU". > > In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_? > NumericQ] := Select[data,(#<=targetL || #>=targetU &)] > > This function will pluck out multiple holes in the data list. > > In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1}, > tmp=data; > Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1, > {i,Dimensions[tarList][[1]]}]; > tmp > ] > > The following works fine (big holes chosen not to give large result): > > In[7]:= datalist=Range[11,3411,10]; > > In[12]:= targetlist={{40, 1500},{1600,3300}}; > > In[13]:= resultdata=subsets[datalist,targetlist] > > Out[13]= > > {11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591,3301,3311,3321,3331,3341,3351,3361,3371,3381,3391,3401,3411} > > But if "datalist" happens to be very large, surely there is a (much) > more efficient method? > > I tried unsuccessfully to use pure functions with Select, but have a > somewhat nebulous feeling there's a pure function way of doing this > effectively much more efficiently. > > I know, I know: the above have no consistency checking. I also know > "subsets" could be used in place of "findsubset" just by replacing the > call of "findsubset" with the code of "findsubset" in "subsets". > > >From what I've seen on this forum there are some really experienced > people who might provide an efficient way of implementing the above. > > -Brian L. > >
From: Ray Koopman on 23 Nov 2009 06:59 If you want something less complex than Leonid's code, here are two simple speedups for the IntervalMemberQ approach. First, it seems to run a little faster with integer limits than with real limits. datalist = Table[Random[Integer,11*^5],{m = 1*^6}]; targetlist = Table[{n,n+20},{n,100,m,50}]; Block[{realreject = Interval@@({.5,-.5}+#&)/@targetlist}, Select[datalist,!IntervalMemberQ[realreject,#]&]] //Length //Timing {9.75 Second, 654464} Block[{integereject = Interval@@({1,-1}+#&)/@targetlist}, Select[datalist,!IntervalMemberQ[integereject,#]&]] //Length //Timing {8.69 Second, 654464} Second, if you expect to keep more items than you drop then it's probably faster to look for the donuts than the holes. Block[{keep = Interval@@Partition[Join[{Min[datalist,targetlist]-1}, Flatten(a)targetlist,{Max[datalist,targetlist]+1}], 2]}, Select[datalist,IntervalMemberQ[keep,#]&]] //Length //Timing {7.47 Second, 654464} On Nov 22, 3:16 am, Leonid Shifrin <lsh...(a)gmail.com> wrote: > Hi Brian, > > This is a follow-up to my previous post. Since you mentioned performance > on large datasets as a key objective, I further optimized the code of my > previous post. It now gives 3-4 times the speed of the solutions based > on Select/IntervalMemberQ, > at least in the region of the problem's parameter space that I was able > to test. Here is the code: > > ---------------------------------------------------------------------- > > Clear[deleteRegionsNew]; > deleteRegionsNew[x_?VectorQ, regs_?ListQ, splitNumber_: Automatic] := > With[{snum = > If[splitNumber === Automatic, > 50000*(1 + IntegerPart[Length[regs]/10000]), splitNumber]}, > Join @@ > Map[deleteRegionsAux[#, regs] &, partitionWithTail[x, snum]]]; > > Clear[partitionWithTail]; > partitionWithTail[l_List, size_Integer] := > If[IntegerQ[Length[l]/size], > #, > Append[#, Drop[l, Length[Flatten[#, 1]]]]] &@ > Partition[l, size]; > > Clear[deleteRegionsAux]; > deleteRegionsAux[x_, regs_] := > Module[{sorted, ord, sortedRegs, ones , xl, minn = Min[x] - 2, > maxx = Max[x] + 2}, > xl = Join[x, {minn, maxx}]; > sortedRegs = > Sort(a)DeleteCases[ > Clip[regs, {minn + 1, maxx - 1}], {minn + 1, > minn + 1} | {maxx - 1, maxx - 1}]; > ones = ConstantArray[1, {Length[sortedRegs]}]; > sorted = xl[[ord = Ordering[xl]]]; > Drop[ xl[[Sort[ord[[#]]]]], -2] &@ > Complement[Range[Length[sorted]], > Apply[Join, > Range @@@ Transpose[{bsearchMaxMassive[sorted, #1, ones], > bsearchMinMassive[sorted, #2, ones]} & @@ > Transpose[sortedRegs]]]]]; > > Clear[bsearchMinMassive]; > bsearchMinMassive = > Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}}, > Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0, > m = ones, diff = ones, un1 = ones, un2 = ones}, > n1 = Length[list]*n0; > While[Sign[n0 - n1] != ones, > m = Floor[(n0 + n1)/2]; > un1 = Floor(a)UnitStep[list[[m]] - elems]; > un2 = ones - un1; > n1 = n1*un2 + (m - 1)*un1; > n0 = n0*un1 + (m + 1)*un2; > ]; > Floor[m - UnitStep[list[[m*UnitStep[m]]] - elems]]]]; > > Clear[bsearchMaxMassive]; > bsearchMaxMassive = > Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}}, > Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0, > m = ones, diff = ones, un1 = ones, un2 = ones, m1 = ones}, > n1 = Length[list]*n0; > While[Sign[n0 - n1] != ones, > m = Floor[(n0 + n1)/2]; > un1 = Floor(a)UnitStep[elems - list[[m]]]; > un2 = ones - un1; > n1 = n1*un1 + (m - 1)*un2; > n0 = n0*un2 + (m + 1)*un1; > ]; > Floor[m + UnitStep[elems - list[[m]]]]]]; > > > (* Daniel's solution for comparison*) > > Clear[subsets2]; > subsets2[data_?VectorQ, tarList_?ListQ] := > Module[{intv = Apply[Interval, Map[# + {.5, -.5} &, tarList]]}, > Select[data, ! IntervalMemberQ[intv, #] &]] > > ---------------------------------------------------------------------- > > Benchmarking on a large dataset: > > In[1]:= > datalist=RandomInteger[1100000,6500000]; > targetlist=Table[{n,n+20},{n,100,6500000,50}]; > > In[2]:= Length[targetlist] > > Out[2]= 129999 > > In[3]:= > > Timing[resultdata=subsets2[datalist,targetlist];] > > Out[3]= {88.797,Null} > > In[4]:= Timing[resultdata2=deleteRegionsNew[datalist,targetlist];] > > Out[4]= {26.629,Null} > > In[5]:= resultdata===resultdata2 > > Out[5]= True > > In[6]:= > datalist = RandomInteger[1100000, 6500000]; > targetlist = Table[{n, n + 20}, {n, 100, 6500000, 5000}]; > > In[7]:= Length[targetlist] > > Out[7]= 1300 > > In[8]:= > Timing[resultdata = subsets2[datalist, targetlist];] > > Out[8]= {46.927, Null} > > In[9]:= Timing[resultdata2 = deleteRegionsNew[datalist, targetlist];] > > Out[9]= {10.625, Null} > > In[10]:= resultdata === resultdata2 > > Out[10]= True > > My code uses one tuning parameter - the third optional parameter which > determines the size of the data chunk to split the original data list > into. I use some heuristics to set it up when it is not explicitly set. > > The main idea of the implementation is to sort the initial dataset, > use binary search to determine the starting and ending positions of > elements in sorted set which get into the holes, and then extract only > those elements which do not. I can expand this description upon request. > > Regards, > Leonid > > On Fri, Nov 20, 2009 at 3:38 AM, blamm64 <blam...(a)charter.net> wrote: > >> I have a couple of functions designed to poke a single hole, and to >> poke multiple holes, in a one-level list: >> >> We define a function which, given the imported pressure data, finds >> the subset of that pressure data excluding the pressure data points >> between "targetL " and "targetU". >> >> In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_? >> NumericQ] := Select[data,(#<=targetL || #>=targetU &)] >> >> This function will pluck out multiple holes in the data list. >> >> In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1}, >> tmp=data; >> Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1, >> {i,Dimensions[tarList][[1]]}]; >> tmp >> ] >> >> The following works fine (big holes chosen not to give large result): >> >> In[7]:= datalist=Range[11,3411,10]; >> >> In[12]:= targetlist={{40, 1500},{1600,3300}}; >> >> In[13]:= resultdata=subsets[datalist,targetlist] >> >> Out[13]= >> >> {11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591, >> 3301,3311,3321,3331,3341,3351,3361,3371,3381,3391,3401,3411} >> >> But if "datalist" happens to be very large, surely there is a (much) >> more efficient method? >> >> I tried unsuccessfully to use pure functions with Select, but have a >> somewhat nebulous feeling there's a pure function way of doing this >> effectively much more efficiently. >> >> I know, I know: the above have no consistency checking. I also know >> "subsets" could be used in place of "findsubset" just by replacing the >> call of "findsubset" with the code of "findsubset" in "subsets". >> >> From what I've seen on this forum there are some really experienced >> people who might provide an efficient way of implementing the above. >> >> -Brian L.
From: Ray Koopman on 24 Nov 2009 05:51
On Nov 20, 3:38 am, blamm64 <blam...(a)charter.net> wrote: On Nov 20, 3:38 am, blamm64 <blam...(a)charter.net> wrote: > I have a couple of functions designed to poke a single hole, and to > poke multiple holes, in a one-level list: > > We define a function which, given the imported pressure data, finds > the subset of that pressure data excluding the pressure data points > between "targetL " and "targetU". > > In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_? > NumericQ] := Select[data,(#<=targetL || #>=targetU &)] > > This function will pluck out multiple holes in the data list. > > In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1}, > tmp=data; > Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1, > {i,Dimensions[tarList][[1]]}]; > tmp > ] > > The following works fine (big holes chosen not to give large result): > > In[7]:= datalist=Range[11,3411,10]; > > In[12]:= targetlist={{40, 1500},{1600,3300}}; > > In[13]:= resultdata=subsets[datalist,targetlist] > > Out[13]= > {11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591, > 3301,3311,3321,3331,3341,3351,3361,3371,3381,3391,3401,3411} > > But if "datalist" happens to be very large, surely there is a (much) > more efficient method? > > I tried unsuccessfully to use pure functions with Select, but have a > somewhat nebulous feeling there's a pure function way of doing this > effectively much more efficiently. > > I know, I know: the above have no consistency checking. I also know > "subsets" could be used in place of "findsubset" just by replacing > the call of "findsubset" with the code of "findsubset" in "subsets". > > From what I've seen on this forum there are some really experienced > people who might provide an efficient way of implementing the above. > > -Brian L. In your example, 'datalist' contains sorted integers, and 'targetlist' contains sorted nonoverlapping "holes" with integer bounds. If this will always be the case then the following code, that "walks" the two lists together, will be much faster than using IntervalMemberQ. funk = Compile[{{datalist,_Integer,1},{targetlist,_Integer,2}}, Module[{b = Append[Flatten[{0,-1}+#&/@targetlist],Last(a)datalist+1], j = 1}, Select[datalist,(While[b[[j]]<#,j++]; OddQ@j)&]]] |