From: Leonid Shifrin on
Hi,

This is a follow-up to my previous post.

I did not quite get your original goal before and my previous post was
somewhat messy. Here is the code which will probably do what you need:

The test list
myList = RandomInteger[{1, 5555}, #] & /@ {{19808, 5}, {7952,
5}, {7952, 5}};

The code:

Clear[mergeSameDates];
mergeSameDates[lst_List] :=
With[{pos = PositionsOfSame @@ lst[[All, All, 5]]},
With[{trpos = Transpose[pos]},
Flatten[
Transpose@
Table[lst[[i, #]] & /@ trpos[[i]], {i, 1,
Length[lst]}], {{1}, {2, 3}}]]];

It uses the function PositionsOfSame, described in my previous post. You use
it as follows:

In[4]:= (res = mergeSameDates[myList]); // Timing

Out[4]= {0.831, Null}

Regards,
Leonid



On Thu, Jul 1, 2010 at 5:28 AM, Garapata <warsaw95826(a)mypacks.net> wrote:

> I have a large nested list, "myList"
>
> It has 3 sublists with the following dimensions:
>
> Dimensions/@ myList
>
> {{19808, 5}, {7952, 5}, {7952, 5}}
>
> The 5th position (i.e., column) in each of the sublists has
> SQLDateTime[]s
> (This may or may not affect what I need, but I thought everyone should
> know).
>
> myIntersection = Intersection @@ (myList[[All, All, 5]]);
>
> gives me the SQLDateTimes[]s common to all sublists. I get 3954
> common elements.
>
> Length[myIntersection]
>
> 3954
>
> All of the above works great and runs very fast.
>
> I then find the positions in myList where all the common
> SQLDateTimes[]s occur and then use Extract pull them out into a new
> list
>
> myPositions = Drop[(Position[data, #] & /@ myIntersection), None,
> None, -1];
>
> myOutput = Extract[myList, #] & /@ myPositions;
>
> I end up with just what I need, which in this case gives me 3954 rows
> of {9, 5} sublists. This occurs because myList[[1]] has 5 occurrences
> of each common date element and sublists myList[[2]] and myList[[3]]
> each have 2 occurrences of each common date element.
>
> The Extract[] runs very fast.
>
> My problem =85. the Position[] runs very very slow (over 90 seconds on a
> dual core iMac).
>
> All the code together:
>
> myIntersection = Intersection @@ (myList[[All, All, 5]]);
> myPositions = Drop[(Position[data, #] & /@ myIntersection), None,
> None, -1];
> myOutput = Extract[myList, #] & /@ myPositions;
>
> So, does anyone know a way to speed up:
>
> myPositions = Drop[(Position[data, #] & /@ myIntersection), None,
> None, -1]; ?
>
> Or can anyone suggest another approach to doing this that could run
> faster.
>
> Patterns?
> ParallelMap?
> Parallelize?
> Sorting?
> Changing SQLDateTimes to DateList[]s before calculating myPositions?
>
> Not clear what to try.
> Please advise.
>
> Thanks.
>
>
From: Leonid Shifrin on
Hi,

I used some problem-specific information (namely that your data - at least
dates - are integers) to get another 2-3 times speed-up on your problem,
with respect to my previous suggestion:

Clear[extractPositionFromSparseArray];
extractPositionFromSparseArray[
HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]


Clear[listSplit];
listSplit[x_List, lengths_List] :=
MapThread[Take[x, {##}] &, {Most[#], Rest[#] - 1}] &@
Accumulate[Prepend[lengths, 1]];

Clear[positionsOfSameInteger];
positionsOfSameInteger[x__List] :=
Module[{inter = Intersection[x], alloc = Table[0, {Max[Join[x]]}],
pFun, plists, extracted, orders, frequencies, positions, allocAll},
allocAll = Array[0 &, {Length[{x}], Max[inter]}];
alloc[[inter]] = 1;
plists =
Flatten[extractPositionFromSparseArray[
SparseArray[alloc[[#]]]]] & /@ {x} ;
extracted = MapThread[Part, {{x}, plists}];
orders = Ordering /@ extracted;
frequencies = Tally /@ MapThread[Part, {extracted, orders}];
positions = MapThread[listSplit[#1[[#2]], #3[[All, 2]]] &,
{plists, orders, frequencies}];
MapThread[(allocAll[[#1, #2[[All, 1]]]] = #3) &,
{Range[Length[{x}]], frequencies, positions}];
Transpose[#[[inter]] & /@ allocAll]];


Clear[mergeSameDates];
mergeSameDates[lst_List] :=
With[{pos = positionsOfSameInteger @@ lst[[All, All, 5]]},
With[{trpos = Transpose[pos]},
Flatten[Transpose@
Table[lst[[i, #]] & /@ trpos[[i]], {i, 1,
Length[lst]}], {{1}, {2, 3}}]]];

My benchmarks now show that it performs about twice faster than the fastest
previous solution (provided by Peter Pein), at least on the test example
below:

In[39]=

myList = RandomInteger[{1, 5555}, #] & /@ {{19808, 5}, {7952,
5}, {7952, 5}};


In[40]:= (res = mergeSameDates[myList]); // Timing

Out[40]= {0.381, Null}

In[42]:=
Timing[
myIntersection = Intersection @@ (myList[[All, All, 5]]);
Dimensions[
myOutput3 =
Flatten[Reap[Map[Sow[#, Last[#]] &, myList, {2}],
myIntersection][[2]], 1]]]

Out[42]= {0.821, {3159}}

In[43]:= res == myOutput3

Out[43]= True

Peter's solution is however certainly more elegant and much more concise.
OTOH, it does not give you the detailed information about positions while
mine does, so they are complementary. A function positionsOfSameInteger is a
byproduct of this problem, and can be for integer lists up to an order of
magnitude faster than the more generic implementation I used in my package -
it may be useful in many contexts.

Anyway, hope this helps.

Regards,
Leonid


On Thu, Jul 1, 2010 at 5:28 AM, Garapata <warsaw95826(a)mypacks.net> wrote:

> I have a large nested list, "myList"
>
> It has 3 sublists with the following dimensions:
>
> Dimensions/@ myList
>
> {{19808, 5}, {7952, 5}, {7952, 5}}
>
> The 5th position (i.e., column) in each of the sublists has
> SQLDateTime[]s
> (This may or may not affect what I need, but I thought everyone should
> know).
>
> myIntersection = Intersection @@ (myList[[All, All, 5]]);
>
> gives me the SQLDateTimes[]s common to all sublists. I get 3954
> common elements.
>
> Length[myIntersection]
>
> 3954
>
> All of the above works great and runs very fast.
>
> I then find the positions in myList where all the common
> SQLDateTimes[]s occur and then use Extract pull them out into a new
> list
>
> myPositions = Drop[(Position[data, #] & /@ myIntersection), None,
> None, -1];
>
> myOutput = Extract[myList, #] & /@ myPositions;
>
> I end up with just what I need, which in this case gives me 3954 rows
> of {9, 5} sublists. This occurs because myList[[1]] has 5 occurrences
> of each common date element and sublists myList[[2]] and myList[[3]]
> each have 2 occurrences of each common date element.
>
> The Extract[] runs very fast.
>
> My problem =85. the Position[] runs very very slow (over 90 seconds on a
> dual core iMac).
>
> All the code together:
>
> myIntersection = Intersection @@ (myList[[All, All, 5]]);
> myPositions = Drop[(Position[data, #] & /@ myIntersection), None,
> None, -1];
> myOutput = Extract[myList, #] & /@ myPositions;
>
> So, does anyone know a way to speed up:
>
> myPositions = Drop[(Position[data, #] & /@ myIntersection), None,
> None, -1]; ?
>
> Or can anyone suggest another approach to doing this that could run
> faster.
>
> Patterns?
> ParallelMap?
> Parallelize?
> Sorting?
> Changing SQLDateTimes to DateList[]s before calculating myPositions?
>
> Not clear what to try.
> Please advise.
>
> Thanks.
>
>