Prev: urllib2 times out accessing SSL URL on OS X
Next: Hi, friends. I wanna ask if there is a function which is ableto take a list as argument and then return its top-k maximums?
From: Chris Rebert on 22 Apr 2010 10:04 2010/4/22 Jo Chan <csjcg2(a)gmail.com>: > Hi,friends. > Â I wanna ask if there is a function which is able to take a list as argument > and then return its top-k maximums? > I only know about max which is poorly a top-1 maximum function, now I want > more yet I am lazy enough that don't want to write one by myself. http://docs.python.org/library/heapq.html#heapq.nlargest Cheers, Chris -- http://blog.rebertia.com
From: Bryan on 23 Apr 2010 04:24 Steven D'Aprano wrote: > John Nagle wrote: > > Is "nlargest" smart enough to decide when it's cheaper to track the > > N largest entries on a linear pass through the list than to sort? It *always* does a linear pass through the list (linear, that is in the length of the entire list). It tracks the n largest elements seen so far in a heap. > Doesn't appear to do so. From Python 3.1: I think you're mis-reading it. > def nlargest(n, iterable): > """Find the n largest elements in a dataset. > > Equivalent to: sorted(iterable, reverse=True)[:n] > """ > it = iter(iterable) > result = list(islice(it, n)) > if not result: > return result > heapify(result) > _heappushpop = heappushpop > for elem in it: > _heappushpop(result, elem) > result.sort(reverse=True) > return result That doesn't sort, or even heapify, the whole list. It keeps the largest n elements seen so far in the 'result' heap, smallest on top. Note the doc for heappushpop: "Push item on the heap, then pop and return the smallest item from the heap." Thus the 'result' heap stays of size n. The final result.sort() is just so the returned list is sorted, and assuming that's a requirement, I think the algorithm is asymptotically the best a comparison-based method can do, regardless of whether the length of the entire sequence dominates n. I figure the worst-case run time is Theta(s lg(n)) where s in the length of the sequence. > Interestingly, nsmallest does use two different algorithms, > depending on how many items you ask for. See the source code. That is interesting. The above algorithm for nlargest is better, but to use it for nsmallest requires a largest-on-top heap, which the module does not bother to implement. -- --Bryan
From: Raymond Hettinger on 23 Apr 2010 20:34 On Apr 22, 10:49 am, John Nagle <na...(a)animats.com> wrote: > Chris Rebert wrote: > > 2010/4/22 Jo Chan <csj...(a)gmail.com>: > >> Hi,friends. > >> I wanna ask if there is a function which is able to take a list as argument > >> and then return its top-k maximums? > >> I only know about max which is poorly a top-1 maximum function, now I want > >> more yet I am lazy enough that don't want to write one by myself. > > >http://docs.python.org/library/heapq.html#heapq.nlargest > > > Cheers, > > Chris > > -- > >http://blog.rebertia.com > > Is "nlargest" smart enough to decide when it's cheaper to track the > N largest entries on a linear pass through the list than to sort? nlargest() has gotten smarter over time. So, the answer depends on which version you're using. Here's a snippet from http://svn.python.org/view/python/branches/py3k/Lib/heapq.py?revision=70695&view=markup def nlargest(n, iterable, key=None): """Find the n largest elements in a dataset. Equivalent to: sorted(iterable, key=key, reverse=True)[:n] """ # Short-cut for n==1 is to use max() when len(iterable)>0 if n == 1: it = iter(iterable) head = list(islice(it, 1)) if not head: return [] if key is None: return [max(chain(head, it))] return [max(chain(head, it), key=key)] # When n>=size, it's faster to use sort() try: size = len(iterable) except (TypeError, AttributeError): pass else: if n >= size: return sorted(iterable, key=key, reverse=True)[:n] # When key is none, use simpler decoration if key is None: it = zip(iterable, count(0,-1)) # decorate result = _nlargest(n, it) return [r[0] for r in result] # undecorate # General case, slowest method in1, in2 = tee(iterable) it = zip(map(key, in1), count(0,-1), in2) # decorate result = _nlargest(n, it) return [r[2] for r in result] # undecorate
From: Raymond Hettinger on 23 Apr 2010 20:40
On Apr 23, 1:24 am, Bryan <bryanjugglercryptograp...(a)yahoo.com> wrote: > That is interesting. The above algorithm for nlargest is better, but > to use it for nsmallest requires a largest-on-top heap, which the > module does not bother to implement. FWIW, the pure python versions differ because they are both implemented in terms of the basic minheap functions, but the underlying C code uses the same algorithm for both (using an underlying maxheap when necessary): http://svn.python.org/view/python/branches/py3k/Modules/_heapqmodule.c?revision=64351&view=markup Raymond |