Prev: Microsoft lessening commitment to IronPython and IronRuby
Next: Is there any way to minimize str()/unicode() objects memoryusage [Python 2.6.4] ?
From: dmtr on 7 Aug 2010 02:27 On Aug 6, 10:56 pm, Michael Torrie <torr...(a)gmail.com> wrote: > On 08/06/2010 07:56 PM, dmtr wrote: > > > Ultimately a dict that can store ~20,000,000 entries: (u'short > > string' : (int, int, int, int, int, int, int)). > > I think you really need a real database engine. With the proper > indexes, MySQL could be very fast storing and retrieving this > information for you. And it will use your RAM to cache as it sees fit. > Don't try to reinvent the wheel here. No, I've tried. DB solutions are not even close in terms of the speed. Processing would take weeks :( Memcached or REDIS sort of work, but they are still a bit on the slow side, to be a pleasure to work with. The standard dict() container is *a lot* faster. It is also hassle free (accepting unicode keys/etc). I just wish there was a bit more compact dict container, optimized for large dataset and memory, not for speed. And with the default dict() I'm also running into some kind of nonlinear performance degradation, apparently after 10,000,000-13,000,000 keys. But I can't recreate this with a solid test case (see http://bugs.python.org/issue9520 ) :( -- Dmitry
From: Peter Otten on 7 Aug 2010 02:50 dmtr wrote: >> > Well... 63 bytes per item for very short unicode strings... Is there >> > any way to do better than that? Perhaps some compact unicode objects? >> >> There is a certain price you pay for having full-feature Python objects. > > Are there any *compact* Python objects? Optimized for compactness? > >> What are you trying to accomplish anyway? Maybe the array module can be >> of some help. Or numpy? > > Ultimately a dict that can store ~20,000,000 entries: (u'short > string' : (int, int, int, int, int, int, int)). I don't know to what extent it still applys but switching off cyclic garbage collection with import gc gc.disable() while building large datastructures used to speed up things significantly. That's what I would try first with your real data. Encoding your unicode strings as UTF-8 could save some memory. When your integers fit into two bytes, say, you can use an array.array() instead of the tuple. Peter
From: dmtr on 7 Aug 2010 03:30 On Aug 6, 11:50 pm, Peter Otten <__pete...(a)web.de> wrote: > I don't know to what extent it still applys but switching off cyclic garbage > collection with > > import gc > gc.disable() Haven't tried it on the real dataset. On the synthetic test it (and sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory usage. Not significant. I'll try it on the real dataset though. > while building large datastructures used to speed up things significantly.. > That's what I would try first with your real data. > > Encoding your unicode strings as UTF-8 could save some memory. Yes... In fact that's what I'm trying now... .encode('utf-8') definitely creates some clutter in the code, but I guess I can subclass dict... And it does saves memory! A lot of it. Seems to be a bit faster too.... > When your integers fit into two bytes, say, you can use an array.array() > instead of the tuple. Excellent idea. Thanks! And it seems to work too, at least for the test code. Here are some benchmarks (x86 desktop): Unicode key / tuple: >>> for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4, i+5, i+6) 1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'], 4.079240 seconds, 245143.698209 keys per second >>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6)) 1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'], 4.985136 seconds, 200596.331486 keys per second >>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1, i+2, i+3, i+4, i+5, i+6) 1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'], 3.572301 seconds, 279931.625282 keys per second Almost halved the memory usage. And faster too. Nice. -- Dmitry
From: dmtr on 7 Aug 2010 03:33 Correction. I've copy-pasted it wrong! array.array('i', (i, i+1, i+2, i +3, i+4, i+5, i+6)) was the best. >>> for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4, i+5, i+6) 1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'], 4.079240 seconds, 245143.698209 keys per second >>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1, i+2, i+3, i+4, i+5, i+6) 1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'], 4.985136 seconds, 200596.331486 keys per second >>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6)) 1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'], 3.572301 seconds, 279931.625282 keys per second -- Dmitry
From: Peter Otten on 7 Aug 2010 04:16
dmtr wrote: > On Aug 6, 11:50 pm, Peter Otten <__pete...(a)web.de> wrote: >> I don't know to what extent it still applys but switching off cyclic >> garbage collection with >> >> import gc >> gc.disable() > > > Haven't tried it on the real dataset. On the synthetic test it (and > sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory > usage. Not significant. I'll try it on the real dataset though. > > >> while building large datastructures used to speed up things >> significantly. That's what I would try first with your real data. >> >> Encoding your unicode strings as UTF-8 could save some memory. > > Yes... In fact that's what I'm trying now... .encode('utf-8') > definitely creates some clutter in the code, but I guess I can > subclass dict... And it does saves memory! A lot of it. Seems to be a > bit faster too.... > >> When your integers fit into two bytes, say, you can use an array.array() >> instead of the tuple. > > Excellent idea. Thanks! And it seems to work too, at least for the > test code. Here are some benchmarks (x86 desktop): > > Unicode key / tuple: >>>> for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4, >>>> i+5, i+6) > 1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'], > 4.079240 seconds, 245143.698209 keys per second > >>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = >>>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6)) > 1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'], > 4.985136 seconds, 200596.331486 keys per second > >>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1, >>>> i+2, i+3, i+4, i+5, i+6) > 1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'], > 3.572301 seconds, 279931.625282 keys per second > > Almost halved the memory usage. And faster too. Nice. > def benchmark_dict(d, N): > start = time.time() > > for i in xrange(N): > length = lengths[random.randint(0, 255)] > word = ''.join([ letters[random.randint(0, 255)] for i in xrange(length) ]) > d[word] += 1 > > dt = time.time() - start > vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' % os.getpid()).read()) > print "%d keys (%d unique), %s, %f seconds, %f keys per second" % (N, len(d), vm, dt, N / dt) > Looking at your benchmark, random.choice(letters) has probably less overhead than letters[random.randint(...)]. You might even try to inline it as letters[int(random.random())*256)] Peter |