From: garabik-news-2005-05 on 7 Aug 2010 02:36 dmtr <dchichkov(a)gmail.com> wrote: > > What I'm really looking for is a dict() that maps short unicode > strings into tuples with integers. But just having a *compact* list > container for unicode strings would help a lot (because I could add a > __dict__ and go from it). > At this point, I'd suggest to use one of the dbm modules, and pack the integers with struct.pack into a short string(s). Depending on your usage pattern, there are marked performance differences between dbhash, gdbm, and dbm implementations, so perhaps it would pay off to invest sometime in benchmarking. If your data are write-once, then cdb has excellent performance (but a different API). The file will be usually cached in RAM, so no need to worry about I/O bottlenecks... and if small enough, you can always put it into a ramdisk. If your strings are long enough, you can improve memory usage with a use of zlib.compress (dumb and unoptimal way of using compression, but easy and present in the std library) - but always verify if the compressed strings are _shorter_ than originals. -- ----------------------------------------------------------- | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ | | __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk | ----------------------------------------------------------- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread!
From: garabik-news-2005-05 on 7 Aug 2010 10:04 dmtr <dchichkov(a)gmail.com> wrote: > I guess with the actual dataset I'll be able to improve the memory > usage a bit, with BioPython::trie. That would probably be enough > optimization to continue working with some comfort. On this test code > BioPython::trie gives a bit of improvement in terms of memory. Not > much though... > >>>> d = dict() >>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6)) > Using struct.pack('7i',i, i+1, i+2, i+3, i+4, i+5, i+6) instead of array.array gives 20% improvement in time with (not surprisingly) the same memory usage. -- ----------------------------------------------------------- | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ | | __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk | ----------------------------------------------------------- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread!
|
Pages: 1 Prev: sched() function questions Next: Python library for Sequence Matching/Comparison |