Prev: Pratt-parser
Next: how to use graphics.h in codeblocks
From: Rod Pemberton on 8 Nov 2009 06:16 "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message news:hd4n2l$3rv$1(a)news.albasani.net... > > my usual algo for string hashing is: > hi=0; > while(*s)hi=(hi*4093)+(*s++); > BGB, you've mentioned hashing quite a bit lately. And, you've posted quite a few algo's... You either have alot of them or you keep changing in an attempt them to improve them. My brute force tests of near 50 hashing algorithms and variations from various sources on the Internet was that there are two good algorithms publicy available: Austin Appleby's MurmurHash2 Bob Jenkins' hashlittle() IIRC, Jenkin's hashlittle is from lookup3.c. His hash from lookup2 performs very similarly to it in terms of collisions. The Jenkins' hash generates 10% or so more collisions than Appleby's. However, both are very low compared to everything else I tested. For the most part, I didn't check to see which hashes failed the various Bob Jenkins' hash tests. (Most fail.) Appleby's MurmurHash2 passes them. IIRC, the speed of one of them (Jenkins ?) was very sensitive to different optimization levels with the GCC compiler. E.g., why is my code with this hash fast with -O but slow with -O2? Another thing to note is Paul Hsieh's highly respected hashes didn't perform well. You should also note that quite a few respected mathematicians in the list have come up with absolutely horrid hashes. But, one should remember that many of these were created decades ago and weren't intended for the data we're throwing at them today. By programatic brute force, I've tested every 4-byte ASCII sequence and a large percentage (I'd guess 60%...) or so of the 5-byte ASCII sequences (upto 2GB file - DOS limit). I was planning to rewrite the test programs so I could test the entire 5-byte ASCII sequence range, but I didn't get around to it. And, I concluded that what I had was probably good enough. Sorry, I didn't have need of hashes of binary sequences, the preferred test. However, I believe the results on the ASCII sequences, since they are consecutive binary values over the text range, are representative of the hash's performance in regards to collisions. All the other hashes fell into two categories: too slow (roughly 33%) to be useful, too many collisions (remainder). I've speed/collision tested the following. Sorry, these may have my naming for them instead of their actual names... Remember, the list includes the good hashes above, but the rest are "bad". The list has been sorted alphabetically. additive hash Adler checksum Arash Partow hash Arash Partow hash Austin Appleby's MurmurHash2 Austin Appleby's MurmurHash2 aligned Bob Jenkins' hash Bob Jenkins' hashlittle Bob Jenkins' oneAtATimeHash Bob Jenkins' oneAtATimeHashPH BP hash (unknown) Bruno Preiss hash CRC16 CRC32 CRC32PH (parallel hash) DJ Bernstein DJ Bernstein 5381 sum DJ Bernstein 5381 xor DJ Bernstein >>16 variation DJ Bernstein modified DJ Bernstein original Donald Knuth hash elf hash Fletcher checksum FNV (Fowler,Noll,Vo) Hash GAWK hash hash 65599 Julienne Walker hash Justin Sobel hash K&R >>16 variation K&R hash Modified FNV (Fowler,Noll,Vo) Ozan Yigit's SDBM hash Paul Hsieh SFHAsm (Super Fast Hash in asm) Paul Hsieh SuperFastHash Peter Kankowski x17 Peter Kankowski x17 unrolled Peter Weinberger RD hash Peter Weinburger hash Ramakrishna hash Robert Sedgwicks hash rot hash shift-add-xor TCL HashString xor hash > although, granted, you can't freely copy code back and forth either, since > the LGPL would remain in effect on the copied code. > Unh... AIUI, under US law, a copyright applies to the *entire* work as a whole, not little sub-pieces. I.e., if you take a small piece from here and insert it over there in a larger work, the copyright of the larger work applies... This can lead to obvious problems, since a larger sub-piece, if unique enough, can also constitute an entire work under the law. Interpretation is almost entirely up to the judge... One of the irritants of the the GNU FAQ's on the GPL is that they say Public Domain code (i.e., no copyright) can be inserted into copyrighted code. But, under US and international laws, such a work was copyrighted *once* automatically and therefore it can't be copyrighted again. Inserting it into GPL'd code - where the copyright applies to the *entire* work - attempts to copyright the Public Domain code... Of course, this only applies if that piece of PD code is unique and large enough to be considered an entire work. But, I've seen quite a few situations in GPL where they copied the entire PD work into GPL'd code. Although I've conversed with an IP attorney on this issue, I'm not an IP attorney, so... My recommendation is to treat every license (w/copyrighted code) and Public Domain code as unmixable entities. I.e., don't mix PD and GPL. Don't mix BSD and LGPL. etc. If you want or intend to do so, I'd check with the OSI (Open Source Initiative) first, for their legal recommendations, not FSF's. Some of FSF's "interpretations" seem somewhat dubious. Rod Pemberton
From: BGB / cr88192 on 8 Nov 2009 09:25 "Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message news:hd69d5$umo$1(a)aioe.org... > "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message > news:hd4n2l$3rv$1(a)news.albasani.net... >> >> my usual algo for string hashing is: >> hi=0; >> while(*s)hi=(hi*4093)+(*s++); >> > > BGB, you've mentioned hashing quite a bit lately. And, you've posted > quite > a few algo's... You either have alot of them or you keep changing in an > attempt them to improve them. > I have a collection of them I use for different tasks, where each one does slightly different things. I have been using them for around 8 years now, so I have collected a few variants. the above sequence is mostly for calculating a hash value for strings, and is best suited for hash tables in the 4096-1M range. another variation is to hash integer data. and, sometimes, I combine pre-existing hashes (this used in my object system). similarly, the tables themselves can be used in different ways: as a single stop cache. (I call these "caching hashes", since they only cache things, rather than working as a means of "permanent" storage, and in general one can NULL all the entries in a caching hash with little or no ill effect). as a collection of linked lists, which is usually a variant of a "storage hash". with chains inside the hash, which is where I often use pseudo-random stepping. .... > My brute force tests of near 50 hashing algorithms and variations from > various sources on the Internet was that there are two good algorithms > publicy available: > > Austin Appleby's MurmurHash2 > Bob Jenkins' hashlittle() > > IIRC, Jenkin's hashlittle is from lookup3.c. His hash from lookup2 > performs > very similarly to it in terms of collisions. > yes, ok. I have not looked much at others' algos, as I have my own variants and experiences as to which works best in which sort of task. > The Jenkins' hash generates 10% or so more collisions than Appleby's. > However, both are very low compared to everything else I tested. For the > most part, I didn't check to see which hashes failed the various Bob > Jenkins' hash tests. (Most fail.) Appleby's MurmurHash2 passes them. > IIRC, the speed of one of them (Jenkins ?) was very sensitive to different > optimization levels with the GCC compiler. E.g., why is my code with this > hash fast with -O but slow with -O2? > > Another thing to note is Paul Hsieh's highly respected hashes didn't > perform > well. You should also note that quite a few respected mathematicians in > the > list have come up with absolutely horrid hashes. But, one should remember > that many of these were created decades ago and weren't intended for the > data we're throwing at them today. > yep. I think most of my hash functions are derived from one of Donald Knuth's PRNG's (original idea was derived from the one found in Newlib). experimentally, I had found it had went both faster and had better collision behavior than the more common variants I had used prior, which had usually involved shifts and xors. for example: hi=0; while(*s)hi=(hi<<3)^(hi>>7)^(*s++); and similar... which had gotten 'pwned' by my discovery of the approach I have been using since. > By programatic brute force, I've tested every 4-byte ASCII sequence and a > large percentage (I'd guess 60%...) or so of the 5-byte ASCII sequences > (upto 2GB file - DOS limit). I was planning to rewrite the test programs > so > I could test the entire 5-byte ASCII sequence range, but I didn't get > around > to it. And, I concluded that what I had was probably good enough. Sorry, > I > didn't have need of hashes of binary sequences, the preferred test. > However, I believe the results on the ASCII sequences, since they are > consecutive binary values over the text range, are representative of the > hash's performance in regards to collisions. > yeah... mine I had refined before through the use of random collections of longer strings, and sometimes random data. others had used random chunks from strings which were "sliced and diced" for testing. .... > All the other hashes fell into two categories: too slow (roughly 33%) to > be > useful, too many collisions (remainder). I've speed/collision tested the > following. Sorry, these may have my naming for them instead of their > actual > names... Remember, the list includes the good hashes above, but the > rest are "bad". The list has been sorted alphabetically. > > additive hash > Adler checksum > Arash Partow hash > Arash Partow hash > Austin Appleby's MurmurHash2 > Austin Appleby's MurmurHash2 aligned > Bob Jenkins' hash > Bob Jenkins' hashlittle > Bob Jenkins' oneAtATimeHash > Bob Jenkins' oneAtATimeHashPH > BP hash (unknown) > Bruno Preiss hash > CRC16 > CRC32 > CRC32PH (parallel hash) > DJ Bernstein > DJ Bernstein 5381 sum > DJ Bernstein 5381 xor > DJ Bernstein >>16 variation > DJ Bernstein modified > DJ Bernstein original > Donald Knuth hash > elf hash > Fletcher checksum > FNV (Fowler,Noll,Vo) Hash > GAWK hash > hash 65599 > Julienne Walker hash > Justin Sobel hash > K&R >>16 variation > K&R hash > Modified FNV (Fowler,Noll,Vo) > Ozan Yigit's SDBM hash > Paul Hsieh SFHAsm (Super Fast Hash in asm) > Paul Hsieh SuperFastHash > Peter Kankowski x17 > Peter Kankowski x17 unrolled > Peter Weinberger RD hash > Peter Weinburger hash > Ramakrishna hash > Robert Sedgwicks hash > rot hash > shift-add-xor > TCL HashString > xor hash > ok. >> although, granted, you can't freely copy code back and forth either, >> since >> the LGPL would remain in effect on the copied code. >> > > Unh... > > AIUI, under US law, a copyright applies to the *entire* work as a whole, > not little sub-pieces. I.e., if you take a small piece from here and > insert > it over there in a larger work, the copyright of the larger work > applies... > This can lead to obvious problems, since a larger sub-piece, if unique > enough, can also constitute an entire work under the law. Interpretation > is > almost entirely up to the judge... > > One of the irritants of the the GNU FAQ's on the GPL is that they say > Public > Domain code (i.e., no copyright) can be inserted into copyrighted code. > But, under US and international laws, such a work was copyrighted *once* > automatically and therefore it can't be copyrighted again. Inserting it > into GPL'd code - where the copyright applies to the *entire* work - > attempts to copyright the Public Domain code... Of course, this only > applies > if that piece of PD code is unique and large enough to be considered an > entire work. But, I've seen quite a few situations in GPL where they > copied > the entire PD work into GPL'd code. Although I've conversed with an IP > attorney on this issue, I'm not an IP attorney, so... My recommendation > is > to treat every license (w/copyrighted code) and Public Domain code as > unmixable entities. I.e., don't mix PD and GPL. Don't mix BSD and LGPL. > etc. If you want or intend to do so, I'd check with the OSI (Open Source > Initiative) first, for their legal recommendations, not FSF's. Some of > FSF's "interpretations" seem somewhat dubious. > PD is "free for whatever use", FWIW, and using PD code in a non-PD project does not effectively change its status for the pre-existing code. as for PD-code embedded in a derived work, this is less certain (say, if one wants to copy it back out as PD). they may have to instead resort to the originals. there are ways to do it, but the key is not to "mix" the code, as in, place code with one liscense inside source files with another liscense. so, in general, it is my practice to divide everything up into independent modules, and then let each module have an overall liscense... then the issue is, what of the final linked result?... well, with GPL, the whole thing goes under GPL (this is the "sticky" property so many dislike of the GPL). with LGPL, it does not. with BSD style liscenses, or PD, the combined result can be distributed under "whatever" liscense, but this does not generally change the code itself. or such...
From: Mike Austin on 8 Nov 2009 15:19 Rod Pemberton wrote: > "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message > news:hd4n2l$3rv$1(a)news.albasani.net... >> my usual algo for string hashing is: >> hi=0; >> while(*s)hi=(hi*4093)+(*s++); >> > > BGB, you've mentioned hashing quite a bit lately. And, you've posted quite > a few algo's... You either have alot of them or you keep changing in an > attempt them to improve them. > > My brute force tests of near 50 hashing algorithms and variations from > various sources on the Internet was that there are two good algorithms > publicy available: > > Austin Appleby's MurmurHash2 > Bob Jenkins' hashlittle() Wow, MurmurHash2 is amazingly simple. I'm not an expert in hashing algorithms, do you have any recommendations for using it for pointers to array indexes? std::map<Symbol*, Value> _slots; where Symbol* is 1,000 to 5,000 pointers mapped to an average of 30 Values per object. > IIRC, Jenkin's hashlittle is from lookup3.c. His hash from lookup2 performs > very similarly to it in terms of collisions. > > The Jenkins' hash generates 10% or so more collisions than Appleby's. > However, both are very low compared to everything else I tested. For the > most part, I didn't check to see which hashes failed the various Bob > Jenkins' hash tests. (Most fail.) Appleby's MurmurHash2 passes them. > IIRC, the speed of one of them (Jenkins ?) was very sensitive to different > optimization levels with the GCC compiler. E.g., why is my code with this > hash fast with -O but slow with -O2? > > Another thing to note is Paul Hsieh's highly respected hashes didn't perform > well. You should also note that quite a few respected mathematicians in the > list have come up with absolutely horrid hashes. But, one should remember > that many of these were created decades ago and weren't intended for the > data we're throwing at them today. > > By programatic brute force, I've tested every 4-byte ASCII sequence and a > large percentage (I'd guess 60%...) or so of the 5-byte ASCII sequences > (upto 2GB file - DOS limit). I was planning to rewrite the test programs so > I could test the entire 5-byte ASCII sequence range, but I didn't get around > to it. And, I concluded that what I had was probably good enough. Sorry, I > didn't have need of hashes of binary sequences, the preferred test. > However, I believe the results on the ASCII sequences, since they are > consecutive binary values over the text range, are representative of the > hash's performance in regards to collisions. > > All the other hashes fell into two categories: too slow (roughly 33%) to be > useful, too many collisions (remainder). I've speed/collision tested the > following. Sorry, these may have my naming for them instead of their actual > names... Remember, the list includes the good hashes above, but the > rest are "bad". The list has been sorted alphabetically. > > additive hash > Adler checksum > Arash Partow hash > Arash Partow hash > Austin Appleby's MurmurHash2 > Austin Appleby's MurmurHash2 aligned > Bob Jenkins' hash > Bob Jenkins' hashlittle > Bob Jenkins' oneAtATimeHash > Bob Jenkins' oneAtATimeHashPH > BP hash (unknown) > Bruno Preiss hash > CRC16 > CRC32 > CRC32PH (parallel hash) > DJ Bernstein > DJ Bernstein 5381 sum > DJ Bernstein 5381 xor > DJ Bernstein >>16 variation > DJ Bernstein modified > DJ Bernstein original > Donald Knuth hash > elf hash > Fletcher checksum > FNV (Fowler,Noll,Vo) Hash > GAWK hash > hash 65599 > Julienne Walker hash > Justin Sobel hash > K&R >>16 variation > K&R hash > Modified FNV (Fowler,Noll,Vo) > Ozan Yigit's SDBM hash > Paul Hsieh SFHAsm (Super Fast Hash in asm) > Paul Hsieh SuperFastHash > Peter Kankowski x17 > Peter Kankowski x17 unrolled > Peter Weinberger RD hash > Peter Weinburger hash > Ramakrishna hash > Robert Sedgwicks hash > rot hash > shift-add-xor > TCL HashString > xor hash Wow, TCL hasing is at the bottom? Isn't the entire language reliant on string hashing? :) >> although, granted, you can't freely copy code back and forth either, since >> the LGPL would remain in effect on the copied code. >> > > Unh... > > AIUI, under US law, a copyright applies to the *entire* work as a whole, > not little sub-pieces. I.e., if you take a small piece from here and insert > it over there in a larger work, the copyright of the larger work applies... > This can lead to obvious problems, since a larger sub-piece, if unique > enough, can also constitute an entire work under the law. Interpretation is > almost entirely up to the judge... > > One of the irritants of the the GNU FAQ's on the GPL is that they say Public > Domain code (i.e., no copyright) can be inserted into copyrighted code. > But, under US and international laws, such a work was copyrighted *once* > automatically and therefore it can't be copyrighted again. Inserting it > into GPL'd code - where the copyright applies to the *entire* work - > attempts to copyright the Public Domain code... Of course, this only applies > if that piece of PD code is unique and large enough to be considered an > entire work. But, I've seen quite a few situations in GPL where they copied > the entire PD work into GPL'd code. Although I've conversed with an IP > attorney on this issue, I'm not an IP attorney, so... My recommendation is > to treat every license (w/copyrighted code) and Public Domain code as > unmixable entities. I.e., don't mix PD and GPL. Don't mix BSD and LGPL. > etc. If you want or intend to do so, I'd check with the OSI (Open Source > Initiative) first, for their legal recommendations, not FSF's. Some of > FSF's "interpretations" seem somewhat dubious. > > > Rod Pemberton
From: Rod Pemberton on 9 Nov 2009 01:38 "Mike Austin" <mike(a)mike-nospam-austin.com> wrote in message news:Jb6dnaOW6s5DumrXnZ2dnUVZ_gWdnZ2d(a)giganews.com... > Rod Pemberton wrote: > > > > My brute force tests of near 50 hashing algorithms and variations from > > various sources on the Internet was that there are two good algorithms > > publicy available: > > > > Austin Appleby's MurmurHash2 > > Bob Jenkins' hashlittle() > > Wow, MurmurHash2 is amazingly simple. I'm not an expert in hashing algorithms, > do you have any recommendations for using it for pointers to array indexes? > Sorry, the extent of my use is was my personal testing. > [...] > > Wow, TCL hasing is at the bottom? No. As stated, the list was sorted alphabetically for the post. I.e., the ordering contains no useful information other than names of authors. This may be useful in locating some of them. I can't tell you TCL hash's overall rank in terms of collisions - since I don't know. I performed a speed test followed by a series of smaller collision tests to eliminate poorer algorithms. I didn't save results for these non-performers since my goal was the results of the larger brute force test on good algorithms. I can tell you TCL hashing's rank in terms of time: 13th out of 45. MurmurHash2 ranked 7th. hashlittle() ranked 2nd. (IMO, the 1st place result should be discarded...) So, if you need a little more speed: Jenkins. If you need fewer collisions: Appleby. Rod Pemberton
|
Pages: 1 Prev: Pratt-parser Next: how to use graphics.h in codeblocks |