Message trees can be fast [General Programming]

Prev: Pratt-parser
Next: how to use graphics.h in codeblocks

From: Rod Pemberton on 8 Nov 2009 06:16

"BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
news:hd4n2l$3rv$1(a)news.albasani.net...
>
> my usual algo for string hashing is:
> hi=0;
> while(*s)hi=(hi*4093)+(*s++);
>

BGB, you've mentioned hashing quite a bit lately. And, you've posted quite
a few algo's... You either have alot of them or you keep changing in an
attempt them to improve them.

My brute force tests of near 50 hashing algorithms and variations from
various sources on the Internet was that there are two good algorithms
publicy available:

Austin Appleby's MurmurHash2
Bob Jenkins' hashlittle()

IIRC, Jenkin's hashlittle is from lookup3.c. His hash from lookup2 performs
very similarly to it in terms of collisions.

The Jenkins' hash generates 10% or so more collisions than Appleby's.
However, both are very low compared to everything else I tested. For the
most part, I didn't check to see which hashes failed the various Bob
Jenkins' hash tests. (Most fail.) Appleby's MurmurHash2 passes them.
IIRC, the speed of one of them (Jenkins ?) was very sensitive to different
optimization levels with the GCC compiler. E.g., why is my code with this
hash fast with -O but slow with -O2?

Another thing to note is Paul Hsieh's highly respected hashes didn't perform
well. You should also note that quite a few respected mathematicians in the
list have come up with absolutely horrid hashes. But, one should remember
that many of these were created decades ago and weren't intended for the
data we're throwing at them today.

By programatic brute force, I've tested every 4-byte ASCII sequence and a
large percentage (I'd guess 60%...) or so of the 5-byte ASCII sequences
(upto 2GB file - DOS limit). I was planning to rewrite the test programs so
I could test the entire 5-byte ASCII sequence range, but I didn't get around
to it. And, I concluded that what I had was probably good enough. Sorry, I
didn't have need of hashes of binary sequences, the preferred test.
However, I believe the results on the ASCII sequences, since they are
consecutive binary values over the text range, are representative of the
hash's performance in regards to collisions.

All the other hashes fell into two categories: too slow (roughly 33%) to be
useful, too many collisions (remainder). I've speed/collision tested the
following. Sorry, these may have my naming for them instead of their actual
names... Remember, the list includes the good hashes above, but the
rest are "bad". The list has been sorted alphabetically.

additive hash
Adler checksum
Arash Partow hash
Arash Partow hash
Austin Appleby's MurmurHash2
Austin Appleby's MurmurHash2 aligned
Bob Jenkins' hash
Bob Jenkins' hashlittle
Bob Jenkins' oneAtATimeHash
Bob Jenkins' oneAtATimeHashPH
BP hash (unknown)
Bruno Preiss hash
CRC16
CRC32
CRC32PH (parallel hash)
DJ Bernstein
DJ Bernstein 5381 sum
DJ Bernstein 5381 xor
DJ Bernstein >>16 variation
DJ Bernstein modified
DJ Bernstein original
Donald Knuth hash
elf hash
Fletcher checksum
FNV (Fowler,Noll,Vo) Hash
GAWK hash
hash 65599
Julienne Walker hash
Justin Sobel hash
K&R >>16 variation
K&R hash
Modified FNV (Fowler,Noll,Vo)
Ozan Yigit's SDBM hash
Paul Hsieh SFHAsm (Super Fast Hash in asm)
Paul Hsieh SuperFastHash
Peter Kankowski x17
Peter Kankowski x17 unrolled
Peter Weinberger RD hash
Peter Weinburger hash
Ramakrishna hash
Robert Sedgwicks hash
rot hash
shift-add-xor
TCL HashString
xor hash

> although, granted, you can't freely copy code back and forth either, since
> the LGPL would remain in effect on the copied code.
>

Unh...

AIUI, under US law, a copyright applies to the *entire* work as a whole,
not little sub-pieces. I.e., if you take a small piece from here and insert
it over there in a larger work, the copyright of the larger work applies...
This can lead to obvious problems, since a larger sub-piece, if unique
enough, can also constitute an entire work under the law. Interpretation is
almost entirely up to the judge...

One of the irritants of the the GNU FAQ's on the GPL is that they say Public
Domain code (i.e., no copyright) can be inserted into copyrighted code.
But, under US and international laws, such a work was copyrighted *once*
automatically and therefore it can't be copyrighted again. Inserting it
into GPL'd code - where the copyright applies to the *entire* work -
attempts to copyright the Public Domain code... Of course, this only applies
if that piece of PD code is unique and large enough to be considered an
entire work. But, I've seen quite a few situations in GPL where they copied
the entire PD work into GPL'd code. Although I've conversed with an IP
attorney on this issue, I'm not an IP attorney, so... My recommendation is
to treat every license (w/copyrighted code) and Public Domain code as
unmixable entities. I.e., don't mix PD and GPL. Don't mix BSD and LGPL.
etc. If you want or intend to do so, I'd check with the OSI (Open Source
Initiative) first, for their legal recommendations, not FSF's. Some of
FSF's "interpretations" seem somewhat dubious.

Rod Pemberton

From: BGB / cr88192 on 8 Nov 2009 09:25

"Rod Pemberton" <do_not_have(a)nohavenot.cmm> wrote in message
news:hd69d5$umo$1(a)aioe.org...
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hd4n2l$3rv$1(a)news.albasani.net...
>>
>> my usual algo for string hashing is:
>> hi=0;
>> while(*s)hi=(hi*4093)+(*s++);
>>
>
> BGB, you've mentioned hashing quite a bit lately. And, you've posted
> quite
> a few algo's... You either have alot of them or you keep changing in an
> attempt them to improve them.
>

I have a collection of them I use for different tasks, where each one does
slightly different things.
I have been using them for around 8 years now, so I have collected a few
variants.

the above sequence is mostly for calculating a hash value for strings, and
is best suited for hash tables in the 4096-1M range.

another variation is to hash integer data.
and, sometimes, I combine pre-existing hashes (this used in my object
system).

similarly, the tables themselves can be used in different ways:
as a single stop cache. (I call these "caching hashes", since they only
cache things, rather than working as a means of "permanent" storage, and in
general one can NULL all the entries in a caching hash with little or no ill
effect).

as a collection of linked lists, which is usually a variant of a "storage
hash".
with chains inside the hash, which is where I often use pseudo-random
stepping.
....

> My brute force tests of near 50 hashing algorithms and variations from
> various sources on the Internet was that there are two good algorithms
> publicy available:
>
> Austin Appleby's MurmurHash2
> Bob Jenkins' hashlittle()
>
> IIRC, Jenkin's hashlittle is from lookup3.c. His hash from lookup2
> performs
> very similarly to it in terms of collisions.
>

yes, ok.

I have not looked much at others' algos, as I have my own variants and
experiences as to which works best in which sort of task.

> The Jenkins' hash generates 10% or so more collisions than Appleby's.
> However, both are very low compared to everything else I tested. For the
> most part, I didn't check to see which hashes failed the various Bob
> Jenkins' hash tests. (Most fail.) Appleby's MurmurHash2 passes them.
> IIRC, the speed of one of them (Jenkins ?) was very sensitive to different
> optimization levels with the GCC compiler. E.g., why is my code with this
> hash fast with -O but slow with -O2?
>
> Another thing to note is Paul Hsieh's highly respected hashes didn't
> perform
> well. You should also note that quite a few respected mathematicians in
> the
> list have come up with absolutely horrid hashes. But, one should remember
> that many of these were created decades ago and weren't intended for the
> data we're throwing at them today.
>

yep.

I think most of my hash functions are derived from one of Donald Knuth's
PRNG's (original idea was derived from the one found in Newlib).

experimentally, I had found it had went both faster and had better collision
behavior than the more common variants I had used prior, which had usually
involved shifts and xors.

for example:
hi=0; while(*s)hi=(hi<<3)^(hi>>7)^(*s++);
and similar...

which had gotten 'pwned' by my discovery of the approach I have been using
since.

> By programatic brute force, I've tested every 4-byte ASCII sequence and a
> large percentage (I'd guess 60%...) or so of the 5-byte ASCII sequences
> (upto 2GB file - DOS limit). I was planning to rewrite the test programs
> so
> I could test the entire 5-byte ASCII sequence range, but I didn't get
> around
> to it. And, I concluded that what I had was probably good enough. Sorry,
> I
> didn't have need of hashes of binary sequences, the preferred test.
> However, I believe the results on the ASCII sequences, since they are
> consecutive binary values over the text range, are representative of the
> hash's performance in regards to collisions.
>

yeah...

mine I had refined before through the use of random collections of longer
strings, and sometimes random data. others had used random chunks from
strings which were "sliced and diced" for testing.
....

> All the other hashes fell into two categories: too slow (roughly 33%) to
> be
> useful, too many collisions (remainder). I've speed/collision tested the
> following. Sorry, these may have my naming for them instead of their
> actual
> names... Remember, the list includes the good hashes above, but the
> rest are "bad". The list has been sorted alphabetically.
>
> additive hash
> Adler checksum
> Arash Partow hash
> Arash Partow hash
> Austin Appleby's MurmurHash2
> Austin Appleby's MurmurHash2 aligned
> Bob Jenkins' hash
> Bob Jenkins' hashlittle
> Bob Jenkins' oneAtATimeHash
> Bob Jenkins' oneAtATimeHashPH
> BP hash (unknown)
> Bruno Preiss hash
> CRC16
> CRC32
> CRC32PH (parallel hash)
> DJ Bernstein
> DJ Bernstein 5381 sum
> DJ Bernstein 5381 xor
> DJ Bernstein >>16 variation
> DJ Bernstein modified
> DJ Bernstein original
> Donald Knuth hash
> elf hash
> Fletcher checksum
> FNV (Fowler,Noll,Vo) Hash
> GAWK hash
> hash 65599
> Julienne Walker hash
> Justin Sobel hash
> K&R >>16 variation
> K&R hash
> Modified FNV (Fowler,Noll,Vo)
> Ozan Yigit's SDBM hash
> Paul Hsieh SFHAsm (Super Fast Hash in asm)
> Paul Hsieh SuperFastHash
> Peter Kankowski x17
> Peter Kankowski x17 unrolled
> Peter Weinberger RD hash
> Peter Weinburger hash
> Ramakrishna hash
> Robert Sedgwicks hash
> rot hash
> shift-add-xor
> TCL HashString
> xor hash
>

ok.

>> although, granted, you can't freely copy code back and forth either,
>> since
>> the LGPL would remain in effect on the copied code.
>>
>
> Unh...
>
> AIUI, under US law, a copyright applies to the *entire* work as a whole,
> not little sub-pieces. I.e., if you take a small piece from here and
> insert
> it over there in a larger work, the copyright of the larger work
> applies...
> This can lead to obvious problems, since a larger sub-piece, if unique
> enough, can also constitute an entire work under the law. Interpretation
> is
> almost entirely up to the judge...
>
> One of the irritants of the the GNU FAQ's on the GPL is that they say
> Public
> Domain code (i.e., no copyright) can be inserted into copyrighted code.
> But, under US and international laws, such a work was copyrighted *once*
> automatically and therefore it can't be copyrighted again. Inserting it
> into GPL'd code - where the copyright applies to the *entire* work -
> attempts to copyright the Public Domain code... Of course, this only
> applies
> if that piece of PD code is unique and large enough to be considered an
> entire work. But, I've seen quite a few situations in GPL where they
> copied
> the entire PD work into GPL'd code. Although I've conversed with an IP
> attorney on this issue, I'm not an IP attorney, so... My recommendation
> is
> to treat every license (w/copyrighted code) and Public Domain code as
> unmixable entities. I.e., don't mix PD and GPL. Don't mix BSD and LGPL.
> etc. If you want or intend to do so, I'd check with the OSI (Open Source
> Initiative) first, for their legal recommendations, not FSF's. Some of
> FSF's "interpretations" seem somewhat dubious.
>

PD is "free for whatever use", FWIW, and using PD code in a non-PD project
does not effectively change its status for the pre-existing code. as for
PD-code embedded in a derived work, this is less certain (say, if one wants
to copy it back out as PD). they may have to instead resort to the
originals.

there are ways to do it, but the key is not to "mix" the code, as in, place
code with one liscense inside source files with another liscense.

so, in general, it is my practice to divide everything up into independent
modules, and then let each module have an overall liscense...

then the issue is, what of the final linked result?...
well, with GPL, the whole thing goes under GPL (this is the "sticky"
property so many dislike of the GPL).
with LGPL, it does not.

with BSD style liscenses, or PD, the combined result can be distributed
under "whatever" liscense, but this does not generally change the code
itself.

or such...

From: Mike Austin on 8 Nov 2009 15:19

Rod Pemberton wrote:
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> news:hd4n2l$3rv$1(a)news.albasani.net...
>> my usual algo for string hashing is:
>> hi=0;
>> while(*s)hi=(hi*4093)+(*s++);
>>
>
> BGB, you've mentioned hashing quite a bit lately. And, you've posted quite
> a few algo's... You either have alot of them or you keep changing in an
> attempt them to improve them.
>
> My brute force tests of near 50 hashing algorithms and variations from
> various sources on the Internet was that there are two good algorithms
> publicy available:
>
> Austin Appleby's MurmurHash2
> Bob Jenkins' hashlittle()

Wow, MurmurHash2 is amazingly simple. I'm not an expert in hashing algorithms,
do you have any recommendations for using it for pointers to array indexes?

std::map<Symbol*, Value> _slots;

where Symbol* is 1,000 to 5,000 pointers mapped to an average of 30 Values per
object.

> IIRC, Jenkin's hashlittle is from lookup3.c. His hash from lookup2 performs
> very similarly to it in terms of collisions.
>
> The Jenkins' hash generates 10% or so more collisions than Appleby's.
> However, both are very low compared to everything else I tested. For the
> most part, I didn't check to see which hashes failed the various Bob
> Jenkins' hash tests. (Most fail.) Appleby's MurmurHash2 passes them.
> IIRC, the speed of one of them (Jenkins ?) was very sensitive to different
> optimization levels with the GCC compiler. E.g., why is my code with this
> hash fast with -O but slow with -O2?
>
> Another thing to note is Paul Hsieh's highly respected hashes didn't perform
> well. You should also note that quite a few respected mathematicians in the
> list have come up with absolutely horrid hashes. But, one should remember
> that many of these were created decades ago and weren't intended for the
> data we're throwing at them today.
>
> By programatic brute force, I've tested every 4-byte ASCII sequence and a
> large percentage (I'd guess 60%...) or so of the 5-byte ASCII sequences
> (upto 2GB file - DOS limit). I was planning to rewrite the test programs so
> I could test the entire 5-byte ASCII sequence range, but I didn't get around
> to it. And, I concluded that what I had was probably good enough. Sorry, I
> didn't have need of hashes of binary sequences, the preferred test.
> However, I believe the results on the ASCII sequences, since they are
> consecutive binary values over the text range, are representative of the
> hash's performance in regards to collisions.
>
> All the other hashes fell into two categories: too slow (roughly 33%) to be
> useful, too many collisions (remainder). I've speed/collision tested the
> following. Sorry, these may have my naming for them instead of their actual
> names... Remember, the list includes the good hashes above, but the
> rest are "bad". The list has been sorted alphabetically.
>
> additive hash
> Adler checksum
> Arash Partow hash
> Arash Partow hash
> Austin Appleby's MurmurHash2
> Austin Appleby's MurmurHash2 aligned
> Bob Jenkins' hash
> Bob Jenkins' hashlittle
> Bob Jenkins' oneAtATimeHash
> Bob Jenkins' oneAtATimeHashPH
> BP hash (unknown)
> Bruno Preiss hash
> CRC16
> CRC32
> CRC32PH (parallel hash)
> DJ Bernstein
> DJ Bernstein 5381 sum
> DJ Bernstein 5381 xor
> DJ Bernstein >>16 variation
> DJ Bernstein modified
> DJ Bernstein original
> Donald Knuth hash
> elf hash
> Fletcher checksum
> FNV (Fowler,Noll,Vo) Hash
> GAWK hash
> hash 65599
> Julienne Walker hash
> Justin Sobel hash
> K&R >>16 variation
> K&R hash
> Modified FNV (Fowler,Noll,Vo)
> Ozan Yigit's SDBM hash
> Paul Hsieh SFHAsm (Super Fast Hash in asm)
> Paul Hsieh SuperFastHash
> Peter Kankowski x17
> Peter Kankowski x17 unrolled
> Peter Weinberger RD hash
> Peter Weinburger hash
> Ramakrishna hash
> Robert Sedgwicks hash
> rot hash
> shift-add-xor
> TCL HashString
> xor hash

Wow, TCL hasing is at the bottom? Isn't the entire language reliant on string
hashing? :)

>> although, granted, you can't freely copy code back and forth either, since
>> the LGPL would remain in effect on the copied code.
>>
>
> Unh...
>
> AIUI, under US law, a copyright applies to the *entire* work as a whole,
> not little sub-pieces. I.e., if you take a small piece from here and insert
> it over there in a larger work, the copyright of the larger work applies...
> This can lead to obvious problems, since a larger sub-piece, if unique
> enough, can also constitute an entire work under the law. Interpretation is
> almost entirely up to the judge...
>
> One of the irritants of the the GNU FAQ's on the GPL is that they say Public
> Domain code (i.e., no copyright) can be inserted into copyrighted code.
> But, under US and international laws, such a work was copyrighted *once*
> automatically and therefore it can't be copyrighted again. Inserting it
> into GPL'd code - where the copyright applies to the *entire* work -
> attempts to copyright the Public Domain code... Of course, this only applies
> if that piece of PD code is unique and large enough to be considered an
> entire work. But, I've seen quite a few situations in GPL where they copied
> the entire PD work into GPL'd code. Although I've conversed with an IP
> attorney on this issue, I'm not an IP attorney, so... My recommendation is
> to treat every license (w/copyrighted code) and Public Domain code as
> unmixable entities. I.e., don't mix PD and GPL. Don't mix BSD and LGPL.
> etc. If you want or intend to do so, I'd check with the OSI (Open Source
> Initiative) first, for their legal recommendations, not FSF's. Some of
> FSF's "interpretations" seem somewhat dubious.
>
>
> Rod Pemberton

From: Rod Pemberton on 9 Nov 2009 01:38

"Mike Austin" <mike(a)mike-nospam-austin.com> wrote in message
news:Jb6dnaOW6s5DumrXnZ2dnUVZ_gWdnZ2d(a)giganews.com...
> Rod Pemberton wrote:
> >
> > My brute force tests of near 50 hashing algorithms and variations from
> > various sources on the Internet was that there are two good algorithms
> > publicy available:
> >
> > Austin Appleby's MurmurHash2
> > Bob Jenkins' hashlittle()
>
> Wow, MurmurHash2 is amazingly simple. I'm not an expert in hashing
algorithms,
> do you have any recommendations for using it for pointers to array
indexes?
>

Sorry, the extent of my use is was my personal testing.

> [...]
>
> Wow, TCL hasing is at the bottom?

No. As stated, the list was sorted alphabetically for the post. I.e., the
ordering contains no useful information other than names of authors. This
may be useful in locating some of them.

I can't tell you TCL hash's overall rank in terms of collisions - since I
don't know. I performed a speed test followed by a series of smaller
collision tests to eliminate poorer algorithms. I didn't save results for
these non-performers since my goal was the results of the larger brute force
test on good algorithms.

I can tell you TCL hashing's rank in terms of time: 13th out of 45.
MurmurHash2 ranked 7th. hashlittle() ranked 2nd. (IMO, the 1st place
result should be discarded...) So, if you need a little more speed:
Jenkins. If you need fewer collisions: Appleby.

Rod Pemberton

|
Pages: 1
Prev: Pratt-parser
Next: how to use graphics.h in codeblocks