Hash table performance [Java Programming]

Prev: The future of Java
Next: weird issue with new lines

From: markspace on 24 Nov 2009 20:04

Patricia Shanahan wrote:

> Given current division speeds, does it really make sense to use a
> power-of-two bucket count?

I wasn't using bit masking and powers of two, I was using double the
previous table size then add 1, and integer modulo to limit the hash
index to the table size. I even started with a hash table size of 101.
I was using closed hashing however, which did not help.

I was *still* getting a huge clump of collisions right in the middle of
the table.

I'm still not sure why, but I think the repeating patterns in the double
represent very few different factors, so if I hit one, then I was stuck.
There are very few other factors in those types of numbers for me to
"modulo" and get a different hash index.

It was so bad that as near as I could tell, for values much less than
the OP's example (one million keys instead of ten million) the test code
was basically not going to ever complete.

From: John B. Matthews on 24 Nov 2009 21:30

In article <hehvr3$qs5$1(a)news.eternal-september.org>,
markspace <nospam(a)nowhere.com> wrote:

> Patricia Shanahan wrote:
>
> > Given current division speeds, does it really make sense to use a
> > power-of-two bucket count?
>
> I wasn't using bit masking and powers of two, I was using double the
> previous table size then add 1, and integer modulo to limit the hash
> index to the table size. I even started with a hash table size of
> 101. I was using closed hashing however, which did not help.
>
> I was *still* getting a huge clump of collisions right in the middle
> of the table.
>
> I'm still not sure why, but I think the repeating patterns in the
> double represent very few different factors, so if I hit one, then I
> was stuck. There are very few other factors in those types of
> numbers for me to "modulo" and get a different hash index.
>
> It was so bad that as near as I could tell, for values much less than
> the OP's example (one million keys instead of ten million) the test
> code was basically not going to ever complete.

Is this just the contrived nature of the example? If the keys were
really positive integers less than 10 million, then HashMap<Integer,
Double> would solve the problem. If the domain were some other subset of
the reals for which Double's hashCode() proved inadequate, one might
wrap Double and provide a tailored hashCode() and equals().

--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>

From: markspace on 24 Nov 2009 21:45

John B. Matthews wrote:

>
> Is this just the contrived nature of the example?

I dunno. I have a feeling that while the example is contrived, it isn't
unreasonable. Doubles and their bit pattern don't seem to be as random
as people expect.

Look at Tom's attempt to introduce some randomness by dividing by ten.
I don't think those he produced bit patterns are random at all. I see a
lot of repeating bit patterns, and I'm going to guess that repeating bit
patterns are a lot more common in doubles and floats than many of us
suppose.

I think a good bit shuffling algorithm is required to prevent
"pathological" hash codes from messing up the hash algorithm. HashMap
provides one. If you roll your own hash table, you should provide your
own bit shuffling too.

From: Daniel Pitts on 24 Nov 2009 23:25

Thomas Pornin wrote:
> According to Patricia Shanahan <pats(a)acm.org>:
>> Given current division speeds, does it really make sense to use a
>> power-of-two bucket count?
>
> Integer division tends to be slow on modern hardware. The Intel
> optimization manual which happens to conveniently reside on my harddisk
> states that latency for a simple "idiv" opcode is between 56 and 70
> clock cycles, while masking uses only a fraction of a clock cycle.
> Division actually becomes slower over time; on a 80386, signed division
> took 43 clock cycles while a bit masking operation used 2 clock cycles.
> I do not have Intel manuals for newer processors (these should be
> downloadable from Intel's web site), but I expect integer division to
> remain slow on newer designs.
>
> On some RISC architectures (e.g. Alpha), there is no integer division
> opcode at all; division is performed in software and is even slower.
>
>
> In Sun's JRE, Hashtable and HashMap differ in their strategy.
> Hashtable uses a bucket count initially set to the provided value,
> or 11 by default. If the count is n and must be increased, then the
> new count is 2*n+1. Conversely, HashMap uses only powers of two for
> its bucket count (if given an initial capacity, it rounds it up to
> the next power of two).
>
>
> --Thomas Pornin

I did a quick microbench to test division versus shifting. I was
surprised by the result, I expected them to be "close enough", but it
looks like, on my machine at least, there is an order of magnitude
difference:

<results>
Nothing: count = 874270000
Shift: count = 869828000
Divide: count = 91958000
Nothing: count = 832932000
Shift: count = 760583000
Divide: count = 82123000
Nothing: count = 890739000
Shift: count = 881628000
Divide: count = 92349000
</results>

<code>
public class TestDivs {
private static final int INNER_LOOP = 1000;
private static final long RUNFOR = 1000 * 1000 * 1000 ;

public static void main(String[] args) {
doNothing();
doShift();
doDivide();
doNothing();
doShift();
doDivide();
doNothing();
doShift();
doDivide();
}

private static void doNothing() {
final long start = System.nanoTime();
long ns;
while (start == (ns = System.nanoTime())) ;
ns += RUNFOR;
int count = 0;
int s = 100;
while (System.nanoTime() < ns) {
for (int i = 0; i < INNER_LOOP; ++i) {
++count;
s += 9999;
}
}
System.out.println("Nothing: count = " + count);
}

private static void doDivide() {
final long start = System.nanoTime();
long ns;
while (start == (ns = System.nanoTime())) ;
ns += RUNFOR;
int count = 0;
int s = 100;
while (System.nanoTime() < ns) {
for (int i = 0; i < INNER_LOOP; ++i) {
++count;
s %= 37;
s += 9999;
}
}
System.out.println("Divide: count = " + count);
}

private static void doShift() {
final long start = System.nanoTime();
long ns;
while (start == (ns = System.nanoTime())) ;
ns += RUNFOR;
int count = 0;
int s = 100;
while (System.nanoTime() < ns) {
for (int i = 0; i < INNER_LOOP; ++i) {
++count;
s >>= 5;
s += 9999;
}
}
System.out.println("Shift: count = " + count);
}
}
</code>

From: John B. Matthews on 24 Nov 2009 23:35

In article <hei5nr$5n0$1(a)news.eternal-september.org>,
markspace <nospam(a)nowhere.com> wrote:

> John B. Matthews wrote:
> >
> > Is this just the contrived nature of the example?
>
> I dunno. I have a feeling that while the example is contrived, it isn't
> unreasonable.

Ah, perhaps a cache of expensive function results.

> Doubles and their bit pattern don't seem to be as random as people
> expect.
>
> Look at Tom's attempt to introduce some randomness by dividing by
> ten. I don't think those he produced bit patterns are random at all.
> I see a lot of repeating bit patterns, and I'm going to guess that
> repeating bit patterns are a lot more common in doubles and floats
> than many of us suppose.

I see what you mean: floats and doubles are just rational approximations
of reals, so they always end in zero or repeat.

> I think a good bit shuffling algorithm is required to prevent
> "pathological" hash codes from messing up the hash algorithm.
> HashMap provides one. If you roll your own hash table, you should
> provide your own bit shuffling too.

The method hash() is adapted to the power-of-two array size:

<http://www.docjar.com/html/api/java/util/HashMap.java.html>

Recalling Patricia's observation above, using a prime array size might
be an alternative.

--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>

First | Prev | Next | Last
Pages: 11 12 13 14 15 16 17 18 19 20 21 22
Prev: The future of Java
Next: weird issue with new lines