From: Lew on
Lew wrote:
>> I'm still not clear on what makes 1 or 42 or 17 or any other value
>> significantly "better" than 0. One might as well use 0, since likely
>> the only String with a hash to be recalculated each time in practice
>> will be "", and as others have pointed out, that one is very quick to
>> recalculate.

Mike Schilling wrote:
> If 0 means "I haven't calculated it yet", and a calculated value of 0 is
> stored as 1, you never need to recalculate. I don't mind if you change
> it to 17182891 to mean "I haven't calculated it yet", and store a
> calculated value of 17182891 as 23023492348. It's the same thing -- you
> never need to recalculate,

How nice. But you utterly ignored the "significantly" part of my comment.

You might skew hash lookups enough by that trick to undo any putative benefit
of not recalculating. So you save .003 microseconds in .0000000025% of cases,
but lose .0031 microseconds to increased hash collision in .0000000026% of cases.

I don't see the validity of the "never need to recalculate" argument. It's
just robbing Peter to pay Paul. At least with 0 as the sentinel value with no
messing around with plugging in another value for it, it's easy to understand
what is going on. I'd hate to lose that for unsubstantiable claims of
micro-optimization.

--
Lew
From: Mike Schilling on


"Lew" <noone(a)lewscanon.com> wrote in message
news:i20jht$jvk$1(a)news.albasani.net...
> Lew wrote:
>>> I'm still not clear on what makes 1 or 42 or 17 or any other value
>>> significantly "better" than 0. One might as well use 0, since likely
>>> the only String with a hash to be recalculated each time in practice
>>> will be "", and as others have pointed out, that one is very quick to
>>> recalculate.
>
> Mike Schilling wrote:
>> If 0 means "I haven't calculated it yet", and a calculated value of 0 is
>> stored as 1, you never need to recalculate. I don't mind if you change
>> it to 17182891 to mean "I haven't calculated it yet", and store a
>> calculated value of 17182891 as 23023492348. It's the same thing -- you
>> never need to recalculate,
>
> How nice. But you utterly ignored the "significantly" part of my comment.
>
> You might skew hash lookups enough by that trick to undo any putative
> benefit of not recalculating. So you save .003 microseconds in
> .0000000025% of cases, but lose .0031 microseconds to increased hash
> collision in .0000000026% of cases.

On the one hand, you might have to do an arbitrarily large amount of hash
recalculation, since the string whose hash code is 0 can be arbitrarily
long. One the other, there's no reason to think that moving one in 4
billion strings from one hash bucket to another will affect hashing
performance at all.

>
> I don't see the validity of the "never need to recalculate" argument.
> It's just robbing Peter to pay Paul. At least with 0 as the sentinel
> value with no messing around with plugging in another value for it, it's
> easy to understand what is going on.

It's easy either way. In the definition of hashCode for strings, you just
append "and if the result is zero, add one."

From: Tom Anderson on
On Mon, 19 Jul 2010, Lew wrote:

> Lew wrote:
>>> I'm still not clear on what makes 1 or 42 or 17 or any other value
>>> significantly "better" than 0. One might as well use 0, since likely
>>> the only String with a hash to be recalculated each time in practice
>>> will be "", and as others have pointed out, that one is very quick to
>>> recalculate.
>
> Mike Schilling wrote:
>> If 0 means "I haven't calculated it yet", and a calculated value of 0 is
>> stored as 1, you never need to recalculate. I don't mind if you change
>> it to 17182891 to mean "I haven't calculated it yet", and store a
>> calculated value of 17182891 as 23023492348. It's the same thing -- you
>> never need to recalculate,
>
> How nice. But you utterly ignored the "significantly" part of my comment.
>
> You might skew hash lookups enough by that trick to undo any putative benefit
> of not recalculating. So you save .003 microseconds in .0000000025% of
> cases, but lose .0031 microseconds to increased hash collision in
> .0000000026% of cases.

Hmm. The only cases where the hash collision rate would change would be
ones where one of the keys currently hashes to 0 - because those are the
only keys whose hash would change. Thus, in every case where you get worse
collisions, you also save time on re-hashing.

It's possible to make an educated guess at what the cost of the former and
benefit of the latter are, although the sugar level in my caffeine stream
is currently too low to be able to do so properly myself.

The worst-case for a worsened collision is pretty bad - the key in
question could move from a bucket of its own to the end of a collision
list containing all the other keys in the table. But that's unlikely; one
could work out the average case based on the distribution of lengths of
collision lists, which is presumably a Poisson distribution or some such.
But i think on average, the change in size of the collision list it's on
will be a very, very small increase.

Whereas the saving from not having to rehash is always going to be a load,
a few arithmetic instructions and a store, and could be as much as lots of
loads, a lot of arithmetic instructions, and a store. The store could be
particularly problematic in an otherwise read-only passage of code,
because of its effect on caches. Maybe. You'd have to ask Patricia.

> I don't see the validity of the "never need to recalculate" argument.
> It's just robbing Peter to pay Paul.

Which is sometimes a sensible thing to do.

> At least with 0 as the sentinel value with no messing around with
> plugging in another value for it, it's easy to understand what is going
> on. I'd hate to lose that for unsubstantiable claims of
> micro-optimization.

True. But i don't think this is an unsubstantiable claim. As yet
unsubstantiated, but certainly substantiable.

tom

--
Give the future a chance!
From: Jim Janney on
Andreas Leitgeb <avl(a)gamma.logic.tuwien.ac.at> writes:

> Tom Anderson <twic(a)urchin.earth.li> wrote:
>> To make this method safe, you either have to synchronize the whole thing,
>> or do the update of calculated and code atomically at the end.
>
> I'd have expected that if "calculated" was assigned *after* "code", that
> would suffice without further synchronisation or volatile-ity of the fields.
>
> Am I still too naive?
>
>> The beauty of the String approach, of using a special value of code to
>> indicate that it had not been calculated, is that you don't need any
>> synchronisation for safety, just the JLS's guarantee of no word tearing in
>> writes and reads of int variables. Because it combines the flag and value
>> fields in a single int, they are read and written atomically as a pair. Of
>> course, were you to do this, you might want to avoid String's ability to
>> generate a code which looks like a flag indicating the lack of a code (ie
>> 0). But then, you might think the one in four billion chance of it
>> happening was insignificant.
>
> I guess, I'd have spent one "if (h==0) { h=42; }" just before "hash = h;"
> After the calculation loop, that extra "if" really wouldn't have hurt.
>
> Has anyone found e.g. an english dictionary-word with hashCode 0, yet?
> Or perhaps the name of some commonly used class in Java standard library
> or some other String likely occurring in innocent code?

All strings of the form "\0", "\0\0", "\0\0\0", etc. Not something
you'll find in a dictionary, but it could conceivably occur in some
applications. For ordinary words, I ran the following against the
dictionary file on a Linux system and got no hits.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;

public class ZeroHash {
public static void findZeroHashes(BufferedReader in) throws IOException {
String s;
while ((s = in.readLine()) != null) {
if (s.hashCode() == 0) {
System.out.println(s);
}
}
}

public static void main(String[] args) throws IOException {
findZeroHashes(new BufferedReader(new InputStreamReader(System.in)));
}
}

--
Jim Janney
From: Alan Gutierrez on
markspace wrote:
> Jim Janney wrote:
>
>> The canonical immutable class, java.lang.String, isn't annotated as
>> immutable.

> Hmmm:
>
> "Basic, canonic, canonical: reduced to the simplest and most significant
> form possible without loss of generality, e.g., "a basic story line"; "a
> canonical syllable pattern.""
>
> Not sure if String is "canonical" here. Maybe "ubiquitous" is more what
> you're after. Or possibly I just misunderstand what you mean.

You seem to have plucked that definition from Wikipedia, which goes onto
to say...

Some circles in the field of computer science have borrowed this usage
from mathematicians. It has come to mean "the usual or standard state or
manner of something"; for example, "the canonical way to organize a file
system is as a hierarchy, with extensions to make it a directed graph".

This is how I use canonical and Jim's usage jibes with my understanding.

--
Alan Gutierrez - alan(a)blogometer.com - http://twitter.com/bigeasy