From: Daniel Pitts on
On 7/9/2010 12:50 PM, Patricia Shanahan wrote:
> On 7/9/2010 12:45 PM, Tom Anderson wrote:
>> On Thu, 8 Jul 2010, Eric Sosman wrote:
>>
>>> Or, you could have BigList implement List but "lie" in its .size()
>>> method, in somewhat the same way TreeSet "lies" about the Set contract.
>>
>> How does TreeSet lie about the Set contract?
>
> The case I'm aware of involves a TreeSet with a Comparator, that is not
> consistent with the .equals methods of the TreeSet elements. The TreeSet
> always goes by the Comparator results. That means the TreeSet could
> contain elements a and b such that a.equals(b).
>
> Patricia
It is my opinion that (given perfect hindsight), the Collections API
should have included several interfaces for comparing items, not just one.

interface Ordering<T> {
enum Order {
LESSER, EQUAL, GREATER
}

Order compare(T left, T right);
}

interface Hasher<T> {
long hash(T value);
}

interface Equivalence<T> {
boolean equal(T left, T right);
}

Then, all the appropriate Collection code could use those interfaces.
There should also be the obvious default implementations.

Not to mention that iterators should have separate methods for advancing
and retrieving, and options for non-fail-fast for certain kinds of
collections (linked lists shouldn't invalidate any iterator unless the
item itself is removed).

--
Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
From: Eric Sosman on
On 7/9/2010 4:06 PM, Daniel Pitts wrote:
>
> It is my opinion that (given perfect hindsight), the Collections API
> should have included several interfaces for comparing items, not just one.
>
> interface Ordering<T> {
> enum Order {
> LESSER, EQUAL, GREATER
> }
>
> Order compare(T left, T right);
> }

The only difference I see between this and Comparator<T> is
the use of the enum instead of an int. The advantages are not
immediately obvious. Meanwhile, instead of `comp(o1,o2) <= 0' one
would write `comp(o1,o2) != Order.GREATER', which reads even less
smoothly than `! (comp(o1,o2) > 0'.

> interface Hasher<T> {
> long hash(T value);
> }

A 64-bit hashCode() would be of little use until you got to
more than 2^32 hash buckets. Just saying.

> interface Equivalence<T> {
> boolean equal(T left, T right);
> }

I don't get it: Why not just use equals()? I guess a class
could choose not to implement Equivalence at all (and thus make
itself unusable in whatever framework relies on Equivalence), but
is that an advantage? Also, you could get a compile-time error
instead of a run-time `false' for trying to call equal() on
references of dissimilar classes; again, where's the benefit?

> Then, all the appropriate Collection code could use those interfaces.
> There should also be the obvious default implementations.

It might be helpful to give some examples of the "appropriate"
uses, and of the "obvious" defaults. For example, how does a HashMap
make use of a key that implements Hasher? Does it reflect on each
key its given and make a run-time choice between using hash() and
hashCode()? I don't get it ...

> Not to mention that iterators should have separate methods for advancing
> and retrieving,

I guess this would allow `iter.advance(666666)', with a possible
savings if the Iterator were backed by a disk file or something, where
you could just skip over two thirds of a million items without looking
at them. But, but, but!!! this only makes sense for a sequential data
structure like a List -- and for List you've already *got* access by
index without using an Iterator at all! I'm beginning to sound like a
broken record[*], I know, but what's the benefit?

> and options for non-fail-fast for certain kinds of
> collections (linked lists shouldn't invalidate any iterator unless the
> item itself is removed).

Not so sure about that. The notion of "what's the next item?"
can certainly change if things are added or removed "near" the point
where an Iterator is active. Or, in the `iter.advance(666666)' case,
you may find yourself at an unexpected place if somebody else has
inserted 555555 items between where you are and where you're going ...

See also ListIterator.

[*] How many programmers nowadays have ever heard a broken record?
Or dialed a telephone, or seen a core? Gad, I'm feeling as old as
Methuselah.[**]

[**] Before your time, laddie, before your time.

--
Eric Sosman
esosman(a)ieee-dot-org.invalid
From: ClassCastException on
On Fri, 09 Jul 2010 16:54:53 -0400, Eric Sosman wrote:

> On 7/9/2010 4:06 PM, Daniel Pitts wrote:
>> interface Hasher<T> {
>> long hash(T value);
>> }
>
> A 64-bit hashCode() would be of little use until you got to
> more than 2^32 hash buckets. Just saying.

Gets us back to the original topic. :-)

>> interface Equivalence<T> {
>> boolean equal(T left, T right);
>> }
>
> I don't get it: Why not just use equals()? I guess a class
> could choose not to implement Equivalence at all (and thus make itself
> unusable in whatever framework relies on Equivalence), but is that an
> advantage? Also, you could get a compile-time error instead of a
> run-time `false' for trying to call equal() on references of dissimilar
> classes; again, where's the benefit?
>
>> Then, all the appropriate Collection code could use those interfaces.
>> There should also be the obvious default implementations.
>
> It might be helpful to give some examples of the "appropriate"
> uses, and of the "obvious" defaults. For example, how does a HashMap
> make use of a key that implements Hasher? Does it reflect on each key
> its given and make a run-time choice between using hash() and
> hashCode()? I don't get it ...

Note that those interfaces specify methods with an "extra" parameter
each. They're like Comparator versus compareTo/Comparable.

The purpose is clear: so a HashMap could be given, optionally, a
Hasher<K> to use in place of the keys' own hashCode methods and an
Equivalence<K> to use in place of the keys' own equals methods.

One obvious benefit is that you get rid of IdentityHashMap by folding
that functionality into plain HashMap. Instead of a separate class, you'd
get an identity hash map with

new HashMap<K,V>(new Hasher<K>() {
public long hash (K x) {
return System.identityHashCode(x);
}
}, new Equivalence<K>() {
public boolean equal (K x, K y) {
return x == y;
}
};

or with canned instances of IdentityHasher and IdentityEquivalence
provided by the library.

With this, you would also be able to get identity WeakHashMaps and the
like; by separating the "how strong is the reference" aspect into one
class and the "how is identity decided" aspect into another, you avoid a
combinatorial explosion and possible lacunae of capability (right now we
have no WeakIdentityHashMap, in particular).

You'd also be able to reduce some of the clumsier uses of HashMap to
HashSet. Picture a

class Record {
public final int id;
public final String name;
public final String address;
}

with the obvious equality semantics (all fields equal) and constructor
added.

Now throw in an Equivalence and a Hasher that use only the record's id
field.

So maybe you keep a change log for an individual person as a
List<Record>, chronological:

id 0001
name Jane Herman
address 1600 Pennsylvania Avenue

id 0001
name Jane Herman
address 18 Wisteria Lane

id 0001
name Jane Thurston
address 18 Wisteria Lane

OK, so she got voted out of office, then got married, or something like
that.

Of course you might want to throw a jumble of records in a Set and have
different ones of the above count as different.

But you might also want a record of the current state of affairs. Given a
HashSet implementation that can use a supplied Hasher and Equivalence the
way TreeSet can use an externally-supplied Comparator, and that also has
the semantics that adding an element that equals an already-present
element replaces that element with the new one, you can update the 0001
record simply by putting a more recent one into this set -- if it already
has a 0001 record, the provided Hasher and Equivalence will lead to the
new one replacing that one.

So in some contexts you can treat records identically only if they're
actually identical; in others if they have the same id; all without
monkeying with an explicit id-to-record HashMap or suchlike.

Another way to achieve this last, though, is to have a KeyExtractor<T>
interface that you implement in this case to return the id field of a
Record and a HashSet implementation that uses the object itself as the
key in its internal HashMap if no KeyExtractor is specified during
construction, and uses the supplied KeyExtractor otherwise. This is
actually closer to the conceptual truth of what you're doing in a case
like this: keying on the id field in a particular HashSet. The
implementation would be something like

public class HashSet<T> {
private HashMap<Object,T> data = new HashMap<Object,T>();
private KeyExtractor<T> ke = new KeyExtractor<T>() {
public Object getKey (T val) {
return val;
}
}

...

public T put (T newElem) {
Object key = ke.getKey(newElem);
T oldElem = data.get(key);
data.put(key, newElem);
return oldElem;
}
}

whereas the Hasher/Equivalence version would just pass the Hasher and
Equivalence to the HashMap constructor when initializing Data and not
have the key local in put, just newElem.

The really interesting thing is that we don't really need to wait for any
hypothetical future Sun (Oracle?) update to do some of this; KeyExtractor
and the above variation of HashSet can be implemented now, perhaps
calling the latter RecordMap instead as it acts as a map from key fields
of records of some sort to whole records, in the typical case, and in
fact you probably do also want to do lookups of whole records by just the
keys. And you might sometimes want to hold the records via weak or soft
references, e.g. to make it a cache. In that case you want to allow
specifying two more things, a type of reference to use (enum
ReferenceType {STRONG; SOFT; WEAK;} with default STRONG) and an optional
ExpensiveGetter that defaults to return null but can be replaced with one
whose expensiveGet() does something like, say, retrieve disk records.
Then get() calls expensiveGet() on not-found and if expensiveGet()
doesn't throw or return null, does a put() before returning the result.
You can throw in another type parameter, too:

public class RecordMap <K,V,E> {
private ExpensiveGetter<K,V,E> eg = new ExpensiveGetter<K,V,E>() {
public V expensiveGet (K key) throws E {
return null;
}
}

private HashMap<K,Object> data = new HashMap<K,Object>();

public enum ReferenceType {
STRONG {
public Object wrap (Object o) {
return o;
}
};
SOFT {
public Object wrap (Object o) {
return new SoftReference(o);
}
};
WEAK {
public Object wrap (Object o) {
return new WeakReference(o);
}
};
public abstract Object wrap (Object o);
}

private ReferenceType referenceType = ReferenceType.STRONG;

...

@SuppressWarnings("unchecked")
public V get (K key) throws E {
Object result = data.get(key);
if (result instanceof Reference) result = result.get();
if (result != null) return (V)result;
result = eg.expensiveGet(key);
if (result == null) return null;
put(key, result);
return (V)result;
}

public void put (K key, V val) {
data.put(key, referenceType.wrap(val);
}
}

This is a bit messy but it's just a quick draft. It doesn't actually
implement Map because it doesn't quite fit the Map contract in a few
places (and making it do so would be difficult, particularly since get
seems to have to be able to throw exceptions). You might want to change
ExpensiveGet to a more general BackingSource that provides both get and
put methods; puts write through to the real backing store whenever
performed as well as writing to the RecordMap in memory, making a
RecordMap with a non-default BackingSource a cache backed by something in
a two-way fashion.

I may be a bit rusty on the syntax of giving enum constants behavior,
too. Clearly in this case that's the right thing to do, from an OO
perspective, rather than having a switch clause in the put method that
could get out of synch if someone decided to add PHANTOM to the thing for
whatever reason or a future JDK added more Reference types that
influenced GC policy in as-yet-unforeseen ways.