From: ClassCastException on
On Fri, 09 Jul 2010 20:38:16 -0400, Arne Vajhøj wrote:

> On 09-07-2010 12:21, Wayne wrote:
>> On 7/9/2010 12:31 AM, Patricia Shanahan wrote:
>>> Wayne wrote:
>>>> On 7/8/2010 5:35 PM, Boris Punk wrote:
>>>>> Integer.MAX_VALUE = 2147483647
>>>>>
>>>>> I might need more items than that. I probably won't, but it's nice
>>>>> to have extensibility.
>>>>
>>>> To me, it is unlikely your system will run well if this one data
>>>> structure consumes 2G of memory. (You didn't really state the
>>>> application or system; certainly there are exceptions to the rule.)
>>>> I would suggest you use a more flexible system, where you keep the
>>>> data on storage (disk) and use memory as a cache. Perhaps an
>>>> ArrayList of soft references would work well. It might even be
>>>> possible in your particular case to run a daemon thread that
>>>> pre-fetches items into the cache.
>>>
>>> What's the difference between one data structure occupying over 2 GB
>>> and a set of data structures that use that much space?
>>>
>>> Certainly, given enough memory, Java can support total data structure
>>> sizes well over 2 GB without excessive paging.
>>
>> A reduction in the number of page faults. There was an interesting
>> article about this topic in this month's Communications of the ACM, by
>> Poul-Jenning Kamp,
>
> Poul-Henning Kamp often abbreviated PHK.
>
>>
who
>> was one of the lead developers of the FreeBSD kernel.
>
> He still contributes to FreeBSD.
>
>> He applied his
>> insight
>> to a web proxy replacement for Squid called Varnish, and was able to
>> replace 12 Squid machines with 3 Varnish ones. It used a modified
>> binary heap he called a B-heap, which respected the page size of
>> memory. The article was titled "You're doing It Wrong". The message I
>> came away with was, don't ignore the fact that computers use paging
>> when designing large data structures. I was thinking that lesson might
>> apply to the OP's situation.
>
> I assume you are talking about this article:
>
> http://queue.acm.org/detail.cfm?id=1814327
>
> He is not suggesting any custom swap to disk or anything, but just
> noting that it is beneficial to keep stuff together to minimize paging.

Does the new G1 collector do that? That is, are the collection spaces:

* 2^n pages in size, for some n and
* aligned on a 2^n-page boundary?

For example, collection spaces that take up 8 pages, and span contiguous
blocks of 8 pages starting at multiples of 8, so pages 0-7 could be one
space, 8-15 another, and so on.

If so, G1 will play nicer with paging than otherwise.

More interesting from a GC developer standpoint is the possibility of
grouping data that's used together. That could mean one of two things:

1. Trying to minimize the sum of the squares of the "lengths" of
references, where the length of a reference is the collection space
number of the space containing the reference minus that of the space
containing the referent, absolute value, and "collection space number"
numbers collection spaces consecutively in virtual memory order of
location.
2. Actually observing patterns of access, such as object X and object Y
tend to go long periods unused and then both tend to get referenced
within a short time of each other. Objects that are accessed together
get preferentially put close together, preferably in the same
collection space. Thus they're likely to be paged in and out together.
Plus they are likely to have similar lifespans, so both will probably
become garbage together. This makes both the paging system and the
GC's jobs easier.

Now alternative 2 we *almost* get for free with G1, since objects that
will tend to be used together also tend to be *born* together and such
objects will *usually* end up sharing one collection space (which
presumably starts out as a TLAB, then fills up, then is allowed to decay
into garbage over time). The trick is avoiding splitting them up later
when rearranging a few live objects to make a space be pure-garbage so it
can be recycled as a TLAB or whatever. Just trying to put the few live
objects from such a space into the *same* other space rather than
scattering them about the heap might suffice. If G1 keeps a partly-filled
space as a "old object tenuring space" and puts all such objects in it
until it's full, then moves on to another, that'll do it, and it will
cause a kind of self-organizing generational-spaces pattern to emerge
within the heap that will work fairly nicely with the OS's paging; long-
lived application global objects will accumulate to a few such spaces,
for instance. I think G1 *is* designed to behave somewhat like this, but
keeping the paging system in mind suggests how it might be possible to
fine-tune it. For example, one-page collection spaces for allocating all
objects no larger than, say, 64 bytes. (Most objects are this small.
What's left is mostly large arrays, and most of these will typically be
the innards of large HashMaps, HashSets, and ArrayLists. Large
collections tend to be long-lived or even application-global. You might
want a special-case optimization here to keep any Collection object that
directly references a large array together with that array in a single
big collection space instead of separating them. Collections into which
large arrays are placed as elements pose a bit of a problem for this, but
ought to be rare.)

What can programmers do by themselves, right now?

1. Investigate GC and paging performance as a function of your choice of
data structures and their sizes. Test if a roll-your-own disk-backed
structure performs better than keeping it in memory and letting the OS
page it out, particularly if it's infrequently accessed but then
accessed in bursts.

2. Try also breaking up large data structures into ones that ought to
span only a few pages each and that keep things that are accessed
together in one substructure together.

3. Breaking the problem itself up, parallelizing it, might make the data
structures smaller. The smaller problems can be run serially,
minimizing heap size, or multithreaded.

4. Prefer immutable objects whenever possible. All modern garbage
collectors are far more efficient with new objects referencing old
ones than with old objects referencing new ones. Old objects
referencing new ones obviously contain mutable fields.

In fact, all of this points to a clear division between two fundamentally
different classes of object as far as GC and paging treatment is
concerned:

On the one hand, we have the ordinary object. It's typically small, with
only a few fields, and often these are likewise small ordinary objects
down to a small depth before you hit primitives. It's often immutable.
It's usually young and usually not long for this world. Its reference-
type fields, if it even has any, tend to refer to older objects. It plays
very nicely with garbage collection and if the GC doesn't touch garbage,
only live objects, it plays very nicely with paging too.

On the other, we have the large long-lived collection. It's typically a
HashFoo or ArrayList, less commonly a TreeFoo or LinkedList. It usually
contains a large array that's a big, contiguous block of memory full of
references. These tend to include lots of references to much-younger
objects. It tends to slow the GC down a lot because of these references
and it plays poorly with paging with the Collection object itself, the
array, and the elements get scattered all over the heap. The occasional
replacement of the array with a bigger array certainly doesn't help.

Interestingly, this suggests an alternative answer to our array issues
that may seem very counterintuitive at first. Instead of adding long
arrays in general, perhaps arrays really serve two purposes.

On the one hand, arrays of primitives serve various numeric and data-
storage purposes (including backing Strings and being IO buffers). On the
other, arrays of reference types mostly seem to serve as the backing
structures of Collections, or else as poor-man's Collections themselves.
These are two separate uses with two separate purposes.

Arrays of primitives divide into three subgroups: char arrays backing
Strings, byte array IO buffers, and numeric arrays used for numerics. The
third has the strongest case for needing to be able to grow huge and
retain high performance, so let's let arrays of int, double, long, and
float get huge.

Strings are really a peculiar form of immutable List that doesn't
actually implement the List interface and should be treated as oddball
Collections.

IO buffers never should be very large, as a rule, except for some
applications involving buffering media during playback to smooth over
hiccups in disk (or network!) traffic and meet the hard realtime
requirements of media playback to avoid visible stuttering, pausing,
glitching, skipping, framerate fluctuation, and other such artifacts that
degrade the user's enjoyment of the media. Byte arrays with the existing
size limit seem to serve these purposes well enough.

And now, Collections. Collection access is usually not too too speed
critical. A few interface call overheads and the like are generally
accepted with their use, after all.

Do Collections really need to be backed by large single arrays? Or would
performance remain adequate if they were backed by smaller arrays that
sometimes formed treelike structures?

Performance with an array-tree obviously gets a factor of K log N
multiplied into the space and time overhead, but the constant K is very
small if the maximum single array size is even 1000 cells, for a few KB
of references and a total array size in memory that lets it fit inside a
single 4KB page on a typical system. In fact with 1000 cells before
splitting K ends up being 1/log 1000, so somewhat smallish. Or put
another way, the tree's depth is the overhead factor and for collections
of 1 billion elements, almost half the present defacto size limit, this
overhead factor is just 3. In code where we are working at a fairly high
level of abstraction and tend not to need blazing speed.

So it might actually be viable to stop using arrays of references (and
arrays of char) larger than 1000, and the resulting data structures might
be only a little slower to use (and only when bigger than 1000 elements).
But they can also become a lot friendlier to the GC and paging system.
For one thing, no single contiguous huge array needs to have room found
for it by the GC. If only large arrays of non-char primitives seem to be
needed, these can be special cased by the GC (for instance, they cannot
source references to other objects).

Everything else in the heap is then small enough to fit in small
collection spaces and in single OS paging system pages, including the
bits of Collections and larger Strings. (The substring aliasing problem
with GC also is ameliorated. Small substrings (<= 1000 characters) of
large Strings would usually be able to get away with only holding a
reference to a single at-most-1000-element array of char, and the rest of
the time to two such arrays plus an at-most-1000-element array of
references.)

There's still the matter of organizing the bits of a Collection (or
String) to optimize GC and paging. Collection-backing arrays will still
be full of references to younger objects, for example. But if neighboring
objects in a collection tend to have similar lifetimes and to be used
together, the array segment referencing them and the objects themselves
could perhaps be kept together.

Ultimately it might be useful to make the GC distinguish among three
broad categories of objects: large primitive arrays and Strings (big, no
outbound references); large Collections (big; many references to younger
objects); and everything else (mostly small; few references to younger
objects). It can then optimize for these cases in some way.

I think G1 will reduce the problems that come from Java GC interacting
with paging (such as GC hauling the entire heap image into memory at
once, causing a long pause as the OS pages gigs of data from disk).
Further improvements to GC and paging performance of Java may require
researching, particularly, how large Collections interact with GC and
paging. Large Collections have a host of features that combine to make
them impact these processes severely:

* They tend to be long-lived, but referring to younger objects, which
hurts the GC.
* They tend to contain a large, contiguous array that cannot be
fragmented, limiting the GC's options. (Using arrays of size only up
to 1000 cells in collection implementations would alleviate this.)
* If traversed or random-accessed, this array will have to be paged
entirely into memory. Objects accessed at similar times won't tend to
have similar hashCode values, so large HashMaps may perform poorly when
partially paged out. GC traversal during collection will force a full
page-in. Rehashing algorithms that jump all over the array looking for
a free cell will force a full page-in, though the load factor gimmick
somewhat reins in rehashing.
* The array is occasionally copied into a bigger array, meaning more RAM
juggling by the allocator and large contiguous chunks of garbage for the
GC to eat. This also makes the Collection object itself older than the
array AND the array older than many of its cell referents at a "typical"
time. The array growings are rare, especially at larger sizes, but the
reference-age issue is continual.
* The Collection object, its array, and the elements, and possibly other
objects referenced by the Collection object such as a Comparator, are
scattered about the heap. The array becomes garbage if the Collection
does, but they're not going to be in the same collection space. Likely
the contents also become garbage at this time, most of them anyway.
They also typically need to be paged in together, but are scattered.

There are two higher-level mitigation strategies possible.

* Programmers can consider replacing HashMaps and similar data structures
that become very large and become performance bottlenecks with more
hierarchical data structures that keep related, accessed-together and
similar-lifespan objects in one substructure. The entire substructure
and many of its parts may tend to be swapped in and out together and,
eventually, to become garbage together.

Taken to an extreme, the substructures might be made immutable using
Collections.unmodifiableFoo, at least those whose membership is going
to be constant or near-constant over its lifetime (if only near-
constant, it gets copied-with-mutation). This drastically cuts down on
old-to-young references.

* New Collection classes can be made that have a tree-structure, and
possibly use not only small but *immutable* arrays and substructures
under the hood, again solving the old-to-young reference issue;
these could also help the programmer keep related (used-together and
will-die-together) objects in substructures. If the objects and the
substructure are all allocated at close to the same time, they may tend
to be close in RAM and in on collection space, making the whole lot GC
and page as one without pain.

I will now point out that Clojure has collections that internally use
tree structures and immutable arrays, so that all the "modifying"
operations return copies rather than mutate the original. The copies
share most of the original's internal structure, so the memory overhead
and copying costs of this aren't nearly as bad as it first sounds. So
using Clojure, using the clojure.lang collection classes from in Java, or
making a new Java library of similar data structures and using that could
reduce paging and GC times in your applications.
From: Tom Anderson on
On Fri, 9 Jul 2010, Patricia Shanahan wrote:

> On 7/9/2010 12:45 PM, Tom Anderson wrote:
>> On Thu, 8 Jul 2010, Eric Sosman wrote:
>>
>>> Or, you could have BigList implement List but "lie" in its .size()
>>> method, in somewhat the same way TreeSet "lies" about the Set contract.
>>
>> How does TreeSet lie about the Set contract?
>
> The case I'm aware of involves a TreeSet with a Comparator, that is not
> consistent with the .equals methods of the TreeSet elements. The TreeSet
> always goes by the Comparator results. That means the TreeSet could
> contain elements a and b such that a.equals(b).

True.

Though that feel more like "TreeSet requires its Comparator to be
consistent with equals" than "TreeSet lies about the Set contract".

If i write this:

class MakeAHashOfThings {
public int h;
public int hashCode() {
return h;
}
}

Set s = new HashSet();
MakeAHashOfThings o = new MakeAHashOfThings();
o.h = 1;
s.add(o);
o.h = 2;
s.add(o);

Is HashSet now breaking the Set contract?

A contract places obligations on both parties. The Set contract requires
the implementation not to contain multiple equal items. But the TreeSet
and HashSet contracts (and classes do constitute their own contracts,
which one must agree to in order to construct them) require the calling
code to use valid Comparators and hashCodes. If the calling code violates
the terms of the contract, then the whole thing is null and void anyway.

tom

--
All bloggers must die.
From: Patricia Shanahan on
Tom Anderson wrote:
> On Fri, 9 Jul 2010, Patricia Shanahan wrote:
>
>> On 7/9/2010 12:45 PM, Tom Anderson wrote:
>>> On Thu, 8 Jul 2010, Eric Sosman wrote:
>>>
>>>> Or, you could have BigList implement List but "lie" in its .size()
>>>> method, in somewhat the same way TreeSet "lies" about the Set contract.
>>>
>>> How does TreeSet lie about the Set contract?
>>
>> The case I'm aware of involves a TreeSet with a Comparator, that is
>> not consistent with the .equals methods of the TreeSet elements. The
>> TreeSet always goes by the Comparator results. That means the TreeSet
>> could contain elements a and b such that a.equals(b).
>
> True.
>
> Though that feel more like "TreeSet requires its Comparator to be
> consistent with equals" than "TreeSet lies about the Set contract".
>
> If i write this:
>
> class MakeAHashOfThings {
> public int h;
> public int hashCode() {
> return h;
> }
> }
>
> Set s = new HashSet();
> MakeAHashOfThings o = new MakeAHashOfThings();
> o.h = 1;
> s.add(o);
> o.h = 2;
> s.add(o);
>
> Is HashSet now breaking the Set contract?
>
> A contract places obligations on both parties. The Set contract requires
> the implementation not to contain multiple equal items. But the TreeSet
> and HashSet contracts (and classes do constitute their own contracts,
> which one must agree to in order to construct them) require the calling
> code to use valid Comparators and hashCodes. If the calling code
> violates the terms of the contract, then the whole thing is null and
> void anyway.

But TreeSet does not do completely wild, uncontracted things given an
inconsistency, the way HashSet does given a hash that is inconsistent
with equals. TreeSet is perfectly usable given a Comparator that is
internally consistent, even if it is not consistent with equals.

Another way of looking at this is "TreeSet acts as though its Comparator
is consistent with equals.".

Patricia
From: Arne Vajhøj on
On 10-07-2010 02:27, ClassCastException wrote:
> On Fri, 09 Jul 2010 21:57:23 -0400, Arne Vajhøj wrote:
>
>> On 09-07-2010 08:15, Eric Sosman wrote:
>>> On 7/8/2010 9:11 PM, Patricia Shanahan wrote:
>>>> Arne Vajhøj wrote:
>>>>> On 08-07-2010 17:35, Boris Punk wrote:
>>>>>> Integer.MAX_VALUE = 2147483647
>>>>>>
>>>>>> I might need more items than that. I probably won't, but it's nice
>>>>>> to have
>>>>>> extensibility.
>>>>>
>>>>> It is a lot of data.
>>>>>
>>>>> I think you should assume YAGNI.
>>>>
>>>>
>>>> Historically, each memory size has gone through a sequence of stages:
>>>>
>>>> 1. Nobody will ever need more than X bytes.
>>>>
>>>> 2. Some people do need to run multiple jobs that need a total of more
>>>> than X bytes, but no one job could possibly need that much.
>>>>
>>>> 3. Some jobs do need more than X bytes, but no one data structure
>>>> could possibly need that much.
>>>>
>>>> 4. Some data structures do need more than X bytes.
>>>>
>>>> Any particular reason to believe 32 bit addressing will stick at stage
>>>> 3, and not follow the normal progression to stage 4?
>>>
>>> None. But Java's int isn't going to grow wider, nor will the type of an
>>> array's .length suddenly become non-int; too much code would break.
>>> When Java reaches the 31-bit wall, I doubt it will find any convenient
>>> door; Java's descendants may pass through, but I think Java will remain
>>> stuck on this side.
>>>
>>> In ten years, we'll all have jobs converting "legacy Java code" to
>>> Sumatra.
>>
>> If Java get 20 years as "it" and 20 years as "legacy", then that would
>> actually be more than OK.
>>
>> Things evolve and sometimes it is better to start with a blank sheet of
>> paper.
>>
>> 64 bit array indexes, functions as first class type, bigint and
>> bigdecimal as language types etc..
>
> Clojure has all of this already except 64 bit array indexes and runs on
> the JVM.
>
> Clojure doesn't even have arrays, though,

I don't think Clojure has what it takes to become a
mainstream language.

Arne
From: ClassCastException on
On Mon, 12 Jul 2010 21:56:48 -0400, Arne Vajhøj wrote:

> On 10-07-2010 02:27, ClassCastException wrote:
>> On Fri, 09 Jul 2010 21:57:23 -0400, Arne Vajhøj wrote:
>>
>>> If Java get 20 years as "it" and 20 years as "legacy", then that would
>>> actually be more than OK.
>>>
>>> Things evolve and sometimes it is better to start with a blank sheet
>>> of paper.
>>>
>>> 64 bit array indexes, functions as first class type, bigint and
>>> bigdecimal as language types etc..
>>
>> Clojure has all of this already except 64 bit array indexes and runs on
>> the JVM.
>>
>> Clojure doesn't even have arrays, though,
>
> I don't think Clojure has what it takes to become a mainstream language.

Why?