From: neuneudr on
Hi all,

On Oreilly's website there's a column about implementing
multi-threaded algoriths, with an example of quicksort in Java.

I'm scratching my head on someting...

May Column: Multi-threaded Algorithm Implementations

http://broadcast.oreilly.com/2009/06/may-column-multithreaded-algor.html

I read the column and thought "that ain't working", then downloaded
the code.

Here's the array to be sorted:

/** Elements to be sorted. */
@SuppressWarnings("unchecked")
final Comparable[] ar;

So it's a regular array.

I fail to see what guarantees are made in the sorting code
that swaps made by spawned threads are visible to someone
who would be observing the [].

Say we have 8,7,6,5,4,3,2,1 and the code makes the swaps
in the array and as now 4,3,2,1,8,7,6,5 and asks to two threads
to take care respectively of the 4,3,2,1 part and the 8,7,6,5 part.

I agree that the spawned threads are going to see correctly what
they should. But what guarantee are there that an observer will
indeed see the array sorted at the end?

To me to work the code needs to be changed, the following
doesn't seem correct:

final Comparable[] ar;

I'd replace it with:

final AtomicReferenceArray<Comparable> ar;

To me you can't simply spawn thread and pass them
a reference to an [] and indexes and tell it "go ahead,
swap and re-arrange the [] as you like" and hope that
changes are going to be visible. Java makes no such
guarantees right!?

While a AtomiceReferenceArray it's guaranteed to work,
but it incurs a performance hit compare to the [] version.

But then I guess I don't understand anything about the Java
memory model nor about the code posted (there a tiny code.zip
file, once you keep only the .java files related to quicksort
it's tiny).

Thoughts?









From: neuneudr on
re all,

adding to my own question... At the end of each spawned thread,
we can see this:

new Thread () {
public void run () {
// invoke single-thread qsort
qsortN (left, pivotIndex - 1);
synchronized(helpRequestedMutex) {
helpersWorking--;
}
}
}.start();

Is the synchronization on the mutex at the end of each
sort enough to guarantee that changes made to the array
are indeed visible? (and hence guaranteed to work seen
that due to quicksort's nature we're sure each thread
is working on a different part of the array)?

From: neuneudr on

Here's the code...

http://pastebin.com/m41b42a91

And even if synchronizing on 'helpRequestedMutex' would
be enough (would it be?) I fail to see how the caller
can synchronize on it too.
From: markspace on
neuneudr(a)yahoo.fr wrote:

> I'm scratching my head on someting...
>
> May Column: Multi-threaded Algorithm Implementations
>
> http://broadcast.oreilly.com/2009/06/may-column-multithreaded-algor.html


I'm scratching my head also. Did you download the source code? I don't
see at least one whole package of his. His package:

algs.model.array

just seems to be missing. It's not in the bin directory either. Broken
code, not tested. Hmm, maybe his book isn't great too.
From: Peter Duniho on
neuneudr(a)yahoo.fr wrote:
> Here's the code...
>
> http://pastebin.com/m41b42a91
>
> And even if synchronizing on 'helpRequestedMutex' would
> be enough (would it be?) I fail to see how the caller
> can synchronize on it too.

The top-level waits on the volatile variable "helpersWorking", and since
that variable is volatile and is modified only after the worker threads
complete their effort and decrement the variable to 0. The use of
"volatile" preserves ordering of the operations, ensuring the main
thread does not return until all of the sorting has in fact been completed.

Note that "volatile" affects not just the variable used, but relative
ordering within the thread altogether. _All_ writes that occur before
the write to the volatile variable (i.e. including those to the array
being sorted) will be seen by any thread that sees the write to the
volatile variable.

(It also works the other way: if you read a volatile variable, you are
assured that all reads that occur later in code order will actually
happen after the read of the volatile variable).

That said, the code isn't defect-free. There's a relatively benign race
condition that hasn't been handled properly, which can result in the
number of worker threads exceeding the desired maximum limit. Using an
AtomicInteger instead of the clumsy "object as mutex" would allow that
to be easily fixed.

A more serious problem is that his top-level thread spins waiting for
the "helpersWorking" variable to reach zero. So one thread is
completely consumed just _monitoring_ the progress. This is a terrible
use of CPU time, just when one is trying to take full advantage of a
multi-core system. The fewer worker threads there are, the more
noticeable this problem will be (to the point of possibly completely
occupying all of the CPU time for a whole CPU core).

Ironically, on that last point, because of another defect in the code --
specifically, ideally the number of active working threads would be the
same as the number of CPU cores, to minimize context switching overhead,
but his code sets the number of threads to some arbitrarily high number
based on the size of the input and a user-selected ratio -- the
spin-wait of the main thread winds up being less of a problem relative
to the total throughput of _this_ code. That's because he's got
potentially lots of threads all contending for the CPU, most of which do
actually do some work when they eventually get it. So the one thread
also contending for the CPU and not doing useful work winds up not
mattering as much, if there's a large number of threads (which for his
code, there usually is).

He seems satisfied by what looks to me to be a relatively marginal
performance improvement. That is, sure...the code runs faster, but not
nearly as much faster as it probably could with a better implementation.

A better approach to the termination condition would be to use a
CountdownLatch, so that the main thread can simply block and not use CPU
time until the worker threads are actually done. That change, in
conjunction with limiting the number of threads to the number of CPU
cores would probably produce a significant performance improvement over
what his code already shows.

Pete