xpath, dom and multi threading [Java Programming]

Prev: Using Java 7
Next: Com.sas.iom package for submitting SAS from Java

From: Tom Anderson on 18 May 2010 17:17

On Tue, 18 May 2010, Daniel Pitts wrote:

> On 5/16/2010 6:36 AM, FrenKy wrote:
>
>> can someone please suggest thread safe DOM implementation with support
>> for Xpath for reading XML files?
>>
>> Or if someone has a good source for hints how to make some dom
>> implementation thread safe...
>>
>> Thanks in advance!
>
> XPath itself isn't multithread safe.

XPath is a language - threadsafety is not a property it can have or lack.
I presume what you mean is that the XPath implementation is not threadsafe
- but since we don't know what the implementation in use here is, that's
an interesting statement. I imagine you mean one or more of (a) the XPath
implementation is not required to be threadsafe, so one shouldn't build
software that requires it to be, (b) you (Daniel) know or strongly suspect
which implementation is in use, and know it isn't thread safe, (c) there
are no XPath implementations which are threadsafe, or (d) it is impossible
for there to be an XPath implementation which is threadsafe. Could you
elaborate?

If you don't have any concrete information about the threadsafety of your
XPath implementation, it might be worth doing some basic stuff to ward off
threading bugs. Make sure that there are memory barriers between the last
write to the DOM tree by any thread and the reads that all the worker
threads are doing. One way to do this would be for the workers to queue up
by calling await() on a CountDownLatch set up with a count of 1, which the
parser thread then releases by calling countDown() on the latch. If you do
that and still get problems, then you know that the XPath implementation
is mutating the heap even when doing read-only operations, at which point
it's probably safe to conclude that XPath isn't going to cut it for you.

> Are you sure you need multithreading for your use-case? If you have
> something that is that performance intensive, perhaps a different
> approach is called for

Presumably, if he's throwing >100 CPUs at it, it's because doing it
singlethreaded would take too long.

But ...

> (StAX/SAX based parsing of the XML file, Building
> a domain object graph instead of a DOM, etc...)

This sounds like a good idea to me. A problem big enough to need >100 CPUs
working on it is big enough to be worth expressing in an efficient form -
i believe DOM implementations are generally deeply inefficient internally.
Lots of linked lists and other pessimicity. Your own model could be more
efficient, and also threadsafe (which after all is not hard to achieve for
read-only data).

tom

--
I KNOW WAHT IM TALKING ABOUT SO LISTAN UP AND LISTEN GOOD BECUASE ITS
TIEM TO DROP SOME SCIENTISTS ON YUO!!! -- Jeff K

From: FrenKy on 18 May 2010 18:53

On 18.5.2010 23:17, Tom Anderson wrote:
> On Tue, 18 May 2010, Daniel Pitts wrote:
>
>> On 5/16/2010 6:36 AM, FrenKy wrote:
>>
>>> can someone please suggest thread safe DOM implementation with support
>>> for Xpath for reading XML files?
>>>
>>> Or if someone has a good source for hints how to make some dom
>>> implementation thread safe...
>>>
>>> Thanks in advance!
>>
>> XPath itself isn't multithread safe.
>
> XPath is a language - threadsafety is not a property it can have or
> lack. I presume what you mean is that the XPath implementation is not
> threadsafe - but since we don't know what the implementation in use here
> is, that's an interesting statement. I imagine you mean one or more of
> (a) the XPath implementation is not required to be threadsafe, so one
> shouldn't build software that requires it to be, (b) you (Daniel) know
> or strongly suspect which implementation is in use, and know it isn't
> thread safe, (c) there are no XPath implementations which are
> threadsafe, or (d) it is impossible for there to be an XPath
> implementation which is threadsafe. Could you elaborate?
>
> If you don't have any concrete information about the threadsafety of
> your XPath implementation, it might be worth doing some basic stuff to
> ward off threading bugs. Make sure that there are memory barriers
> between the last write to the DOM tree by any thread and the reads that
> all the worker threads are doing. One way to do this would be for the
> workers to queue up by calling await() on a CountDownLatch set up with a
> count of 1, which the parser thread then releases by calling countDown()
> on the latch. If you do that and still get problems, then you know that
> the XPath implementation is mutating the heap even when doing read-only
> operations, at which point it's probably safe to conclude that XPath
> isn't going to cut it for you.
>
>> Are you sure you need multithreading for your use-case? If you have
>> something that is that performance intensive, perhaps a different
>> approach is called for
>
> Presumably, if he's throwing >100 CPUs at it, it's because doing it
> singlethreaded would take too long.
>
> But ...
>
>> (StAX/SAX based parsing of the XML file, Building a domain object
>> graph instead of a DOM, etc...)
>
> This sounds like a good idea to me. A problem big enough to need >100
> CPUs working on it is big enough to be worth expressing in an efficient
> form - i believe DOM implementations are generally deeply inefficient
> internally. Lots of linked lists and other pessimicity. Your own model
> could be more efficient, and also threadsafe (which after all is not
> hard to achieve for read-only data).
>
> tom
>

Thanks to all guys, I have some thinking to do.

If I have some more questions, I'll try to give it in a sscce form ;)

From: Mike Schilling on 18 May 2010 20:34

FrenKy wrote:
> On 17.5.2010 15:48, Lew wrote:
>> What part do you want to be thread safe, parsing the XML document or
>> accessing the DOM that results?
>>
>> The former is almost certainly not practicable. The second boils
>> down to what you do to make any object model thread safe.
>>
>> --
>
> I'm building the DOM in a single thread and then I'm reading it in
> several threads (usually not more then 20, depending on number of
> CPUs, e.g. I'm running it sometimes on 100+ CPU machines).
> But sometimes (seldom) I get NullPointer exception on most unexpected
> locations during read operations... But _always_ when I've already
> built XML. Threads are started after xml file is built. So I figured
> I'm doing something wrong with multithreading and sync. Same
> application ran in single threading mode does not throw
> NullPointerException.

Are you using Xerces for XPath? It builds another representation of the DOM
(called a DTM) on which to run the XPath expressions, and it builds it
incrementally. Thus, even if the DOM is fully built, and thus safe to
travese, running XPath on it in two threads can result in exceptions. You
should be OK if you

1. Don't access a DOM until it's fully built.
2. Synchronize all use of XPath.

From: Joshua Cranmer on 18 May 2010 22:35

On 05/18/2010 08:34 PM, Mike Schilling wrote:
> Are you using Xerces for XPath? It builds another representation of the DOM
> (called a DTM) on which to run the XPath expressions, and it builds it
> incrementally. Thus, even if the DOM is fully built, and thus safe to
> travese, running XPath on it in two threads can result in exceptions. You
> should be OK if you
>
> 1. Don't access a DOM until it's fully built.
> 2. Synchronize all use of XPath.

I don't know how the Xerces API works, but the DTM representation should
be local to the XPath object; if so, then creating a new object per
thread should do the trick.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth

From: Mike Schilling on 19 May 2010 03:13

Joshua Cranmer wrote:
> On 05/18/2010 08:34 PM, Mike Schilling wrote:
>> Are you using Xerces for XPath? It builds another representation of
>> the DOM (called a DTM) on which to run the XPath expressions, and it
>> builds it incrementally. Thus, even if the DOM is fully built, and
>> thus safe to travese, running XPath on it in two threads can result
>> in exceptions. You should be OK if you
>>
>> 1. Don't access a DOM until it's fully built.
>> 2. Synchronize all use of XPath.
>
> I don't know how the Xerces API works, but the DTM representation
> should be local to the XPath object; if so, then creating a new
> object per thread should do the trick.

It's actually local to the XPathContext object. That allows more choices,
like having multiple contexts, which would create multiple DTMs per DOM.
The best choice depends on your usage pattern: how many DOMs, how long
they're active for, how many threads each DOM is used in, etc.

First | Prev |
Pages: 1 2
Prev: Using Java 7
Next: Com.sas.iom package for submitting SAS from Java