large XML files [Java Programming]

Prev: object relational database versus "inteligent" serialization
Next: Portable Key Derivation from a password

From: Arne Vajhøj on 7 Feb 2010 17:00

On 07-02-2010 16:37, Mike Schilling wrote:
> Arne Vajh�j wrote:
>> On 07-02-2010 12:59, Roedy Green wrote:
>>> It seems to me the usual XML tools in Java load the entire XML file
>>> into RAM.
>>
>> ????
>>
>> W3CDOM and JAXB do load all data in memory.
>>
>> SAX and StAX do not load all data in memory.
>
> If you use XSLT to process an XML file, it has to keep a complete
> representation of the resulting XML document into memory, since an XSLT
> transformation can include XPath expressions, and XPath can in principle
> access anything in the dociument. This is true even if the input to XSLT is
> a SAXSource.

True.

But that problem is very hard to solve.

Arne

From: Tom Anderson on 7 Feb 2010 17:25

On Sun, 7 Feb 2010, Mike Schilling wrote:

> Arne Vajh?j wrote:
>> On 07-02-2010 12:59, Roedy Green wrote:
>>> It seems to me the usual XML tools in Java load the entire XML file
>>> into RAM.
>>
>> ????
>>
>> W3CDOM and JAXB do load all data in memory.
>>
>> SAX and StAX do not load all data in memory.
>
> If you use XSLT to process an XML file, it has to keep a complete
> representation of the resulting XML document into memory, since an XSLT
> transformation can include XPath expressions, and XPath can in principle
> access anything in the dociument. This is true even if the input to
> XSLT is a SAXSource.

Weeeellll, kinda. Some XSLTs will require the whole document to be held in
memory. But it is possible to process some XSLTs in a streaming or
streaming-ish manner (where elements are held in memory, but only a subset
at a time). There's nothing stopping an XSLT processor compiling such
XSLTs into a form which does just that. Whether any actually do, i don't
know.

A while ago, i read about a streaming XPath processor. It couldn't handle
all XPaths in a streaming manner, so it had to fall back to searching an
in-memory tree where that was the case, but many common XPaths can be
handled streamingly. For instance, something like:

//order[@id='99']/order-item

Could be. You run the parse, and maintain the current stack of elements in
memory - all the elements enclosing the current parse point, IYSWIM. Then
you just look at the top of the stack at every point to see if it's an
order-item, then if it is, look back to see if the enclosing order has an
id of 99. You could probably do it more efficiently than that, but that's
one way you could do it. Something like this:

//order[customer[@id='99']]/order-item

Is more challenging, and requires a more sophisticated evaluation strategy
- you might need to read in a whole order, search it for matching
order-items, then throw it away and move on to the next one. Or, if you
knew from the DTD that the customer element had to come before any
order-items in an order, you could build a state machine that could decide
that it was inside a matching order, and then report all order-items.

Anyway, all speculation, but it's interesting stuff!

tom

--
Dreams are not covered by any laws. They can be about anything. --
Cmdr Zorg

From: Tom Anderson on 7 Feb 2010 17:26

On Sun, 7 Feb 2010, Roedy Green wrote:

> It seems to me the usual XML tools in Java load the entire XML file into
> RAM. Are there any tools that process sequentially, bringing in only a
> chunk at a time so you could handle really fat files.

What do you mean by 'tools'?

tom

--
Dreams are not covered by any laws. They can be about anything. --
Cmdr Zorg

From: Mike Schilling on 7 Feb 2010 22:12

Tom Anderson wrote:
> On Sun, 7 Feb 2010, Mike Schilling wrote:
>
>> Arne Vajh?j wrote:
>>> On 07-02-2010 12:59, Roedy Green wrote:
>>>> It seems to me the usual XML tools in Java load the entire XML file
>>>> into RAM.
>>>
>>> ????
>>>
>>> W3CDOM and JAXB do load all data in memory.
>>>
>>> SAX and StAX do not load all data in memory.
>>
>> If you use XSLT to process an XML file, it has to keep a complete
>> representation of the resulting XML document into memory, since an
>> XSLT transformation can include XPath expressions, and XPath can in
>> principle access anything in the dociument. This is true even if
>> the input to XSLT is a SAXSource.
>
> Weeeellll, kinda. Some XSLTs will require the whole document to be
> held in memory. But it is possible to process some XSLTs in a
> streaming or streaming-ish manner (where elements are held in memory,
> but only a subset at a time). There's nothing stopping an XSLT
> processor compiling such XSLTs into a form which does just that.
> Whether any actually do, i don't know.

Xalan (the XSLT processor in the JDK), doesn't.

From: Arne Vajhøj on 7 Feb 2010 22:13

On 07-02-2010 17:25, Tom Anderson wrote:
> On Sun, 7 Feb 2010, Mike Schilling wrote:
>> Arne Vajh?j wrote:
>>> On 07-02-2010 12:59, Roedy Green wrote:
>>>> It seems to me the usual XML tools in Java load the entire XML file
>>>> into RAM.
>>>
>>> ????
>>>
>>> W3CDOM and JAXB do load all data in memory.
>>>
>>> SAX and StAX do not load all data in memory.
>>
>> If you use XSLT to process an XML file, it has to keep a complete
>> representation of the resulting XML document into memory, since an
>> XSLT transformation can include XPath expressions, and XPath can in
>> principle access anything in the dociument. This is true even if the
>> input to XSLT is a SAXSource.
>
> Weeeellll, kinda. Some XSLTs will require the whole document to be held
> in memory. But it is possible to process some XSLTs in a streaming or
> streaming-ish manner (where elements are held in memory, but only a
> subset at a time). There's nothing stopping an XSLT processor compiling
> such XSLTs into a form which does just that. Whether any actually do, i
> don't know.
>
> A while ago, i read about a streaming XPath processor. It couldn't
> handle all XPaths in a streaming manner, so it had to fall back to
> searching an in-memory tree where that was the case, but many common
> XPaths can be handled streamingly. For instance, something like:
>
> //order[@id='99']/order-item
>
> Could be. You run the parse, and maintain the current stack of elements
> in memory - all the elements enclosing the current parse point, IYSWIM.
> Then you just look at the top of the stack at every point to see if it's
> an order-item, then if it is, look back to see if the enclosing order
> has an id of 99. You could probably do it more efficiently than that,
> but that's one way you could do it. Something like this:
>
> //order[customer[@id='99']]/order-item
>
> Is more challenging, and requires a more sophisticated evaluation
> strategy - you might need to read in a whole order, search it for
> matching order-items, then throw it away and move on to the next one.
> Or, if you knew from the DTD that the customer element had to come
> before any order-items in an order, you could build a state machine that
> could decide that it was inside a matching order, and then report all
> order-items.
>
> Anyway, all speculation, but it's interesting stuff!

Interesting.

But for writing code today that use the standard XML libraries,
then assuming that XSLT would read it all into memory would be
a safe assumption.

Arne

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: object relational database versus "inteligent" serialization
Next: Portable Key Derivation from a password