large XML files [Java Programming]

Prev: object relational database versus "inteligent" serialization
Next: Portable Key Derivation from a password

From: Lew on 7 Feb 2010 22:28

Tom Anderson wrote:
>> Weeeellll, kinda. Some XSLTs will require the whole document to be held
>> in memory. But it is possible to process some XSLTs in a streaming or
>> streaming-ish manner (where elements are held in memory, but only a
>> subset at a time). There's nothing stopping an XSLT processor compiling
>> such XSLTs into a form which does just that. Whether any actually do, i
>> don't know.

None in common use. The usual XSLT and XPath processors assume a DOM.

I know from a recent project that it's next to useless to match XPath
expressions with a SAX parser.

>> A while ago, i [sic] read about a streaming XPath processor. It couldn't
>> handle all XPaths in a streaming manner, so it had to fall back to
>> searching an in-memory tree where that was the case, but many common
>> XPaths can be handled streamingly. For instance, something like:
>>
>> //order[@id='99']/order-item

Links?

Arne Vajhøj wrote:
> But for writing code today that use the standard XML libraries,
> then assuming that XSLT would read it all into memory would be
> a safe assumption.

--
Lew

From: Roedy Green on 8 Feb 2010 06:57

On Sun, 07 Feb 2010 13:14:26 -0500, "John B. Matthews"
<nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who
said :

>
>I thought that was a principal advantage of the Simple API For XML (SAX)
>model, at least in principle. :-)

I read a sentence about SAX that lead me to believe it too read the
whole file into RAM, it just did not create a DOM tree. I am glad that
is not true.
--
Roedy Green Canadian Mind Products
http://mindprod.com

Every compilable program in a sense works. The problem is with your unrealistic expections on what it will do.

From: Lew on 8 Feb 2010 11:00

John B. Matthews wrote, quoted or indirectly quoted someone who said :
>> I thought that was a principal advantage of the Simple API For XML (SAX)
>> model, at least in principle. :-)

Roedy Green wrote:
> I read a sentence about SAX that lead me to believe it too read the
> whole file into RAM, it just did not create a DOM tree. I am glad that
> is not true.

It does read the whole file into RAM, just not all at once.

SAX and StAX let you deal with the information as it streams in (hence the "S"
for "streaming"), letting you process and perhaps discard stuff as it flows
by. A typical use is to create an object model, perhaps including everything
from the document, that is not a DOM. A DOM parser does the same thing, but
allows only the DOM, not a custom model, and doesn't let you discard anything.
It presents the whole DOM at the conclusion of parsing. If you then need a
different object model you need room for both that model and the DOM.

--
Lew

From: Tom Anderson on 8 Feb 2010 13:10

On Sun, 7 Feb 2010, Lew wrote:

> Tom Anderson wrote:
>> On Sun, 7 Feb 2010, Mike Schilling wrote:
>>
>>> If you use XSLT to process an XML file, it has to keep a complete
>>> representation of the resulting XML document into memory, since an
>>> XSLT transformation can include XPath expressions, and XPath can in
>>> principle access anything in the dociument. This is true even if the
>>> input to XSLT is a SAXSource.
>>
>> Weeeellll, kinda. Some XSLTs will require the whole document to be held
>> in memory. But it is possible to process some XSLTs in a streaming or
>> streaming-ish manner (where elements are held in memory, but only a
>> subset at a time). There's nothing stopping an XSLT processor compiling
>> such XSLTs into a form which does just that. Whether any actually do, i
>> don't know.
>
> None in common use. The usual XSLT and XPath processors assume a DOM.

Curses. I had an idea that xmlstarlet did streaming XSLT, but on reading
its documentation, i see no mention of it.

I would point out that my point was in response to "XSLT [...] *has* to"
(my emphasis), pointing out that this is not always so, though of course a
theoretical possibility which is not implemented anywhere is of no use to
anyone.

> I know from a recent project that it's next to useless to match XPath
> expressions with a SAX parser.

In what sense? That it justs builds a DOM tree behind the scenes?

>>> A while ago, i [sic] read about a streaming XPath processor. It couldn't
>>> handle all XPaths in a streaming manner, so it had to fall back to
>>> searching an in-memory tree where that was the case, but many common
>>> XPaths can be handled streamingly. For instance, something like:
>>>
>>> //order[@id='99']/order-item
>
> Links?

Yes, some of those would be really good, actually.

tom

--
secular utopianism is based on a belief in an unstoppable human ability
to make a better world -- Rt Rev Tom Wright

From: Lew on 8 Feb 2010 14:08

Lew wrote:
>> I know from a recent project that it's next to useless to match XPath
>> expressions with a SAX parser.

Tom Anderson wrote:
> In what sense? That it justs builds a DOM tree behind the scenes?

In the sense that for XPath to work, there has to already be a DOM for it to
search, or else you have to forego built-in XPath processing. In that recent
project they attempted to cache results from XPath expressions that were built
by manually matching the expression with data from the streamed input. When
that missed, they had to either re-read the whole input or go ahead and build
a DOM regardless. The complexity and time cost of manual XPath handling and
the frequency of misses presented a rather intractable barrier to the approach.

That's only a single data point, of course. I don't rule out the possibility
that another approach to blending SAX and XPath could work. Had it been up to
me, I would have abandoned XPath for that application and just used SAX or
StAX to build a domain-specific object model, not a DOM, and directly
referenced items from that model.

--
Lew

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: object relational database versus "inteligent" serialization
Next: Portable Key Derivation from a password