Python and Regular Expressions [Python]

Prev: personality development
Next: PyCon Australia Call For Proposals

From: Paul Rubin on 10 Apr 2010 21:38

Steven D'Aprano <steve(a)REMOVE-THIS-cybersource.com.au> writes:
> As entertaining as this is, the analogy is rubbish. Skis are far too
> simple to use as an analogy for a parser (he says, having never seen skis
> up close in his life *wink*). Have you looked at PyParsing's source code?
> Regexes are only a small part of the parser, and not analogous to the
> wood of skis.

The impression that I have (from a distance) is that Pyparsing is a good
interface abstraction with a kludgy and slow implementation. That the
implementation uses regexps just goes to show how kludgy it is. One
hopes that someday there will be a more serious implementation, perhaps
using llvm-py (I wonder whatever happened to that project, by the way)
so that your parser script will compile to executable machine code on
the fly.

From: Paul McGuire on 11 Apr 2010 00:32

On Apr 10, 8:38 pm, Paul Rubin <no.em...(a)nospam.invalid> wrote:
> The impression that I have (from a distance) is that Pyparsing is a good
> interface abstraction with a kludgy and slow implementation. That the
> implementation uses regexps just goes to show how kludgy it is. One
> hopes that someday there will be a more serious implementation, perhaps
> using llvm-py (I wonder whatever happened to that project, by the way)
> so that your parser script will compile to executable machine code on
> the fly.

I am definitely flattered that pyparsing stirs up so much interest,
and among such a distinguished group. But I have to take some umbrage
at Paul Rubin's left-handed compliment, "Pyparsing is a good
interface abstraction with a kludgy and slow implementation,"
especially since he forms his opinions "from a distance".

I actually *did* put some thought into what I wanted in pyparsing
before designing it, and this forms this chapter of "Getting Started
with Pyparsing" (available here as a free online excerpt:
http://my.safaribooksonline.com/9780596514235/what_makes_pyparsing_so_special#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTk3ODA1OTY1MTQyMzUvMTYmaW1hZ2VwYWdlPTE2),
the "Zen of Pyparsing" as it were. My goals were:

- build parsers using explicit constructs (such as words, groups,
repetition, alternatives), vs. expression encoding using specialized
character sequences, as found in regexen

- easy parser construction from primitive elements to complex groups
and alternatives, using Python's operator overloading for ease of
direct implementation of parsers using ordinary Python syntax; include
mechanisms for defining recursive parser expressions

- implicit skipping of whitespace between parser elements

- results returned not just as a list of strings, but as a rich data
object, with access to parsed fields by name or by list index, taking
interfaces from both dicts and lists for natural adoption into common
Python idioms

- no separate code-generation steps, a la lex/yacc

- support for parse-time callbacks, for specialized token handling,
conversion, and/or construction of data structures

- 100% pure Python, to be runnable on any platform that supports
Python

- liberal licensing, to permit easy adoption into any user's projects
anywhere

So raw performance really didn't even make my short-list, beyond the
obvious "should be tolerably fast enough."

I have found myself reading posts on c.l.py with wording like "I'm
trying to parse <blah-blah> and I've been trying for hours/days to get
this regex working." For kicks, I'd spend 5-15 minutes working up a
working pyparsing solution, which *does* run comparatively slowly,
perhaps taking a few minutes to process the poster's data file. But
the net solution is developed and running in under 1/2 an hour, which
to me seems like an overall gain compared to hours of fruitless
struggling with backslashes and regex character sequences. On top of
which, the pyparsing solutions are still readable when I come back to
them weeks or months later, instead of staring at some line-noise
regex and just scratch my head wondering what it was for. And
sometimes "comparatively slowly" means that it runs 50x slower than a
compiled method that runs in 0.02 seconds - that's still getting the
job done in just 1 second.

And is the internal use of regexes with pyparsing really a "kludge"?
Why? They are almost completely hidden from the parser developer. And
yet by using compiled regexes, I retain the portability of 100% Python
while leveraging the compiled speed of the re engine.

It does seem that there have been many posts of late (either on c.l.py
or the related posts on Stackoverflow) where the OP is trying to
either scrape content from HTML, or parse some type of recursive
expression. HTML scrapers implemented using re's are terribly
fragile, since HTML in the wild often contains little surprises
(unexpected whitespace; upper/lower case inconsistencies; tag
attributes in unpredictable order; attribute values with double,
single, or no quotation marks) which completely frustrate any re-based
approach. Granted, there are times when an re-parsing-of-HTML
endeavor *isn't* futile or doomed from the start - the OP may be
working with a very restricted set of HTML, generated from some other
script so that the output is very consistent. Unfortunately, this
poster usually gets thrown under the same "you'll never be able to
parse HTML with re's" bus. I can't explain the surge in these posts,
other than to wonder if we aren't just seeing a skewed sample - that
is, the many cases where people *are* successfully using re's to solve
their text extraction problems aren't getting posted to c.l.py, since
no one posts questions they already have the answers to.

So don't be too dismissive of pyparsing, Mr. Rubin. I've gotten many e-
mails, wiki, and forum posts from Python users at all levels of the
expertise scale, saying that pyparsing has helped them to be very
productive in one or another aspect of creating a command parser, or
adding safe expression evaluation to an app, or just extracting some
specific data from a log file. I am encouraged that most report that
they can get their parsers working in reasonably short order, often by
reworking one of the examples that comes with pyparsing. If you're
offering to write that extension to pyparsing that generates the
parser runtime in fast machine code, it sounds totally bitchin' and
I'd be happy to include it when it's ready.

-- Paul

From: Patrick Maupin on 11 Apr 2010 02:29

On Apr 10, 1:05 pm, Stefan Behnel <stefan...(a)behnel.de> wrote:

> Running a Python program in CPython eventually boils down to a sequence of
> commands being executed by the CPU. That doesn't mean you should write
> those commands manually, even if you can. It's perfectly ok to write the
> program in Python instead.

Absolutely. But (as I seem to have posted many times recently) if
somebody asks how to do "x" it may be useful to point out that it
sounds like he really wants "y" and there are already several canned
solutions that do "y", but if he really wants "x", here is how he
should do it, or here is why he will have problems if he attempts to
do it (hint: whether Jamie Zawinski decides to kill a puppy or not is
not really a problem for somebody just asking a programming question
-- that's really up to Jamie).

Regards,
Pat

From: Neil Cerutti on 12 Apr 2010 08:09

On 2010-04-11, Steven D'Aprano
<steve(a)REMOVE-THIS-cybersource.com.au> wrote:
> On Sat, 10 Apr 2010 10:11:07 -0700, Patrick Maupin wrote:
>> On Apr 10, 11:35??am, Neil Cerutti <ne...(a)norwich.edu> wrote:
>>> On 2010-04-10, Patrick Maupin <pmau...(a)gmail.com> wrote:
>>> > as Pyparsing". ??Which is all well and good, except then the OP will
>>> > download pyparsing, take a look, realize that it uses regexps under
>>> > the hood, and possibly be very confused.
>>>
>>> I don't agree with that. If a person is trying to ski using pieces of
>>> wood that they carved themselves, I don't expect them to be surprised
>>> that the skis they buy are made out of similar materials.
>>
>> But, in this case, the guy ASKED how to make the skis in his woodworking
>> shop, and was told not to be silly -- you don't use wood to make skis --
>> and then directed to go buy some skis that are, in fact, made out of
>> wood.
>
> As entertaining as this is, the analogy is rubbish.

You should have seen the car engine analogy I thought up at
first. ;)

> Skis are far too simple to use as an analogy for a parser (he
> says, having never seen skis up close in his life *wink*).
> Have you looked at PyParsing's source code? Regexes are only a
> small part of the parser, and not analogous to the wood of
> skis.

I was mainly trying to get accross my incredulity that somebody
should be surprised a parsing package uses regexes under the
good. But for the record, a set of downhill skis comes with a
really fancy interface layer:

URL:http://images03.olx.com/ui/1/85/66/13147966_1.jpg

--
Neil Cerutti

First | Prev |
Pages: 1 2 3 4
Prev: personality development
Next: PyCon Australia Call For Proposals