From: George Neuner on
On Tue, 27 Jul 2010 12:27:14 -0700, D Yuniskis
<not.going.to.be(a)seen.com> wrote:

><frown> Hard to consider anything "lighter" than lex/yacc
>(without resorting to ad-hoc approaches)

If all you need is regex there is a simple preprocessing tool called
RE2C that lets you embed the expression right into your C code. I
used it a number of years ago and it worked quite well. The project
has a new owner and supposedly there have been improvements since
then. http://re2c.org/

If you can use a C++ compiler (even for writing C) Boost Spirit is a
lightweight library for regex and LL parsing. A number of Boost
libraries are function based rather than object based and intermix
easily with C code ... but I haven't used Spirit so I don't know which
category it is in. http://www.boost.org/

George
From: D Yuniskis on
Hi George,

George Neuner wrote:
> On Fri, 23 Jul 2010 23:20:50 -0700, D Yuniskis
>> George Neuner wrote:
>>
>>> For one thing, using (f)lex,
>> [I can't speak for flex...]
>
> Read (f)lex as shorthand for "flex or lex"

Yes, I understood your notation. Rather, I was commenting
that I don't know if *flex* makes the same assumptions
re: token "values" (starting at 256)

>> lex begins assigning token values at 256 -- conveniently so that
>> you can pass any ASCII value (any 8 bit value, to be pedantic)
>> directly through.
>
> So does flex.

OK

>> E.g., "(int) yylex[0]" just takes the
>> "current character" (matching "dot") and returns it AS IF
>> it was a "token". (technically, the cast should be a bit more
>> explicit to handle signed char's properly)
>
> Yes, bison predefines 1..255 as tokens (I don't know about yacc).
> However zero is an illegal token - (f)lex returns zero to mean EOF. So
> you can't parse binary that includes zero by doing this.

I'm not worried about binary values -- just ASCII
(I don't consider 'NUL' :> )

>> I.e., the traditional approach would be to let lex do the
>> grunt work of gathering up all the "digits", converting them
>> to an integer (or whatever the grammar really wants), stuffing
>> the value of that "integer" into u.int and reducing the
>> character sequence to a single non-terminal: INTEGER (etc.).
>>
>> By contrast, I am comparing the approach where the individual
>> digits are passed directly through lex as tokens in their
>> own right. I.e., '0' is the token having the value 0x30
>> (call it ZERO if you like); '1' is the token having the
>> value 0x31 ("ONE"), etc.
>>
>> Then, letting yacc directly operate on these "characters"
>> (tokens) and fabricate it's own concept of "integer:"
>> (vs. INTEGER).
>>
>> So, my question is, how expensive does this get to be by
>> doing that grunt work in yacc (instead of lex)? E.g.,
>> yacc has to track N different states (for the N digit
>> "integer:") instead of responding to a single INTEGER
>> token in *a* particular state.
>
> I don't think anyone has measured this.

<frown> I was wondering if anyone had "looked under the
hood" enough to be able to posit a guess.

> I would expect the __raw__ regex ((f)lex) and PDA (yacc/bison) state
> machines to be roughly comparable in performance. The main difference
> is in where and when any additional user code is executed.

lex seems to like to "write spaghetti code" (an overstatement
but, contrasted to yacc's more rigid "tables", I think you
know what I mean).

For the sorts of "traditional" tokens that it is meant to
parse, I think lex is probably a win. E.g.,

[0-9]+ return INTEGER;
[a-zA-Z]+ return WORD;

etc. yacc would probably not take a big hit, either, as
it's just a state looping back to itself repeatedly.
Even ad hoc approaches can be efficient with these.
(i.e., lots of actions result in "same state" transitions
so the total number of states is "fewer" than if you had
to *count* "digits" or "letters")

OTOH, if your grammar has more rigidly defined tokens like:

[0-9]{4,4} return INTEGER;

or (gasp):

[1-9][0-9][0-9][0-9]
| ' '[1-9][0-9][0-9]
| ' ' ' '[1-9][0-9]
| ' ' ' ' ' '[0-9]

then there are a lot *more* state transitions involved.
Here, yacc's approach could get pricey (since you can't
"cheat" and wrap lots of character sequences into a
*few* consecutive states like:

{' '}*[1-9]?[0-9]+

(which isn't a 1:1 replacement for the above)

I.e., for rigid message formats, you can end up
with lots of sequential states, each of which
defines the *few* expected "inputs" that will
advance to the "next state" while many/most
other "inputs" will signal an error (syntax).

> For regex, the user code is always executed after the FSM has finished
> so there is no impact on recognition speed. If you try to recognize
> simple things in the PDA however, the intermediate rules may have to
> execute code: consider the difference between recognizing the word
> "integer" and recognizing an actual integer value:
>
> %%
>
> input : word
> {
> printf( "recognized word \"integer\"\n");
> }
> | value
> {
> printf( "recognized value : %d\n", $1 );
> }
> ;
>
> word : 'i' 'n' 't' 'e' 'g' 'e' 'r';
>
> value : value digit
> {
> $$ = ($1 * 10) + $2;
> }
> | digit
> ;
>
> digit : '0' { $$ = 0; }
> | '1' { $$ = 1; }
> | '2' { $$ = 2; }
> | '3' { $$ = 3; }
> | '4' { $$ = 4; }
> | '5' { $$ = 5; }
> | '6' { $$ = 6; }
> | '7' { $$ = 7; }
> | '8' { $$ = 8; }
> | '9' { $$ = 9; }
> ;
>
> %%
>
> Accumulating any kind of value requires execution of user code. To
> accumulate an integer, you *must* construct a value for EACH digit
> individually and pass it up the chain.
>
> So for example, the following won't work:
>
> value : value digit
> {
> $$ = ($1 * 10) + ($2 - '0');
> }
> | digit
> {
> $$ = $1 - '0';
> }
> ;
>
> because by default the current state identifier is passed up the chain
> .... that is, the subrule digit will receive some bogus internal yacc
> value instead of the input character.

Yes, but you could do:

integer: digit digit digit digit
{
$$ = ( (( (( ( ($1 - '0') * 10) +
($2 - '0')) * 10) +
($3 - '0')) * 10) +
($4 - '0'));
}

(assuming I've counted my parens properly :-/ )

>> If, however, yacc "can't count", then it effectively
>> counts a 5 character sequence of digit by:
>
> Asa matter of fact, yacc CAN count - indirectly via the PDA's stack -
> which is why yacc can handle recursive grammars while lex cannot.
>
> But I know what you mean.
>
>> The grammars are all relatively simple. Rather, I was interested in
>> any differences in the automata they construct (for the previously
>> illustrated examples).
>
> (f)lex constructs a pure state machine. yacc/bison constructs a state
> machine with a stack. Aside from that you are dealing with how they
> pass user values around.

I'll have to build some toy grammars and move the promotions
around between lex and yacc and see what the resulting code
looks like. I'll try to get around to that in the next few
days and post some examples

[currently preoccupied retiring a server and replacing another
so I live in fear of "forgetting something" if distracted while
in the process :-/ Unfortunately, I should have installed
the Gb switch *before* doing this as moving that much data at
100M is *painfully* slow! :< Sheesh! To think back to the
SneakerNet days... :-/ ]
From: D Yuniskis on
Hi George,

George Neuner wrote:
> On Fri, 23 Jul 2010 17:57:56 -0700, D Yuniskis
> <not.going.to.be(a)seen.com> wrote:
>
>> But, you can push some stuff into the lexer instead of putting
>> it *all* in the yacc grammar. Roughly the same in terms of
>> readability:
>>
>> [1-9] return NONZERO;
>> 0 return ZERO;
>>
>> with:
>>
>> digit: NONZERO | ZERO
>> value: NONZERO DIGIT DIGIT DIGIT DIGIT '.' DIGIT DIGIT
>> | SP NONZERO DIGIT DIGIT DIGIT '.' DIGIT DIGIT
>>
>> (ad nauseum)
>>
>> [this is admittedly a bad example]
>>
>> My question regards the differences in the generated code
>> and memory requirements (e.g., transition tables in the FSM)
>> that are consequential to one approach over another.
>
> The lexer has to examine each and every character of input and there
> is overhead for each call to it. The more times the parsing code
> calls the lexing code, the slower the parse becomes.

Agreed. This suggests letting lex gobble up as much
input as it can (reasonably).

OTOH, the parser has more knowledge of what's going on
than the lexer. Once invoked, the lexer will gladly
process lots of input... THAT THE PARSER MIGHT CONSIDER
AS "INAPPROPRIATE" (given context)!

It's this that I think will force me to move everything
into yacc, regardless of "efficiency". Thinking aloud:
if context suggest to yacc that <blah> is *required*,
lex can merrily encounter <string> in the input and
patiently gobble up as many [A-Za-z]+ as happen to be
present -- EVEN THOUGH "return STRING" will cause yacc
to signal an error!

If, OTOH, lex simply passed each character (in essence)
to yacc as discrete tokens and let yacc promote each
accordingly based on the rules of the grammar, then
yacc can detect invalid "messages" earlier!

Recall, the goal here is to parse messages, not "text
files" (e.g., "source code"). And, that processing is not
"off-line"/batch oriented. I.e., lex's "input()" will
be pulling characters off of a live input stream.

That input stream is not "received" in nice clean units
(i.e., it's not like a text file where full lines can be
extracted at a time). E.g., it could *pause* "mid-message"
for an indeterminate amount of time (subject to the
communication protocol in place).

As a (trivial) example, consider a grammar that supports
two message types:

"Hello, my name is <string>"
| "<expr> <op> <expr>"

(where <expr> and <op> are what you suspect them to be)

[granted, a silly example but should suffice to illustrate my
point]

If the parser has processed "Hello, my name is ", then it
is obviously awaiting <string>. If, OTOH, the input contains
"1234567.8" and has *stalled* at that point, then having
lex process <number> as "[0-9]+'.'[0-9]*" means lex will
dutifully hang waiting the next character (after the '8')
to see if it can fold that into it's <number> rule. Only
when *lex* has decided where that token's boundary lies
will it pass NUMBER to yacc.

OTOH, if the "1" had been passed to yacc, it would have
already decided that "Hello, my name is 1..." is not a
valid sentence in the grammar and signalled an error.
(by contrast, if it has to wait for lex -- and lex has to
wait for a *paused* input stream -- then the error is
not detected until well after the fact!)

So, it looks like lex can only do context independant
promotions *if* I want to keep the error path expedited.

> Regardless of implementation, tokenization and lexical analysis of the
> input stream are the limiting factors for parser performance. In a
> typical optimizing compiler, lexing accounts for 70-80% of the running
> time while parsing accounts for less than 1%.
>
>> Keep in mind, lex and yacc were intended to generate code that
>> runs in resource rich environments...
>
> No they aren't. lex and yacc are products of the 1960's when the
> average program executing on a mainframe had 32KB of RAM. They were
> designed around a particular space/speed performance point that was
> acceptable to programmers and which the tools themselves could manage
> the limited space available.

But, they had secondary storage available, could make multiple
passes over the input, weren't time constrained, etc. I.e.,
they could concentrate on *just* that one aspect.

> Newer tools like flex and bison have other options. bison has even
> tighter table packing modes than yacc and can produce smaller parsers
> using the same grammars. But bison can also produce much faster
> parsers, or reentrant parsers, or handle more complex grammars, etc.
> Similarly, flex can make smaller lexers than lex or it can make lexers
> that run somewhat faster.

The details of their respective (and relative) automata are
needed to sort out just how efficient each *might* be.
From: George Neuner on
On Wed, 28 Jul 2010 15:48:11 -0700, D Yuniskis
<not.going.to.be(a)seen.com> wrote:

>Hi George,
>
>George Neuner wrote:
>>
>> The lexer has to examine each and every character of input and there
>> is overhead for each call to it. The more times the parsing code
>> calls the lexing code, the slower the parse becomes.
>
>Agreed. This suggests letting lex gobble up as much
>input as it can (reasonably).
>
>OTOH, the parser has more knowledge of what's going on
>than the lexer. Once invoked, the lexer will gladly
>process lots of input... THAT THE PARSER MIGHT CONSIDER
>AS "INAPPROPRIATE" (given context)!

That's not a convincing argument.

First of all, every successful non-human language and protocol is
context free (in the Chomsky sense). "Context free" does not mean the
parser works without knowledge of its surroundings ... it means that
the syntax is such that any legal "sentence" can have only one
possible interpretation.

Secondly, to be efficient the parser needs to work at a higher level
than input characters. Lexers were created specifically so that
parsers can work with abstract idea "tokens" rather than strings.


>It's this that I think will force me to move everything
>into yacc, regardless of "efficiency". Thinking aloud:
>if context suggest to yacc that <blah> is *required*,
>lex can merrily encounter <string> in the input and
>patiently gobble up as many [A-Za-z]+ as happen to be
>present -- EVEN THOUGH "return STRING" will cause yacc
>to signal an error!

That's right. But if the surroundings (let's avoid the word "context"
to prevent Chomsky based misunderstandings) dictate that <blah> is
required, then it would be a syntax error if the lexer found a string.


>If, OTOH, lex simply passed each character (in essence)
>to yacc as discrete tokens and let yacc promote each
>accordingly based on the rules of the grammar, then
>yacc can detect invalid "messages" earlier!

But at the expense of a huge increase in grammar complexity and
correspondingly larger parsing code, increased processing per input
character and possibly a dead programmer who, in deep despair, blew
his own brains out before finishing the job because he didn't realize
what he was getting himself into.


>Recall, the goal here is to parse messages, not "text
>files" (e.g., "source code"). And, that processing is not
>"off-line"/batch oriented. I.e., lex's "input()" will
>be pulling characters off of a live input stream.

Live input doesn't matter in any theoretical sense. As a practical
matter, however, if you must provide real time feedback you simply
create an input driven "push" parser rather than an expectation driven
"pull" parser. In a push parser, the lexer drives parsing, throwing
tokens at the parser which accumulates them until they make sense.
You'll get your error at the first "word" that doesn't fit into any of
the possible "sentences" in the match set.

yacc/lex can't create push parsers, but bison/flex and some other
modern tools like ANTLR and Yacc++ can.


>That input stream is not "received" in nice clean units
>(i.e., it's not like a text file where full lines can be
>extracted at a time). E.g., it could *pause* "mid-message"
>for an indeterminate amount of time (subject to the
>communication protocol in place).
>
>As a (trivial) example, consider a grammar that supports
>two message types:
>
>"Hello, my name is <string>"
>| "<expr> <op> <expr>"
>
>(where <expr> and <op> are what you suspect them to be)
>
>[granted, a silly example but should suffice to illustrate my
>point]
>
>If the parser has processed "Hello, my name is ", then it
>is obviously awaiting <string>. If, OTOH, the input contains
>"1234567.8" and has *stalled* at that point, then having
>lex process <number> as "[0-9]+'.'[0-9]*" means lex will
>dutifully hang waiting the next character (after the '8')
>to see if it can fold that into it's <number> rule. Only
>when *lex* has decided where that token's boundary lies
>will it pass NUMBER to yacc.

That's as it should be. The "words" of a "sentence" have to be
delineated. Parsing 1492 as "I've got '149' so far and I'm stuck
waiting for more input" is semantically wrong. Until you see
something that doesn't fit into the assumptions of the current match
set, you *cannot* know that you've matched anything.

Honestly, it doesn't matter whether you are matching characters or
abstract tokens. Whenever matching is applied over sequences, the
matching *process* itself is the limitation on what's possible.


>OTOH, if the "1" had been passed to yacc, it would have
>already decided that "Hello, my name is 1..." is not a
>valid sentence in the grammar and signalled an error.
>(by contrast, if it has to wait for lex -- and lex has to
>wait for a *paused* input stream -- then the error is
>not detected until well after the fact!)

You're going to have to take my word for it that you likely will be
unable to write a non-trivial grammar that can to do that. Go ahead
and try, but if you blow your brains out don't say I didn't warn you.

My advice is to investigate push parsers instead.


>So, it looks like lex can only do context independant
>promotions *if* I want to keep the error path expedited.

(f)lex has a feature that can help: start states (flex calls them
start "conditions"). Normally all the individual regex patterns are
coalesced into one giant DFA which matches input against all of them
simultaneously and stops when it has a match against any of them.
Start states group related patterns into sub-DFAs which are only
matched against when the lexer is in their related state.

Flex is more flexible 8) than lex in that it allows the same pattern
to belong to more than one state group (in lex you have to repeat the
pattern with a different state prefix). State changes are entirely
handled by user code, so, for example, the state could be a parameter
passed in by the parser.

Overall flex is a much better tool than lex. It is *much* more
flexible in the way it interfaces to the parser and in how it handles
input. Unlike lex, there are well documented hooks into most of the
core functions. Flex lexers are intended to be customizable. Flex
also can create measurably faster lexers.

George