From: George Neuner on
On Fri, 23 Jul 2010 17:57:56 -0700, D Yuniskis
<not.going.to.be(a)seen.com> wrote:

>But, you can push some stuff into the lexer instead of putting
>it *all* in the yacc grammar. Roughly the same in terms of
>readability:
>
>[1-9] return NONZERO;
>0 return ZERO;
>
>with:
>
>digit: NONZERO | ZERO
>value: NONZERO DIGIT DIGIT DIGIT DIGIT '.' DIGIT DIGIT
> | SP NONZERO DIGIT DIGIT DIGIT '.' DIGIT DIGIT
>
>(ad nauseum)
>
>[this is admittedly a bad example]
>
>My question regards the differences in the generated code
>and memory requirements (e.g., transition tables in the FSM)
>that are consequential to one approach over another.

The lexer has to examine each and every character of input and there
is overhead for each call to it. The more times the parsing code
calls the lexing code, the slower the parse becomes.

Regardless of implementation, tokenization and lexical analysis of the
input stream are the limiting factors for parser performance. In a
typical optimizing compiler, lexing accounts for 70-80% of the running
time while parsing accounts for less than 1%.


>Keep in mind, lex and yacc were intended to generate code that
>runs in resource rich environments...

No they aren't. lex and yacc are products of the 1960's when the
average program executing on a mainframe had 32KB of RAM. They were
designed around a particular space/speed performance point that was
acceptable to programmers and which the tools themselves could manage
the limited space available.

Newer tools like flex and bison have other options. bison has even
tighter table packing modes than yacc and can produce smaller parsers
using the same grammars. But bison can also produce much faster
parsers, or reentrant parsers, or handle more complex grammars, etc.
Similarly, flex can make smaller lexers than lex or it can make lexers
that run somewhat faster.

George
From: D Yuniskis on
Hi George,

George Neuner wrote:
> On Fri, 23 Jul 2010 01:47:11 -0700, D Yuniskis
> <not.going.to.be(a)seen.com> wrote:
>
>> Does anyone have any rule-of-thumb guidelines for
>> the relative efficiencies of doing promotions in
>> the lexer vs the formal grammar? Both in terms of
>> time and space?
>
> Point of clarification: a lexer specification *is* a formal grammar
> .... only for a simpler language (in the Chomsky sense).

Yes, but the grammars that lex can parse are quite simple.
It's mainly used to do the "grunt work" (e.g., gobbling
up groups of characters that the yacc grammar would
eventually consider as "nonterminals" -- NUMBER, STRING,
PATHNAME, etc.) that would consume lots of state transitions
in the yacc FSM.

>> E.g., imagine:
>>
>> [1-9][0-9]{2,4} return VALUE;
>>
>> in the lexer vs. the same explicit definition in
>> the *grammar*? (i.e., using ". return (int) yylex[0]"
>> in the lexer)
>
> For one thing, using (f)lex, it won't work if you are using a yacc or
> bison parser above it. The parser expects the return value to be a
> numeric token identifier ... returning an arbitrary integer extracted
> from the input will cause a fatal error.

[I can't speak for flex...]

lex begins assigning token values at 256 -- conveniently so that
you can pass any ASCII value (any 8 bit value, to be pedantic)
directly through. E.g., "(int) yylex[0]" just takes the
"current character" (matching "dot") and returns it AS IF
it was a "token". (technically, the cast should be a bit more
explicit to handle signed char's properly)

So, the rule I cited essentially turns lex into a raw pipe...

> What you want to do is something like the following:
>
> [1-9][0-9]{2,4} { yylval = atoi(yytext); return VALUE; }
>
> The token's value is passed separately through a variable called
> "yylval", which defaults to an integer but can be redefined to be a
> union of differing types.

Yes, I understand. I was pointing out the difference in the
lex+yacc approach or "dumb lex"+"do-it-all-in-yacc" approach.

I.e., the traditional approach would be to let lex do the
grunt work of gathering up all the "digits", converting them
to an integer (or whatever the grammar really wants), stuffing
the value of that "integer" into u.int and reducing the
character sequence to a single non-terminal: INTEGER (etc.).

By contrast, I am comparing the approach where the individual
digits are passed directly through lex as tokens in their
own right. I.e., '0' is the token having the value 0x30
(call it ZERO if you like); '1' is the token having the
value 0x31 ("ONE"), etc.

Then, letting yacc directly operate on these "characters"
(tokens) and fabricate it's own concept of "integer:"
(vs. INTEGER).

So, my question is, how expensive does this get to be by
doing that grunt work in yacc (instead of lex)? E.g.,
yacc has to track N different states (for the N digit
"integer:") instead of responding to a single INTEGER
token in *a* particular state.

[grrr... this is clear in my head but I seem to be
having a tough time expressing it :< ]

Consider:

You can "scan" a 5 character sequence of digits with
a counter and a single state (if char is digit, stay
in this state and increment count; if char is not digit,
move to someother state as defined by the grammar).

If, however, yacc "can't count", then it effectively
counts a 5 character sequence of digit by:

state1: if digit, move to state 2, else state X
state2: if digit, move to state 3, else state ...
state3: if digit, move to state 4, else state ...
state4: if digit, move to state 5, else state ...
state5: if digit, move to state 6, else state ...

i.e., in state2, one digit has been accumulated; in
state3, two digits have been accumulated; ... until,
finally, in state6, five (consecutive) digits have
been accumulated.

If each state requires a transition table with 256+ entries
(for each of the possible stimuli -- tokens -- that can be
received *in* that state), then having "state strings" like
this gets expensive.

E.g., imagine a grammar that parses:

sentence: "The quick brown fox stepped on the lazy dog's nose"

entirely within yacc (I am guessing that implication analysis
won't let you reduce that to anything less than strlen(...)
discrete states!)

> lex/yacc is before my time, but using flex/bison you can redefine the
> default parser and lexer functions to include arbitrary parameters.
> I've used this ability to, e.g., parse out of memory buffers rather
> than from file streams.

Yes, this is an inherited capability.

>> Are there any easily identifiable criteria that affect
>> the tradeoff between the two approaches?
>
> In general there is no performance difference between extracting
> non-character data in the lexer vs extracting it at a higher level ...
> you'll make the same number of passes over it.

I'm concerned with the efficiency of the FSM generation.
(consider the "sentence" above) And, how deep the pushdown
stack gets to cover backing out of "wrong promotions"
that are attempted as yacc encounters new tokens (which,
in the case I am suggesting, are actually *characters*!)

> However, the lexer does not have to preserve input characters across
> calls to yylex(), so extraction at a higher level may mean making a
> copy of the input characters.
>
>> Is there any (significant) difference in lex/yacc
>> implementations wrt this issue (e.g., vs flex/bison)?
>
> No ... your issue is outside the scope of either lexer or parser
> generation.
>
> flex and bison are MAJOR improvements on lex and yacc, but the
> differences are optimizations of their internal DFA and PDA processes,
> support for programmer extensions, and for bison, support for handling
> more complex grammars: yacc is LALR, bison supports LALR, LR, SLR and
> GLR (depending on version).

The grammars are all relatively simple. Rather, I was interested in
any differences in the automata they construct (for the previously
illustrated examples).

I should just build some bogus grammars and tweek them to see
how the tables change in size. Perhaps I can deduce the basic
behavior of the FSM generator empirically (I suspect these
things are too clever for that sort of simplistic experimentation)

[if I've still not managed to highlight the issue that is
of interest to me, let me know and I'll try to come up with
some real examples to *quantitatively* illustrate]
From: George Neuner on
On Fri, 23 Jul 2010 23:20:50 -0700, D Yuniskis
<not.going.to.be(a)seen.com> wrote:

>George Neuner wrote:
>
>> Point of clarification: a lexer specification *is* a formal grammar
>> .... only for a simpler language (in the Chomsky sense).
>
>Yes, but the grammars that lex can parse are quite simple.
>It's mainly used to do the "grunt work" ...

The grammars lex can parse are a superset of regular expressions.
http://en.wikipedia.org/wiki/Chomsky_hierarchy

lex adds to regex limited context sensitivity and limited ability to
do counting. Both of these abilities are simple hacks, however, and
neither is sufficient to promote the recognized language to the next
Chomsky level.


>> For one thing, using (f)lex,
>
>[I can't speak for flex...]

Read (f)lex as shorthand for "flex or lex"

>lex begins assigning token values at 256 -- conveniently so that
>you can pass any ASCII value (any 8 bit value, to be pedantic)
>directly through.

So does flex.

>E.g., "(int) yylex[0]" just takes the
>"current character" (matching "dot") and returns it AS IF
>it was a "token". (technically, the cast should be a bit more
>explicit to handle signed char's properly)

Yes, bison predefines 1..255 as tokens (I don't know about yacc).
However zero is an illegal token - (f)lex returns zero to mean EOF. So
you can't parse binary that includes zero by doing this.


>I.e., the traditional approach would be to let lex do the
>grunt work of gathering up all the "digits", converting them
>to an integer (or whatever the grammar really wants), stuffing
>the value of that "integer" into u.int and reducing the
>character sequence to a single non-terminal: INTEGER (etc.).
>
>By contrast, I am comparing the approach where the individual
>digits are passed directly through lex as tokens in their
>own right. I.e., '0' is the token having the value 0x30
>(call it ZERO if you like); '1' is the token having the
>value 0x31 ("ONE"), etc.
>
>Then, letting yacc directly operate on these "characters"
>(tokens) and fabricate it's own concept of "integer:"
>(vs. INTEGER).
>
>So, my question is, how expensive does this get to be by
>doing that grunt work in yacc (instead of lex)? E.g.,
>yacc has to track N different states (for the N digit
>"integer:") instead of responding to a single INTEGER
>token in *a* particular state.

I don't think anyone has measured this.

I would expect the __raw__ regex ((f)lex) and PDA (yacc/bison) state
machines to be roughly comparable in performance. The main difference
is in where and when any additional user code is executed.

For regex, the user code is always executed after the FSM has finished
so there is no impact on recognition speed. If you try to recognize
simple things in the PDA however, the intermediate rules may have to
execute code: consider the difference between recognizing the word
"integer" and recognizing an actual integer value:

%%

input : word
{
printf( "recognized word \"integer\"\n");
}
| value
{
printf( "recognized value : %d\n", $1 );
}
;

word : 'i' 'n' 't' 'e' 'g' 'e' 'r';

value : value digit
{
$$ = ($1 * 10) + $2;
}
| digit
;

digit : '0' { $$ = 0; }
| '1' { $$ = 1; }
| '2' { $$ = 2; }
| '3' { $$ = 3; }
| '4' { $$ = 4; }
| '5' { $$ = 5; }
| '6' { $$ = 6; }
| '7' { $$ = 7; }
| '8' { $$ = 8; }
| '9' { $$ = 9; }
;

%%

Accumulating any kind of value requires execution of user code. To
accumulate an integer, you *must* construct a value for EACH digit
individually and pass it up the chain.

So for example, the following won't work:

value : value digit
{
$$ = ($1 * 10) + ($2 - '0');
}
| digit
{
$$ = $1 - '0';
}
;

because by default the current state identifier is passed up the chain
.... that is, the subrule digit will receive some bogus internal yacc
value instead of the input character.


>If, however, yacc "can't count", then it effectively
>counts a 5 character sequence of digit by:

Asa matter of fact, yacc CAN count - indirectly via the PDA's stack -
which is why yacc can handle recursive grammars while lex cannot.

But I know what you mean.



>The grammars are all relatively simple. Rather, I was interested in
>any differences in the automata they construct (for the previously
>illustrated examples).

(f)lex constructs a pure state machine. yacc/bison constructs a state
machine with a stack. Aside from that you are dealing with how they
pass user values around.

George
From: Cesar Rabak on
Em 23/7/2010 05:47, D Yuniskis escreveu:
> Hi,
>
[snipped]

don,

Having arrived late in this thread, and understood that you're trying to
use lex/yacc for a [comm] protocol, I wonder if in your searches did you
consider using tools more closer to your problem at hand.

Something lighter than Estelle, for example?

--
Cesar Rabak
GNU/Linux User 52247.
Get counted: http://counter.li.org/
From: D Yuniskis on
Hi Cesar,

Cesar Rabak wrote:
> Having arrived late in this thread, and understood that you're trying to
> use lex/yacc for a [comm] protocol, I wonder if in your searches did you
> consider using tools more closer to your problem at hand.

<frown> Hard to consider anything "lighter" than lex/yacc
(without resorting to ad-hoc approaches)

> Something lighter than Estelle, for example?

I've never used Estelle. But, it seems a lot more heavy-handed
than simply specifying a grammar in lex/yac. I.e., it seems like
it wants to extend *beyond* just the "interface specification"
and deeper into the interactions between competing/cooperating
processes that happen to be "communicating" (gee, how's that
for a mouthful? :> )

I just want to create a simple template/framework that allows
others to create similar interfaces just by tweeking an
existing one (since most folks seem to go "deer-in-headlights"
when faced with creating something "from scratch").

lex/yacc can do that. But, the question is, where various bits
of the grammar should reside to most effectively use these
capabilities (as with most things, there are lots of different
ways to approach it WITHIN the lex/yacc framework; but, there
are consequences to each approach!). Since I am anticipating
others will copy/modify what I have done, there is extra incentive
to "get it right" (lest *everyone* ends up repeating my mistakes!)

:-/