From: Walter Banks on


D Yuniskis wrote:
>
> Hi Walter,
>
> Walter Banks wrote:
> >
> > D Yuniskis wrote:
> >
> >> Does anyone have any rule-of-thumb guidelines for
> >> the relative efficiencies of doing promotions in
> >> the lexer vs the formal grammar? Both in terms of
> >> time and space?
> >>
> >> E.g., imagine:
> >>
> >> [1-9][0-9]{2,4} return VALUE;
> >>
> >> in the lexer vs. the same explicit definition in
> >> the *grammar*? (i.e., using ". return (int) yylex[0]"
> >> in the lexer)
> >>
> >> Are there any easily identifiable criteria that affect
> >> the tradeoff between the two approaches?
> >>
> >> Is there any (significant) difference in lex/yacc
> >> implementations wrt this issue (e.g., vs flex/bison)?
> >>
> >> I can run some empirical tests but I am not expert enough
> >> on either to know how to push the limits with test cases...
> >
> > Parsing is such a small part of a compiler that
>
> Not part of a compiler -- please see my reply to Peter Keller's
> post.
>
> > I would trade time differences for clarity especially
>
> The point of using a formal grammar instead of an ad-hoc
> implementation is exactly that -- to make the formats
> of the messages parsed very obvious. And, to make it easy for
> others to write similar parsers just by cutting and pasting
> from an existing one.

If I were doing it I would still make a similar comment
of doing it for clarity. My personal approach would be
regular expressions rather than in the lexer as the function
of these these are more likely to be understood without error
and use common library code.



Regards,


Walter..
--
Walter Banks
Byte Craft Limited
http://www.bytecraft.com


are more likely to be recognized

--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
From: Joe Chisolm on
On Fri, 23 Jul 2010 10:16:31 -0700, D Yuniskis wrote:

> Hi Peter,
>
> Peter Keller wrote:
>> D Yuniskis <not.going.to.be(a)seen.com> wrote:
>>> Does anyone have any rule-of-thumb guidelines for the relative
>>> efficiencies of doing promotions in the lexer vs the formal grammar?
>>> Both in terms of time and space?
>>>
>>> E.g., imagine:
>>>
>>> [1-9][0-9]{2,4} return VALUE;
>>>
>>> in the lexer vs. the same explicit definition in the *grammar*?
>>> (i.e., using ". return (int) yylex[0]" in the lexer)
>>>
>>> Are there any easily identifiable criteria that affect the tradeoff
>>> between the two approaches?
>>>
>>> Is there any (significant) difference in lex/yacc implementations wrt
>>> this issue (e.g., vs flex/bison)?
>>>
>>> I can run some empirical tests but I am not expert enough on either to
>>> know how to push the limits with test cases...
>>
>> Unless your compiler pass is compiling 24/7 and has to be real-time
>> like, you absolutely shouldn't care and should write your compiler to
>> be the most maintainable thing ever. Speed doesn't matter for 99% of
>> compilation passes.
>
> (sigh) Sorry, I should have clarified. I'm not using them to write a
> compiler but, rather, a parser for "messages". While speed isn't a real
> issue (though I wouldn't want it to be a *pig*!), I *am* concerned with
> size and stack penetration, etc. "footprint"
>
> As such, I am concerned with how often the machine "backtracks", how
> much state it carries, etc.

This depends on your REs in the lexer and your grammar. From your other
posts it seems you are at the stage where you can control the error
message syntax and thus your grammar. This is a good thing. If you
can put a lot of terminal symbols in your grammar you can control how
much back tracking the grammar has to do. Back tracking in your lexer
is a lot dependent on the longest match first rules and your REs.

For simple grammars it can sometimes be better to let the lexer be
just a tokenizer and push some of the rules back into the grammar. Example
your digit RE above. But bottom line is the number of messages you will
have to process per unit of time. Unless you have a complex grammar I
dont think your memory foot print is going to be much of a issue vs a
hand coded parser.

>
>> Type promotion is done either in the type checking phase of the
>> compiler, after parsing, or in the code generator. It is better to make
>> type promotions explicit in a pass of your compiler, since them you can
>> warn when type promotions lose data (among other things). Of course,
>> all of the phases can be adhoc'ed together, but then you end up with
>> unmaintainable and fragile junk.
>>
>> Well, that's my opinion anyway... :)





--
Joe Chisolm
Marble Falls, Tx.
From: D Yuniskis on
Hi Joe,

Joe Chisolm wrote:
> On Fri, 23 Jul 2010 10:16:31 -0700, D Yuniskis wrote:
>
>> Hi Peter,
>>
>> Peter Keller wrote:
>>> D Yuniskis <not.going.to.be(a)seen.com> wrote:
>>>> Does anyone have any rule-of-thumb guidelines for the relative
>>>> efficiencies of doing promotions in the lexer vs the formal grammar?
>>>> Both in terms of time and space?
>>>>
>>>> E.g., imagine:
>>>>
>>>> [1-9][0-9]{2,4} return VALUE;
>>>>
>>>> in the lexer vs. the same explicit definition in the *grammar*?
>>>> (i.e., using ". return (int) yylex[0]" in the lexer)
>>>>
>>>> Are there any easily identifiable criteria that affect the tradeoff
>>>> between the two approaches?
>>>>
>>>> Is there any (significant) difference in lex/yacc implementations wrt
>>>> this issue (e.g., vs flex/bison)?
>>>>
>>>> I can run some empirical tests but I am not expert enough on either to
>>>> know how to push the limits with test cases...
>>> Unless your compiler pass is compiling 24/7 and has to be real-time
>>> like, you absolutely shouldn't care and should write your compiler to
>>> be the most maintainable thing ever. Speed doesn't matter for 99% of
>>> compilation passes.
>> (sigh) Sorry, I should have clarified. I'm not using them to write a
>> compiler but, rather, a parser for "messages". While speed isn't a real
>> issue (though I wouldn't want it to be a *pig*!), I *am* concerned with
>> size and stack penetration, etc. "footprint"
>>
>> As such, I am concerned with how often the machine "backtracks", how
>> much state it carries, etc.
>
> This depends on your REs in the lexer and your grammar. From your other
> posts it seems you are at the stage where you can control the error
> message syntax and thus your grammar. This is a good thing. If you

Well, in some cases, I have complete control over the grammar.
In other cases, I am interfacing to existing protocols so I'm
stuck with whatever was thrown together (usually with little
concern for this sort of thing!) previously.

> can put a lot of terminal symbols in your grammar you can control how
> much back tracking the grammar has to do. Back tracking in your lexer
> is a lot dependent on the longest match first rules and your REs.
>
> For simple grammars it can sometimes be better to let the lexer be
> just a tokenizer and push some of the rules back into the grammar. Example

Currently, I'm experimenting with the lexer doing damn little
at all! E.g., things like:

\r return CR
\n return LF
[folks seem to more readily relate to seeing "CR LF" in messages]

\040 return SP
[eliminate the ambiguity of ' ' being a space or a coincidental tab]

[\200-\277] return NONASCII
[8b clear channel can see bogus "characters" if UART misconfigured]

.. return (int) yylex[0];
[push everything up to the formal grammar]

But, doing this (last rule) leaves me worrying: "lex exists for a
reason; I've just made it redundant! :< "

> your digit RE above. But bottom line is the number of messages you will
> have to process per unit of time. Unless you have a complex grammar I
> dont think your memory foot print is going to be much of a issue vs a
> hand coded parser.

I'm more concerned with rules that have long chains of terminals
that can force lots of (consecutive) states into the FSM. I've
not looked at the actual code generated, yet to see how well
yacc "folds" those tables (implication analysis). Nor if it is
clever enough to shrink the "other" tables to their minimum sizes
(i.e., omit the "can't happen" state transitions).

For example, parsing a fixed width data field:

value: [1-9][0-9][0-9][0-9][0-9]'.'[0-9][0-9]
| SP [1-9][0-9][0-9][0-9]'.'[0-9][0-9]
| SP SP [1-9][0-9][0-9]'.'[0-9][0-9]
| SP SP SP [1-9][0-9]'.'[0-9][0-9]
| SP SP SP SP [1-9]'.'[0-9][0-9]
| undefined

Note these sorts of rules are more expensive than
a more typical:

value: [0-9]+.[0-9]+

As I say, lex exists for a reason. Which god am I angering by
failing to pay it tribute? :-/
From: D Yuniskis on
Hi Walter,

Walter Banks wrote:
>
> D Yuniskis wrote:
>> Hi Walter,
>>
>> Walter Banks wrote:
>>> D Yuniskis wrote:
>>>
>>>> Does anyone have any rule-of-thumb guidelines for
>>>> the relative efficiencies of doing promotions in
>>>> the lexer vs the formal grammar? Both in terms of
>>>> time and space?
>>>>
>>>> E.g., imagine:
>>>>
>>>> [1-9][0-9]{2,4} return VALUE;
>>>>
>>>> in the lexer vs. the same explicit definition in
>>>> the *grammar*? (i.e., using ". return (int) yylex[0]"
>>>> in the lexer)
>>>>
>>>> Are there any easily identifiable criteria that affect
>>>> the tradeoff between the two approaches?
>>>>
>>>> Is there any (significant) difference in lex/yacc
>>>> implementations wrt this issue (e.g., vs flex/bison)?
>>>>
>>>> I can run some empirical tests but I am not expert enough
>>>> on either to know how to push the limits with test cases...
>>> Parsing is such a small part of a compiler that
>> Not part of a compiler -- please see my reply to Peter Keller's
>> post.
>>
>>> I would trade time differences for clarity especially
>> The point of using a formal grammar instead of an ad-hoc
>> implementation is exactly that -- to make the formats
>> of the messages parsed very obvious. And, to make it easy for
>> others to write similar parsers just by cutting and pasting
>> from an existing one.
>
> If I were doing it I would still make a similar comment
> of doing it for clarity. My personal approach would be
> regular expressions rather than in the lexer as the function
> of these these are more likely to be understood without error
> and use common library code.

Agreed as to "clarity", ease of maintenance, etc.

But, you can push some stuff into the lexer instead of putting
it *all* in the yacc grammar. Roughly the same in terms of
readability:

[1-9] return NONZERO;
0 return ZERO;

with:

digit: NONZERO | ZERO
value: NONZERO DIGIT DIGIT DIGIT DIGIT '.' DIGIT DIGIT
| SP NONZERO DIGIT DIGIT DIGIT '.' DIGIT DIGIT

(ad nauseum)

[this is admittedly a bad example]

My question regards the differences in the generated code
and memory requirements (e.g., transition tables in the FSM)
that are consequential to one approach over another.

(and, whether different lex/yacc implementations are better
or worse than others)

Keep in mind, lex and yacc were intended to generate code that
runs in resource rich environments...
From: George Neuner on
On Fri, 23 Jul 2010 01:47:11 -0700, D Yuniskis
<not.going.to.be(a)seen.com> wrote:

>Hi,

Hi Don,

>Does anyone have any rule-of-thumb guidelines for
>the relative efficiencies of doing promotions in
>the lexer vs the formal grammar? Both in terms of
>time and space?

Point of clarification: a lexer specification *is* a formal grammar
.... only for a simpler language (in the Chomsky sense).


>E.g., imagine:
>
>[1-9][0-9]{2,4} return VALUE;
>
>in the lexer vs. the same explicit definition in
>the *grammar*? (i.e., using ". return (int) yylex[0]"
>in the lexer)

For one thing, using (f)lex, it won't work if you are using a yacc or
bison parser above it. The parser expects the return value to be a
numeric token identifier ... returning an arbitrary integer extracted
from the input will cause a fatal error.

What you want to do is something like the following:

[1-9][0-9]{2,4} { yylval = atoi(yytext); return VALUE; }

The token's value is passed separately through a variable called
"yylval", which defaults to an integer but can be redefined to be a
union of differing types.

lex/yacc is before my time, but using flex/bison you can redefine the
default parser and lexer functions to include arbitrary parameters.
I've used this ability to, e.g., parse out of memory buffers rather
than from file streams.


>Are there any easily identifiable criteria that affect
>the tradeoff between the two approaches?

In general there is no performance difference between extracting
non-character data in the lexer vs extracting it at a higher level ...
you'll make the same number of passes over it.

However, the lexer does not have to preserve input characters across
calls to yylex(), so extraction at a higher level may mean making a
copy of the input characters.


>Is there any (significant) difference in lex/yacc
>implementations wrt this issue (e.g., vs flex/bison)?

No ... your issue is outside the scope of either lexer or parser
generation.

flex and bison are MAJOR improvements on lex and yacc, but the
differences are optimizations of their internal DFA and PDA processes,
support for programmer extensions, and for bison, support for handling
more complex grammars: yacc is LALR, bison supports LALR, LR, SLR and
GLR (depending on version).


George