From: George Neuner on 24 Jul 2010 00:20 On Fri, 23 Jul 2010 17:57:56 -0700, D Yuniskis <not.going.to.be(a)seen.com> wrote: >But, you can push some stuff into the lexer instead of putting >it *all* in the yacc grammar. Roughly the same in terms of >readability: > >[1-9] return NONZERO; >0 return ZERO; > >with: > >digit: NONZERO | ZERO >value: NONZERO DIGIT DIGIT DIGIT DIGIT '.' DIGIT DIGIT > | SP NONZERO DIGIT DIGIT DIGIT '.' DIGIT DIGIT > >(ad nauseum) > >[this is admittedly a bad example] > >My question regards the differences in the generated code >and memory requirements (e.g., transition tables in the FSM) >that are consequential to one approach over another. The lexer has to examine each and every character of input and there is overhead for each call to it. The more times the parsing code calls the lexing code, the slower the parse becomes. Regardless of implementation, tokenization and lexical analysis of the input stream are the limiting factors for parser performance. In a typical optimizing compiler, lexing accounts for 70-80% of the running time while parsing accounts for less than 1%. >Keep in mind, lex and yacc were intended to generate code that >runs in resource rich environments... No they aren't. lex and yacc are products of the 1960's when the average program executing on a mainframe had 32KB of RAM. They were designed around a particular space/speed performance point that was acceptable to programmers and which the tools themselves could manage the limited space available. Newer tools like flex and bison have other options. bison has even tighter table packing modes than yacc and can produce smaller parsers using the same grammars. But bison can also produce much faster parsers, or reentrant parsers, or handle more complex grammars, etc. Similarly, flex can make smaller lexers than lex or it can make lexers that run somewhat faster. George
From: D Yuniskis on 24 Jul 2010 02:20 Hi George, George Neuner wrote: > On Fri, 23 Jul 2010 01:47:11 -0700, D Yuniskis > <not.going.to.be(a)seen.com> wrote: > >> Does anyone have any rule-of-thumb guidelines for >> the relative efficiencies of doing promotions in >> the lexer vs the formal grammar? Both in terms of >> time and space? > > Point of clarification: a lexer specification *is* a formal grammar > .... only for a simpler language (in the Chomsky sense). Yes, but the grammars that lex can parse are quite simple. It's mainly used to do the "grunt work" (e.g., gobbling up groups of characters that the yacc grammar would eventually consider as "nonterminals" -- NUMBER, STRING, PATHNAME, etc.) that would consume lots of state transitions in the yacc FSM. >> E.g., imagine: >> >> [1-9][0-9]{2,4} return VALUE; >> >> in the lexer vs. the same explicit definition in >> the *grammar*? (i.e., using ". return (int) yylex[0]" >> in the lexer) > > For one thing, using (f)lex, it won't work if you are using a yacc or > bison parser above it. The parser expects the return value to be a > numeric token identifier ... returning an arbitrary integer extracted > from the input will cause a fatal error. [I can't speak for flex...] lex begins assigning token values at 256 -- conveniently so that you can pass any ASCII value (any 8 bit value, to be pedantic) directly through. E.g., "(int) yylex[0]" just takes the "current character" (matching "dot") and returns it AS IF it was a "token". (technically, the cast should be a bit more explicit to handle signed char's properly) So, the rule I cited essentially turns lex into a raw pipe... > What you want to do is something like the following: > > [1-9][0-9]{2,4} { yylval = atoi(yytext); return VALUE; } > > The token's value is passed separately through a variable called > "yylval", which defaults to an integer but can be redefined to be a > union of differing types. Yes, I understand. I was pointing out the difference in the lex+yacc approach or "dumb lex"+"do-it-all-in-yacc" approach. I.e., the traditional approach would be to let lex do the grunt work of gathering up all the "digits", converting them to an integer (or whatever the grammar really wants), stuffing the value of that "integer" into u.int and reducing the character sequence to a single non-terminal: INTEGER (etc.). By contrast, I am comparing the approach where the individual digits are passed directly through lex as tokens in their own right. I.e., '0' is the token having the value 0x30 (call it ZERO if you like); '1' is the token having the value 0x31 ("ONE"), etc. Then, letting yacc directly operate on these "characters" (tokens) and fabricate it's own concept of "integer:" (vs. INTEGER). So, my question is, how expensive does this get to be by doing that grunt work in yacc (instead of lex)? E.g., yacc has to track N different states (for the N digit "integer:") instead of responding to a single INTEGER token in *a* particular state. [grrr... this is clear in my head but I seem to be having a tough time expressing it :< ] Consider: You can "scan" a 5 character sequence of digits with a counter and a single state (if char is digit, stay in this state and increment count; if char is not digit, move to someother state as defined by the grammar). If, however, yacc "can't count", then it effectively counts a 5 character sequence of digit by: state1: if digit, move to state 2, else state X state2: if digit, move to state 3, else state ... state3: if digit, move to state 4, else state ... state4: if digit, move to state 5, else state ... state5: if digit, move to state 6, else state ... i.e., in state2, one digit has been accumulated; in state3, two digits have been accumulated; ... until, finally, in state6, five (consecutive) digits have been accumulated. If each state requires a transition table with 256+ entries (for each of the possible stimuli -- tokens -- that can be received *in* that state), then having "state strings" like this gets expensive. E.g., imagine a grammar that parses: sentence: "The quick brown fox stepped on the lazy dog's nose" entirely within yacc (I am guessing that implication analysis won't let you reduce that to anything less than strlen(...) discrete states!) > lex/yacc is before my time, but using flex/bison you can redefine the > default parser and lexer functions to include arbitrary parameters. > I've used this ability to, e.g., parse out of memory buffers rather > than from file streams. Yes, this is an inherited capability. >> Are there any easily identifiable criteria that affect >> the tradeoff between the two approaches? > > In general there is no performance difference between extracting > non-character data in the lexer vs extracting it at a higher level ... > you'll make the same number of passes over it. I'm concerned with the efficiency of the FSM generation. (consider the "sentence" above) And, how deep the pushdown stack gets to cover backing out of "wrong promotions" that are attempted as yacc encounters new tokens (which, in the case I am suggesting, are actually *characters*!) > However, the lexer does not have to preserve input characters across > calls to yylex(), so extraction at a higher level may mean making a > copy of the input characters. > >> Is there any (significant) difference in lex/yacc >> implementations wrt this issue (e.g., vs flex/bison)? > > No ... your issue is outside the scope of either lexer or parser > generation. > > flex and bison are MAJOR improvements on lex and yacc, but the > differences are optimizations of their internal DFA and PDA processes, > support for programmer extensions, and for bison, support for handling > more complex grammars: yacc is LALR, bison supports LALR, LR, SLR and > GLR (depending on version). The grammars are all relatively simple. Rather, I was interested in any differences in the automata they construct (for the previously illustrated examples). I should just build some bogus grammars and tweek them to see how the tables change in size. Perhaps I can deduce the basic behavior of the FSM generator empirically (I suspect these things are too clever for that sort of simplistic experimentation) [if I've still not managed to highlight the issue that is of interest to me, let me know and I'll try to come up with some real examples to *quantitatively* illustrate]
From: George Neuner on 26 Jul 2010 14:57 On Fri, 23 Jul 2010 23:20:50 -0700, D Yuniskis <not.going.to.be(a)seen.com> wrote: >George Neuner wrote: > >> Point of clarification: a lexer specification *is* a formal grammar >> .... only for a simpler language (in the Chomsky sense). > >Yes, but the grammars that lex can parse are quite simple. >It's mainly used to do the "grunt work" ... The grammars lex can parse are a superset of regular expressions. http://en.wikipedia.org/wiki/Chomsky_hierarchy lex adds to regex limited context sensitivity and limited ability to do counting. Both of these abilities are simple hacks, however, and neither is sufficient to promote the recognized language to the next Chomsky level. >> For one thing, using (f)lex, > >[I can't speak for flex...] Read (f)lex as shorthand for "flex or lex" >lex begins assigning token values at 256 -- conveniently so that >you can pass any ASCII value (any 8 bit value, to be pedantic) >directly through. So does flex. >E.g., "(int) yylex[0]" just takes the >"current character" (matching "dot") and returns it AS IF >it was a "token". (technically, the cast should be a bit more >explicit to handle signed char's properly) Yes, bison predefines 1..255 as tokens (I don't know about yacc). However zero is an illegal token - (f)lex returns zero to mean EOF. So you can't parse binary that includes zero by doing this. >I.e., the traditional approach would be to let lex do the >grunt work of gathering up all the "digits", converting them >to an integer (or whatever the grammar really wants), stuffing >the value of that "integer" into u.int and reducing the >character sequence to a single non-terminal: INTEGER (etc.). > >By contrast, I am comparing the approach where the individual >digits are passed directly through lex as tokens in their >own right. I.e., '0' is the token having the value 0x30 >(call it ZERO if you like); '1' is the token having the >value 0x31 ("ONE"), etc. > >Then, letting yacc directly operate on these "characters" >(tokens) and fabricate it's own concept of "integer:" >(vs. INTEGER). > >So, my question is, how expensive does this get to be by >doing that grunt work in yacc (instead of lex)? E.g., >yacc has to track N different states (for the N digit >"integer:") instead of responding to a single INTEGER >token in *a* particular state. I don't think anyone has measured this. I would expect the __raw__ regex ((f)lex) and PDA (yacc/bison) state machines to be roughly comparable in performance. The main difference is in where and when any additional user code is executed. For regex, the user code is always executed after the FSM has finished so there is no impact on recognition speed. If you try to recognize simple things in the PDA however, the intermediate rules may have to execute code: consider the difference between recognizing the word "integer" and recognizing an actual integer value: %% input : word { printf( "recognized word \"integer\"\n"); } | value { printf( "recognized value : %d\n", $1 ); } ; word : 'i' 'n' 't' 'e' 'g' 'e' 'r'; value : value digit { $$ = ($1 * 10) + $2; } | digit ; digit : '0' { $$ = 0; } | '1' { $$ = 1; } | '2' { $$ = 2; } | '3' { $$ = 3; } | '4' { $$ = 4; } | '5' { $$ = 5; } | '6' { $$ = 6; } | '7' { $$ = 7; } | '8' { $$ = 8; } | '9' { $$ = 9; } ; %% Accumulating any kind of value requires execution of user code. To accumulate an integer, you *must* construct a value for EACH digit individually and pass it up the chain. So for example, the following won't work: value : value digit { $$ = ($1 * 10) + ($2 - '0'); } | digit { $$ = $1 - '0'; } ; because by default the current state identifier is passed up the chain .... that is, the subrule digit will receive some bogus internal yacc value instead of the input character. >If, however, yacc "can't count", then it effectively >counts a 5 character sequence of digit by: Asa matter of fact, yacc CAN count - indirectly via the PDA's stack - which is why yacc can handle recursive grammars while lex cannot. But I know what you mean. >The grammars are all relatively simple. Rather, I was interested in >any differences in the automata they construct (for the previously >illustrated examples). (f)lex constructs a pure state machine. yacc/bison constructs a state machine with a stack. Aside from that you are dealing with how they pass user values around. George
From: Cesar Rabak on 26 Jul 2010 15:04 Em 23/7/2010 05:47, D Yuniskis escreveu: > Hi, > [snipped] don, Having arrived late in this thread, and understood that you're trying to use lex/yacc for a [comm] protocol, I wonder if in your searches did you consider using tools more closer to your problem at hand. Something lighter than Estelle, for example? -- Cesar Rabak GNU/Linux User 52247. Get counted: http://counter.li.org/
From: D Yuniskis on 27 Jul 2010 15:27 Hi Cesar, Cesar Rabak wrote: > Having arrived late in this thread, and understood that you're trying to > use lex/yacc for a [comm] protocol, I wonder if in your searches did you > consider using tools more closer to your problem at hand. <frown> Hard to consider anything "lighter" than lex/yacc (without resorting to ad-hoc approaches) > Something lighter than Estelle, for example? I've never used Estelle. But, it seems a lot more heavy-handed than simply specifying a grammar in lex/yac. I.e., it seems like it wants to extend *beyond* just the "interface specification" and deeper into the interactions between competing/cooperating processes that happen to be "communicating" (gee, how's that for a mouthful? :> ) I just want to create a simple template/framework that allows others to create similar interfaces just by tweeking an existing one (since most folks seem to go "deer-in-headlights" when faced with creating something "from scratch"). lex/yacc can do that. But, the question is, where various bits of the grammar should reside to most effectively use these capabilities (as with most things, there are lots of different ways to approach it WITHIN the lex/yacc framework; but, there are consequences to each approach!). Since I am anticipating others will copy/modify what I have done, there is extra incentive to "get it right" (lest *everyone* ends up repeating my mistakes!) :-/
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: Data transfer via ezusb.sys Next: What might be the meaning of 10-tooth machine? |