From: George Neuner on 27 Jul 2010 19:14 On Tue, 27 Jul 2010 12:27:14 -0700, D Yuniskis <not.going.to.be(a)seen.com> wrote: ><frown> Hard to consider anything "lighter" than lex/yacc >(without resorting to ad-hoc approaches) If all you need is regex there is a simple preprocessing tool called RE2C that lets you embed the expression right into your C code. I used it a number of years ago and it worked quite well. The project has a new owner and supposedly there have been improvements since then. http://re2c.org/ If you can use a C++ compiler (even for writing C) Boost Spirit is a lightweight library for regex and LL parsing. A number of Boost libraries are function based rather than object based and intermix easily with C code ... but I haven't used Spirit so I don't know which category it is in. http://www.boost.org/ George
From: D Yuniskis on 27 Jul 2010 22:36 Hi George, George Neuner wrote: > On Fri, 23 Jul 2010 23:20:50 -0700, D Yuniskis >> George Neuner wrote: >> >>> For one thing, using (f)lex, >> [I can't speak for flex...] > > Read (f)lex as shorthand for "flex or lex" Yes, I understood your notation. Rather, I was commenting that I don't know if *flex* makes the same assumptions re: token "values" (starting at 256) >> lex begins assigning token values at 256 -- conveniently so that >> you can pass any ASCII value (any 8 bit value, to be pedantic) >> directly through. > > So does flex. OK >> E.g., "(int) yylex[0]" just takes the >> "current character" (matching "dot") and returns it AS IF >> it was a "token". (technically, the cast should be a bit more >> explicit to handle signed char's properly) > > Yes, bison predefines 1..255 as tokens (I don't know about yacc). > However zero is an illegal token - (f)lex returns zero to mean EOF. So > you can't parse binary that includes zero by doing this. I'm not worried about binary values -- just ASCII (I don't consider 'NUL' :> ) >> I.e., the traditional approach would be to let lex do the >> grunt work of gathering up all the "digits", converting them >> to an integer (or whatever the grammar really wants), stuffing >> the value of that "integer" into u.int and reducing the >> character sequence to a single non-terminal: INTEGER (etc.). >> >> By contrast, I am comparing the approach where the individual >> digits are passed directly through lex as tokens in their >> own right. I.e., '0' is the token having the value 0x30 >> (call it ZERO if you like); '1' is the token having the >> value 0x31 ("ONE"), etc. >> >> Then, letting yacc directly operate on these "characters" >> (tokens) and fabricate it's own concept of "integer:" >> (vs. INTEGER). >> >> So, my question is, how expensive does this get to be by >> doing that grunt work in yacc (instead of lex)? E.g., >> yacc has to track N different states (for the N digit >> "integer:") instead of responding to a single INTEGER >> token in *a* particular state. > > I don't think anyone has measured this. <frown> I was wondering if anyone had "looked under the hood" enough to be able to posit a guess. > I would expect the __raw__ regex ((f)lex) and PDA (yacc/bison) state > machines to be roughly comparable in performance. The main difference > is in where and when any additional user code is executed. lex seems to like to "write spaghetti code" (an overstatement but, contrasted to yacc's more rigid "tables", I think you know what I mean). For the sorts of "traditional" tokens that it is meant to parse, I think lex is probably a win. E.g., [0-9]+ return INTEGER; [a-zA-Z]+ return WORD; etc. yacc would probably not take a big hit, either, as it's just a state looping back to itself repeatedly. Even ad hoc approaches can be efficient with these. (i.e., lots of actions result in "same state" transitions so the total number of states is "fewer" than if you had to *count* "digits" or "letters") OTOH, if your grammar has more rigidly defined tokens like: [0-9]{4,4} return INTEGER; or (gasp): [1-9][0-9][0-9][0-9] | ' '[1-9][0-9][0-9] | ' ' ' '[1-9][0-9] | ' ' ' ' ' '[0-9] then there are a lot *more* state transitions involved. Here, yacc's approach could get pricey (since you can't "cheat" and wrap lots of character sequences into a *few* consecutive states like: {' '}*[1-9]?[0-9]+ (which isn't a 1:1 replacement for the above) I.e., for rigid message formats, you can end up with lots of sequential states, each of which defines the *few* expected "inputs" that will advance to the "next state" while many/most other "inputs" will signal an error (syntax). > For regex, the user code is always executed after the FSM has finished > so there is no impact on recognition speed. If you try to recognize > simple things in the PDA however, the intermediate rules may have to > execute code: consider the difference between recognizing the word > "integer" and recognizing an actual integer value: > > %% > > input : word > { > printf( "recognized word \"integer\"\n"); > } > | value > { > printf( "recognized value : %d\n", $1 ); > } > ; > > word : 'i' 'n' 't' 'e' 'g' 'e' 'r'; > > value : value digit > { > $$ = ($1 * 10) + $2; > } > | digit > ; > > digit : '0' { $$ = 0; } > | '1' { $$ = 1; } > | '2' { $$ = 2; } > | '3' { $$ = 3; } > | '4' { $$ = 4; } > | '5' { $$ = 5; } > | '6' { $$ = 6; } > | '7' { $$ = 7; } > | '8' { $$ = 8; } > | '9' { $$ = 9; } > ; > > %% > > Accumulating any kind of value requires execution of user code. To > accumulate an integer, you *must* construct a value for EACH digit > individually and pass it up the chain. > > So for example, the following won't work: > > value : value digit > { > $$ = ($1 * 10) + ($2 - '0'); > } > | digit > { > $$ = $1 - '0'; > } > ; > > because by default the current state identifier is passed up the chain > .... that is, the subrule digit will receive some bogus internal yacc > value instead of the input character. Yes, but you could do: integer: digit digit digit digit { $$ = ( (( (( ( ($1 - '0') * 10) + ($2 - '0')) * 10) + ($3 - '0')) * 10) + ($4 - '0')); } (assuming I've counted my parens properly :-/ ) >> If, however, yacc "can't count", then it effectively >> counts a 5 character sequence of digit by: > > Asa matter of fact, yacc CAN count - indirectly via the PDA's stack - > which is why yacc can handle recursive grammars while lex cannot. > > But I know what you mean. > >> The grammars are all relatively simple. Rather, I was interested in >> any differences in the automata they construct (for the previously >> illustrated examples). > > (f)lex constructs a pure state machine. yacc/bison constructs a state > machine with a stack. Aside from that you are dealing with how they > pass user values around. I'll have to build some toy grammars and move the promotions around between lex and yacc and see what the resulting code looks like. I'll try to get around to that in the next few days and post some examples [currently preoccupied retiring a server and replacing another so I live in fear of "forgetting something" if distracted while in the process :-/ Unfortunately, I should have installed the Gb switch *before* doing this as moving that much data at 100M is *painfully* slow! :< Sheesh! To think back to the SneakerNet days... :-/ ]
From: D Yuniskis on 28 Jul 2010 18:48 Hi George, George Neuner wrote: > On Fri, 23 Jul 2010 17:57:56 -0700, D Yuniskis > <not.going.to.be(a)seen.com> wrote: > >> But, you can push some stuff into the lexer instead of putting >> it *all* in the yacc grammar. Roughly the same in terms of >> readability: >> >> [1-9] return NONZERO; >> 0 return ZERO; >> >> with: >> >> digit: NONZERO | ZERO >> value: NONZERO DIGIT DIGIT DIGIT DIGIT '.' DIGIT DIGIT >> | SP NONZERO DIGIT DIGIT DIGIT '.' DIGIT DIGIT >> >> (ad nauseum) >> >> [this is admittedly a bad example] >> >> My question regards the differences in the generated code >> and memory requirements (e.g., transition tables in the FSM) >> that are consequential to one approach over another. > > The lexer has to examine each and every character of input and there > is overhead for each call to it. The more times the parsing code > calls the lexing code, the slower the parse becomes. Agreed. This suggests letting lex gobble up as much input as it can (reasonably). OTOH, the parser has more knowledge of what's going on than the lexer. Once invoked, the lexer will gladly process lots of input... THAT THE PARSER MIGHT CONSIDER AS "INAPPROPRIATE" (given context)! It's this that I think will force me to move everything into yacc, regardless of "efficiency". Thinking aloud: if context suggest to yacc that <blah> is *required*, lex can merrily encounter <string> in the input and patiently gobble up as many [A-Za-z]+ as happen to be present -- EVEN THOUGH "return STRING" will cause yacc to signal an error! If, OTOH, lex simply passed each character (in essence) to yacc as discrete tokens and let yacc promote each accordingly based on the rules of the grammar, then yacc can detect invalid "messages" earlier! Recall, the goal here is to parse messages, not "text files" (e.g., "source code"). And, that processing is not "off-line"/batch oriented. I.e., lex's "input()" will be pulling characters off of a live input stream. That input stream is not "received" in nice clean units (i.e., it's not like a text file where full lines can be extracted at a time). E.g., it could *pause* "mid-message" for an indeterminate amount of time (subject to the communication protocol in place). As a (trivial) example, consider a grammar that supports two message types: "Hello, my name is <string>" | "<expr> <op> <expr>" (where <expr> and <op> are what you suspect them to be) [granted, a silly example but should suffice to illustrate my point] If the parser has processed "Hello, my name is ", then it is obviously awaiting <string>. If, OTOH, the input contains "1234567.8" and has *stalled* at that point, then having lex process <number> as "[0-9]+'.'[0-9]*" means lex will dutifully hang waiting the next character (after the '8') to see if it can fold that into it's <number> rule. Only when *lex* has decided where that token's boundary lies will it pass NUMBER to yacc. OTOH, if the "1" had been passed to yacc, it would have already decided that "Hello, my name is 1..." is not a valid sentence in the grammar and signalled an error. (by contrast, if it has to wait for lex -- and lex has to wait for a *paused* input stream -- then the error is not detected until well after the fact!) So, it looks like lex can only do context independant promotions *if* I want to keep the error path expedited. > Regardless of implementation, tokenization and lexical analysis of the > input stream are the limiting factors for parser performance. In a > typical optimizing compiler, lexing accounts for 70-80% of the running > time while parsing accounts for less than 1%. > >> Keep in mind, lex and yacc were intended to generate code that >> runs in resource rich environments... > > No they aren't. lex and yacc are products of the 1960's when the > average program executing on a mainframe had 32KB of RAM. They were > designed around a particular space/speed performance point that was > acceptable to programmers and which the tools themselves could manage > the limited space available. But, they had secondary storage available, could make multiple passes over the input, weren't time constrained, etc. I.e., they could concentrate on *just* that one aspect. > Newer tools like flex and bison have other options. bison has even > tighter table packing modes than yacc and can produce smaller parsers > using the same grammars. But bison can also produce much faster > parsers, or reentrant parsers, or handle more complex grammars, etc. > Similarly, flex can make smaller lexers than lex or it can make lexers > that run somewhat faster. The details of their respective (and relative) automata are needed to sort out just how efficient each *might* be.
From: George Neuner on 29 Jul 2010 01:38 On Wed, 28 Jul 2010 15:48:11 -0700, D Yuniskis <not.going.to.be(a)seen.com> wrote: >Hi George, > >George Neuner wrote: >> >> The lexer has to examine each and every character of input and there >> is overhead for each call to it. The more times the parsing code >> calls the lexing code, the slower the parse becomes. > >Agreed. This suggests letting lex gobble up as much >input as it can (reasonably). > >OTOH, the parser has more knowledge of what's going on >than the lexer. Once invoked, the lexer will gladly >process lots of input... THAT THE PARSER MIGHT CONSIDER >AS "INAPPROPRIATE" (given context)! That's not a convincing argument. First of all, every successful non-human language and protocol is context free (in the Chomsky sense). "Context free" does not mean the parser works without knowledge of its surroundings ... it means that the syntax is such that any legal "sentence" can have only one possible interpretation. Secondly, to be efficient the parser needs to work at a higher level than input characters. Lexers were created specifically so that parsers can work with abstract idea "tokens" rather than strings. >It's this that I think will force me to move everything >into yacc, regardless of "efficiency". Thinking aloud: >if context suggest to yacc that <blah> is *required*, >lex can merrily encounter <string> in the input and >patiently gobble up as many [A-Za-z]+ as happen to be >present -- EVEN THOUGH "return STRING" will cause yacc >to signal an error! That's right. But if the surroundings (let's avoid the word "context" to prevent Chomsky based misunderstandings) dictate that <blah> is required, then it would be a syntax error if the lexer found a string. >If, OTOH, lex simply passed each character (in essence) >to yacc as discrete tokens and let yacc promote each >accordingly based on the rules of the grammar, then >yacc can detect invalid "messages" earlier! But at the expense of a huge increase in grammar complexity and correspondingly larger parsing code, increased processing per input character and possibly a dead programmer who, in deep despair, blew his own brains out before finishing the job because he didn't realize what he was getting himself into. >Recall, the goal here is to parse messages, not "text >files" (e.g., "source code"). And, that processing is not >"off-line"/batch oriented. I.e., lex's "input()" will >be pulling characters off of a live input stream. Live input doesn't matter in any theoretical sense. As a practical matter, however, if you must provide real time feedback you simply create an input driven "push" parser rather than an expectation driven "pull" parser. In a push parser, the lexer drives parsing, throwing tokens at the parser which accumulates them until they make sense. You'll get your error at the first "word" that doesn't fit into any of the possible "sentences" in the match set. yacc/lex can't create push parsers, but bison/flex and some other modern tools like ANTLR and Yacc++ can. >That input stream is not "received" in nice clean units >(i.e., it's not like a text file where full lines can be >extracted at a time). E.g., it could *pause* "mid-message" >for an indeterminate amount of time (subject to the >communication protocol in place). > >As a (trivial) example, consider a grammar that supports >two message types: > >"Hello, my name is <string>" >| "<expr> <op> <expr>" > >(where <expr> and <op> are what you suspect them to be) > >[granted, a silly example but should suffice to illustrate my >point] > >If the parser has processed "Hello, my name is ", then it >is obviously awaiting <string>. If, OTOH, the input contains >"1234567.8" and has *stalled* at that point, then having >lex process <number> as "[0-9]+'.'[0-9]*" means lex will >dutifully hang waiting the next character (after the '8') >to see if it can fold that into it's <number> rule. Only >when *lex* has decided where that token's boundary lies >will it pass NUMBER to yacc. That's as it should be. The "words" of a "sentence" have to be delineated. Parsing 1492 as "I've got '149' so far and I'm stuck waiting for more input" is semantically wrong. Until you see something that doesn't fit into the assumptions of the current match set, you *cannot* know that you've matched anything. Honestly, it doesn't matter whether you are matching characters or abstract tokens. Whenever matching is applied over sequences, the matching *process* itself is the limitation on what's possible. >OTOH, if the "1" had been passed to yacc, it would have >already decided that "Hello, my name is 1..." is not a >valid sentence in the grammar and signalled an error. >(by contrast, if it has to wait for lex -- and lex has to >wait for a *paused* input stream -- then the error is >not detected until well after the fact!) You're going to have to take my word for it that you likely will be unable to write a non-trivial grammar that can to do that. Go ahead and try, but if you blow your brains out don't say I didn't warn you. My advice is to investigate push parsers instead. >So, it looks like lex can only do context independant >promotions *if* I want to keep the error path expedited. (f)lex has a feature that can help: start states (flex calls them start "conditions"). Normally all the individual regex patterns are coalesced into one giant DFA which matches input against all of them simultaneously and stops when it has a match against any of them. Start states group related patterns into sub-DFAs which are only matched against when the lexer is in their related state. Flex is more flexible 8) than lex in that it allows the same pattern to belong to more than one state group (in lex you have to repeat the pattern with a different state prefix). State changes are entirely handled by user code, so, for example, the state could be a parameter passed in by the parser. Overall flex is a much better tool than lex. It is *much* more flexible in the way it interfaces to the parser and in how it handles input. Unlike lex, there are well documented hooks into most of the core functions. Flex lexers are intended to be customizable. Flex also can create measurably faster lexers. George
First
|
Prev
|
Pages: 1 2 3 4 Prev: Data transfer via ezusb.sys Next: What might be the meaning of 10-tooth machine? |