Prev: Atomic variable Vs. Atomic operation
Next: How to program critical section for reader-writer systems?
From: Peter Olcott on 21 May 2010 10:34 "Chris F Clark" <cfc(a)shell01.TheWorld.com> wrote in message news:sddk4qx95rv.fsf(a)shell01.TheWorld.com... > "Peter Olcott" <NoSpam(a)OCR4Screen.com> writes: > >> I was wondering if there is a generally faster way to >> implement a DFA recognizer than by using a state >> transition >> matrix. >> >> I already asked this question on several other groups and >> never received any definitive answer. > > Did you ask on comp.compilers? > > In any case, looking at the answers here. It looks like > one is just > fiddling with the proportionality constant. A transition > matrix and > nested switch tables are exactly equivalent in terms that > they both > are linear (with slightly different constants), which > constant is > better depends upon how big your DFA is, how clever the > compiler is, > and what size of caches your machine has. For small > enough DFA's the > case statement will generally be faster. For most > compilers, though, > eventually, when one gets a large enough DFA, the > optimizer and code > generator "break", sometimes literally, but more often > simply cannot > track enough information to generate optimal code and the > table method > actually gets faster. > > Note, case statements aren't your only option. Most DFA's > can be > converted into what looks like hand-code programs. That > takes *a lot* > more work. That is in essence what people do when writing > hand- > written recursive descent parsers. > > ab* could be translated into something like: > > if ( token() = 'a' ) { > while ( token() = 'b' ) ; > } > accept(); > This would likely degrade into sequences of "if else if else if" for the real world use of creating a lexical analyzer that can recognize 50 keywords. This would seem to be slower than a state transition matrix / case statement combination. > Note, programs like this (which are hard to write > correctly) *can* > produce even faster lexers than case statement ones, > because this code > is more similar to what the optimizer and code generator > were > originally written and tuned for. Yeah the case statement one has non trivial overhead. > That is still not the fastest one can do, depending on the > situation. > It is probably the fastest if one is only feeding in > inputs that > match, but it might be slower at rejecting non-matching > inputs than > some non-obvious other methods. For example, the > methodology of > Boyer-Moore string searches can be applied to certain DFAs > to quickly > look for places where the DFA might fail. For accepting > one has to > look at every input byte, but for rejecting there is no > such > requirement, and the BM technique is suggestive of how one > can quickly > find rejectable bytes in the input (if you don't have to > process it > linearly). > > However, since I assume you are looking for a simple DFA > recognizer > that processes its input left-to-right, and probably want > one in a > high-level language, you have already "found" your answer. > If you are > willing to use assembly language, find the paper by Thomas > Pennello on > "the Fastest LR Parser". The heart of an LR parser is > essentially a > DFA recognizer. They went from the state matrix to > optimal VAX > assembly language. Still I suspect for a large enough > DFA, a table > interpreter could still have been faster due to cache > issues. I don't understand this last part. It seems like the first one used a state transition matrix which is a table (and then recoded this in assembly language), and an alternative faster one also used a table. This looks like they are both using the same method, so how is one faster than the other? > > Hope this helps, > -Chris > > ****************************************************************************** > Chris Clark email: > christopher.f.clark(a)compiler-resources.com > Compiler Resources, Inc. Web Site: > http://world.std.com/~compres > 23 Bailey Rd voice: (508) 435-5016 > Berlin, MA 01503 USA twitter: @intel_chris > ------------------------------------------------------------------------------
From: Tim Little on 22 May 2010 06:58 On 2010-05-21, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote: > I was thinking of the case of a lexical analyzer that needs > to recognize fifty keywords. In this case most of the > alphabet does not need to be explicitly represented, only > those letters that are part of keywords that are to be > matched need be represented. Ah yes, with the default case taking the rest. Yes, that could be quite compact. A special case making up a tiny fraction of all DFAs, but a rather important one in practice :) - Tim
From: Chris F Clark on 22 May 2010 16:16 "Peter Olcott" <NoSpam(a)OCR4Screen.com> writes: > I don't understand this last part. It seems like the first > one used a state transition matrix which is a table (and > then recoded this in assembly language), and an alternative > faster one also used a table. This looks like they are both > using the same method, so how is one faster than the other? In the assembly language version the table is encoded in the code--the if-then-else sequences you were mentioning when I covered hand-written recursive descent. There is essentially no table. The table is completely folded into the code that interprets it (i.e. one has applied partial evaluation). There is just the large assembly language program that looks at bytes and decides what to do. If your lexer is small enough, (your 50 keyword example is likely to be), then this assembly language program will fit in the cache of the processor and it will run at maximal speed, because the cache will keep up with the processor and the processor will deal with the bytes in the minimum number of cycles. However, if the lexer is not small enough, (when investigating this stuff we are looking at "lexers" with hundreds-of-thousands of "keywords"), the assembly language program grows to be multiple gigabytes and doesn't fit into cache at all. Then writing a small interpreter that does fit into cache and building a cache-friendly data structure for the table makes a difference. You are going to have cache misses, but you pick a representation that minimizes the number of them, because a cache miss takes longer than the inner loop of the interpreter. Do those two paragraphs make the distinction clear? I hope so, because I'm going to throw another wrinkle in the problem. Note, if your problem is small enough, that cache isn't the issue, the next level you look at is branch prediction, as you want the process to "guess" the right way on the branch instructions in your program. The real advantage of merging you table into your code isn't simply eliminating the array reads to fetch the table. You actually want to optimize that code. This is the big plus for partial evaluation. You have a look at which branches are actually taken and optimize the code for the most frequent of those sequences (and eliminate any code for branches that can never be taken). This may cause you to unroll loops and do other things that change the "shape" of the program, which is why the hand-written recursive descent parser isn't simply one big loop with a case-statement inside. If you've seen "the", looking to see whether the next character is "n" is fairly important (and likely to have a hit), not so if you've seen "whil". Again, this kind of optimization is relatively hard to automate, and you are into the diminishing returns level. (That said, we did some of that in Yacc++, if you ask the tables to be laid out as code. You can tell it that you want it to unroll the code. And, if you do, it sees how not only one character of the token influences what it should do, but looks at character sequences upto the unrolling limit you have set. It does fairly minimal optimizations on that, but it does at least eliminate the dead branches. A fairly typical one is that it eliminates is spurious checks for EOF, which tend to dominate lexer time if you do them on every character. That alone made investigating laying out the code and optimizing it worthwhile. We figured out how to fold that optimization back into the default interpreter and sped up average lexing time by 10-25% even when one was just using tables.) Hope this helps, -Chris ****************************************************************************** Chris Clark email: christopher.f.clark(a)compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------
First
|
Prev
|
Pages: 1 2 3 Prev: Atomic variable Vs. Atomic operation Next: How to program critical section for reader-writer systems? |