Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Joseph M. Newcomer on 18 May 2010 21:07

The counterexample is that he DID it. You'll have to ask him for proof. But I do lexical
analysis all the time using DFAs and never once do I have a table. Years of experience
taught me that they are faster than anything lex or analogous lexer generators can do. But
I'll reserve judgment.
joe

On Tue, 18 May 2010 17:36:48 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/18/2010 5:26 PM, Joseph M. Newcomer wrote:
>> It is important to distinguish between the *concept* of DFA and the implementation of "a
>> table-driven DFA" and the special case of "a table-drive DFA generated by the lex tool"
>>
>> All lexical analysis is done using the concept of DFAs, but the table-driven DFA output by
>> lex is not very fast. Again, I point to the work of Nigel Horspool, which is very old
>> work at this point, of generating tight C code, and no table appears anywhere in his
>> implementation.
>> joe
>>
>I really, really doubt that no table is used in a lexical analyzer that
>is very very fast. Some little include of an include that you missed is
>more likely. I can't accept this without direct proof.
>
>I can accept that he may have derived a parser that is very very fast
>without the need for a table. This is more plausible.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 18 May 2010 21:42

On 5/18/2010 8:07 PM, Joseph M. Newcomer wrote:
> The counterexample is that he DID it. You'll have to ask him for proof. But I do lexical
> analysis all the time using DFAs and never once do I have a table. Years of experience
> taught me that they are faster than anything lex or analogous lexer generators can do. But
> I'll reserve judgment.
> joe

How can you scan for dozens of keywords efficiently without a state
transition matrix? Without a state transition matrix its gotta be
something like if, else if, else if, et cetera.

With a state transition matrix you can simultaneously look for an
unlimited number of different character strings with zero comparisons
per byte.

>
> On Tue, 18 May 2010 17:36:48 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 5/18/2010 5:26 PM, Joseph M. Newcomer wrote:
>>> It is important to distinguish between the *concept* of DFA and the implementation of "a
>>> table-driven DFA" and the special case of "a table-drive DFA generated by the lex tool"
>>>
>>> All lexical analysis is done using the concept of DFAs, but the table-driven DFA output by
>>> lex is not very fast. Again, I point to the work of Nigel Horspool, which is very old
>>> work at this point, of generating tight C code, and no table appears anywhere in his
>>> implementation.
>>> joe
>>>
>> I really, really doubt that no table is used in a lexical analyzer that
>> is very very fast. Some little include of an include that you missed is
>> more likely. I can't accept this without direct proof.
>>
>> I can accept that he may have derived a parser that is very very fast
>> without the need for a table. This is more plausible.
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Mihai N. on 19 May 2010 01:34

> I can't accept this without direct proof.

Yet, we are all supposed to take everything *you*
say without any proof.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Mihai N. on 19 May 2010 01:42

> You code in Word? Word is essentially user-hostile for writing code!

Actually, it is very nice.

You set the spelling language to "C" and get nice squiggles for
all syntax errors.
;-)

Also, has the great benefit that it does not compile and run your code, which
allows to write perfect code and create super fast algorithms.
Compiling the code and running it destroy the perfection :-)
It's called "reality" and it is a bad thing.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Joseph M. Newcomer on 19 May 2010 02:18

See below...
On Tue, 18 May 2010 20:42:55 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/18/2010 8:07 PM, Joseph M. Newcomer wrote:
>> The counterexample is that he DID it. You'll have to ask him for proof. But I do lexical
>> analysis all the time using DFAs and never once do I have a table. Years of experience
>> taught me that they are faster than anything lex or analogous lexer generators can do. But
>> I'll reserve judgment.
>> joe
>
>How can you scan for dozens of keywords efficiently without a state
>transition matrix? Without a state transition matrix its gotta be
>something like if, else if, else if, et cetera.
****
You assumption that a "state transition matrix" is required shows that you have never
actually written a hard-coded lexer. I've written so many that I've lost count. I'll
knock one off as fast as I can type; in fact, typing speed is the limit to how long it
takes to create one.

How did you develop this fixation that a "state transition matrix" is required? That is
only one way to implement a DFA. A switch statement works well, too, and if-statements
are considered a viable technique. While a table-driven method is known to be one of the
implementations, it is far and away not the ONLY known implementation. Nigel's experience
was that a hard-coded recognizer without a table is blindingly fast, MUCH faster than a
lexer that uses a transition matrix to drive the recognition. As these things go, using
transition matrices is a rather "modern" invention in lexing. The compiler for the CDC
Star computer used pipelined floating point operations to create a character class vector
(because the Star was LOUSY at character manipulation but blindingly fast at pipelined
floating point). So don't try to explain to me that table-driver lexers are the only
possible implementation; I was writing hard-coded lexers in 1966, and using table-driven
lexers in 1967 that were created from a lexer generator written by Art Evans for his PhD
dissertation in 1961. We wrote a hand-coded lexer for the Bliss compiler. I've been
writing compilers of one sort or another for 45 years, with table-driven components,
hard-coded components, and everything in between. I've written sparse matrix packers
using "comb vectors" to compress parse tables (an invention of people at Kahrsruhe
University in Germany). And I can write programs that can convert NDFA machines to DFA
machines, and then optimize the DFA to minimum states, and used to give this as a freshman
programming exercise. So please stop trying to tell me how much I know or don't know
about lexers, parsers, and compiler implementations in general. I also worked closely
with one of the world's best language designers, and I learned a lot about language design
from him. Not enough to be as good as he is, but enough to know when I see a bad design.
Do you know what a "perfect hash" is? I've built perfect hash algorithms. Do you know
where they are used, and why?

I'm not some newbie at this business. I worked with optimizing compiler technology from
1970 through 1983, and my Ph.D. was in optimal code generation from table-driven code
generators. I've built debuggers, byte code interpreters, tree interpreters, threaded
code interpreters, and machine code generators. I wrote my first compiler in 1967, which
took an Algol-like language and generator hard code. I was one of the key developers of
an interactive language from 1967-1969. My first optimizing compiler was a student
project in 1967 (I can date that exactly, because the machine I wrote it for was
decomissioned in 1968; our grade was inversely proportional to code size, and I got an
A+). I've forgotten more about compilers than most people have ever learned. The last
compiler I wrote had not a trace of tables anywhere in it, and I wrote it three years ago
(I just checked the file dates). My first Windows project was a compiler which had not a
single table for either its lexer or parser, back in 1990. So I have no idea why you
think these representations are essential.

And my skills pale beside those of deep experts like Nigel Horspool; see, for example,

http://webhome.cs.uvic.ca/~nigelh/pubs.html

and I recommend the paper

http://webhome.cs.uvic.ca/~nigelh/Publications/fastparse.pdf

A 1990 paper wherein he shows why performance of table-driven parsers can be improved by
very large factors (like 8x) by hard-coding the transitions in pure C code without a table
anywhere in the resulting compiler. Don't try to explain why table-driven must be faster
than any alternative when it was proven 20 years ago that this is not true. I suggest
looking at Table 2 on page 19. You asked for proof; here it is.
****
>
>With a state transition matrix you can simultaneously look for an
>unlimited number of different character strings with zero comparisons
>per byte.
****
So what's your complaint about adding lexer rules for localized digit sequences? You just
said it won't add any time...
joe
>
>>
>> On Tue, 18 May 2010 17:36:48 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> On 5/18/2010 5:26 PM, Joseph M. Newcomer wrote:
>>>> It is important to distinguish between the *concept* of DFA and the implementation of "a
>>>> table-driven DFA" and the special case of "a table-drive DFA generated by the lex tool"
>>>>
>>>> All lexical analysis is done using the concept of DFAs, but the table-driven DFA output by
>>>> lex is not very fast. Again, I point to the work of Nigel Horspool, which is very old
>>>> work at this point, of generating tight C code, and no table appears anywhere in his
>>>> implementation.
>>>> joe
>>>>
>>> I really, really doubt that no table is used in a lexical analyzer that
>>> is very very fast. Some little include of an include that you missed is
>>> more likely. I can't accept this without direct proof.
>>>
>>> I can accept that he may have derived a parser that is very very fast
>>> without the need for a table. This is more plausible.
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks