Prev: Designing a Finite State Machine DFA Recognizer for UTF-8
Next: Simple Valication Check... Question
From: Joseph M. Newcomer on 22 May 2010 23:56 See below... On Sat, 22 May 2010 12:34:29 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote: >On 5/22/2010 11:38 AM, Joseph M. Newcomer wrote: >> See below... >> On Sat, 22 May 2010 08:33:43 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: >> >>> On 5/22/2010 4:50 AM, Joseph M. Newcomer wrote: >>>> See below... >>>> On Fri, 21 May 2010 22:50:40 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: >>>> >>>>> I already posted it you should be able to tell. I also checked on >>>>> another (more specialized) newsgroup for alternative designs. There were >>>>> only two basic alternatives provided. Joe said that hand coded lexers >>>>> can be faster than table driven ones. The evidence that he provided for >>>>> the is a paper that showed improvements to table driven parsers. >>>> **** >>>> And that implementation has a hard-coded lexer in it as well. I was searching for a paper >>>> that corresponded to Nigel's presentation to our working group, and that was the closest >>>> one to what he presented. The lexer is generated from the lex tables. It's hard code all >>>> the way down. Maybe you can go read some of his other papers. I'm not going to take the >>>> time to justify every little detail to you. >>>> joe >>>> **** >>> >>> There are only three basic designs for lexers that I uncovered: >>> (1) Case statement within case statement solutions >>> (2) Multiple sequences of if else if statements >>> (3) State transition matrix based solutions >> **** >> (4) The case-pair decoder, with the current state and character pair >That is (1) above. **** No, it is not, it is much flatter, and easier to write. I'm sorry you can't tell the difference, which is actually pretty significant! There is NO "case statement within case statement" in the model I showed you, only a very flat, single switch statement. Isn't it obvious by inspection? joe **** > >> >> Note that I already showed you this, so it would have taken zero effort to "uncover" it. >> Weren't you paying attention? I even wrote out the code, and it is not any of the above >> three cases! >> joe >> **** >>> >>> It seems that the latter one has far fewer instructions in its >>> instruction trace table, and in many cases (such as the UTF8_to_UTF32 >>> converter) is small enough to entirely fit within cache. >>> >>> Although you have convinced me that one can not be sure of the actual >>> relative performance without actual benchmark testing, I am still going >>> with the last option as the best choice. Others can feel free to >>> benchmark against this choice when I complete this code next week. >> Joseph M. Newcomer [MVP] >> email: newcomer(a)flounder.com >> Web: http://www.flounder.com >> MVP Tips: http://www.flounder.com/mvp_tips.htm Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Peter Olcott on 23 May 2010 00:32 On 5/22/2010 10:56 PM, Joseph M. Newcomer wrote: > See below... > On Sat, 22 May 2010 12:34:29 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: > >>>> There are only three basic designs for lexers that I uncovered: >>>> (1) Case statement within case statement solutions >>>> (2) Multiple sequences of if else if statements >>>> (3) State transition matrix based solutions >>> **** >>> (4) The case-pair decoder, with the current state and character pair >> That is (1) above. > **** > No, it is not, it is much flatter, and easier to write. I'm sorry you can't tell the > difference, which is actually pretty significant! There is NO "case statement within case > statement" in the model I showed you, only a very flat, single switch statement. Isn't it > obvious by inspection? > joe > **** Although I don't remember it at all, I can imagine it from the details that you just provided. For small DFAs it may be faster than the method that I propose. For larger ones it may be slower. I see no way around it not taking more memory that the one that I propose.
From: Joseph M. Newcomer on 23 May 2010 01:16 See below... On Sat, 22 May 2010 23:32:26 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote: >On 5/22/2010 10:56 PM, Joseph M. Newcomer wrote: >> See below... >> On Sat, 22 May 2010 12:34:29 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: >> > >>>>> There are only three basic designs for lexers that I uncovered: >>>>> (1) Case statement within case statement solutions >>>>> (2) Multiple sequences of if else if statements >>>>> (3) State transition matrix based solutions >>>> **** >>>> (4) The case-pair decoder, with the current state and character pair >>> That is (1) above. >> **** >> No, it is not, it is much flatter, and easier to write. I'm sorry you can't tell the >> difference, which is actually pretty significant! There is NO "case statement within case >> statement" in the model I showed you, only a very flat, single switch statement. Isn't it >> obvious by inspection? >> joe >> **** > >Although I don't remember it at all, I can imagine it from the details >that you just provided. For small DFAs it may be faster than the method >that I propose. For larger ones it may be slower. I see no way around it >not taking more memory that the one that I propose. **** What you are capable of seeing has no relevance in this discussion. If you create the code, and the total space consumed by the code and data is less than the space consumed by another implmentation's code and data, as measured by inspection of the actual generated code, then the smaller, faster implementation is *potentially* better. Of course, you have to account for the asymmetry in performance of accessing code and data in doing this evaluation. But you are offering an opinion, not anything factual that it is possible to discuss. Only with running implementations that are carefully instrumented is it possible to make claims about performance. Nothing else matters, and I'm surprised that after months of my telling you this, you still don't understand it. Pointing to ANY two instances of code and saying "this instance has better performance than that instance" is a meaningless activity. MEASURING the two instances and saying "According to the statistical performance data, code A has an average time of KA units of time with a standard deviation of SDA units, and code B has an average time of KB units of time with a standard deviation of SDB units" can you say anything meaningful about the execution of the code. Anyting else is a complete waste of time. To quote a recent ad for the TV series "The Closer": "Evidence. You know, something we can look at and say AH!" Opinions are not evidence. What you are capable of seeing is irrelevant. Only actual code and its performance matter (note that because the constants used in the comparison are in the code stream, they are subject to code prefetch and predecode optimizations, instruction pipeline optimizations, etc., whereas data in an external table cannot be subject to the same optimizations). The point is, I don't KNOW the correct answer, but I'm willing to say so; you offer an ill-founded opinion based on the scope of your limited understanding, and claim it is inarguably correct. And you ignore nearly everything that might actually change the program behavior, such as the nature of modern computer architectures; you seem to still work in the delusional system that instructions execute sequentially, a fact which is known to have not been true for over 20 years in the x86 family. Then you assume that instruction execution times are constant, which is sort-of-true if you take 1 clock cycle as the execution time, and ignore L1/L2/memory timings, and then ignore the fact that in modern superscalars somewhere between 2 and 6 instructions can be concurrently executed in a single clock cycle, that there is speculative execution, branch prediction, asynchronous opportunistic instruction execution, and so on... Stop making your silly claims and produce some REAL data we can look at! You *may* be right. But we have ZERO *evidence* that you are, and all you offer as "proof" is your limited ability to reason about performance. You write a *design* and claim it is an *implementation*. You look at source code and pretend you understand how the generated instructions execute, and how the memory access patterns work! Most experts cannot look at the generated machine code and /a priori/ predict behavior, but you seem to claim you can do this just by looking a C/C++ source! Get a grip on life, man, and realize you haven't a CLUE how long it takes a loop to execute on a REAL machine, simply by looking at the C/C++ source (or even at the generated machine code). Note: I know one HELL of a lot more about the x86 than you do, and I know that *I* cannot predict the behavior, even if I spend days and days "desk checking" the generated machine code. So I'm not sure how someone with your limited understanding knows so much better how long it takes to execute. Maybe, just maybe, one of the chip architects and Intel who designed the chip could do this. But you don't possess that level of expertise, and neither do I, and probably no one outside the chip design group at Intel does. joe Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Peter Olcott on 23 May 2010 09:21 On 5/23/2010 12:16 AM, Joseph M. Newcomer wrote: > See below... > On Sat, 22 May 2010 23:32:26 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: > >> On 5/22/2010 10:56 PM, Joseph M. Newcomer wrote: >>> See below... >>> On Sat, 22 May 2010 12:34:29 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: >>> >> >>>>>> There are only three basic designs for lexers that I uncovered: >>>>>> (1) Case statement within case statement solutions >>>>>> (2) Multiple sequences of if else if statements >>>>>> (3) State transition matrix based solutions >>>>> **** >>>>> (4) The case-pair decoder, with the current state and character pair >>>> That is (1) above. >>> **** >>> No, it is not, it is much flatter, and easier to write. I'm sorry you can't tell the >>> difference, which is actually pretty significant! There is NO "case statement within case >>> statement" in the model I showed you, only a very flat, single switch statement. Isn't it >>> obvious by inspection? >>> joe >>> **** >> >> Although I don't remember it at all, I can imagine it from the details >> that you just provided. For small DFAs it may be faster than the method >> that I propose. For larger ones it may be slower. I see no way around it >> not taking more memory that the one that I propose. > **** > What you are capable of seeing has no relevance in this discussion. If you create the > code, and the total space consumed by the code and data is less than the space consumed by > another implmentation's code and data, as measured by inspection of the actual generated > code, then the smaller, faster implementation is *potentially* better. Of course, you Not actually. The total amount of space of the method that you proposed would be much larger because it would either have to repeat the code in my twelve ActionCodes very many times or take the overhead of function calls, which would negate the speed improvement. One can know in advance very good estimates of actual relative performance. > have to account for the asymmetry in performance of accessing code and data in doing this > evaluation. But you are offering an opinion, not anything factual that it is possible to > discuss. Only with running implementations that are carefully instrumented is it possible > to make claims about performance. Nothing else matters, and I'm surprised that after I have found that bot to be the case. Nothing else may be 100% completely definitive but these performance estimates are accurate enough to matter very much. > months of my telling you this, you still don't understand it. Pointing to ANY two > instances of code and saying "this instance has better performance than that instance" is > a meaningless activity. MEASURING the two instances and saying "According to the > statistical performance data, code A has an average time of KA units of time with a > standard deviation of SDA units, and code B has an average time of KB units of time with a > standard deviation of SDB units" can you say anything meaningful about the execution of > the code. Anyting else is a complete waste of time. I understand it and reject it as false. > > To quote a recent ad for the TV series "The Closer": "Evidence. You know, something we > can look at and say AH!" > > Opinions are not evidence. What you are capable of seeing is irrelevant. Only actual > code and its performance matter (note that because the constants used in the comparison > are in the code stream, they are subject to code prefetch and predecode optimizations, > instruction pipeline optimizations, etc., whereas data in an external table cannot be > subject to the same optimizations). The point is, I don't KNOW the correct answer, but > I'm willing to say so; you offer an ill-founded opinion based on the scope of your limited > understanding, and claim it is inarguably correct. And you ignore nearly everything that > might actually change the program behavior, such as the nature of modern computer > architectures; you seem to still work in the delusional system that instructions execute > sequentially, a fact which is known to have not been true for over 20 years in the x86 > family. Then you assume that instruction execution times are constant, which is > sort-of-true if you take 1 clock cycle as the execution time, and ignore L1/L2/memory > timings, and then ignore the fact that in modern superscalars somewhere between 2 and 6 > instructions can be concurrently executed in a single clock cycle, that there is > speculative execution, branch prediction, asynchronous opportunistic instruction > execution, and so on... Yes all that stuff can potentially throw off the estimates. I don't think that it is all that difficult to see which sequences of instructions can not be correctly executed in parallel. > > Stop making your silly claims and produce some REAL data we can look at! I will produce working code very soon. For data to look at I need a large file of UTF-8 Chinese. The source code will measure and report its own performance. It will take a UTF-8 file as command line input and produce a UTF-32 file as output. It will time the conversion duration using clock(). > > You *may* be right. But we have ZERO *evidence* that you are, and all you offer as > "proof" is your limited ability to reason about performance. You write a *design* and > claim it is an *implementation*. You look at source code and pretend you understand how > the generated instructions execute, and how the memory access patterns work! > > Most experts cannot look at the generated machine code and /a priori/ predict behavior, > but you seem to claim you can do this just by looking a C/C++ source! Get a grip on life, > man, and realize you haven't a CLUE how long it takes a loop to execute on a REAL machine, > simply by looking at the C/C++ source (or even at the generated machine code). No I don't bother to do this. It is most often the case where one loop has triple the assembly language instructions (accomplishing the same functional end result) that it would take longer than the one with fewer. Because of all the things you cited I would tend to agree that estimating how much longer that it takes would be fruitless, only measurement can tell this. I would bet the 90% of the time in the case of compiler generated code (not hand tweaked assembly language) if one loop that was coded to produce that same functional result as another and has triple the assembly language instructions that it will more time than the one with fewer instructions. I am not talking code that was specifically designed to form a counter example to this claim. I am talking about typical code that is written. > > Note: I know one HELL of a lot more about the x86 than you do, and I know that *I* cannot > predict the behavior, even if I spend days and days "desk checking" the generated machine > code. So I'm not sure how someone with your limited understanding knows so much better > how long it takes to execute. Maybe, just maybe, one of the chip architects and Intel who > designed the chip could do this. But you don't possess that level of expertise, and > neither do I, and probably no one outside the chip design group at Intel does. > joe You couldn't predict that the OS would not swap out the pages of my data when I could and did make this accurate prediction. I knew that the OS had far more RAM than it needed, and thus had no good reason to swap out this data. > Joseph M. Newcomer [MVP] > email: newcomer(a)flounder.com > Web: http://www.flounder.com > MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on 23 May 2010 12:42
See below... On Sun, 23 May 2010 08:21:29 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote: >On 5/23/2010 12:16 AM, Joseph M. Newcomer wrote: >> See below... >> On Sat, 22 May 2010 23:32:26 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: >> >>> On 5/22/2010 10:56 PM, Joseph M. Newcomer wrote: >>>> See below... >>>> On Sat, 22 May 2010 12:34:29 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: >>>> >>> >>>>>>> There are only three basic designs for lexers that I uncovered: >>>>>>> (1) Case statement within case statement solutions >>>>>>> (2) Multiple sequences of if else if statements >>>>>>> (3) State transition matrix based solutions >>>>>> **** >>>>>> (4) The case-pair decoder, with the current state and character pair >>>>> That is (1) above. >>>> **** >>>> No, it is not, it is much flatter, and easier to write. I'm sorry you can't tell the >>>> difference, which is actually pretty significant! There is NO "case statement within case >>>> statement" in the model I showed you, only a very flat, single switch statement. Isn't it >>>> obvious by inspection? >>>> joe >>>> **** >>> >>> Although I don't remember it at all, I can imagine it from the details >>> that you just provided. For small DFAs it may be faster than the method >>> that I propose. For larger ones it may be slower. I see no way around it >>> not taking more memory that the one that I propose. >> **** >> What you are capable of seeing has no relevance in this discussion. If you create the >> code, and the total space consumed by the code and data is less than the space consumed by >> another implmentation's code and data, as measured by inspection of the actual generated >> code, then the smaller, faster implementation is *potentially* better. Of course, you > >Not actually. The total amount of space of the method that you proposed >would be much larger because it would either have to repeat the code in >my twelve ActionCodes very many times or take the overhead of function >calls, which would negate the speed improvement. One can know in advance >very good estimates of actual relative performance. **** You keep making these silly statements, for example, you are PRESUMING that "total amount of space" actually has MEANING! What part of "if you haven't MEASURED it you know NOTHING" have you failed to understand? You CANNOT know in advance which is going to be true, for the HUGE set of reasons I already outlined! **** > >> have to account for the asymmetry in performance of accessing code and data in doing this >> evaluation. But you are offering an opinion, not anything factual that it is possible to >> discuss. Only with running implementations that are carefully instrumented is it possible >> to make claims about performance. Nothing else matters, and I'm surprised that after > >I have found that bot to be the case. Nothing else may be 100% >completely definitive but these performance estimates are accurate >enough to matter very much. **** What part of "if you have MEASURED it you know NOTHING" have you failed to understand? You cannot sit there and say "I know it is faster" because you don't actually understand the problem. You still seem to work in the delusional system that code executes sequentially! That presumption alone brands you as totally clueless. **** > >> months of my telling you this, you still don't understand it. Pointing to ANY two >> instances of code and saying "this instance has better performance than that instance" is >> a meaningless activity. MEASURING the two instances and saying "According to the >> statistical performance data, code A has an average time of KA units of time with a >> standard deviation of SDA units, and code B has an average time of KB units of time with a >> standard deviation of SDB units" can you say anything meaningful about the execution of >> the code. Anyting else is a complete waste of time. > >I understand it and reject it as false. **** Then you are even more stupid than you have led us to believe. How can you say that you reject FACTS? How can you claim you know how an opportunistic asynchronous processor with multilevel caches and pipelines actually performs? Answer: because you are clueless, you think you know an answer. I know that if you asked any expert and didn't provide actual code to measure, said expert would be incompetent if he/she said "This code (pointing to some specific implementation) is faster". Talk to people who actually have to optimize code for modern architectures, and they will tell you about pipe-busting code sequences that can slow performance down by a factor of 5 or more. Are you absolutely sure you don't have one of these? **** > >> >> To quote a recent ad for the TV series "The Closer": "Evidence. You know, something we >> can look at and say AH!" >> >> Opinions are not evidence. What you are capable of seeing is irrelevant. Only actual >> code and its performance matter (note that because the constants used in the comparison >> are in the code stream, they are subject to code prefetch and predecode optimizations, >> instruction pipeline optimizations, etc., whereas data in an external table cannot be >> subject to the same optimizations). The point is, I don't KNOW the correct answer, but >> I'm willing to say so; you offer an ill-founded opinion based on the scope of your limited >> understanding, and claim it is inarguably correct. And you ignore nearly everything that >> might actually change the program behavior, such as the nature of modern computer >> architectures; you seem to still work in the delusional system that instructions execute >> sequentially, a fact which is known to have not been true for over 20 years in the x86 >> family. Then you assume that instruction execution times are constant, which is >> sort-of-true if you take 1 clock cycle as the execution time, and ignore L1/L2/memory >> timings, and then ignore the fact that in modern superscalars somewhere between 2 and 6 >> instructions can be concurrently executed in a single clock cycle, that there is >> speculative execution, branch prediction, asynchronous opportunistic instruction >> execution, and so on... > >Yes all that stuff can potentially throw off the estimates. I don't >think that it is all that difficult to see which sequences of >instructions can not be correctly executed in parallel. **** Gee, for someone who knows as little about architectures as you do, you have a massive amount of overconfidence. I've talked to modern compiler writers, who have worked REALLY hard to figure out how to optimize performance, and I'm reasonably certain you do not have their understanding of what is going on. Yet you know the outcome already! **** > >> >> Stop making your silly claims and produce some REAL data we can look at! > >I will produce working code very soon. For data to look at I need a >large file of UTF-8 Chinese. The source code will measure and report >its own performance. It will take a UTF-8 file as command line input and >produce a UTF-32 file as output. It will time the conversion duration >using clock(). **** You really are more stupid than I could imagine. You don't even understand the rudiments of performance analysis or instrumentation. clock() is for children who have never understood that it is about the worst possible way of measuring performance. Why do you think a resolution of 15ms is "good enough" for serious performance measurement? Unless you are measuring values which are hundreds of times larger than the clock resolution, you cannot trust the metrics. Of course, knowing the standard deviation is critical. Do you know how to maintain both mean and s.d. with only THREE variables? If not, you are probably unqualified to even start doing performance measurement. Nobody who is serious about performance measurement would ever trust clock() to give a meaningful performance metric. But then, you wouldn't know that. **** > >> >> You *may* be right. But we have ZERO *evidence* that you are, and all you offer as >> "proof" is your limited ability to reason about performance. You write a *design* and >> claim it is an *implementation*. You look at source code and pretend you understand how >> the generated instructions execute, and how the memory access patterns work! >> >> Most experts cannot look at the generated machine code and /a priori/ predict behavior, >> but you seem to claim you can do this just by looking a C/C++ source! Get a grip on life, >> man, and realize you haven't a CLUE how long it takes a loop to execute on a REAL machine, >> simply by looking at the C/C++ source (or even at the generated machine code). > >No I don't bother to do this. It is most often the case where one loop >has triple the assembly language instructions (accomplishing the same >functional end result) that it would take longer than the one with >fewer. Because of all the things you cited I would tend to agree that >estimating how much longer that it takes would be fruitless, only >measurement can tell this. **** Oh, you are SO naive it is almost cute. Sort of like a cute little puppy who keeps messing the carpet. Because you work in the delusional system that instructions execute sequentially and all instructions take the same amount of time, you think you can predict performance by counting instructions. This is just silly in modern architectures, and if you talk to people who write performance code for a living, both on x86s and supercomputers, you would have found out years ago (in my case, decades ago) that code size and code performance rarely correlate on high-performance architectures. Only on the 8088 and PDP-11 did these have any correlation. Larger code is often faster because it encourages speculative execution, prefetch, pipeline utilization and cache hit improvement. Smaller code is often slower. I have decades of experience that have proven this over and over. But you Just Don't Get It. So you go around making nonsense statements like "smaller code is faster" as if these are absolutely true under all possible conditions, and it is known that this is considered true only under certain optimal conditions. Maybe because I saw my first counterexample to this back around 1970 (that's forty years ago), I tend to distrust instruction-counting and code size as valid predictors of performance (back then, this applied only to machines in the > $10million dollar price range; now we have those same architectures on the x86 chip) Until you have HIGH-RESOLUTION measurements with STATISTICALLY SIGNIFICANT confidence intervals, you are just making empty noises and should be ignored. Every once in a while I realize, in spite of my OCD, that you are uneducable. *** > >I would bet the 90% of the time in the case of compiler generated code >(not hand tweaked assembly language) if one loop that was coded to >produce that same functional result as another and has triple the >assembly language instructions that it will more time than the one with >fewer instructions. I am not talking code that was specifically designed >to form a counter example to this claim. I am talking about typical code >that is written. **** See above. Such a statement says "on the average, for general algorithms, ones which are insensitive to hardware acceleration", whereas a DFA is a specialized algorithm and one which is likely to be readily optimized by catering to the hardware. So the statement has no meaning with respect to your problem. And I've seen compilers that generate larger code (it's called "loop unrolling", an optimization known for a half-century) that is necessarily faster. joe **** > >> >> Note: I know one HELL of a lot more about the x86 than you do, and I know that *I* cannot >> predict the behavior, even if I spend days and days "desk checking" the generated machine >> code. So I'm not sure how someone with your limited understanding knows so much better >> how long it takes to execute. Maybe, just maybe, one of the chip architects and Intel who >> designed the chip could do this. But you don't possess that level of expertise, and >> neither do I, and probably no one outside the chip design group at Intel does. >> joe > >You couldn't predict that the OS would not swap out the pages of my data >when I could and did make this accurate prediction. I knew that the OS >had far more RAM than it needed, and thus had no good reason to swap out >this data. **** And I knew that some versions of Windows would swap out pages whether they needed to be swapped out or not. You did not use one of those versions, because Windows has obviously been fixed to not do that. The point is, you made the original statement without any evidence; I had years of evidence from Windows 4.0 and Windows 2000 that contradicted your statement. When you provided evidence, I acknowledged that Windows no longer swapped out pages gratuitously. But until we had evidence, I could only base my predictions on past performance. And I'm saying here that I do not KNOW the correct answer, but I know that YOU do not know it, either. joe **** > >> Joseph M. Newcomer [MVP] >> email: newcomer(a)flounder.com >> Web: http://www.flounder.com >> MVP Tips: http://www.flounder.com/mvp_tips.htm Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm |