From: Alexandre Ferrieux on 22 Feb 2010 18:52 On Feb 22, 11:26 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > Also, htclient is significantly faster/more efficient than geturl. I > haven't figure out exactly why yet, but I think geturl spends too much > time in string manipulations. Basically I would say htclient is good > for developers, not so good for the casual user. Tom, is htclient using the regexp techniques we explored together last fall, or is it parsing headers by hand ? -Alex
From: tom.rmadilo on 22 Feb 2010 19:46 On Feb 22, 3:52 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Feb 22, 11:26 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > > Also, htclient is significantly faster/more efficient than geturl. I > > haven't figure out exactly why yet, but I think geturl spends too much > > time in string manipulations. Basically I would say htclient is good > > for developers, not so good for the casual user. > > Tom, is htclient using the regexp techniques we explored together last > fall, or is it parsing headers by hand ? Alex, Thanks for asking. Unfortunately I was unable to write a regexp which actually parsed a generic header. I can only say that I gave it a good try. Either I have a serious deficiency in this area (highly likely) or it can't be done. I guess I should ask for help. But I never advanced past the most important issues (IMO): finding the end of the current header while avoiding blocking the application. Even if I found the whole header, I would still be stuck with the job or parsing the header into tokens. Unless and until I or someone else can produce a regular expression which can correctly parse all headers into tokens, I'm stuck with the current char-by-char code. All I can say is that the current method seems to offer lots of flexibility and is relatively easy to understand. The regexp code seems to require lots of testing, and so far the testing eventually finds cases which are not handled correctly. Of course I'm still stuck with the fact that htclient is faster than geturl and does not block. At some point I have to wonder if there is really anything wrong with my approach. (If unput was available at the Tcl level about half my code would be unnecessary.)
From: Alexandre Ferrieux on 23 Feb 2010 04:44 On Feb 23, 1:46 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > Tom, is htclient using the regexp techniques we explored together last > > fall, or is it parsing headers by hand ? > > Thanks for asking. Unfortunately I was unable to write a regexp which > actually parsed a generic header. I can only say that I gave it a good > try. Either I have a serious deficiency in this area (highly likely) > or it can't be done. I guess I should ask for help. > But I never advanced past the most important issues (IMO): finding the > end of the current header while avoiding blocking the application. > Even if I found the whole header, I would still be stuck with the job > or parsing the header into tokens. Ah, but for this you can take a two-step approach: first identify the end of headers (CRLFCRLF) and then parse the accumulated blob. Of course you'll get false alarms when CRLFCRLF's are embedded in quoted strings, but our beloved regexp will detect quote imbalance. Moreover, it is easy to distinguish this situation from a more serious syntax error by re-checking with an additional single quote: [regexp $BIGREGEXP $blob] -> 0 [regexp $BIGREGEXP $blob\"] -> 1 detects a case of pure quote imbalance. In that case, continue appending to the blob, up to the next CRLFCRLF. Iterate. > Unless and until I or someone else can produce a regular expression > which can correctly parse all headers into tokens, I'm stuck with the > current char-by-char code. I'm interested in helping you pursue the regexp approach. I really think it holds the key to the fastest and most elegant solution. > All I can say is that the current method seems to offer lots of > flexibility and is relatively easy to understand. The regexp code > seems to require lots of testing, and so far the testing eventually > finds cases which are not handled correctly. I admit a regexp might be tricky to read for the unaccustomed eye, but don't forget that the regexp compiler will catch additional inconsistencies (parenthesis imbalance) that a hand-written automaton will happily get away with... So all in all, I still believe regexps are superiorly maintainable. -Alex
From: tom.rmadilo on 23 Feb 2010 11:19 On Feb 23, 1:44 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Feb 23, 1:46 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > > > Tom, is htclient using the regexp techniques we explored together last > > > fall, or is it parsing headers by hand ? > > > Thanks for asking. Unfortunately I was unable to write a regexp which > > actually parsed a generic header. I can only say that I gave it a good > > try. Either I have a serious deficiency in this area (highly likely) > > or it can't be done. I guess I should ask for help. > > But I never advanced past the most important issues (IMO): finding the > > end of the current header while avoiding blocking the application. > > Even if I found the whole header, I would still be stuck with the job > > or parsing the header into tokens. > > Ah, but for this you can take a two-step approach: first identify the > end of headers (CRLFCRLF) and then parse the accumulated blob. Of > course you'll get false alarms when CRLFCRLF's are embedded in quoted > strings, but our beloved regexp will detect quote imbalance. Moreover, > it is easy to distinguish this situation from a more serious syntax > error by re-checking with an additional single quote: > > [regexp $BIGREGEXP $blob] -> 0 > [regexp $BIGREGEXP $blob\"] -> 1 > > detects a case of pure quote imbalance. In that case, continue > appending to the blob, up to the next CRLFCRLF. Iterate. > > > Unless and until I or someone else can produce a regular expression > > which can correctly parse all headers into tokens, I'm stuck with the > > current char-by-char code. > > I'm interested in helping you pursue the regexp approach. I really > think it holds the key to the fastest and most elegant solution. > > > All I can say is that the current method seems to offer lots of > > flexibility and is relatively easy to understand. The regexp code > > seems to require lots of testing, and so far the testing eventually > > finds cases which are not handled correctly. > > I admit a regexp might be tricky to read for the unaccustomed eye, but > don't forget that the regexp compiler will catch additional > inconsistencies (parenthesis imbalance) that a hand-written automaton > will happily get away with... So all in all, I still believe regexps > are superiorly maintainable. I have put down my regexp testing code here: http://www.junom.com/gitweb/gitweb.perl?p=htclient.git;a=blob;f=regexp/token.tcl What I found so far is that the regexp does work, but that it is (so far) easily fooled by two mistakes which balance each other out. It also happily ignores what doesn't match and produces what does match. Anyway, the basic problem is that my regexps have parsed correct headers correctly, but also parse incorrect headers. I'm also confused as to how I accumulate input without causing a potential blocking or other problems. So the first thing needed is a regexp which can distinguish between correct and incorrect headers and parse correct headers into tokens.
From: Alexandre Ferrieux on 23 Feb 2010 11:32
On Feb 23, 5:19 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > On Feb 23, 1:44 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> > wrote: > > > > > > > On Feb 23, 1:46 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > > Tom, is htclient using the regexp techniques we explored together last > > > > fall, or is it parsing headers by hand ? > > > > Thanks for asking. Unfortunately I was unable to write a regexp which > > > actually parsed a generic header. I can only say that I gave it a good > > > try. Either I have a serious deficiency in this area (highly likely) > > > or it can't be done. I guess I should ask for help. > > > But I never advanced past the most important issues (IMO): finding the > > > end of the current header while avoiding blocking the application. > > > Even if I found the whole header, I would still be stuck with the job > > > or parsing the header into tokens. > > > Ah, but for this you can take a two-step approach: first identify the > > end of headers (CRLFCRLF) and then parse the accumulated blob. Of > > course you'll get false alarms when CRLFCRLF's are embedded in quoted > > strings, but our beloved regexp will detect quote imbalance. Moreover, > > it is easy to distinguish this situation from a more serious syntax > > error by re-checking with an additional single quote: > > > [regexp $BIGREGEXP $blob] -> 0 > > [regexp $BIGREGEXP $blob\"] -> 1 > > > detects a case of pure quote imbalance. In that case, continue > > appending to the blob, up to the next CRLFCRLF. Iterate. > > > > Unless and until I or someone else can produce a regular expression > > > which can correctly parse all headers into tokens, I'm stuck with the > > > current char-by-char code. > > > I'm interested in helping you pursue the regexp approach. I really > > think it holds the key to the fastest and most elegant solution. > > > > All I can say is that the current method seems to offer lots of > > > flexibility and is relatively easy to understand. The regexp code > > > seems to require lots of testing, and so far the testing eventually > > > finds cases which are not handled correctly. > > > I admit a regexp might be tricky to read for the unaccustomed eye, but > > don't forget that the regexp compiler will catch additional > > inconsistencies (parenthesis imbalance) that a hand-written automaton > > will happily get away with... So all in all, I still believe regexps > > are superiorly maintainable. > > I have put down my regexp testing code here: > > http://www.junom.com/gitweb/gitweb.perl?p=htclient.git;a=blob;f=regex... > > What I found so far is that the regexp does work, but that it is (so > far) easily fooled by two mistakes which balance each other out. It > also happily ignores what doesn't match and produces what does match. > Anyway, the basic problem is that my regexps have parsed correct > headers correctly, but also parse incorrect headers. > > I'm also confused as to how I accumulate input without causing a > potential blocking or other problems. > > So the first thing needed is a regexp which can distinguish between > correct and incorrect headers and parse correct headers into tokens. I suspect some kind of escaping hell in those dynamically built regexps... Why don't you start back from the constant, braced, commented regexp we've built together ? -Alex |