From: Donal K. Fellows on 23 Feb 2010 11:33 On 23 Feb, 16:19, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > So the first thing needed is a regexp which can distinguish between > correct and incorrect headers and parse correct headers into tokens. IIRC, you're supposed to be able to clearly identify what is and what isn't a header and to also be able to basic-tokenize the sequence of headers into individual headers. (The key is that I think newlines that don't terminate a header have to be followed by a space. IIRC anyway.) If I'm wrong with that, then the HTTP spec is crappy because you can bet that it's not just Tcl code that would find it easier to do the parsing in the way that I describe. > I'm also confused as to how I accumulate input without causing a > potential blocking or other problems. IIRC, there was [chan pending] added to 8.5 to allow you to handle that sort of thing. Not an area I've experimented much in. Donal.
From: Alexandre Ferrieux on 23 Feb 2010 12:07 On Feb 23, 5:33 pm, "Donal K. Fellows" <donal.k.fell...(a)manchester.ac.uk> wrote: > > > I'm also confused as to how I accumulate input without causing a > > potential blocking or other problems. > > IIRC, there was [chan pending] added to 8.5 to allow you to handle > that sort of thing. Not an area I've experimented much in. Not sure [chan pending] pushes the envelope of what was already possible with non-blocking [gets] ;-) -Alex
From: tom.rmadilo on 23 Feb 2010 12:28 On Feb 23, 8:32 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Feb 23, 5:19 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > > > > On Feb 23, 1:44 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> > > wrote: > > > > On Feb 23, 1:46 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > > > Tom, is htclient using the regexp techniques we explored together last > > > > > fall, or is it parsing headers by hand ? > > > > > Thanks for asking. Unfortunately I was unable to write a regexp which > > > > actually parsed a generic header. I can only say that I gave it a good > > > > try. Either I have a serious deficiency in this area (highly likely) > > > > or it can't be done. I guess I should ask for help. > > > > But I never advanced past the most important issues (IMO): finding the > > > > end of the current header while avoiding blocking the application. > > > > Even if I found the whole header, I would still be stuck with the job > > > > or parsing the header into tokens. > > > > Ah, but for this you can take a two-step approach: first identify the > > > end of headers (CRLFCRLF) and then parse the accumulated blob. Of > > > course you'll get false alarms when CRLFCRLF's are embedded in quoted > > > strings, but our beloved regexp will detect quote imbalance. Moreover, > > > it is easy to distinguish this situation from a more serious syntax > > > error by re-checking with an additional single quote: > > > > [regexp $BIGREGEXP $blob] -> 0 > > > [regexp $BIGREGEXP $blob\"] -> 1 > > > > detects a case of pure quote imbalance. In that case, continue > > > appending to the blob, up to the next CRLFCRLF. Iterate. > > > > > Unless and until I or someone else can produce a regular expression > > > > which can correctly parse all headers into tokens, I'm stuck with the > > > > current char-by-char code. > > > > I'm interested in helping you pursue the regexp approach. I really > > > think it holds the key to the fastest and most elegant solution. > > > > > All I can say is that the current method seems to offer lots of > > > > flexibility and is relatively easy to understand. The regexp code > > > > seems to require lots of testing, and so far the testing eventually > > > > finds cases which are not handled correctly. > > > > I admit a regexp might be tricky to read for the unaccustomed eye, but > > > don't forget that the regexp compiler will catch additional > > > inconsistencies (parenthesis imbalance) that a hand-written automaton > > > will happily get away with... So all in all, I still believe regexps > > > are superiorly maintainable. > > > I have put down my regexp testing code here: > > >http://www.junom.com/gitweb/gitweb.perl?p=htclient.git;a=blob;f=regex... > > > What I found so far is that the regexp does work, but that it is (so > > far) easily fooled by two mistakes which balance each other out. It > > also happily ignores what doesn't match and produces what does match. > > Anyway, the basic problem is that my regexps have parsed correct > > headers correctly, but also parse incorrect headers. > > > I'm also confused as to how I accumulate input without causing a > > potential blocking or other problems. > > > So the first thing needed is a regexp which can distinguish between > > correct and incorrect headers and parse correct headers into tokens. > > I suspect some kind of escaping hell in those dynamically built > regexps... > Why don't you start back from the constant, braced, commented regexp > we've built together ? I've never seen this type of parsing issue solved with a single regexp. I've tried taking your advice, but apparently I don't understand it enough to produce the correct results. As far as speed is concerned, it seems obvious to me that only visiting each char once is the fastest possible algorithm. The only problem is that the visiting is done in Tcl code and not in C code. If this is the real issue (speed of Tcl), maybe a rewrite into C would be the best solution?
From: tom.rmadilo on 23 Feb 2010 12:47 On Feb 23, 8:33 am, "Donal K. Fellows" <donal.k.fell...(a)manchester.ac.uk> wrote: > On 23 Feb, 16:19, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > So the first thing needed is a regexp which can distinguish between > > correct and incorrect headers and parse correct headers into tokens. > > IIRC, you're supposed to be able to clearly identify what is and what > isn't a header and to also be able to basic-tokenize the sequence of > headers into individual headers. (The key is that I think newlines > that don't terminate a header have to be followed by a space. IIRC > anyway.) If I'm wrong with that, then the HTTP spec is crappy because > you can bet that it's not just Tcl code that would find it easier to > do the parsing in the way that I describe. Right, my htclient code tokenizes the header field values as well as headers. If you can't tokenize the field value, you can't really distinguish between valid and invalid headers. Maybe a regexp exists which can do this, right now I use an fsm to do the job. It works and is guaranteed not to block. > > I'm also confused as to how I accumulate input without causing a > > potential blocking or other problems. > > IIRC, there was [chan pending] added to 8.5 to allow you to handle > that sort of thing. Not an area I've experimented much in. This is exactly what I do to avoid blocking: I read one char at a time for headers or [chan pending] chars for the body. If we had [chan unputs], I could eliminate about half my fsm code (needed to handle <CR><LF>).
From: tom.rmadilo on 23 Feb 2010 13:08
On Feb 23, 9:07 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Feb 23, 5:33 pm, "Donal K. Fellows" > > <donal.k.fell...(a)manchester.ac.uk> wrote: > > > > I'm also confused as to how I accumulate input without causing a > > > potential blocking or other problems. > > > IIRC, there was [chan pending] added to 8.5 to allow you to handle > > that sort of thing. Not an area I've experimented much in. > > Not sure [chan pending] pushes the envelope of what was already > possible with non-blocking [gets] ;-) I've not experimented with non-blocking [gets], but I understand that you have to always check for failure and handle such situations. You also [gets] chars, not octets. This doesn't match up with what signals a readable event. One thing which confuses me is that [chan pending] will sometimes return zero after a readable event (50% of the time if the previous read drained the buffer), if you don't read at least one char, the buffer never fills up. |