Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...
From: Alexandre Ferrieux on 5 Nov 2009 02:50 On Nov 5, 2:47 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > Anyway, I think you have somewhat valid complaints about this > implementation, but I think they assume a prefect world where > everybody is reasonable. I have a personality complex which doesn't > allow me to trust myself, much less a foreign server. Ah-ha. Discussion switching back to productive mode. Thanks. Look again: this optimistic assumption is also prevalent in the current http package. I'd risk that many other widely used libraries across the industry strike a similar balance between complexity and security. How many unices have the kerberos extensions ? How many Solarisses have the C2 package ? > I also have a previously declared bias against regular expressions. I Be reassured, that was clear ;-) > consider them brittle code which require much documentation, I call > them miniature programs. If the regular expression must extend beyond > a single token, the complexity increases dramatically. Ah ? Is the equivalent [read 1;switch] automaton readable without comments ? Yes they are miniature programs. But the -expanded flag of [regexp] does a lot to make them just as easy (or hard) to read as an equivalent lower-level program, with two additional properties: (1) They are guaranteed to stay within the bounds of regular grammars. In this context it is a bonus, because it limits the "spaghetti power" of the code as compared to an unconstrained Turing complete programming language with counters, stack, etc. (2) They get compiled to a very efficient form (a) by determinization and (b) because the final parsing is done in C. So yes, they are mini-programs with an universally known syntax, very simple semantics, and a turbo compiler. Then it's a matter of personal taste ;-) > But I've never seen a regular expression which could handle escape > sequences, since that requires counting, so I don't see how you cover > quoted strings even as a stand-alone token. If we can get by that > problem, maybe there is a regular expression solution. Ha ! So you're having trouble understanding the regexp I sent ? The re_syntax is written in English, which you're proficient at, and is IMO very clear. So what ? Maybe you should start by getting up to pace with regexps before touting about their limitations... One thing that might help you on that route: the assumtion "escape sequences require counting" is wrong. Indeed, they only require counting _modulo 2_, which makes a (quoting) hell of a difference, since it means a two-state finite automaton, as opposed to an unbounded integer counter which is outside the regular realm. Look closely at my regexp, and you'll find this tiny two-state sub- automaton: \\. . It essentially handles the difference between an odd and an even number of backslashes before a quote. Now you mentioned nesting constructs (comments), which do need context- free. However, I don't see tem in RFC2616. Can you show me ? -Alex -Alex
From: tom.rmadilo on 5 Nov 2009 03:49 On Nov 4, 11:50 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Nov 5, 2:47 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > > Anyway, I think you have somewhat valid complaints about this > > implementation, but I think they assume a prefect world where > > everybody is reasonable. I have a personality complex which doesn't > > allow me to trust myself, much less a foreign server. > > Ah-ha. Discussion switching back to productive mode. Thanks. > Look again: this optimistic assumption is also prevalent in the > current http package. > I'd risk that many other widely used libraries across the industry > strike a similar balance between complexity and security. How many > unices have the kerberos extensions ? How many Solarisses have the C2 > package ? > I don't know, but besides security, the headers are currently divided into tokens/words with whitespace removed. In theory this should make it easier to interpret the field values. I know it will simplify working with the date header, as a quick example. I'm not yet convinced it will be easy to deal with the complexity of partial headers if there are illegal <CR> or <LF> chars. Basically you read in an unknown chunk of data with [gets] and hope you can detect errors. OTOH, you seem convinced it'll work. It seems easy enough to substitute in your ideas, so I'll give it a try. > > But I've never seen a regular expression which could handle escape > > sequences, since that requires counting, so I don't see how you cover > > quoted strings even as a stand-alone token. If we can get by that > > problem, maybe there is a regular expression solution. > > Ha ! So you're having trouble understanding the regexp I sent ? > The re_syntax is written in English, which you're proficient at, and > is IMO very clear. So what ? > Maybe you should start by getting up to pace with regexps before > touting about their limitations... Did I complain about their limitations? I just think they take time to develop and test, sometimes it is well worth the effort, sometimes it isn't. I also agree that you should try to document them, which it looks like the -expanded flag enables inline. > One thing that might help you on that route: the assumtion "escape > sequences require counting" is wrong. Indeed, they only require > counting _modulo 2_, which makes a (quoting) hell of a difference, > since it means a two-state finite automaton, as opposed to an > unbounded integer counter which is outside the regular realm. > > Look closely at my regexp, and you'll find this tiny two-state sub- > automaton: \\. . It essentially handles the difference between an odd > and an even number of backslashes before a quote. Cool, really cool. > Now you mentioned nesting constructs (comments), which do need context- > free. However, I don't see tem in RFC2616. Can you show me ? From the rfc: Fielding, et al. Expires April 29, 2010 [Page 20] Internet-Draft HTTP/1.1, Part 1 October 2009 .... Comments can be included in some HTTP header fields by surrounding the comment text with parentheses. Comments are only allowed in fields containing "comment" as part of their field value definition. comment = "(" *( ctext / quoted-cpair / comment ) ")" ctext = OWS / %x21-27 / %x2A-5B / %x5D-7E / obs-text ; OWS / <VCHAR except "(", ")", and "\"> / obs- text The backslash character ("\") can be used as a single-character quoting mechanism within comment constructs: quoted-cpair = "\" ( WSP / VCHAR / obs-text ) Producers SHOULD NOT escape characters that do not require escaping (i.e., other than the backslash character "\" and the parentheses "(" and ")"). I haven't looked at which header fields allow comments, this seems like something to avoid like the plague, if possible.
From: Donal K. Fellows on 5 Nov 2009 05:03 On 5 Nov, 08:49, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > I haven't looked at which header fields allow comments, this seems > like something to avoid like the plague, if possible. Looks like it's permitted in the Server, User-Agent, and Via fields. Since comments can't include anything that looks like the start of another header or the end of the headers (check the definition of TEXT, which excludes all CTLs except for LWS, and that is used as a line continuation which is unambiguously different) all the confusion is confined to those specific headers, but they're all ones that never need to be parsed; they're all just informational. If you want to parse out the comments, do so. But they're not important. Donal.
From: tom.rmadilo on 5 Nov 2009 20:05 On Nov 4, 3:53 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Nov 4, 7:58 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > Why not just produce the easy to read and understand regexp? > Sure: > > proc match_chunk_header_line s { > if {![regexp -expanded { > ^ > ([0-9a-fA-F]+) # hex length > \s* > (?: # begin optional chunkext > ;\s* > [^\s=]+ # name is a token > = > (?: # begin alternative > [^"\s;]+ # simple token > | # or > " # opening quote > ([^"\\]|\\.)* # normal chars or escape sequences > " # closing quote > ) # end alternative > \s* > )* # end and iterate chunkext > \r # ah don't forget CR > $ > } $s -> len]} { > error "Syntax error in chunk header" > } > scan $len %x len > return $len > } Alex, So I'm looking into this more, but I've run into a problem. The above regexp essentially validates the chunk-size, chunk-ext "header". Not sure if it is really a header, fortunately it is a little easier (no line folding allowed). It places the hexchars into "len" and discards the remaining junk. This is okay in this situation, since we must ignore stuff we don't understand. Being dumb is something of a benefit. So my question is with regular headers. Currently I find the field name, which is a token and divide the field value into tokens, words and punctuation which can't be in a token, but doesn't need to be in a word (a quoted string). For instance, a colon, equal sign or comma are in this punctuation group. How you write a regexp which can divide a header into these different parts: tokens, words, bare punctuation? I'm guessing I must use [regexp -all -inline]
From: Alexandre Ferrieux on 6 Nov 2009 05:38
On Nov 5, 9:49 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > OTOH, you seem convinced it'll work. It seems easy enough to > substitute in your ideas, so I'll give it a try. Thanks. I do appreciate. > > > Now you mentioned nesting constructs (comments), which do need context- > > free. However, I don't see tem in RFC2616. Can you show me ? > > From the rfc: > Fielding, et al. Expires April 29, 2010 [Page > 20] Ah, so it's not RFC2616, it's the Fielding HTTPbis IETF draft. You're on the bleeding edge, eh ;-) Anyway I think we all agree that ignoring those braindead nested comments is the way to go, possibly with post-processing of a regular approximation of the grammar. Among other things, this post-processing can leverage Tcl's own parser by morphing the nesting constructs (here, parentheses) into braces, and then calling [info complete]. But personally I'd happily [error UNSUPPORTED_STUPID_STANDARD]. -Alex |