Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...
From: Alexandre Ferrieux on 4 Nov 2009 12:42 On Nov 4, 5:27 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > You can't apply regexp to something which must be parsed. Running for QOTW, eh ? By "must be parsed", do you by any chance mean "in the context-free class, not the regular class" ? If yes, then what you're saying is that the chunk-ext subsyntax (or even the whole RFC2616 syntax) has _recursions_ which kick it out of the class of regular languages. Is that the case ? A cursory look at the RFC doesn't show such recursion. If you confirm that cursory look, then basic language theory shows that the syntax is within reach of a finite automaton without stack, hence of [regexp]. If you don't, then produce the minimal set of rules showing a loop. -Alex
From: tom.rmadilo on 4 Nov 2009 13:58 On Nov 4, 9:42 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Nov 4, 5:27 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > > You can't apply regexp to something which must be parsed. > > Running for QOTW, eh ? > > By "must be parsed", do you by any chance mean "in the context-free > class, not the regular class" ? > If yes, then what you're saying is that the chunk-ext subsyntax (or > even the whole RFC2616 syntax) has _recursions_ which kick it out of > the class of regular languages. Is that the case ? A cursory look at > the RFC doesn't show such recursion. > > If you confirm that cursory look, then basic language theory shows > that the syntax is within reach of a finite automaton without stack, > hence of [regexp]. > > If you don't, then produce the minimal set of rules showing a loop. Why not just produce the easy to read and understand regexp? But before you ever get to the regular expression, your approach already contains a huge security flaw. Maybe look into the problem of allowing <CR>, <LF>, <CR><LF> or <LF><CR> in field names or field values. [gets] enables this type of attack. http://www.packetstormsecurity.org/papers/general/whitepaper_httpresponse.pdf Not really sure if this qualifies as a loop: " (double quote) \" (escaped double quote) \\" (double quote) Comments can also be nested. (Why are they even allowed?) The HTTP message syntax is horrible. It represents malpractice. Unfortunately the situation is made worse by clients, servers and proxies not following the syntax rules. I once found and fixed a bug in the mozilla cookie handling code. I found it because I correctly implemented the standard, but the browser couldn't handle quoted integers. But even with the fix now in place, you still can't safely follow the standard, since there are millions of old browsers which still contain the bug. But the original bug was really allowing the cookie code to do its own parsing. Eventually parsing is done everywhere slightly differently. While I was writing this code I read somewhere that google's cookie value violated the header syntax. So I was interested in testing it to see it fail. But it didn't fail. Anyway, sometimes patterns don't exactly cover the allowed syntax and also catch every violation.
From: Alexandre Ferrieux on 4 Nov 2009 18:53 On Nov 4, 7:58 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > On Nov 4, 9:42 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> > wrote: > > > > > > > On Nov 4, 5:27 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > You can't apply regexp to something which must be parsed. > > > Running for QOTW, eh ? > > > By "must be parsed", do you by any chance mean "in the context-free > > class, not the regular class" ? > > If yes, then what you're saying is that the chunk-ext subsyntax (or > > even the whole RFC2616 syntax) has _recursions_ which kick it out of > > the class of regular languages. Is that the case ? A cursory look at > > the RFC doesn't show such recursion. > > > If you confirm that cursory look, then basic language theory shows > > that the syntax is within reach of a finite automaton without stack, > > hence of [regexp]. > > > If you don't, then produce the minimal set of rules showing a loop. > > Why not just produce the easy to read and understand regexp? Sure: proc match_chunk_header_line s { if {![regexp -expanded { ^ ([0-9a-fA-F]+) # hex length \s* (?: # begin optional chunkext ;\s* [^\s=]+ # name is a token = (?: # begin alternative [^"\s;]+ # simple token | # or " # opening quote ([^"\\]|\\.)* # normal chars or escape sequences " # closing quote ) # end alternative \s* )* # end and iterate chunkext \r # ah don't forget CR $ } $s -> len]} { error "Syntax error in chunk header" } scan $len %x len return $len } > But before you ever get to the regular expression, your approach > already contains a huge security flaw. Maybe look into the problem of > allowing <CR>, <LF>, <CR><LF> or <LF><CR> in field names or field > values. [gets] enables this type of attack. Ah, so the problem of [gets] is with line continuations. Right. But less far-fetching than the initial statement about "HTTP being intrinsically not line-oriented", eh ? Two remarks though: (1) the above regexp can be perturbated to also allow (and identify) an unterminated open quote in a value, and then it's just a matter of reiterating [gets] and appending. (2) all this nitpicking goes far beyond what the current http package tries to do. As you yourself admitted, the HTTP standard nears insanity in some of its details. But who on earth will be using such shenanigans in a server ? Remember the situation is very different from your server-side experience. A server doesn't attack a client ! (I sincerely hope you see the asymmetry). > Not really sure if this qualifies as a loop: > > " (double quote) > \" (escaped double quote) > \\" (double quote) No. What I call a loop here is a non-tail recursion. That is, a loop in the graph of rewrite rules, with nonempty terminals emitted on both sides during a cycle. Read up on context-free vs regular grammars. > Comments can also be nested. (Why are they even allowed?) Ah: THIS is the true regexp-killer. But as you say, do we care ? -Aled
From: tom.rmadilo on 4 Nov 2009 20:47 On Nov 4, 3:53 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Nov 4, 7:58 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > Comments can also be nested. (Why are they even allowed?) > > Ah: THIS is the true regexp-killer. But as you say, do we care ? You (the parser) either understands the syntax or rejects the message. I'm willing to reject. I would also like to reject line continuations, but the <CR>, <LF> stuff has nothing to do with that. Apparently what can happen is that buggy servers can allow invalid chars in the headers, and if the client accepts these invalid chars the client could end up submitting smuggled data using the client's credentials. But the parsing issue seems more complex if you have to read data and then fit it to a regular expression. I compare this to a tool like flex. Flex allows regular expressions to delimit tokens. But flex has many advantages. First, you get to match against many possible regular expressions. Second, the regular expression match triggers arbitrary code. Third, you don't have to consume additional buffer chars, there is no indexing. The hard parts are quoted strings and comments. But the insane syntax which defines http requires a combination of scanning, tokenizing and parsing. Fortunately Tcl protects against most problems like buffer overflows, but the http protocol is uniquely vulnerable to manipulation and there doesn't seem like there is any margin for error. Anyway, I think you have somewhat valid complaints about this implementation, but I think they assume a prefect world where everybody is reasonable. I have a personality complex which doesn't allow me to trust myself, much less a foreign server. I also have a previously declared bias against regular expressions. I consider them brittle code which require much documentation, I call them miniature programs. If the regular expression must extend beyond a single token, the complexity increases dramatically. But I've never seen a regular expression which could handle escape sequences, since that requires counting, so I don't see how you cover quoted strings even as a stand-alone token. If we can get by that problem, maybe there is a regular expression solution. Personally, something like running flex on the input stream would be an ideal model, but in my experience you still need special code for tokens which contain escapes. Example: http://www.junom.com/gitweb/gitweb.perl?p=tnt.git;a=blob;f=packages/view/c/t3.l;h=4baf7
From: tom.rmadilo on 4 Nov 2009 21:15
On Nov 4, 3:53 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > (2) all this nitpicking goes far beyond what the current http > package tries to do. As you yourself admitted, the HTTP standard nears > insanity in some of its details. But who on earth will be using such > shenanigans in a server ? Remember the situation is very different > from your server-side experience. A server doesn't attack a client ! > (I sincerely hope you see the asymmetry). I think you underestimate the severity of the problem. Exploits exist for many reasons, mostly by ignoring the advice in the newest standards. Ignoring stuff that you shouldn't have to worry about, however insane, is what causes problems. Basically an http message is unsanitized user input. It should probably be examined closely. I'm not sure where your "server doesn't attack a client" optimism comes from, ever heard of cross site scripting attacks? This is like a growth industry. |