From: Alexandre Ferrieux on
On Nov 4, 5:27 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> You can't apply regexp to something which must be parsed.

Running for QOTW, eh ?

By "must be parsed", do you by any chance mean "in the context-free
class, not the regular class" ?
If yes, then what you're saying is that the chunk-ext subsyntax (or
even the whole RFC2616 syntax) has _recursions_ which kick it out of
the class of regular languages. Is that the case ? A cursory look at
the RFC doesn't show such recursion.

If you confirm that cursory look, then basic language theory shows
that the syntax is within reach of a finite automaton without stack,
hence of [regexp].

If you don't, then produce the minimal set of rules showing a loop.

-Alex
From: tom.rmadilo on
On Nov 4, 9:42 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Nov 4, 5:27 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
>
>
> > You can't apply regexp to something which must be parsed.
>
> Running for QOTW, eh ?
>
> By "must be parsed", do you by any chance mean "in the context-free
> class, not the regular class" ?
> If yes, then what you're saying is that the chunk-ext subsyntax (or
> even the whole RFC2616 syntax) has _recursions_ which kick it out of
> the class of regular languages. Is that the case ? A cursory look at
> the RFC doesn't show such recursion.
>
> If you confirm that cursory look, then basic language theory shows
> that the syntax is within reach of a finite automaton without stack,
> hence of [regexp].
>
> If you don't, then produce the minimal set of rules showing a loop.

Why not just produce the easy to read and understand regexp?

But before you ever get to the regular expression, your approach
already contains a huge security flaw. Maybe look into the problem of
allowing <CR>, <LF>, <CR><LF> or <LF><CR> in field names or field
values. [gets] enables this type of attack.

http://www.packetstormsecurity.org/papers/general/whitepaper_httpresponse.pdf

Not really sure if this qualifies as a loop:

" (double quote)
\" (escaped double quote)
\\" (double quote)

Comments can also be nested. (Why are they even allowed?)

The HTTP message syntax is horrible. It represents malpractice.
Unfortunately the situation is made worse by clients, servers and
proxies not following the syntax rules.

I once found and fixed a bug in the mozilla cookie handling code. I
found it because I correctly implemented the standard, but the browser
couldn't handle quoted integers. But even with the fix now in place,
you still can't safely follow the standard, since there are millions
of old browsers which still contain the bug. But the original bug was
really allowing the cookie code to do its own parsing. Eventually
parsing is done everywhere slightly differently.

While I was writing this code I read somewhere that google's cookie
value violated the header syntax. So I was interested in testing it to
see it fail. But it didn't fail. Anyway, sometimes patterns don't
exactly cover the allowed syntax and also catch every violation.
From: Alexandre Ferrieux on
On Nov 4, 7:58 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
> On Nov 4, 9:42 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
> wrote:
>
>
>
>
>
> > On Nov 4, 5:27 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> > > You can't apply regexp to something which must be parsed.
>
> > Running for QOTW, eh ?
>
> > By "must be parsed", do you by any chance mean "in the context-free
> > class, not the regular class" ?
> > If yes, then what you're saying is that the chunk-ext subsyntax (or
> > even the whole RFC2616 syntax) has _recursions_ which kick it out of
> > the class of regular languages. Is that the case ? A cursory look at
> > the RFC doesn't show such recursion.
>
> > If you confirm that cursory look, then basic language theory shows
> > that the syntax is within reach of a finite automaton without stack,
> > hence of [regexp].
>
> > If you don't, then produce the minimal set of rules showing a loop.
>
> Why not just produce the easy to read and understand regexp?

Sure:

proc match_chunk_header_line s {
if {![regexp -expanded {
^
([0-9a-fA-F]+) # hex length
\s*
(?: # begin optional chunkext
;\s*
[^\s=]+ # name is a token
=
(?: # begin alternative
[^"\s;]+ # simple token
| # or
" # opening quote
([^"\\]|\\.)* # normal chars or escape sequences
" # closing quote
) # end alternative
\s*
)* # end and iterate chunkext
\r # ah don't forget CR
$
} $s -> len]} {
error "Syntax error in chunk header"
}
scan $len %x len
return $len
}

> But before you ever get to the regular expression, your approach
> already contains a huge security flaw. Maybe look into the problem of
> allowing <CR>, <LF>, <CR><LF> or <LF><CR> in field names or field
> values. [gets] enables this type of attack.

Ah, so the problem of [gets] is with line continuations. Right. But
less far-fetching than the initial statement about "HTTP being
intrinsically not line-oriented", eh ?
Two remarks though:

(1) the above regexp can be perturbated to also allow (and
identify) an unterminated open quote in a value, and then it's just a
matter of reiterating [gets] and appending.

(2) all this nitpicking goes far beyond what the current http
package tries to do. As you yourself admitted, the HTTP standard nears
insanity in some of its details. But who on earth will be using such
shenanigans in a server ? Remember the situation is very different
from your server-side experience. A server doesn't attack a client !
(I sincerely hope you see the asymmetry).


> Not really sure if this qualifies as a loop:
>
> " (double quote)
> \" (escaped double quote)
> \\" (double quote)

No. What I call a loop here is a non-tail recursion. That is, a loop
in the graph of rewrite rules, with nonempty terminals emitted on both
sides during a cycle. Read up on context-free vs regular grammars.

> Comments can also be nested. (Why are they even allowed?)

Ah: THIS is the true regexp-killer. But as you say, do we care ?

-Aled
From: tom.rmadilo on
On Nov 4, 3:53 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Nov 4, 7:58 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
> > Comments can also be nested. (Why are they even allowed?)
>
> Ah: THIS is the true regexp-killer. But as you say, do we care ?

You (the parser) either understands the syntax or rejects the message.
I'm willing to reject. I would also like to reject line continuations,
but the <CR>, <LF> stuff has nothing to do with that. Apparently what
can happen is that buggy servers can allow invalid chars in the
headers, and if the client accepts these invalid chars the client
could end up submitting smuggled data using the client's credentials.

But the parsing issue seems more complex if you have to read data and
then fit it to a regular expression.

I compare this to a tool like flex. Flex allows regular expressions to
delimit tokens. But flex has many advantages. First, you get to match
against many possible regular expressions. Second, the regular
expression match triggers arbitrary code. Third, you don't have to
consume additional buffer chars, there is no indexing. The hard parts
are quoted strings and comments. But the insane syntax which defines
http requires a combination of scanning, tokenizing and parsing.

Fortunately Tcl protects against most problems like buffer overflows,
but the http protocol is uniquely vulnerable to manipulation and there
doesn't seem like there is any margin for error.

Anyway, I think you have somewhat valid complaints about this
implementation, but I think they assume a prefect world where
everybody is reasonable. I have a personality complex which doesn't
allow me to trust myself, much less a foreign server.

I also have a previously declared bias against regular expressions. I
consider them brittle code which require much documentation, I call
them miniature programs. If the regular expression must extend beyond
a single token, the complexity increases dramatically.

But I've never seen a regular expression which could handle escape
sequences, since that requires counting, so I don't see how you cover
quoted strings even as a stand-alone token. If we can get by that
problem, maybe there is a regular expression solution.

Personally, something like running flex on the input stream would be
an ideal model, but in my experience you still need special code for
tokens which contain escapes.

Example:

http://www.junom.com/gitweb/gitweb.perl?p=tnt.git;a=blob;f=packages/view/c/t3.l;h=4baf7

From: tom.rmadilo on
On Nov 4, 3:53 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
>    (2) all this nitpicking goes far beyond what the current http
> package tries to do. As you yourself admitted, the HTTP standard nears
> insanity in some of its details. But who on earth will be using such
> shenanigans in a server ? Remember the situation is very different
> from your server-side experience. A server doesn't attack a client !
> (I sincerely hope you see the asymmetry).

I think you underestimate the severity of the problem. Exploits exist
for many reasons, mostly by ignoring the advice in the newest
standards. Ignoring stuff that you shouldn't have to worry about,
however insane, is what causes problems.

Basically an http message is unsanitized user input. It should
probably be examined closely.

I'm not sure where your "server doesn't attack a client" optimism
comes from, ever heard of cross site scripting attacks? This is like a
growth industry.
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7
Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...