From: phil on
Hello everyone.

Reading the thread about http file events and such reminded me of the
beta tclhttp-1.1 client that was released many many years ago on
sourceforge.

Maybe the things it does are plain wrong.
Or maybe the things it does can help.

Feel free to look and steal ideas:

http://tclhttp1-1.cvs.sourceforge.net/viewvc/tclhttp1-1/tclhttp1-1/

Phil
From: tom.rmadilo on
On Feb 24, 1:43 am, "Donal K. Fellows"
<donal.k.fell...(a)manchester.ac.uk> wrote:
> On 23 Feb, 17:47, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> > Right, my htclient code tokenizes the header field values as well as
> > headers. If you can't tokenize the field value, you can't really
> > distinguish between valid and invalid headers. Maybe a regexp exists
> > which can do this, right now I use an fsm to do the job. It works and
> > is guaranteed not to block.
>
> I *think* that the natural way to process HTTP headers is as a multi-
> level grammar. First you read lines until you get a proper blank line
> (i.e., the end of the headers). Then you go back and split the block
> of headers into individual header "lines" (which may be multiple
> lines; I believe continuation lines must start with a space or tab).
> Then, if desired, you parse the individual header lines (at the very
> least, you need to work out what the name of the header line is, but
> that should be trivial).

My theory is that you have to lex/parse headers pretty much like you
parse Tcl source code: char-by-char. Alex believes that you can
unambiguously find line-endings and then apply a regular expression,
or some series of regular expressions to validate that each header
fits the basic definition. There is no doubt that you can easily
separate valid headers, the question is can you easily detect invalid,
or malicious headers?



> > This is exactly what I do to avoid blocking: I read one char at a time
> > for headers or [chan pending] chars for the body.
>
> That's quite messy. With a non-blocking channel (if you're not non-
> blocking, it's time to change!) you can just do a [gets $ch lineVar]
> when the channel is readable. That will either read a full line (up to
> whatever is set as the line separator) and return the length of line
> (minus terminator) or return -1. If it returns -1, you've either got
> an EOF or you've exhausted the bytes available without being able to
> get a line (i.e., [fblocked $ch] will return 1 at that point). If
> you've blocked, you can just go back to sleep waiting in the event
> loop for some more bytes to arrive. Or if you're being careful, you
> can use [chan pending input $ch] to see whether the data that's
> accumulated in the input buffers - which must be a single incomplete
> line because you didn't read a complete one - has exceeded some
> threshold and you're going to kill the connection for being from a
> bunch of scumbags. You probably want to limit the number of complete
> lines you read too, for identical reasons.
>
> In short, having *some* buffering and memory allocation is OK. So long
> as it doesn't get out of hand, it's easier to let Tcl do all those
> bits for you. And trying to do all the parsing in a single level of
> FSM[*] is painful.

I think you are overlooking two key issues:
1. [read $chan 1] is much more efficient than than parsing a string
buffer with string commands.
2. If you put the unparsed header blobs into a string, which will be
very fast, you still need to lex/parse them, and unless you have a
bullet-proof regular expression, you have to do this parsing by hand,
using relatively slow string commands.

I have looked far and wide for any other HTTP server/client which uses
a regular expression to parse a generic header into tokens. However,
it is easy to tokenize HTTP headers using the char-at-a-time method.
The benefits are easy to demonstrate: the date header can use any of
three different formats. You could use three different regular
expressions to find a match, or you could just examine the length of
the list of tokens. Each format has a different number of tokens.

Another example is the case where a field value is actually a comma
separated list. But if a list item is a quoted text which contains a
comma, you need to distinguish the context of the comma.

My guess is that a regular expression which would work on a csv file,
converting a line of comma separated values (maybe quoted and escaped)
into a tcl list would be close to what is needed (except csv allows
newlines within quoted values).

Also, for some strange reason the quotes around quoted text in a Set-
Cookie/Cookie header are not part of the value, so they have to be
removed.

The regular expression idea only works if you can divide the header
block into individual headers. Many, many exploits are based upon the
apparent triviality of this task.
From: Alexandre Ferrieux on
On Feb 24, 6:11 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> Alex, you are the only one suggesting this is a bug. And if you look
> closely, below you suggest it isn't a bug, just a big misunderstanding
> on my part. Right now I just call it consistent behavior.

Sorry you lost me completely.

You seemed to be complaining about infinite loops, and I believed you
were saying they were unavoidable by the current semantics of Tcl
primitives, hence my suggestion to transform vague complaints into a
bug report. Of course until this bug report is produced, *I* know Tcl
has no bugs in that area ;-)

> But I'm not using [gets], so the idiom doesn't apply. Here is how
> chunk data is read:
> Please note the absence of an explicit loop. Each fileevent is handled
> separately.

Oh, the loop is just an optimization of course.
But the code you show is the one accumulating chunk data, not the one
for headers.
Again you're shifting focus. Why ?


> I know that, I was just describing the behavior. If a readable
> fileevent were generated by accident and then [chan pending] reports
> zero bytes pending, it seems like eventually there would be bytes
> available. Instead, if I do [read $chan 0] and return, the next
> fileevent, and the next, etc. gives the same result.

[read $chan 0] does nothing... is that a kind of joke ?

> Of course I have no idea what the actual "readable event" was, but it
> can't be an error condition because I'm able to read from the
> channel.

Bottom line:
(1) I don't grasp what you're complaining about.
(2) I advise you to forget about [chan pending] for the time being
(3) Once you have your code working efficiently (regexp) based on
simple nonblocking IO, you can worry about DoS attacks, and we'll help
you inject [chan pending] at the key spot.

-Alex
From: tom.rmadilo on
On Feb 24, 10:27 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:

> Bottom line:
>  (1) I don't grasp what you're complaining about.

You think I'm complaining?

Earlier you said I had a "vague complaint" and that:
"You (as in tom) seemed to be complaining about infinite loops, and I
believed you
were saying they were unavoidable by the current semantics of Tcl
primitives"

And yet I included an example of how I avoid an infinite loop, by
reading one char when [chan pending] returns zero. In other words, I
examine what [chan pending] returns and purposefully avoid reading
zero bytes...because otherwise you get into an infinite loop. This is
purely a description of the facts. I have no idea if this is a bug,
and I don't care because, as you say it makes no sense to [read $chan
0]. Another point is that this situation never comes up when reading
headers, since I always read one byte. I don't consult [chan pending],
and I don't consult it because I just got a fileevent which caused the
callback to run. It is possible that the fileevent was triggered by
something other than data becoming available in the input buffer,
hopefully this will lead to an error in the callback and it will
quickly exit.

The only somewhat puzzling detail of [chan pending] is that the C
level API "Tcl_InputBuffered" scales from 0, where 0 means the channel
isn't open for reading. The Tcl level API uses -1 to indicate the same
thing. Maybe the input buffer can never contain zero bytes? In general
I would think it impossible to map between these two representations.

>  (2) I advise you to forget about [chan pending] for the time being

You only advise this because of your complete misunderstanding of the
importance of [chan pending]. If it wasn't necessary, or useful, why
in the hell did the core team approve it? Why does the underlying C
API exist? If [chan pending] was not available, I would never have
even attempted to write a DOS tolerant network client/reader in Tcl.
Tcl still needs timeouts on channel operations if developers ever want
to guarantee robust client/server application development.

>  (3) Once you have your code working efficiently (regexp) based on
> simple nonblocking IO, you can worry about DoS attacks

Alex, you are the one who is so entranced with this magical regular
expression. It only exists in your mind, or maybe you just want me to
believe it exists, who knows. You claim it exists. Yet there is no
evidence anywhere that a single regular expression, or a two-phase two-
regular expression test will validate and tokenize a generic header
and reject all invalid headers.

A simpler task would be to create such a regexp for a comma separated
values file, or just one record of a csv. It is a close call if a csv
file is harder to parse than an HTTP header (just the field-value
part). My guess is that the field-value is 2-5 times more difficult.

IMHO, such a regular expression would be infinitely more valuable than
an HTTP-header parser/re. Yet if you look at the available Tcl code
which parses csv, you don't find anything simple. Maybe Jeff Hobbs and
Andreas Kupries are as dense-headed as I am? Otherwise why write a
couple of hundred lines of code, with dozens of regexps, regsubs and
[string map]s? The Tcllib csv module is just waiting for your expert
analysis.

> and we'll help you inject [chan pending] at the key spot.

Why not inject a regular expression? You claim it is so easy. Please
prove I'm a complete idiot. Forget about the initial issue of
delimiting headers. Start with a header which contains the invalid
quoted string "abc\\"def" and the valid quoted string "abc\"def", now
handle any combination of these two. A correct result would barf on
the first "abc\\"def".
From: Alexandre Ferrieux on
On Feb 25, 1:01 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> You think I'm complaining?
> [...]
> And yet I included an example of how I avoid an infinite loop,

Aw sh*t, I must be having an impedance mismatch with your prose.
Forget about who's complaining. Let somebody with a Tom-decoder
stacked on stdin take over this discussion.

> >  (2) I advise you to forget about [chan pending] for the time being
>
> You only advise this because of your complete misunderstanding of the
> importance of [chan pending].

Re-read: I advise *you*, not anybody else...

And yes, you must be right, my total misunderstanding of Tcl IO
internals is what lets me participate in their maintenance with a
fresh eye :)

> >  (3) Once you have your code working efficiently (regexp) based on
> > simple nonblocking IO, you can worry about DoS attacks
>
> Alex, you are the one who is so entranced with this magical regular
> expression. It only exists in your mind

No, I posted it here, and at the time you even admitted learning
something about the power of regexps. That's why I earnestly asked
whether you were still using it or not. And now you come back after a
brain-reformat of sorts, saying "really regexps suck, char-by-char
hard-coded FSMs in Tcl rock".

Back to square one. Don't count on me for the rest of the game.

-Alex