From: phil on 24 Feb 2010 12:41 Hello everyone. Reading the thread about http file events and such reminded me of the beta tclhttp-1.1 client that was released many many years ago on sourceforge. Maybe the things it does are plain wrong. Or maybe the things it does can help. Feel free to look and steal ideas: http://tclhttp1-1.cvs.sourceforge.net/viewvc/tclhttp1-1/tclhttp1-1/ Phil
From: tom.rmadilo on 24 Feb 2010 13:26 On Feb 24, 1:43 am, "Donal K. Fellows" <donal.k.fell...(a)manchester.ac.uk> wrote: > On 23 Feb, 17:47, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > Right, my htclient code tokenizes the header field values as well as > > headers. If you can't tokenize the field value, you can't really > > distinguish between valid and invalid headers. Maybe a regexp exists > > which can do this, right now I use an fsm to do the job. It works and > > is guaranteed not to block. > > I *think* that the natural way to process HTTP headers is as a multi- > level grammar. First you read lines until you get a proper blank line > (i.e., the end of the headers). Then you go back and split the block > of headers into individual header "lines" (which may be multiple > lines; I believe continuation lines must start with a space or tab). > Then, if desired, you parse the individual header lines (at the very > least, you need to work out what the name of the header line is, but > that should be trivial). My theory is that you have to lex/parse headers pretty much like you parse Tcl source code: char-by-char. Alex believes that you can unambiguously find line-endings and then apply a regular expression, or some series of regular expressions to validate that each header fits the basic definition. There is no doubt that you can easily separate valid headers, the question is can you easily detect invalid, or malicious headers? > > This is exactly what I do to avoid blocking: I read one char at a time > > for headers or [chan pending] chars for the body. > > That's quite messy. With a non-blocking channel (if you're not non- > blocking, it's time to change!) you can just do a [gets $ch lineVar] > when the channel is readable. That will either read a full line (up to > whatever is set as the line separator) and return the length of line > (minus terminator) or return -1. If it returns -1, you've either got > an EOF or you've exhausted the bytes available without being able to > get a line (i.e., [fblocked $ch] will return 1 at that point). If > you've blocked, you can just go back to sleep waiting in the event > loop for some more bytes to arrive. Or if you're being careful, you > can use [chan pending input $ch] to see whether the data that's > accumulated in the input buffers - which must be a single incomplete > line because you didn't read a complete one - has exceeded some > threshold and you're going to kill the connection for being from a > bunch of scumbags. You probably want to limit the number of complete > lines you read too, for identical reasons. > > In short, having *some* buffering and memory allocation is OK. So long > as it doesn't get out of hand, it's easier to let Tcl do all those > bits for you. And trying to do all the parsing in a single level of > FSM[*] is painful. I think you are overlooking two key issues: 1. [read $chan 1] is much more efficient than than parsing a string buffer with string commands. 2. If you put the unparsed header blobs into a string, which will be very fast, you still need to lex/parse them, and unless you have a bullet-proof regular expression, you have to do this parsing by hand, using relatively slow string commands. I have looked far and wide for any other HTTP server/client which uses a regular expression to parse a generic header into tokens. However, it is easy to tokenize HTTP headers using the char-at-a-time method. The benefits are easy to demonstrate: the date header can use any of three different formats. You could use three different regular expressions to find a match, or you could just examine the length of the list of tokens. Each format has a different number of tokens. Another example is the case where a field value is actually a comma separated list. But if a list item is a quoted text which contains a comma, you need to distinguish the context of the comma. My guess is that a regular expression which would work on a csv file, converting a line of comma separated values (maybe quoted and escaped) into a tcl list would be close to what is needed (except csv allows newlines within quoted values). Also, for some strange reason the quotes around quoted text in a Set- Cookie/Cookie header are not part of the value, so they have to be removed. The regular expression idea only works if you can divide the header block into individual headers. Many, many exploits are based upon the apparent triviality of this task.
From: Alexandre Ferrieux on 24 Feb 2010 13:27 On Feb 24, 6:11 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > Alex, you are the only one suggesting this is a bug. And if you look > closely, below you suggest it isn't a bug, just a big misunderstanding > on my part. Right now I just call it consistent behavior. Sorry you lost me completely. You seemed to be complaining about infinite loops, and I believed you were saying they were unavoidable by the current semantics of Tcl primitives, hence my suggestion to transform vague complaints into a bug report. Of course until this bug report is produced, *I* know Tcl has no bugs in that area ;-) > But I'm not using [gets], so the idiom doesn't apply. Here is how > chunk data is read: > Please note the absence of an explicit loop. Each fileevent is handled > separately. Oh, the loop is just an optimization of course. But the code you show is the one accumulating chunk data, not the one for headers. Again you're shifting focus. Why ? > I know that, I was just describing the behavior. If a readable > fileevent were generated by accident and then [chan pending] reports > zero bytes pending, it seems like eventually there would be bytes > available. Instead, if I do [read $chan 0] and return, the next > fileevent, and the next, etc. gives the same result. [read $chan 0] does nothing... is that a kind of joke ? > Of course I have no idea what the actual "readable event" was, but it > can't be an error condition because I'm able to read from the > channel. Bottom line: (1) I don't grasp what you're complaining about. (2) I advise you to forget about [chan pending] for the time being (3) Once you have your code working efficiently (regexp) based on simple nonblocking IO, you can worry about DoS attacks, and we'll help you inject [chan pending] at the key spot. -Alex
From: tom.rmadilo on 24 Feb 2010 19:01 On Feb 24, 10:27 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > Bottom line: > (1) I don't grasp what you're complaining about. You think I'm complaining? Earlier you said I had a "vague complaint" and that: "You (as in tom) seemed to be complaining about infinite loops, and I believed you were saying they were unavoidable by the current semantics of Tcl primitives" And yet I included an example of how I avoid an infinite loop, by reading one char when [chan pending] returns zero. In other words, I examine what [chan pending] returns and purposefully avoid reading zero bytes...because otherwise you get into an infinite loop. This is purely a description of the facts. I have no idea if this is a bug, and I don't care because, as you say it makes no sense to [read $chan 0]. Another point is that this situation never comes up when reading headers, since I always read one byte. I don't consult [chan pending], and I don't consult it because I just got a fileevent which caused the callback to run. It is possible that the fileevent was triggered by something other than data becoming available in the input buffer, hopefully this will lead to an error in the callback and it will quickly exit. The only somewhat puzzling detail of [chan pending] is that the C level API "Tcl_InputBuffered" scales from 0, where 0 means the channel isn't open for reading. The Tcl level API uses -1 to indicate the same thing. Maybe the input buffer can never contain zero bytes? In general I would think it impossible to map between these two representations. > (2) I advise you to forget about [chan pending] for the time being You only advise this because of your complete misunderstanding of the importance of [chan pending]. If it wasn't necessary, or useful, why in the hell did the core team approve it? Why does the underlying C API exist? If [chan pending] was not available, I would never have even attempted to write a DOS tolerant network client/reader in Tcl. Tcl still needs timeouts on channel operations if developers ever want to guarantee robust client/server application development. > (3) Once you have your code working efficiently (regexp) based on > simple nonblocking IO, you can worry about DoS attacks Alex, you are the one who is so entranced with this magical regular expression. It only exists in your mind, or maybe you just want me to believe it exists, who knows. You claim it exists. Yet there is no evidence anywhere that a single regular expression, or a two-phase two- regular expression test will validate and tokenize a generic header and reject all invalid headers. A simpler task would be to create such a regexp for a comma separated values file, or just one record of a csv. It is a close call if a csv file is harder to parse than an HTTP header (just the field-value part). My guess is that the field-value is 2-5 times more difficult. IMHO, such a regular expression would be infinitely more valuable than an HTTP-header parser/re. Yet if you look at the available Tcl code which parses csv, you don't find anything simple. Maybe Jeff Hobbs and Andreas Kupries are as dense-headed as I am? Otherwise why write a couple of hundred lines of code, with dozens of regexps, regsubs and [string map]s? The Tcllib csv module is just waiting for your expert analysis. > and we'll help you inject [chan pending] at the key spot. Why not inject a regular expression? You claim it is so easy. Please prove I'm a complete idiot. Forget about the initial issue of delimiting headers. Start with a header which contains the invalid quoted string "abc\\"def" and the valid quoted string "abc\"def", now handle any combination of these two. A correct result would barf on the first "abc\\"def".
From: Alexandre Ferrieux on 25 Feb 2010 03:52
On Feb 25, 1:01 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > You think I'm complaining? > [...] > And yet I included an example of how I avoid an infinite loop, Aw sh*t, I must be having an impedance mismatch with your prose. Forget about who's complaining. Let somebody with a Tom-decoder stacked on stdin take over this discussion. > > (2) I advise you to forget about [chan pending] for the time being > > You only advise this because of your complete misunderstanding of the > importance of [chan pending]. Re-read: I advise *you*, not anybody else... And yes, you must be right, my total misunderstanding of Tcl IO internals is what lets me participate in their maintenance with a fresh eye :) > > (3) Once you have your code working efficiently (regexp) based on > > simple nonblocking IO, you can worry about DoS attacks > > Alex, you are the one who is so entranced with this magical regular > expression. It only exists in your mind No, I posted it here, and at the time you even admitted learning something about the power of regexps. That's why I earnestly asked whether you were still using it or not. And now you come back after a brain-reformat of sorts, saying "really regexps suck, char-by-char hard-coded FSMs in Tcl rock". Back to square one. Don't count on me for the rest of the game. -Alex |