Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...
From: tom.rmadilo on 24 Oct 2009 21:45 Last week there was discussion of a problem with the http client which ships with Tcl. The cause of the problem (an ignored timeout) was code which places the channel into blocking mode during reading chunked encoded transfers. Although it seems like a bad idea on first read, I think the overall cause was the monolithic nature of the original code (before chunked handling was added). Anyway, Alexandre Ferrieux suggested that I should contribute a new style (maybe coro), or a direct fix. I don't think a direct fix is possible, so I went with a new style. This new style does not use coroutines, although that seems interesting I have no experience and I don't see the commands included in the Tcl distribution. My compromise is to require the use of at least Tcl8.5.? I use [chan pending] and unnecessarily [lsearch -nocase -index] . [chan pending] allows you to see how much data is available after a fileevent. This allows you to safely read more than a single byte. Strangely, 50% of the time a [chan pending] after a fileevent returns zero, theoretically meaning that no data is available, even though one byte is guaranteed. If no bytes are read on that fileevent, the next fileevent is also followed with zero pending bytes. But if you read one byte, everything seems to work. The buffer fills up again allowing a substantial read. The initial experimental version of htclient is available here: http://www.junom.com/gitweb/gitweb.perl?p=htclient.git This version prints debugging info (to a selected channel, default stderr) for every event. The above link includes several logs of different requests. The architecture is very different from the http package. The best way to explain it is to just try it out, look at the log files, or read the code. Only thing required to test it is to source the htclient.tcl file. The end of the file contains a url to fetch, you can change this url, or comment it out and repeat the command at the prompt.
From: Alexandre Ferrieux on 27 Oct 2009 04:10 On Oct 25, 2:45 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > Anyway, Alexandre Ferrieux suggested that I should contribute a new > style (maybe coro), or a direct fix. I don't think a direct fix is > possible, so I went with a new style. Thank you for this. I admit I have not looked at the details, but the separation of the state machines looks cleaner than the current http package's (but then, the coverage is not the same yet, right ? ;-). Also, removing vwait from the library is a Good Idea (tm). It could be the responsibility of an extra layer to provide the blocking variety of ::http::geturl (with a vwait of course) for compatibility though. A few random questions: (1) you seem to be avoiding [gets] systematically. Why ? (2) I don't understand the necessity of [chan pending] in this code (are you targeting DoS attacks as explained in the manpage ?) (3) you're parsing hex bytes (for chunk lengths) by hand. Any reason not to use [scan] ? (4) clientHeaders could benefit from becoming an array instead of a linearly-scanned list, especially when more headers are taken into account, don't you think ? -Alex
From: tom.rmadilo on 27 Oct 2009 15:04 On Oct 27, 1:10 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Oct 25, 2:45 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > > Anyway, Alexandre Ferrieux suggested that I should contribute a new > > style (maybe coro), or a direct fix. I don't think a direct fix is > > possible, so I went with a new style. > > Thank you for this. I admit I have not looked at the details, but the > separation of the state machines looks cleaner than the current http > package's (but then, the coverage is not the same yet, right ? ;-). Absolutely right, this only does GET (actually not completely true) and doesn't have anywhere near the functionality of the current http client. I don't really think that the code which creates the request should be on the same level as the protocol. With this API. Here is the basic API: ::htclient::addClient host port url {method GET} {headers {}} {content {}} {version 1.1} {timeout 0} args This automatically generates the host header, and if content is provided, it adds a content-length header. The version can be used to avoid chunked transfer encoding (use 1.0). So you could use POST or HEAD. POST probably works, HEAD would require some special code to realize that no content is expected, but it may work anyway since the content-length header isn't currently used to find the end of content. There are a bunch of rules for figuring this out and I didn't care to spend any time on it. > Also, removing vwait from the library is a Good Idea (tm). It could be > the responsibility of an extra layer to provide the blocking variety > of ::http::geturl (with a vwait of course) for compatibility though. > If you look at the next-to-last modification in the git repository, you will notice that I added in a vwait on a per client basis. This basically serializes requests, so I removed it. My suggestion is to use wish, or somehow make sure the event loop is already running (or do requests in batches?). Something completely missing right now, actually one of my original goals was a non-blocking client with a timeout. There is no timeout. However, a fixed timeout value seems not exactly right. What might be more useful is a rate limit monitor. If the short-term transfer rate falls below some value, consider aborting, or put a timer on and check back later. Again, this doesn't need to be a part of the protocol layer there are probably a number of good ways to do this. The current version of this code will likely become a server testing version, and a production version would have nearly all the logging code removed. One log per event helps in code development and understanding of what is going on, but obviously not useful for real world use. > A few random questions: > > (1) you seem to be avoiding [gets] systematically. Why ? HTTP status line and headers must be parsed char by char. Many potential security issues are attributed to not correctly parsing headers before passing them on. I decided to spend the time parsing the headers into tokens/words. This doesn't completely validate the headers, each of which as a special form (who know why). So each header will need an individual interpreter. For instance, the date header. There are three date formats accepted as valid. Parsing these different formats as a string is not pretty. But once the date is tokenized, it is easy to figure out which type is in use, or if it is invalid. To turn the tokens back into a date, a simple [format] can be used. One security issue with headers has to do with whitespace between the field name and the colon. The standards say you have to reject headers of this form. I don't know if it is important for a client, but the code will serve as a template for a server or proxy. Also, tokenizing is required to remove "bad whitespace" and "optional whitespace" before either passing on the header, or interpreting the content. > (2) I don't understand the necessity of [chan pending] in this code > (are you targeting DoS attacks as explained in the manpage ?) Otherwise the [read] would block until all content is received. This could be allow a DoS attack, although I'm not sure if it would keep other clients from continuing. At some point you have to stop reading content and start reading the <CR><LF> and then the chunk size<CR><LF>, so you can't just read everything at once. > (3) you're parsing hex bytes (for chunk lengths) by hand. Any reason > not to use [scan] ? I guess I could, wrap it in a [catch] and use [gets]? You then have to figure out why it failed. You also get the same issue with reading headers. It isn't safe. I also think that a production client would limit the chunk size and abort when the value gets too big. If you slowly increase the number, this is easy to check, if you scan first I'm not sure what the possible problems are. The main problem is that HTTP is not a line oriented protocol, at least the current version isn't. Since you have to parse headers, reading one char at a time isn't really a problem (the read is from a buffer). > (4) clientHeaders could benefit from becoming an array instead of a > linearly-scanned list, especially when more headers are taken into > account, don't you think ? Unfortunately you have to maintain header order, and in some cases you have to combine same named headers into a comma separated list. But the basic reason is the current format allows the developer to use [lsearch -inline -all -nocase] to search for all headers of a given name. I provided two examples to demonstrate how to search for headers. Also, note that the clientHeaders is itself an array. each client's headers are stored in clientHeaders($client). This gives additional code easy access to the headers if you know the clientID. There is another array clientContentLength, with the same keys. As new data is appended to clientContent, the length is incremented. It should be easy to setup a trace on this variable and track progress. Try running htclient with multiple requests before you do a vwait (or use wish). Tcl's very egalitarian event loop gives every client a turn, although the order gets scrambled over the history of the responses, but this just indicates that each client is independent of the others (I think?).
From: tom.rmadilo on 27 Oct 2009 15:18 On Oct 27, 1:10 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > (4) clientHeaders could benefit from becoming an array instead of a > linearly-scanned list, especially when more headers are taken into > account, don't you think ? Shoot, almost forgot: the set-cookie header cannot be combined with the comma operator, you have to maintain two separate keys with the same name. Also, the exact header field search is [lsearch -inline - all -nocase -index ...], which should make it easy to write specific header handling code.
From: Alexandre Ferrieux on 30 Oct 2009 13:21
On Oct 27, 8:04 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > (1) you seem to be avoiding [gets] systematically. Why ? > > HTTP status line and headers must be parsed char by char. Many > potential security issues are attributed to not correctly parsing > headers before passing them on. [...] > I don't know if it is important for a client, but the > code will serve as a template for a server or proxy. Ah OK. Let me just point out that for a generic enabler like the HTTP client library in Tcl, the dominant pattern is very far from this corner case. I'd risk a 99.9%-0.1% balance between simple client and proxy. I'm saying this because this decision to work hard on the extreme case has a cost on code size and readability, which is too bad since you came here to improve on the existing code ;-) > > (2) I don't understand the necessity of [chan pending] in this code > > (are you targeting DoS attacks as explained in the manpage ?) > > Otherwise the [read] would block until all content is received. Uh ? You're already using nonblocking reads and fileevents. So ? > > (3) you're parsing hex bytes (for chunk lengths) by hand. Any reason > > not to use [scan] ? > > I guess I could, wrap it in a [catch] and use [gets]? You then have to > figure out why it failed. You also get the same issue with reading > headers. It isn't safe. I also think that a production client would > limit the chunk size and abort when the value gets too big. If you > slowly increase the number, this is easy to check, if you scan first > I'm not sure what the possible problems are. So really you're doing all this in case a _server_ would attack the client ? That's your right of course, but that dimension is orthogonal to what you set out to fix in the first place (readability and modularity of the generic HTTP client library). > The main problem is that HTTP is not a line oriented protocol, at > least the current version isn't. Since you have to parse headers, > reading one char at a time isn't really a problem (the read is from a > buffer). At some point heavy char-by-char processing at script level will cost you more than what the buffered input saves in syscalls... > > (4) clientHeaders could benefit from becoming an array instead of a > > linearly-scanned list, especially when more headers are taken into > > account, don't you think ? > > Unfortunately you have to maintain header order, and in some cases you > have to combine same named headers into a comma separated list. Header order, yes, but only within the multiple values of a given header, right ? This is easily done with lappend arr($header) $value > But > the basic reason is the current format allows the developer to use > [lsearch -inline -all -nocase] to search for all headers of a given > name. Normalize on writing and you get the same: lappend arr([string toupper $header]) $value -Alex |