Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...
From: tom.rmadilo on 30 Oct 2009 15:07 On Oct 30, 10:21 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Oct 27, 8:04 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > > > (1) you seem to be avoiding [gets] systematically. Why ? > > > HTTP status line and headers must be parsed char by char. Many > > potential security issues are attributed to not correctly parsing > > headers before passing them on. [...] > > I don't know if it is important for a client, but the > > code will serve as a template for a server or proxy. > > Ah OK. Let me just point out that for a generic enabler like the HTTP > client library in Tcl, the dominant pattern is very far from this > corner case. I'd risk a 99.9%-0.1% balance between simple client and > proxy. I'm saying this because this decision to work hard on the > extreme case has a cost on code size and readability, which is too bad > since you came here to improve on the existing code ;-) Look at the code and see if you think it is more readable. Code size is somewhat higher than I would like so that I can avoid maintaining too much state. Handling the line ending <CR><LF> sequences adds to the length quite a bit. If a single <LF> was used, this would not be the case. More length is added because of the way the standards are written. Depending on the parsing context, each char value is in a particular "class". Just in case you are interested, I ran a few tests with ::http::geturl, compared to htclient. Grabbing a single file (http://rmadilo.com/files/, 17k bytes), htclient was about 50% faster (67k microseconds vs. 97k for geturl). However when grabbing 10+ copies of the above url htclient is about twice as fast (48k microseconds average). Approximately half of the 48k us is spent just opening the connection (htOpen). Assuming that geturl spends the same time opening (prior to data prep or sending anything), htclient is about three times as fast for everything after the open (all prep, sending and receiving data). The script is part of the distro (now available in zip format) in the bin/test-http.tcl file. Maybe geturl would preform better with Tk or wish, because in Tcl it runs in serial mode. So I have no way to compare one-on-one. Also, just in case, my experimental code already solves the non- blocking chunked transfer encoding problem. The additional code for that case was not very much code. > > > (2) I don't understand the necessity of [chan pending] in this code > > > (are you targeting DoS attacks as explained in the manpage ?) > > > Otherwise the [read] would block until all content is received. > > Uh ? You're already using nonblocking reads and fileevents. So ? How do you do chunked reads and signal when to stop, remove the <cr><lf>, read the next chunk size, remove the <cr><lf> and start reading again? Anyway, using [chan pending] does allow you to easily monitor progress of a download. It also, imho simplifies code since you always know the result of the last read. Although I'm still confuse as to why [chan pending] returns zero bytes half the time after a read event. > > > (3) you're parsing hex bytes (for chunk lengths) by hand. Any reason > > > not to use [scan] ? > > > I guess I could, wrap it in a [catch] and use [gets]? You then have to > > figure out why it failed. You also get the same issue with reading > > headers. It isn't safe. I also think that a production client would > > limit the chunk size and abort when the value gets too big. If you > > slowly increase the number, this is easy to check, if you scan first > > I'm not sure what the possible problems are. > > So really you're doing all this in case a _server_ would attack the > client ? That's your right of course, but that dimension is orthogonal > to what you set out to fix in the first place (readability and > modularity of the generic HTTP client library). > > > The main problem is that HTTP is not a line oriented protocol, at > > least the current version isn't. Since you have to parse headers, > > reading one char at a time isn't really a problem (the read is from a > > buffer). > > At some point heavy char-by-char processing at script level will cost > you more than what the buffered input saves in syscalls... Like I said above, the standard requires this. You have to "understand" the headers, which first requires parsing, removing whitespace, quotes, comments, etc. Right now all I am doing is parsing the headers into tokens which will be much easier to use in case someone is interested. > > > (4) clientHeaders could benefit from becoming an array instead of a > > > linearly-scanned list, especially when more headers are taken into > > > account, don't you think ? > > > Unfortunately you have to maintain header order, and in some cases you > > have to combine same named headers into a comma separated list. > > Header order, yes, but only within the multiple values of a given > header, right ? > This is easily done with > > lappend arr($header) $value > > > But > > the basic reason is the current format allows the developer to use > > [lsearch -inline -all -nocase] to search for all headers of a given > > name. > > Normalize on writing and you get the same: > > lappend arr([string toupper $header]) $value So you think this is faster? It isn't for a few reasons: one is you have to know in advance which headers allow comma separated lists. I'm not the one who wrote these crazy rules, but that is the way they are. Also, don't forget that the set-cookie header (in practice) doesn't allow this combining into a list. I also generally think it is a bad idea to modify input. It is easier to leave stuff alone. Each client connection has exactly one variable to store headers, and these are stored in an array. You suggest an array for each client for headers, something which isn't easily passed around. Notice that the htclient code makes zero use of upvar and adds zero pollution to the global namespace. My thought is that the only real reason to avoid per char parsing, or to use an array instead of a list for headers, is that it is done in Tcl and not C. In C it would be obviously faster to use the char-by- char algorithm, each char is processes at most once and the processing involves simple comparisons of char values or char class values. I simply disagree that the algorithm needs to take into account the relatively slower speed of the Tcl scripting language. I also think that actually getting the parsing right requires this method. But right now htclient is faster than ::http::geturl and already handles chunked transfer encoding.
From: Alexandre Ferrieux on 31 Oct 2009 12:44 On Oct 30, 8:07 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > > > > (2) I don't understand the necessity of [chan pending] in this code > > > > (are you targeting DoS attacks as explained in the manpage ?) > > > > Otherwise the [read] would block until all content is received. > > > Uh ? You're already using nonblocking reads and fileevents. So ? > > How do you do chunked reads and signal when to stop, remove the > <cr><lf>, read the next chunk size, remove the <cr><lf> and start > reading again? With a state machine, fileevents, and nonblocking gets. -Alex
From: tom.rmadilo on 31 Oct 2009 15:12 On Oct 31, 9:44 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Oct 30, 8:07 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > How do you do chunked reads and signal when to stop, remove the > > <cr><lf>, read the next chunk size, remove the <cr><lf> and start > > reading again? > > With a state machine, fileevents, and nonblocking gets. HTTP data is binary, or maybe it is better to say it is opaque. The chunked transfer encoding is specifically byte oriented with a well defined structure. Nothing in the standards I have read indicates that you should treat the data as line oriented. You could argue that the headers are line oriented, but you still have to parse them char-by-char, so the only possible "benefit" of using [gets] would be to complicate the read/parse cycle. I've also pointed out several times that since gets can fail, you have to handle error conditions: again this just adds new, unnecessary states to the machine. One requirement of a state machine is well defined states and transition conditions. For instance, if you do a [gets] to load the length of a chunk, you have these potential error conditions: 1. value contains non-hexchars. Example: "axbr" This will require you to catch your scan and recover (abort). 2. the [gets] could fail to find <CR><LF> before the buffer runs dry. This requires you to use [fblocked] for every call to [gets] before you can use the data. If [fblocked], you must loop back until you have your <CR><LF>. But because you didn't get all the data the first time, you have to establish a temporary buffer and append to that before your scan. 3. The scanned value could exceed the maximum integer size for the system. How will you detect this condition? One basic premise of the state machines in this code is they identify all possible valid data in each state. If invalid data is found the state set to an error state, with no need to use catch and back up and try to identify what went wrong. I know ahead of time that something will fail and stop before the failure. But I noticed something more interesting about the buffered channels. Tcl doesn't refill the input buffer until it is empty. You should look at the log files for the chunked transfers. As the code preforms repeated [reads] for less than what is in the buffer, the buffer continues to drain until it reaches zero. Then guess what? It stays at zero unless you read one byte. Then the buffer fills up again and the process repeats. What that means is that the client could spend a lot of time waiting for the tcp packet to run the round trip to the foreign server if you try to read more bytes than are in the buffer. In other words, the application may not be blocking exactly, but a long multi-buffer read will delay other clients from reading their buffers which already have data. But if it doesn't delay other clients, then the only interpretation is that the read returned after depleting the buffer, which is exactly what the code is doing now. It shouldn't take you too long to write an alternative chunked state machine, just give it the same name as the current machine and make sure it starts with the same state.
From: mcccol on 31 Oct 2009 19:13 On Oct 25, 11:45 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > Last week there was discussion of a problem with the http client which > ships with Tcl. The cause of the problem (an ignored timeout) was code > which places the channel into blocking mode during reading chunked > encoded transfers. Although it seems like a bad idea on first read, I > think the overall cause was the monolithic nature of the original code > (before chunked handling was added). > > Anyway, Alexandre Ferrieux suggested that I should contribute a new > style (maybe coro), or a direct fix. I don't think a direct fix is > possible, so I went with a new style. You might want to consider out the following client, which is coro based. http://wub.googlecode.com/svn-history/r1798/trunk/Client/HTTP.tcl Colin.
From: tom.rmadilo on 31 Oct 2009 20:03
On Oct 31, 4:13 pm, mcccol <mcc...(a)gmail.com> wrote: > On Oct 25, 11:45 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > Anyway, Alexandre Ferrieux suggested that I should contribute a new > > style (maybe coro), or a direct fix. I don't think a direct fix is > > possible, so I went with a new style. > > You might want to consider out the following client, which is coro > based. > > http://wub.googlecode.com/svn-history/r1798/trunk/Client/HTTP.tcl I'm sure Alexandre will be interested. I'm simply playing around with a non-blocking fully event driven http client. The above client requires WUB and other things which are not part of Tcl. If I were to look outside of Tcl, I'd probably just recommend AOLserver's http client. Damn, I could have done that weeks ago. Near the end of that file, something suspiciously dangerous: http://wub.googlecode.com/svn/trunk/Client/HTTP.tcl {set ::source} justcontent 1 ;# fetch the latest HTTP.tcl .... vwait ::source set source [subst -nocommands -novariables $source] (note "http://wub.googlecode.com/svn/trunk/Client/HTTP.tcl" is a command) If $source contains something like this: $a([rm -rf /]), you're kinda screwed. |