htclient: new experimental HTTP Client [TCL]

Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...

From: tom.rmadilo on 30 Oct 2009 15:07

On Oct 30, 10:21 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Oct 27, 8:04 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
>
>
> > > (1) you seem to be avoiding [gets] systematically. Why ?
>
> > HTTP status line and headers must be parsed char by char. Many
> > potential security issues are attributed to not correctly parsing
> > headers before passing them on. [...]
> > I don't know if it is important for a client, but the
> > code will serve as a template for a server or proxy.
>
> Ah OK. Let me just point out that for a generic enabler like the HTTP
> client library in Tcl, the dominant pattern is very far from this
> corner case. I'd risk a 99.9%-0.1% balance between simple client and
> proxy. I'm saying this because this decision to work hard on the
> extreme case has a cost on code size and readability, which is too bad
> since you came here to improve on the existing code ;-)

Look at the code and see if you think it is more readable. Code size
is somewhat higher than I would like so that I can avoid maintaining
too much state. Handling the line ending <CR><LF> sequences adds to
the length quite a bit. If a single <LF> was used, this would not be
the case.

More length is added because of the way the standards are written.
Depending on the parsing context, each char value is in a particular
"class".

Just in case you are interested, I ran a few tests
with ::http::geturl, compared to htclient. Grabbing a single file
(http://rmadilo.com/files/, 17k bytes), htclient was about 50% faster
(67k microseconds vs. 97k for geturl). However when grabbing 10+
copies of the above url htclient is about twice as fast (48k
microseconds average). Approximately half of the 48k us is spent just
opening the connection (htOpen). Assuming that geturl spends the same
time opening (prior to data prep or sending anything), htclient is
about three times as fast for everything after the open (all prep,
sending and receiving data).

The script is part of the distro (now available in zip format) in the
bin/test-http.tcl file. Maybe geturl would preform better with Tk or
wish, because in Tcl it runs in serial mode. So I have no way to
compare one-on-one.

Also, just in case, my experimental code already solves the non-
blocking chunked transfer encoding problem. The additional code for
that case was not very much code.

> > > (2) I don't understand the necessity of [chan pending] in this code
> > > (are you targeting DoS attacks as explained in the manpage ?)
>
> > Otherwise the [read] would block until all content is received.
>
> Uh ? You're already using nonblocking reads and fileevents. So ?

How do you do chunked reads and signal when to stop, remove the
<cr><lf>, read the next chunk size, remove the <cr><lf> and start
reading again?

Anyway, using [chan pending] does allow you to easily monitor progress
of a download. It also, imho simplifies code since you always know the
result of the last read.

Although I'm still confuse as to why [chan pending] returns zero bytes
half the time after a read event.

> > > (3) you're parsing hex bytes (for chunk lengths) by hand. Any reason
> > > not to use [scan] ?
>
> > I guess I could, wrap it in a [catch] and use [gets]? You then have to
> > figure out why it failed. You also get the same issue with reading
> > headers. It isn't safe. I also think that a production client would
> > limit the chunk size and abort when the value gets too big. If you
> > slowly increase the number, this is easy to check, if you scan first
> > I'm not sure what the possible problems are.
>
> So really you're doing all this in case a _server_ would attack the
> client ? That's your right of course, but that dimension is orthogonal
> to what you set out to fix in the first place (readability and
> modularity of the generic HTTP client library).
>
> > The main problem is that HTTP is not a line oriented protocol, at
> > least the current version isn't. Since you have to parse headers,
> > reading one char at a time isn't really a problem (the read is from a
> > buffer).
>
> At some point heavy char-by-char processing at script level will cost
> you more than what the buffered input saves in syscalls...

Like I said above, the standard requires this. You have to
"understand" the headers, which first requires parsing, removing
whitespace, quotes, comments, etc. Right now all I am doing is parsing
the headers into tokens which will be much easier to use in case
someone is interested.

> > > (4) clientHeaders could benefit from becoming an array instead of a
> > > linearly-scanned list, especially when more headers are taken into
> > > account, don't you think ?
>
> > Unfortunately you have to maintain header order, and in some cases you
> > have to combine same named headers into a comma separated list.
>
> Header order, yes, but only within the multiple values of a given
> header, right ?
> This is easily done with
>
> lappend arr($header) $value
>
> > But
> > the basic reason is the current format allows the developer to use
> > [lsearch -inline -all -nocase] to search for all headers of a given
> > name.
>
> Normalize on writing and you get the same:
>
> lappend arr([string toupper $header]) $value

So you think this is faster? It isn't for a few reasons: one is you
have to know in advance which headers allow comma separated lists. I'm
not the one who wrote these crazy rules, but that is the way they are.
Also, don't forget that the set-cookie header (in practice) doesn't
allow this combining into a list. I also generally think it is a bad
idea to modify input. It is easier to leave stuff alone.

Each client connection has exactly one variable to store headers, and
these are stored in an array. You suggest an array for each client for
headers, something which isn't easily passed around. Notice that the
htclient code makes zero use of upvar and adds zero pollution to the
global namespace.

My thought is that the only real reason to avoid per char parsing, or
to use an array instead of a list for headers, is that it is done in
Tcl and not C. In C it would be obviously faster to use the char-by-
char algorithm, each char is processes at most once and the processing
involves simple comparisons of char values or char class values. I
simply disagree that the algorithm needs to take into account the
relatively slower speed of the Tcl scripting language. I also think
that actually getting the parsing right requires this method.

But right now htclient is faster than ::http::geturl and already
handles chunked transfer encoding.

From: Alexandre Ferrieux on 31 Oct 2009 12:44

On Oct 30, 8:07 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> > > > (2) I don't understand the necessity of [chan pending] in this code
> > > > (are you targeting DoS attacks as explained in the manpage ?)
>
> > > Otherwise the [read] would block until all content is received.
>
> > Uh ? You're already using nonblocking reads and fileevents. So ?
>
> How do you do chunked reads and signal when to stop, remove the
> <cr><lf>, read the next chunk size, remove the <cr><lf> and start
> reading again?

With a state machine, fileevents, and nonblocking gets.

-Alex

From: tom.rmadilo on 31 Oct 2009 15:12

On Oct 31, 9:44 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Oct 30, 8:07 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
> > How do you do chunked reads and signal when to stop, remove the
> > <cr><lf>, read the next chunk size, remove the <cr><lf> and start
> > reading again?
>
> With a state machine, fileevents, and nonblocking gets.

HTTP data is binary, or maybe it is better to say it is opaque. The
chunked transfer encoding is specifically byte oriented with a well
defined structure. Nothing in the standards I have read indicates that
you should treat the data as line oriented.

You could argue that the headers are line oriented, but you still have
to parse them char-by-char, so the only possible "benefit" of using
[gets] would be to complicate the read/parse cycle.

I've also pointed out several times that since gets can fail, you have
to handle error conditions: again this just adds new, unnecessary
states to the machine.

One requirement of a state machine is well defined states and
transition conditions.

For instance, if you do a [gets] to load the length of a chunk, you
have these potential error conditions:

1. value contains non-hexchars. Example: "axbr" This will require you
to catch your scan and recover (abort).
2. the [gets] could fail to find <CR><LF> before the buffer runs dry.
This requires you to use [fblocked] for every call to [gets] before
you can use the data. If [fblocked], you must loop back until you have
your <CR><LF>. But because you didn't get
all the data the first time, you have to establish a temporary buffer
and append to that before your scan.
3. The scanned value could exceed the maximum integer size for the
system. How will you detect this condition?

One basic premise of the state machines in this code is they identify
all possible valid data in each state. If invalid data is found the
state set to an error state, with no need to use catch and back up and
try to identify what went wrong. I know ahead of time that something
will fail and stop before the failure.

But I noticed something more interesting about the buffered channels.
Tcl doesn't refill the input buffer until it is empty. You should look
at the log files for the chunked transfers. As the code preforms
repeated [reads] for less than what is in the buffer, the buffer
continues to drain until it reaches zero. Then guess what? It stays at
zero unless you read one byte. Then the buffer fills up again and the
process repeats. What that means is that the client could spend a lot
of time waiting for the tcp packet to run the round trip to the
foreign server if you try to read more bytes than are in the buffer.
In other words, the application may not be blocking exactly, but a
long multi-buffer read will delay other clients from reading their
buffers which already have data. But if it doesn't delay other
clients, then the only interpretation is that the read returned after
depleting the buffer, which is exactly what the code is doing now.

It shouldn't take you too long to write an alternative chunked state
machine, just give it the same name as the current machine and make
sure it starts with the same state.

From: mcccol on 31 Oct 2009 19:13

On Oct 25, 11:45 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
> Last week there was discussion of a problem with the http client which
> ships with Tcl. The cause of the problem (an ignored timeout) was code
> which places the channel into blocking mode during reading chunked
> encoded transfers. Although it seems like a bad idea on first read, I
> think the overall cause was the monolithic nature of the original code
> (before chunked handling was added).
>
> Anyway, Alexandre Ferrieux suggested that I should contribute a new
> style (maybe coro), or a direct fix. I don't think a direct fix is
> possible, so I went with a new style.

You might want to consider out the following client, which is coro
based.

http://wub.googlecode.com/svn-history/r1798/trunk/Client/HTTP.tcl

Colin.

From: tom.rmadilo on 31 Oct 2009 20:03

On Oct 31, 4:13 pm, mcccol <mcc...(a)gmail.com> wrote:
> On Oct 25, 11:45 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
> > Anyway, Alexandre Ferrieux suggested that I should contribute a new
> > style (maybe coro), or a direct fix. I don't think a direct fix is
> > possible, so I went with a new style.
>
> You might want to consider out the following client, which is coro
> based.
>
> http://wub.googlecode.com/svn-history/r1798/trunk/Client/HTTP.tcl

I'm sure Alexandre will be interested. I'm simply playing around with
a non-blocking fully event driven http client. The above client
requires WUB and other things which are not part of Tcl. If I were to
look outside of Tcl, I'd probably just recommend AOLserver's http
client. Damn, I could have done that weeks ago.

Near the end of that file, something suspiciously dangerous:

http://wub.googlecode.com/svn/trunk/Client/HTTP.tcl {set ::source}
justcontent 1 ;# fetch the latest HTTP.tcl
....

vwait ::source
set source [subst -nocommands -novariables $source]

(note "http://wub.googlecode.com/svn/trunk/Client/HTTP.tcl" is a
command)

If $source contains something like this: $a([rm -rf /]), you're kinda
screwed.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...