Line oriented protocols vs. [gets] [TCL]

Prev: lreplace behaviour change in tcl 8
Next: Expect/TCL Configuration Issue - Form POST submit not working

From: tom.rmadilo on 2 Jul 2010 17:06

On Jul 1, 4:45 pm, "Donal K. Fellows"
<donal.k.fell...(a)manchester.ac.uk> wrote:
> On 01/07/2010 20:39, tom.rmadilo wrote:
>
> > At a minimum, Tcl's I/O API lacks two feature: timed wait on blocking
> > channels
>
> Too bad the OS's own API doesn't allow for that (except by sending a
> signal to interrupt, which is a *very* crude method). You need to use a
> non-blocking channel and some of the other facilities that you *do* have.

Umm, select() has a timeout. Some platforms offer an ioctl option of
the number of bytes available in the kernel (add that to the buffer
bytes and you have available binary bytes which can be read without
blocking).

> > and max byte/char reads on any channel (allowing single call
> > protection against overflow).
>
> You need [chan pending] for that, added in 8.5. That lets you see how
> much is currently buffered inside Tcl.

Unfortunately [chan pending] only returns the number of bytes
available inside Tcl, like you say, but a read event is caused by data
arriving outside of Tcl, but not yet in the Tcl buffers. This results
in [chan pending] returning zero bytes 50% of the time. I just read
one byte, since this is guaranteed to be available (in -translation
binary).

> Combined with non-blocking
> channels and [after] events, that lets you do safe reading of lines with
> [gets]. The code to do it is more than I'm willing to write at around
> midnight. :-)

Wow! Does anyone have time to enlighten the Tcl world? This is
something which should have been done with the introduction of the
code (in designing the code), not as an answer to an idiot posting on
comp.lang.tcl.

> > Plus the ability to do both: wait for n bytes, on timeout return the
> > number of bytes received (plus some pointer to the bytes). Note that
> > when a channel becomes readable, you can read at least one byte.
>
> Actually, when a channel becomes available you know that a [read] of one
> byte will not block but not that a byte is available; a closed channel
> is the other main source of such events. When the channel is
> non-blocking, you know that you'll always only get bytes or characters
> from the data that is available (which might or might not involve a call
> to the OS, depending on what is actually buffered).

In -translation binary mode, one byte is available and can be read, of
course the read event could be a closed channel or some error and not
a byte.

> > But how many bytes can you read without blocking for an additional
> > network I/O operation? Also, Tcl includes channel errors as readable
> > events, so you have to check for that as well.
>
> Actually, it all works rather well (especially if you use 8.6's
> coroutines to hide the details, c.f.http://wiki.tcl.tk/22231). You can
> cap the amount that you buffer in non-blocking mode (i.e., to no more
> than a fixed amount more than some limit you can decide) and you can
> handle timeouts any way you want.

Right, not byteing on that one. I hope nobody thinks coroutines have
solved any I/O issue. The code referenced may be an example of
coroutines, but it does not solve of extend API to solve existing I/O
issues.

> > Also, there is a strange combination of using bytes vs. chars inputs
> > in various Tcl API. I can't figure out how you could write a valid
> > program which seeks a UTF-8 file (at the Tcl script level).
>
> [seek] and [tell] always work with byte addresses, but they *are* aware
> of what's in Tcl's buffers. If you're not going to the start or the end
> of the file, you need to get to the point where you want to remember and
> use [tell] to remember it so you can [seek] back there again. It tends
> to be fairly rare that they're used in text files; they're just not that
> useful with variable length records.

Remember? At least with UTF-8 you could seek forward or backward to a
char boundary, most encoding don't have this synchronization
potential. But you have to seek by bytes.

> > My idea: why not make it easy to implement generic protocols in Tcl,
> > while still assuming that the C or C++ version will be faster? We
> > don't even have the tools for efficient I/O mixed with application
> > state changes. What we do have is relative immunity from buffer
> > overflow and many other issues affecting languages such as C, C++,
> > Java, .NET, etc.

> Have you measured the inefficiency, or is this supposition?

My code is fast, the current http::geturl is a dog. I was hoping to
improve performance of my code, not the dog code currently offered as
a standard Tcl package.

I decided to make a one char change to my htclient code: set the
client connection to blocking (I had been using non-blocking). Guess
what? htclient performs the same as if sockets were in non-blocking
mode.

http://rmadilo.com/files/htclient/

I should point out that Tcl offers something not available in the
typical OS: non-application-blocking client connect . This allows
multiple connects to be performed without blocking the application.

From: Donal K. Fellows on 3 Jul 2010 04:56

On 2 July, 22:06, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
> Wow! Does anyone have time to enlighten the Tcl world? This is
> something which should have been done with the introduction of the
> code (in designing the code), not as an answer to an idiot posting on
> comp.lang.tcl.

Now that I'm (much!) more awake, see the code on the Wiki at
http://wiki.tcl.tk/19667 which should show how it is done. It checks
for both the nefarious case where someone is sending too much at once
(i.e., where the buffered data gets to be more than a kilobyte without
producing a line) and the case where they're dribbling bytes across
(that requires timeouts, of course). The code is designed so that it
delivers each complete line to "user" code through a callback. It
should be fairly easy to adjust the anti-nefariousness policies too;
the line timeout and the buffer limits are straightforward to find.

Donal.

From: tom.rmadilo on 3 Jul 2010 21:07

On Jul 3, 1:56 am, "Donal K. Fellows"
<donal.k.fell...(a)manchester.ac.uk> wrote:
> On 2 July, 22:06, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> > Wow! Does anyone have time to enlighten the Tcl world? This is
> > something which should have been done with the introduction of the
> > code (in designing the code), not as an answer to an idiot posting on
> > comp.lang.tcl.
>
> Now that I'm (much!) more awake, see the code on the Wiki athttp://wiki.tcl.tk/19667which should show how it is done. It checks
> for both the nefarious case where someone is sending too much at once
> (i.e., where the buffered data gets to be more than a kilobyte without
> producing a line) and the case where they're dribbling bytes across
> (that requires timeouts, of course). The code is designed so that it
> delivers each complete line to "user" code through a callback. It
> should be fairly easy to adjust the anti-nefariousness policies too;
> the line timeout and the buffer limits are straightforward to find.

Since starting this thread, I stumbled across a bell labs paper which
poses the same question I originally started with:

http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf

The sam text editor is a product of this concept:

http://en.wikipedia.org/wiki/Sam_(text_editor)

My only interest in this product is in the variable and context-
oriented selection of text. Tcl's [gets] is line oriented, but even
line-oriented protocols somehow work around the distinction between
records and lines defined as something which ends in <CR>, <LF> or
<CR><LF>. Sam still misses out on some of the complexity of protocol
messages, but it is somewhat closer to what I'm looking for.

Here is a short quote from the wikipedia article:

"Sam's command syntax is formally similar to ed's or ex's, containing
(structural-) regular-expression-based conditional and loop functions
and scope addressing, even sharing some of ed's syntax for such
functions. But while ed's commands are line-oriented, sam's are
selection-oriented. Selections are contiguous strings of text (which
may span multiple lines), and are specified either with the mouse (by
"sweeping" it over a region of text) or by a pattern match. Sam's
commands take such selections as basicmore or less as other Unix
tools treat lines; thus, multi-line and sub-line patterns are as
naturally handled by Sam as whole-line patterns are by ed, vi, awk,
Perl, etc. This is implemented through a model called "structural
regular expressions," which can recursively apply regular-expression
matching to obtain other (sub)selections within a given selection. In
this way, sam's command set can be applied to substrings that are
identified by arbitrarily complex context.

"Sam extends its basic text-editing command set to handling of
multiple files, providing similar pattern-based conditional and loop
commands for filename specification. Any sequence of text-editing
commands may be applied as a unit to each such specification."

The main issue is handling multiple possible regular expressions at
the same time and being able to sub-divide regex's into groups/
contexts.

Otherwise Tcl offers a stark choice: use a predefined concept of a
line and use that as the input to your protocol interpreter, or read
some number of bytes and apply a similar algorithm. The only important
point is that a protocol's state can change at any particular byte, so
the only generic algorithm which can handle any protocol must be able
to execute arbitrary code and change state at every byte boundary.
Tcl's regular expressions do not offer this ability. But it might be
possible to build this in to a Tcl channel so that reading a channel,
transforming input, etc., might be specified at the Tcl script level,
but still leverage the I/O efficiency of low level C code.

From: Uwe Klein on 4 Jul 2010 05:19

tom.rmadilo wrote:
> But while ed's commands are line-oriented, sam's are
> selection-oriented. Selections are contiguous strings of text (which
> may span multiple lines), and are specified either with the mouse (by
> "sweeping" it over a region of text) or by a pattern match.

Actually the (s)ed suite allows you to append to the matchbuffer.

patterns spanning embedded linebreaks are thus possible.

never used it much
but would be my hammer to go at forex mail-header multiline items

uwe

First | Prev |
Pages: 1 2 3 4
Prev: lreplace behaviour change in tcl 8
Next: Expect/TCL Configuration Issue - Form POST submit not working