From: Alexandre Ferrieux on
On Nov 5, 2:47 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> Anyway, I think you have somewhat valid complaints about this
> implementation, but I think they assume a prefect world where
> everybody is reasonable. I have a personality complex which doesn't
> allow me to trust myself, much less a foreign server.

Ah-ha. Discussion switching back to productive mode. Thanks.
Look again: this optimistic assumption is also prevalent in the
current http package.
I'd risk that many other widely used libraries across the industry
strike a similar balance between complexity and security. How many
unices have the kerberos extensions ? How many Solarisses have the C2
package ?

> I also have a previously declared bias against regular expressions. I

Be reassured, that was clear ;-)

> consider them brittle code which require much documentation, I call
> them miniature programs. If the regular expression must extend beyond
> a single token, the complexity increases dramatically.

Ah ? Is the equivalent [read 1;switch] automaton readable without
comments ?
Yes they are miniature programs. But the -expanded flag of [regexp]
does a lot to make them just as easy (or hard) to read as an
equivalent lower-level program, with two additional properties:

(1) They are guaranteed to stay within the bounds of regular
grammars. In this context it is a bonus, because it limits the
"spaghetti power" of the code as compared to an unconstrained Turing
complete programming language with counters, stack, etc.

(2) They get compiled to a very efficient form (a) by
determinization and (b) because the final parsing is done in C.

So yes, they are mini-programs with an universally known syntax, very
simple semantics, and a turbo compiler.
Then it's a matter of personal taste ;-)

> But I've never seen a regular expression which could handle escape
> sequences, since that requires counting, so I don't see how you cover
> quoted strings even as a stand-alone token. If we can get by that
> problem, maybe there is a regular expression solution.

Ha ! So you're having trouble understanding the regexp I sent ?
The re_syntax is written in English, which you're proficient at, and
is IMO very clear. So what ?
Maybe you should start by getting up to pace with regexps before
touting about their limitations...

One thing that might help you on that route: the assumtion "escape
sequences require counting" is wrong. Indeed, they only require
counting _modulo 2_, which makes a (quoting) hell of a difference,
since it means a two-state finite automaton, as opposed to an
unbounded integer counter which is outside the regular realm.

Look closely at my regexp, and you'll find this tiny two-state sub-
automaton: \\. . It essentially handles the difference between an odd
and an even number of backslashes before a quote.

Now you mentioned nesting constructs (comments), which do need context-
free. However, I don't see tem in RFC2616. Can you show me ?

-Alex


-Alex
From: tom.rmadilo on
On Nov 4, 11:50 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Nov 5, 2:47 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
>
>
> > Anyway, I think you have somewhat valid complaints about this
> > implementation, but I think they assume a prefect world where
> > everybody is reasonable. I have a personality complex which doesn't
> > allow me to trust myself, much less a foreign server.
>
> Ah-ha. Discussion switching back to productive mode. Thanks.
> Look again: this optimistic assumption is also prevalent in the
> current http package.
> I'd risk that many other widely used libraries across the industry
> strike a similar balance between complexity and security. How many
> unices have the kerberos extensions ? How many Solarisses have the C2
> package ?
>

I don't know, but besides security, the headers are currently divided
into tokens/words with whitespace removed. In theory this should make
it easier to interpret the field values. I know it will simplify
working with the date header, as a quick example.

I'm not yet convinced it will be easy to deal with the complexity of
partial headers if there are illegal <CR> or <LF> chars. Basically you
read in an unknown chunk of data with [gets] and hope you can detect
errors.

OTOH, you seem convinced it'll work. It seems easy enough to
substitute in your ideas, so I'll give it a try.

> > But I've never seen a regular expression which could handle escape
> > sequences, since that requires counting, so I don't see how you cover
> > quoted strings even as a stand-alone token. If we can get by that
> > problem, maybe there is a regular expression solution.
>
> Ha ! So you're having trouble understanding the regexp I sent ?
> The re_syntax is written in English, which you're proficient at, and
> is IMO very clear. So what ?
> Maybe you should start by getting up to pace with regexps before
> touting about their limitations...

Did I complain about their limitations? I just think they take time to
develop and test, sometimes it is well worth the effort, sometimes it
isn't. I also agree that you should try to document them, which it
looks like the -expanded flag enables inline.

> One thing that might help you on that route: the assumtion "escape
> sequences require counting" is wrong. Indeed, they only require
> counting _modulo 2_, which makes a (quoting) hell of a difference,
> since it means a two-state finite automaton, as opposed to an
> unbounded integer counter which is outside the regular realm.
>
> Look closely at my regexp, and you'll find this tiny two-state sub-
> automaton: \\. . It essentially handles the difference between an odd
> and an even number of backslashes before a quote.

Cool, really cool.

> Now you mentioned nesting constructs (comments), which do need context-
> free. However, I don't see tem in RFC2616. Can you show me ?

From the rfc:

Fielding, et al. Expires April 29, 2010 [Page
20]

Internet-Draft HTTP/1.1, Part 1 October
2009

....

Comments can be included in some HTTP header fields by surrounding
the comment text with parentheses. Comments are only allowed in
fields containing "comment" as part of their field value
definition.

comment = "(" *( ctext / quoted-cpair / comment ) ")"
ctext = OWS / %x21-27 / %x2A-5B / %x5D-7E / obs-text
; OWS / <VCHAR except "(", ")", and "\"> / obs-
text

The backslash character ("\") can be used as a single-character
quoting mechanism within comment constructs:

quoted-cpair = "\" ( WSP / VCHAR / obs-text )

Producers SHOULD NOT escape characters that do not require escaping
(i.e., other than the backslash character "\" and the parentheses
"("
and ")").

I haven't looked at which header fields allow comments, this seems
like something to avoid like the plague, if possible.
From: Donal K. Fellows on
On 5 Nov, 08:49, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
> I haven't looked at which header fields allow comments, this seems
> like something to avoid like the plague, if possible.

Looks like it's permitted in the Server, User-Agent, and Via fields.
Since comments can't include anything that looks like the start of
another header or the end of the headers (check the definition of
TEXT, which excludes all CTLs except for LWS, and that is used as a
line continuation which is unambiguously different) all the confusion
is confined to those specific headers, but they're all ones that never
need to be parsed; they're all just informational.

If you want to parse out the comments, do so. But they're not
important.

Donal.
From: tom.rmadilo on
On Nov 4, 3:53 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Nov 4, 7:58 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:

> > Why not just produce the easy to read and understand regexp?

> Sure:
>
>   proc match_chunk_header_line s {
>     if {![regexp -expanded {
>        ^
>        ([0-9a-fA-F]+) # hex length
>        \s*
>        (?:       # begin optional chunkext
>         ;\s*
>         [^\s=]+  # name is a token
>         =
>         (?:      # begin alternative
>          [^"\s;]+ #  simple token
>         |        # or
>          "       #  opening quote
>          ([^"\\]|\\.)* # normal chars or escape sequences
>          "       #  closing quote
>         )        # end alternative
>         \s*
>        )*        # end and iterate chunkext
>        \r        # ah don't forget CR
>        $
>      } $s -> len]} {
>        error "Syntax error in chunk header"
>      }
>      scan $len %x len
>      return $len
>    }

Alex,

So I'm looking into this more, but I've run into a problem. The above
regexp essentially validates the chunk-size, chunk-ext "header". Not
sure if it is really a header, fortunately it is a little easier (no
line folding allowed). It places the hexchars into "len" and discards
the remaining junk. This is okay in this situation, since we must
ignore stuff we don't understand. Being dumb is something of a
benefit.

So my question is with regular headers. Currently I find the field
name, which is a token and divide the field value into tokens, words
and punctuation which can't be in a token, but doesn't need to be in a
word (a quoted string). For instance, a colon, equal sign or comma are
in this punctuation group.

How you write a regexp which can divide a header into these different
parts: tokens, words, bare punctuation?

I'm guessing I must use [regexp -all -inline]
From: Alexandre Ferrieux on
On Nov 5, 9:49 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> OTOH, you seem convinced it'll work. It seems easy enough to
> substitute in your ideas, so I'll give it a try.

Thanks. I do appreciate.

>
> > Now you mentioned nesting constructs (comments), which do need context-
> > free. However, I don't see tem in RFC2616. Can you show me ?
>
> From the rfc:
> Fielding, et al. Expires April 29, 2010 [Page
> 20]

Ah, so it's not RFC2616, it's the Fielding HTTPbis IETF draft. You're
on the bleeding edge, eh ;-)

Anyway I think we all agree that ignoring those braindead nested
comments is the way to go, possibly with post-processing of a regular
approximation of the grammar. Among other things, this post-processing
can leverage Tcl's own parser by morphing the nesting constructs
(here, parentheses) into braces, and then calling [info complete]. But
personally I'd happily [error UNSUPPORTED_STUPID_STANDARD].

-Alex
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7
Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...