From: Alexandre Ferrieux on
On Nov 6, 2:05 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> So I'm looking into this more, but I've run into a problem. The above
> regexp essentially validates the chunk-size, chunk-ext "header". Not
> sure if it is really a header, fortunately it is a little easier (no
> line folding allowed). It places the hexchars into "len" and discards
> the remaining junk. This is okay in this situation, since we must
> ignore stuff we don't understand. Being dumb is something of a
> benefit.

Yes, sir :-)

> So my question is with regular headers. Currently I find the field
> name, which is a token and divide the field value into tokens, words
> and punctuation which can't be in a token, but doesn't need to be in a
> word (a quoted string). For instance, a colon, equal sign or comma are
> in this punctuation group.
>
> How you write a regexp which can divide a header into these different
> parts: tokens, words, bare punctuation?
>
> I'm guessing I must use [regexp -all -inline]

You're guessing right. The trick is to do it in two steps:

(a) one [regexp -indices] or with a toplevel capturing subexpression
to identify the "repetitive" part:

regexp {REGEXP-FOR-HEADER:(REGEXP-FOR-WHOLE-VALUE)} $s -> value

(b) one [regexp -all -inline] on the repetitive part, yielding a list
of matches:

set l [regexp -all -inline {REGEXP-FOR-SEPARATED-QUOTED-OR-
UNQUOTED-WORD} $value]

If you have trouble writing this REGEXP-FOR-SEPARATED-QUOTED-OR-
UNQUOTED-WORD, ask again (I'm a bit busy right now).

-Alex

From: tom.rmadilo on
On Nov 6, 3:14 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:

> If you have trouble writing this REGEXP-FOR-SEPARATED-QUOTED-OR-
> UNQUOTED-WORD, ask again (I'm a bit busy right now).

I apologize in advance for the length of this post.

Here's a simple test script which seems to correctly "tokenize" a
header which is valid. I'm stuck on how to get it to work when the
header isn't.

# token.tcl

# \x21 !
# \x23 #
# \x24 $
# \x25 %
# \x26 &
# \x27 '
# \x2A *
# \x2B +
# \x2D -
# \x2E .
# \x30-\x39 0-9
# \x41-x5A A-Z
# \x5E ^
# \x5F _
# \x60 `
# \x61-\x7A a-z
# \x7C |
# \x7E ~

set token {[\x21\x23-\x27\x2A-\x2B\x2D\x2E\x30-\x39\x41-\x5A\x5E-\x7A
\x7C\x7E]+}

# useless:
set wsp {(?:[\x09\x20]+)}

# comment n quoted-string delimiters:
# \x22 "
# \x28 (
# \x29 )
# \x5C \

# stuff that terminates/divides tokens:
# \x2C ,
# \x2F /
# \x3A :
# \x3B ;
# \x3C <
# \x3D =
# \x3E >
# \x3F ?
# \x40 @
# \x5B [
# \x5D ]
# \x7B \{
# \x7D \}

set unquotedPunct {[\x2C\x2F\x3A-\x40\x5B\x5D\x7B\x7D]}

set quotedText {"(?:[\x09\x20\x21\x23-\x5B\x5D-\x7E\x80-\xFF])*"}

set allRegexp "$token|$unquotedPunct|$quotedText"

set strings {abcd ef gh kkk--$+ i*& % o9}

foreach string $strings {

if {[regexp $allRegexp $string x -> y]} {
puts "string = <$string> x = <$x> y = <$y>"
} else {
puts "string = <$string> did not match"
}
}

set result [regexp -all -inline $allRegexp $strings]

puts "result = $result"

# End token.tcl

I have specifically disallowed "quoted-pairs" like \" and \\ in quoted-
text (actually I just disallow \ and "), I can't figure out what else
to "or" into the quotedText expression. The problem is that the regexp
just pulls out what matches and throws away what doesn't.

Maybe there is a global indication that something didn't match?

Otherwise, this regexp correctly divides the headers into words.
From: Alexandre Ferrieux on
On Nov 6, 11:16 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> I have specifically disallowed "quoted-pairs" like \" and \\ in quoted-
> text (actually I just disallow \ and "), I can't figure out what else
> to "or" into the quotedText expression. The problem is that the regexp
> just pulls out what matches and throws away what doesn't.
> Maybe there is a global indication that something didn't match?

Not sure I understand the problem you are describing; however there
are two things that might help:

(1) separate the tasks of global validation and of data extraction:

if {![regexp {^(?:TOKEN|PUNCT|QUOTED|WSP)*$} $s]} barf
set l [regexp -all -inline {TOKEN|PUNCT|QUOTED|WSP} $s]

(2) In the above, allow " and \ only inside QUOTED, using my earlier
construct:

" # opening quote
([^"\\]|\\.)* # normal chars or escape sequences
" # closing quote

Also, in your code I see sometimes: (?:[FOO])
Note that the (?: ) are useless in this case, since [FOO] is a single
char.

-Alex
From: tom.rmadilo on
On Nov 6, 3:29 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:

> Also, in your code I see sometimes: (?:[FOO])
> Note that the (?: ) are useless in this case, since [FOO] is a single
> char.

I don't want to take too much of your time with this. The original
(?:...) was like this (among others):

set quotedText {"(?:[\x09\x20\x21\x23-\x5B\x5D-\x7E\x80-\xFF]|\\\\|\\
\")*"}

Anyway, without the (?:) I get extra matches. Sometimes the last
matching char, sometimes a full double match.

% regexp -inline -all $allRegexp { x = "this\\\" okay"}
x = {"this\\\" okay"} ; # as expected
% regexp -inline -all $allRegexp { x = "this\\" okay"}
x = {"this\\"} okay ; # failed, as expected (but how to detect)
% regexp -inline -all $allRegexp { x = "this\ okay"}
x = this okay ; # failed, because it isn't defined, but still looks
like success

If I remove the ?:, I get extra reporting:

% regexp -inline -all $allRegexp {abcd ef gh kkk--$+ i*& % o9}
abcd {} ef {} gh {} {kkk--$+} {} i*& {} % {} o9 {}

I'm guessing that the global validation route would squeeze illegal
sequences so they must pop out and create the barf. I'll look into
that.
From: Alexandre Ferrieux on
On Nov 7, 1:57 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
> On Nov 6, 3:29 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
> wrote:
>
> > Also, in your code I see sometimes: (?:[FOO])
> > Note that the (?: ) are useless in this case, since [FOO] is a single
> > char.
>
> I don't want to take too much of your time with this. The original
> (?:...) was like this (among others):
>
> set quotedText {"(?:[\x09\x20\x21\x23-\x5B\x5D-\x7E\x80-\xFF]|\\\\|\\
> \")*"}

Ah OK, there was an OR of several regexp, that justifies the (?:).
Only the final regexp with a single [...] doesn't need to be
parenthesized, since non-capturing parentheses are a no-op for an
atomic regexp.

> Anyway, without the (?:) I get extra matches. Sometimes the last
> matching char, sometimes a full double match.

No, I mean "remove (?:...)" , not "remove ?:".

Of course if you just remove ?:, you turn a non-capturing pair into a
capturing one, hence the extra reports.

> I'm guessing that the global validation route would squeeze illegal
> sequences so they must pop out and create the barf. I'll look into
> that.

Yes, that's the principle. If you encounter problems, please give
examples of offending input strings and wanted output.

-Alex


First  |  Prev  | 
Pages: 1 2 3 4 5 6 7
Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...