Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...
From: Alexandre Ferrieux on 6 Nov 2009 06:14 On Nov 6, 2:05 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > So I'm looking into this more, but I've run into a problem. The above > regexp essentially validates the chunk-size, chunk-ext "header". Not > sure if it is really a header, fortunately it is a little easier (no > line folding allowed). It places the hexchars into "len" and discards > the remaining junk. This is okay in this situation, since we must > ignore stuff we don't understand. Being dumb is something of a > benefit. Yes, sir :-) > So my question is with regular headers. Currently I find the field > name, which is a token and divide the field value into tokens, words > and punctuation which can't be in a token, but doesn't need to be in a > word (a quoted string). For instance, a colon, equal sign or comma are > in this punctuation group. > > How you write a regexp which can divide a header into these different > parts: tokens, words, bare punctuation? > > I'm guessing I must use [regexp -all -inline] You're guessing right. The trick is to do it in two steps: (a) one [regexp -indices] or with a toplevel capturing subexpression to identify the "repetitive" part: regexp {REGEXP-FOR-HEADER:(REGEXP-FOR-WHOLE-VALUE)} $s -> value (b) one [regexp -all -inline] on the repetitive part, yielding a list of matches: set l [regexp -all -inline {REGEXP-FOR-SEPARATED-QUOTED-OR- UNQUOTED-WORD} $value] If you have trouble writing this REGEXP-FOR-SEPARATED-QUOTED-OR- UNQUOTED-WORD, ask again (I'm a bit busy right now). -Alex
From: tom.rmadilo on 6 Nov 2009 17:16 On Nov 6, 3:14 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > If you have trouble writing this REGEXP-FOR-SEPARATED-QUOTED-OR- > UNQUOTED-WORD, ask again (I'm a bit busy right now). I apologize in advance for the length of this post. Here's a simple test script which seems to correctly "tokenize" a header which is valid. I'm stuck on how to get it to work when the header isn't. # token.tcl # \x21 ! # \x23 # # \x24 $ # \x25 % # \x26 & # \x27 ' # \x2A * # \x2B + # \x2D - # \x2E . # \x30-\x39 0-9 # \x41-x5A A-Z # \x5E ^ # \x5F _ # \x60 ` # \x61-\x7A a-z # \x7C | # \x7E ~ set token {[\x21\x23-\x27\x2A-\x2B\x2D\x2E\x30-\x39\x41-\x5A\x5E-\x7A \x7C\x7E]+} # useless: set wsp {(?:[\x09\x20]+)} # comment n quoted-string delimiters: # \x22 " # \x28 ( # \x29 ) # \x5C \ # stuff that terminates/divides tokens: # \x2C , # \x2F / # \x3A : # \x3B ; # \x3C < # \x3D = # \x3E > # \x3F ? # \x40 @ # \x5B [ # \x5D ] # \x7B \{ # \x7D \} set unquotedPunct {[\x2C\x2F\x3A-\x40\x5B\x5D\x7B\x7D]} set quotedText {"(?:[\x09\x20\x21\x23-\x5B\x5D-\x7E\x80-\xFF])*"} set allRegexp "$token|$unquotedPunct|$quotedText" set strings {abcd ef gh kkk--$+ i*& % o9} foreach string $strings { if {[regexp $allRegexp $string x -> y]} { puts "string = <$string> x = <$x> y = <$y>" } else { puts "string = <$string> did not match" } } set result [regexp -all -inline $allRegexp $strings] puts "result = $result" # End token.tcl I have specifically disallowed "quoted-pairs" like \" and \\ in quoted- text (actually I just disallow \ and "), I can't figure out what else to "or" into the quotedText expression. The problem is that the regexp just pulls out what matches and throws away what doesn't. Maybe there is a global indication that something didn't match? Otherwise, this regexp correctly divides the headers into words.
From: Alexandre Ferrieux on 6 Nov 2009 18:29 On Nov 6, 11:16 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > > I have specifically disallowed "quoted-pairs" like \" and \\ in quoted- > text (actually I just disallow \ and "), I can't figure out what else > to "or" into the quotedText expression. The problem is that the regexp > just pulls out what matches and throws away what doesn't. > Maybe there is a global indication that something didn't match? Not sure I understand the problem you are describing; however there are two things that might help: (1) separate the tasks of global validation and of data extraction: if {![regexp {^(?:TOKEN|PUNCT|QUOTED|WSP)*$} $s]} barf set l [regexp -all -inline {TOKEN|PUNCT|QUOTED|WSP} $s] (2) In the above, allow " and \ only inside QUOTED, using my earlier construct: " # opening quote ([^"\\]|\\.)* # normal chars or escape sequences " # closing quote Also, in your code I see sometimes: (?:[FOO]) Note that the (?: ) are useless in this case, since [FOO] is a single char. -Alex
From: tom.rmadilo on 6 Nov 2009 19:57 On Nov 6, 3:29 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > Also, in your code I see sometimes: (?:[FOO]) > Note that the (?: ) are useless in this case, since [FOO] is a single > char. I don't want to take too much of your time with this. The original (?:...) was like this (among others): set quotedText {"(?:[\x09\x20\x21\x23-\x5B\x5D-\x7E\x80-\xFF]|\\\\|\\ \")*"} Anyway, without the (?:) I get extra matches. Sometimes the last matching char, sometimes a full double match. % regexp -inline -all $allRegexp { x = "this\\\" okay"} x = {"this\\\" okay"} ; # as expected % regexp -inline -all $allRegexp { x = "this\\" okay"} x = {"this\\"} okay ; # failed, as expected (but how to detect) % regexp -inline -all $allRegexp { x = "this\ okay"} x = this okay ; # failed, because it isn't defined, but still looks like success If I remove the ?:, I get extra reporting: % regexp -inline -all $allRegexp {abcd ef gh kkk--$+ i*& % o9} abcd {} ef {} gh {} {kkk--$+} {} i*& {} % {} o9 {} I'm guessing that the global validation route would squeeze illegal sequences so they must pop out and create the barf. I'll look into that.
From: Alexandre Ferrieux on 7 Nov 2009 04:41
On Nov 7, 1:57 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote: > On Nov 6, 3:29 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> > wrote: > > > Also, in your code I see sometimes: (?:[FOO]) > > Note that the (?: ) are useless in this case, since [FOO] is a single > > char. > > I don't want to take too much of your time with this. The original > (?:...) was like this (among others): > > set quotedText {"(?:[\x09\x20\x21\x23-\x5B\x5D-\x7E\x80-\xFF]|\\\\|\\ > \")*"} Ah OK, there was an OR of several regexp, that justifies the (?:). Only the final regexp with a single [...] doesn't need to be parenthesized, since non-capturing parentheses are a no-op for an atomic regexp. > Anyway, without the (?:) I get extra matches. Sometimes the last > matching char, sometimes a full double match. No, I mean "remove (?:...)" , not "remove ?:". Of course if you just remove ?:, you turn a non-capturing pair into a capturing one, hence the extra reports. > I'm guessing that the global validation route would squeeze illegal > sequences so they must pop out and create the barf. I'll look into > that. Yes, that's the principle. If you encounter problems, please give examples of offending input strings and wanted output. -Alex |