From: Alexandre Ferrieux on
On Feb 22, 11:26 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> Also, htclient is significantly faster/more efficient than geturl. I
> haven't figure out exactly why yet, but I think geturl spends too much
> time in string manipulations. Basically I would say htclient is good
> for developers, not so good for the casual user.

Tom, is htclient using the regexp techniques we explored together last
fall, or is it parsing headers by hand ?

-Alex

From: tom.rmadilo on
On Feb 22, 3:52 pm, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Feb 22, 11:26 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
>
>
> > Also, htclient is significantly faster/more efficient than geturl. I
> > haven't figure out exactly why yet, but I think geturl spends too much
> > time in string manipulations. Basically I would say htclient is good
> > for developers, not so good for the casual user.
>
> Tom, is htclient using the regexp techniques we explored together last
> fall, or is it parsing headers by hand ?

Alex,

Thanks for asking. Unfortunately I was unable to write a regexp which
actually parsed a generic header. I can only say that I gave it a good
try. Either I have a serious deficiency in this area (highly likely)
or it can't be done. I guess I should ask for help.

But I never advanced past the most important issues (IMO): finding the
end of the current header while avoiding blocking the application.
Even if I found the whole header, I would still be stuck with the job
or parsing the header into tokens.

Unless and until I or someone else can produce a regular expression
which can correctly parse all headers into tokens, I'm stuck with the
current char-by-char code.

All I can say is that the current method seems to offer lots of
flexibility and is relatively easy to understand. The regexp code
seems to require lots of testing, and so far the testing eventually
finds cases which are not handled correctly.

Of course I'm still stuck with the fact that htclient is faster than
geturl and does not block. At some point I have to wonder if there is
really anything wrong with my approach. (If unput was available at the
Tcl level about half my code would be unnecessary.)

From: Alexandre Ferrieux on
On Feb 23, 1:46 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
> >
> > Tom, is htclient using the regexp techniques we explored together last
> > fall, or is it parsing headers by hand ?
>
> Thanks for asking. Unfortunately I was unable to write a regexp which
> actually parsed a generic header. I can only say that I gave it a good
> try. Either I have a serious deficiency in this area (highly likely)
> or it can't be done. I guess I should ask for help.
> But I never advanced past the most important issues (IMO): finding the
> end of the current header while avoiding blocking the application.
> Even if I found the whole header, I would still be stuck with the job
> or parsing the header into tokens.

Ah, but for this you can take a two-step approach: first identify the
end of headers (CRLFCRLF) and then parse the accumulated blob. Of
course you'll get false alarms when CRLFCRLF's are embedded in quoted
strings, but our beloved regexp will detect quote imbalance. Moreover,
it is easy to distinguish this situation from a more serious syntax
error by re-checking with an additional single quote:

[regexp $BIGREGEXP $blob] -> 0
[regexp $BIGREGEXP $blob\"] -> 1

detects a case of pure quote imbalance. In that case, continue
appending to the blob, up to the next CRLFCRLF. Iterate.

> Unless and until I or someone else can produce a regular expression
> which can correctly parse all headers into tokens, I'm stuck with the
> current char-by-char code.

I'm interested in helping you pursue the regexp approach. I really
think it holds the key to the fastest and most elegant solution.

> All I can say is that the current method seems to offer lots of
> flexibility and is relatively easy to understand. The regexp code
> seems to require lots of testing, and so far the testing eventually
> finds cases which are not handled correctly.

I admit a regexp might be tricky to read for the unaccustomed eye, but
don't forget that the regexp compiler will catch additional
inconsistencies (parenthesis imbalance) that a hand-written automaton
will happily get away with... So all in all, I still believe regexps
are superiorly maintainable.

-Alex

From: tom.rmadilo on
On Feb 23, 1:44 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Feb 23, 1:46 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
>
>
> > > Tom, is htclient using the regexp techniques we explored together last
> > > fall, or is it parsing headers by hand ?
>
> > Thanks for asking. Unfortunately I was unable to write a regexp which
> > actually parsed a generic header. I can only say that I gave it a good
> > try. Either I have a serious deficiency in this area (highly likely)
> > or it can't be done. I guess I should ask for help.
> > But I never advanced past the most important issues (IMO): finding the
> > end of the current header while avoiding blocking the application.
> > Even if I found the whole header, I would still be stuck with the job
> > or parsing the header into tokens.
>
> Ah, but for this you can take a two-step approach: first identify the
> end of headers (CRLFCRLF) and then parse the accumulated blob. Of
> course you'll get false alarms when CRLFCRLF's are embedded in quoted
> strings, but our beloved regexp will detect quote imbalance. Moreover,
> it is easy to distinguish this situation from a more serious syntax
> error by re-checking with an additional single quote:
>
>  [regexp $BIGREGEXP $blob] -> 0
>  [regexp $BIGREGEXP $blob\"] -> 1
>
> detects a case of pure quote imbalance. In that case, continue
> appending to the blob, up to the next CRLFCRLF. Iterate.
>
> > Unless and until I or someone else can produce a regular expression
> > which can correctly parse all headers into tokens, I'm stuck with the
> > current char-by-char code.
>
> I'm interested in helping you pursue the regexp approach. I really
> think it holds the key to the fastest and most elegant solution.
>
> > All I can say is that the current method seems to offer lots of
> > flexibility and is relatively easy to understand. The regexp code
> > seems to require lots of testing, and so far the testing eventually
> > finds cases which are not handled correctly.
>
> I admit a regexp might be tricky to read for the unaccustomed eye, but
> don't forget that the regexp compiler will catch additional
> inconsistencies (parenthesis imbalance) that a hand-written automaton
> will happily get away with... So all in all,  I still believe regexps
> are superiorly maintainable.

I have put down my regexp testing code here:

http://www.junom.com/gitweb/gitweb.perl?p=htclient.git;a=blob;f=regexp/token.tcl

What I found so far is that the regexp does work, but that it is (so
far) easily fooled by two mistakes which balance each other out. It
also happily ignores what doesn't match and produces what does match.
Anyway, the basic problem is that my regexps have parsed correct
headers correctly, but also parse incorrect headers.

I'm also confused as to how I accumulate input without causing a
potential blocking or other problems.

So the first thing needed is a regexp which can distinguish between
correct and incorrect headers and parse correct headers into tokens.

From: Alexandre Ferrieux on
On Feb 23, 5:19 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
> On Feb 23, 1:44 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
> wrote:
>
>
>
>
>
> > On Feb 23, 1:46 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> > > > Tom, is htclient using the regexp techniques we explored together last
> > > > fall, or is it parsing headers by hand ?
>
> > > Thanks for asking. Unfortunately I was unable to write a regexp which
> > > actually parsed a generic header. I can only say that I gave it a good
> > > try. Either I have a serious deficiency in this area (highly likely)
> > > or it can't be done. I guess I should ask for help.
> > > But I never advanced past the most important issues (IMO): finding the
> > > end of the current header while avoiding blocking the application.
> > > Even if I found the whole header, I would still be stuck with the job
> > > or parsing the header into tokens.
>
> > Ah, but for this you can take a two-step approach: first identify the
> > end of headers (CRLFCRLF) and then parse the accumulated blob. Of
> > course you'll get false alarms when CRLFCRLF's are embedded in quoted
> > strings, but our beloved regexp will detect quote imbalance. Moreover,
> > it is easy to distinguish this situation from a more serious syntax
> > error by re-checking with an additional single quote:
>
> >  [regexp $BIGREGEXP $blob] -> 0
> >  [regexp $BIGREGEXP $blob\"] -> 1
>
> > detects a case of pure quote imbalance. In that case, continue
> > appending to the blob, up to the next CRLFCRLF. Iterate.
>
> > > Unless and until I or someone else can produce a regular expression
> > > which can correctly parse all headers into tokens, I'm stuck with the
> > > current char-by-char code.
>
> > I'm interested in helping you pursue the regexp approach. I really
> > think it holds the key to the fastest and most elegant solution.
>
> > > All I can say is that the current method seems to offer lots of
> > > flexibility and is relatively easy to understand. The regexp code
> > > seems to require lots of testing, and so far the testing eventually
> > > finds cases which are not handled correctly.
>
> > I admit a regexp might be tricky to read for the unaccustomed eye, but
> > don't forget that the regexp compiler will catch additional
> > inconsistencies (parenthesis imbalance) that a hand-written automaton
> > will happily get away with... So all in all,  I still believe regexps
> > are superiorly maintainable.
>
> I have put down my regexp testing code here:
>
> http://www.junom.com/gitweb/gitweb.perl?p=htclient.git;a=blob;f=regex...
>
> What I found so far is that the regexp does work, but that it is (so
> far) easily fooled by two mistakes which balance each other out. It
> also happily ignores what doesn't match and produces what does match.
> Anyway, the basic problem is that my regexps have parsed correct
> headers correctly, but also parse incorrect headers.
>
> I'm also confused as to how I accumulate input without causing a
> potential blocking or other problems.
>
> So the first thing needed is a regexp which can distinguish between
> correct and incorrect headers and parse correct headers into tokens.

I suspect some kind of escaping hell in those dynamically built
regexps...
Why don't you start back from the constant, braced, commented regexp
we've built together ?

-Alex
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8
Prev: hard drive serial number
Next: uploading a file and form data