Prev: FAQ 4.8 How do I perform an operation on a series of integers?
Next: FAQ 4.22 How do I expand function calls in a string?
From: Helmut Richter on 12 Feb 2010 06:40 For a seemingly simple problem with regular expressions I tried out several solutions. One of them seems to be working now, but I would like to learn why the solutions behave differently. Perl is 5.8.8 on Linux. The task is to replace the characters # $ \ by their HTML entity, e.g. # but not within markup. The following code reads and consumes a variable $inbuf0 and builds up a variable $inbuf with the result. Solution 1: while ($inbuf0) { $inbuf0 =~ /^(?: # skip initial sequences of [^<\&#\$\\]+ # harmless characters | <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags | <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags | \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references | <!--(?:.|\n)*?--> # comments | <[?](?:.|\n)*?[?]> # processing instructions, etc. )*/x; $inbuf .= $&; $inbuf0 = $'; if ($inbuf0) { $inbuf .= '&#' . ord($inbuf0) . ';'; substr ($inbuf0, 0, 1) = ''; $replaced = 1; }; }; Here the regexp eats up the maximal initial string (note the * at the end of the regexp) that needs not be processed and then processes the first character of the remainder. In this version, it sometimes works and sometimes blows up with segmentation fault. Another version has * instead of + at the "harmless characters". That one does not try all alternatives as the first one matches always, that is, the * at the end of the regexp is not used in this case. Yet another version has nothing instead of + at the "harmless characters"; thus eating zero or one character per iteration of the final *. This should have the same net effect, but it always blows up with segmentation fault. Solution 2: while ($inbuf0) { if ($inbuf0 =~ /^# skip initial [^<\&#\$\\]+ # harmless characters | <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags | <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags | \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references | <!--(?:.|\n)*?--> # comments | <[?](?:.|\n)*?[?]> # processing instructions, etc. /x) { $inbuf .= $&; $inbuf0 = $'; } else { $inbuf .= '&#' . ord($inbuf0) . ';'; substr ($inbuf0, 0, 1) = ''; $replaced = 1; }; }; Here the regexp eats up an initial string, typically not maximal (note the absence of * at the end of the regexp), that needs not be processed and, if nothing has been found, processes the first character of the input. This version runs considerably slower, by a factor of three, but has so far not yielded segmentation faults. I am using it now. I am sure there are lots of other ways to do it. With which knowledge could I have saved the time of the numerous trial-and-error cycles and done it alright from the beginning? -- Helmut Richter
From: Peter Makholm on 12 Feb 2010 07:09 Helmut Richter <hhr-m(a)web.de> writes: > For a seemingly simple problem with regular expressions I tried out several > solutions. One of them seems to be working now, but I would like to learn why > the solutions behave differently. Perl is 5.8.8 on Linux. The regexp engine in perl 5.8.8 is implemented by recursion. This is known to cause segmentation faults on some occasions. See http://www.nntp.perl.org/group/perl.perl5.porters/2006/05/msg113036.html Upgrading to perl 5.10 solves this issue by making the regexp engine iterative instead. > The task is to replace the characters # $ \ by their HTML entity, e.g. # > but not within markup. The following code reads and consumes a variable > $inbuf0 and builds up a variable $inbuf with the result. Trying to handle XML and HTML correctly by parsing it with regular expressions isn't recommended at all. I would use some XML parser and walk through the DOM and change the content of text nodes with the trivial substitution on each text node. //Makholm
From: J�rgen Exner on 12 Feb 2010 11:03 Helmut Richter <hhr-m(a)web.de> wrote: >For a seemingly simple problem with regular expressions I tried out several >solutions. One of them seems to be working now, but I would like to learn why >the solutions behave differently. Perl is 5.8.8 on Linux. > >The task is to replace the characters # $ \ by their HTML entity, e.g. # >but not within markup. [...] You may want to read up on Chomsky hierarchy. HTML is a not a a regular language but a context-free language. Therefore it cannot be parsed by a regular engine. Granted, Perl's Regular Expressions have extensions that make them significantly more powerful than a formal regular engine, but they are still the wrong tool for the job. Use any standard HTML parser to dissect your file into its components and then apply your substitution to those components where you want them applied. jue
From: Helmut Richter on 12 Feb 2010 11:52 On Fri, 12 Feb 2010, wrote: > You may want to read up on Chomsky hierarchy. HTML is a not a a regular > language but a context-free language. Therefore it cannot be parsed by a > regular engine. But the distinction of markup and non-markup is. The only parenthesis-like structure I have so far found is the nesting of brackets in <!CDATA[ ... ]]> but this is also regular, as ]]> cannot occur inside. *If* I were interested in the semantics of the tags, I would probably follow the advice given here to use an XML analyser, provided I keep the control of what to do when the input is not well-formed XML. Just being told "your data is not okay, so cannot do anything with it" would not suffice: Even in an environment where the end-user has full control of everything, it is not always the best idea to have him fix every error before proceeding; sometimes it is better to let errors in the input and fix them at a later step. -- Helmut Richter
From: Dr.Ruud on 12 Feb 2010 13:41
Helmut Richter wrote: > [again parsing the wrong way] Is there a newsgroup or mailing list that we can refer "them" to? I am sure that we are well past our monthly share already. -- Ruud |