Help with regular expression [Perl]

Prev: FAQ 4.1 Why am I getting long decimals (eg, 19.9499999999999) instead of the numbers I should be getting (eg, 19.95)?
Next: FAQ 3.24 Why don't Perl one-liners work on my DOS/Mac/VMS system?

From: sln on 3 Aug 2010 14:20

On Sun, 1 Aug 2010 19:47:18 +0000 (UTC), Mark Hobley <markhobley(a)yahoo.donottypethisbit.co> wrote:

>On Sun, 25 Jul 2010 22:04:25 +0200, Peter J. Holzer wrote:
>
>> Is this a match?
>>
>> (((1 + 2) * (3 +4)))
>
>Yes. That is a match.

Then
((3 * bar) + ((foo))) - This is a match
((3 * bar) + ((foo))bar) - This is a match.
is matched by ((foo)) only ...
It can't be both ways.

You have 2 requirements, balanced double parenths
that does not include balanced single parenths.

Pretty easy actually. Do you know regular expressions?

-sln

--------------------
use strict;
use warnings;

# Require (())
my $x = quotemeta '((';
my $y = quotemeta '))';

# Require not ()
my $m = quotemeta '(';
my $n = quotemeta ')';

my $regex = qr/
(
$x
(?:
(
$m
(?:
(?>(?:(?!$x|$y|$m|$n).)+)
| (?2)
)*
$n
)
|
(?>(?:(?!$x|$y|$m|$n).)+)
| (?1)
)*
$y
)
/xs;

while (<DATA>) {
if (/$regex/) {
print "$1\n";
}
}

__DATA__

I need a regular expression with the following properties.
I need to match text (typically, though not necessarily expressions)
enclosed within double parentheses. However, I do not want to match nested
single parentheses enclosed text.

So ((*)) is a match, but ((*)*(*)) is not a match.
Here are some examples to illustrate this.

((FOO)) - This is a match
(()) - This is a match
((3 + 2)) - This is a match
((3 + 2) + (2 * foo)) - This is not a match
((3 * bar) + ((foo))) - This is a match
((3 * bar) + ((foo))bar) - This is a match.

I hope that lot makes sense.

Thanks in advance to anyone who can help.

--
Mark Hobley
Linux User: #370818 http://markhobley.yi.org/

--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---

On Sun, 25 Jul 2010 22:04:25 +0200, Peter J. Holzer wrote:

> Is this a match?
>
> (((1 + 2) * (3 +4)))

Yes. That is a match.

--
Mark Hobley
Linux User: #370818 http://markhobley.yi.org/

--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---

From: Ted Zlatanov on 3 Aug 2010 14:42

On Tue, 3 Aug 2010 18:40:09 +0200 Helmut Richter <hhr-m(a)web.de> wrote:

HR> On Tue, 3 Aug 2010, Ted Zlatanov wrote:
>> I don't think even a grammar will help. The requirements are
>> fundamentally broken because there's more than one way to interpret
>> nested parens.

HR> I do not think so:

HR> Let X be the regular language of nonempty words not containing any
HR> parentheses. Then the language L of words that are double-parenthesis
HR> enclosed is:

HR> L -> (( inside ))
HR> inside -> inside1 | inside2
HR> inside1 -> X | inside1 single-paren | inside1 X
HR> inside2 -> single-paren X | single-paren single-paren | inside2 X
HR> | inside2 single-paren
HR> single-paren -> ( inside ) | ( )

HR> "inside1" should be the language of all properly nested strings that do not
HR> begin with "(", and "inside2" the language of all properly nested strings that
HR> begin with "(" except when the last token is the matching ")".

HR> Not that I find that grammar pretty or easy to parse -- but at least it is not
HR> ambiguous.

I didn't parse the requirements that way, but that's probably my error.
Thanks for explaining.

Ted

From: sln on 3 Aug 2010 14:57

On Tue, 03 Aug 2010 10:54:26 -0500, Ted Zlatanov <tzz(a)lifelogs.com> wrote:

>On Mon, 2 Aug 2010 00:10:44 +0200 "Peter J. Holzer" <hjp-usenet2(a)hjp.at> wrote:
>
>PJH> Then the problem cannot be solved with a real regular expression.
>...
>PJH> I agree with Eric: Write a proper grammar and use that to parse your
>PJH> expressions. If you've ever heard of BNF, using Parse::Yapp or
>PJH> Parse::RecDescent shouldn't be too hard (I prefer the former, although
>PJH> the docs assume that you are already familiar with yacc).
>
>I don't think even a grammar will help. The requirements are
>fundamentally broken because there's more than one way to interpret
>nested parens. The OP should explain what he's trying to do and give
>real-world examples he needs parsed.
>
>Also, a grammar is pretty slow compared to regular expressions. So I
>always hesitate before recommending it for anything except
>low-throughput situations, e.g. input submitted by a user or small
>files.
>
>Ted

I don't think the requirements are broken at all.
As soon as the OP said that neither of the double
parenths can be part of an inner single parenth,
it pretty much made it complete. Trivial or not
I believe its a complete req, that can be done with
a simple regular expression.

The match will satisfy the requirements, however,
outlier parenths may not be balanced relative to the
match. Though, additional expressions could be added
to balance the complete text.

-sln

From: Helmut Richter on 3 Aug 2010 16:11

On Tue, 3 Aug 2010, Ted Zlatanov wrote:

> On Tue, 3 Aug 2010 18:40:09 +0200 Helmut Richter <hhr-m(a)web.de> wrote:
>
> HR> On Tue, 3 Aug 2010, Ted Zlatanov wrote:
> >> I don't think even a grammar will help. The requirements are
> >> fundamentally broken because there's more than one way to interpret
> >> nested parens.
>
> HR> I do not think so:
>
> HR> Let X be the regular language of nonempty words not containing any
> HR> parentheses.

No, this is an error, albeit easy to fix. X should be the language of
one-token words where the token is not a parenthesis. Otherwise
concatenating the X would introduce an ambiguity.

> HR> Then the language L of words that are double-parenthesis
> HR> enclosed is:
>
> HR> L -> (( inside ))
> HR> inside -> inside1 | inside2
> HR> inside1 -> X | inside1 single-paren | inside1 X
> HR> inside2 -> single-paren X | single-paren single-paren | inside2 X
> HR> | inside2 single-paren
> HR> single-paren -> ( inside ) | ( )
>
> HR> "inside1" should be the language of all properly nested strings that do not
> HR> begin with "(", and "inside2" the language of all properly nested strings that
> HR> begin with "(" except when the last token is the matching ")".
>
> HR> Not that I find that grammar pretty or easy to parse -- but at least it is not
> HR> ambiguous.

Well, it *is* easy to parse: nearly LR(0) with the only exception that
there is a minor shift-reduce conflict when ")" is encountered. So it is
certainly LR(1). Writing the L rule as two rules makes it even a
precedence grammar for several notions of precedence. This allows writing
a parser by hand, whereas LR parsers should better be generated.

> I didn't parse the requirements that way, but that's probably my error.

Well, when I set up the grammar, there were ambiguities of interpretation
of the requirements. This is just my interpretation. But at least I chose
it because I found it to be the most plausible, and not in order that
parsing be possible.

I am not sure the extended notion of regexp in perl, which goes beyond
regular languages, cannot be used to parse such a thing. After all, regexp
handling involves backtracking, which is not normally considered a good
technique in context-free parsing.

--
Helmut Richter

From: sln on 3 Aug 2010 17:03

On Tue, 3 Aug 2010 22:11:16 +0200, Helmut Richter <hhr-m(a)web.de> wrote:

>I am not sure the extended notion of regexp in perl, which goes beyond
>regular languages, cannot be used to parse such a thing. After all, regexp
>handling involves backtracking, which is not normally considered a good
>technique in context-free parsing.

But mixing expressions without backtracking with expressions with
back tracking is a feature of extended notation.

-sln

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: FAQ 4.1 Why am I getting long decimals (eg, 19.9499999999999) instead of the numbers I should be getting (eg, 19.95)?
Next: FAQ 3.24 Why don't Perl one-liners work on my DOS/Mac/VMS system?