How to print all expressions that match a regular expression [Python]

Prev: intolerant HTML parser
Next: how to fix bugs (python)

From: Alf P. Steinbach on 6 Feb 2010 19:51

* Steven D'Aprano:
> On Sat, 06 Feb 2010 16:05:15 -0800, hzhuo1(a)gmail.com wrote:
>
>> Thanks for your reply.
>> So there isn't such a routine just because some of the regular
>> expressions cannot be enumerated. However, some of them can be
>> enumerated. I guess I have to write a function myself.
>
> How do you expect to tell the ones that can be enumerated apart from
> those that can't be?
>
> Regular expressions are programs in a "regex" programming language. What
> you are asking for is the same as saying:
>
> "Is there a program that can enumerate every possible set of data that is
> usable as valid input for a given program?"
>
> This, in turn, is equivalent to the Halting Problem -- if you can solve
> one, you can solve the other. You might like to google on the Halting
> Problem before you spend too much time on this.

Hm, well, text editors /regularly/ do repeated regular expression searches,
producing match after match after match, on request.

To use that /expression/, it seems that Theory is yet again up against Hard Reality.

In such a contest where something doesn't quite /match/, is the Map, the
Terrain, or perhaps the Interpretation of how the Map applies to Terrain, at fault?

> (Note, however, it isn't necessary to solve the Halting Problem for *all*
> cases in order to have a useful Endless Loop Detector program.)
>
> Why do you think you need this? Seems to me you're starting on an
> extraordinarily difficult job. I hope the benefit is equally
> extraordinary.

Depending on the application there may be more efficient ways than applying a
general purpose regexp matcher.

Don't know about modern *nix but in the old days there were different greps for
different purposes, egrep, fgrep, whatever.

Aside: the only article by Niklaus Wirth that I can remember reading was about
how to transform algorithms to more efficient ones by exploiting the invariants,
and one of his examples was simple text searching, where you can advance the
pattern a number of characters depending on the current non-matching character.

> [Aside: Python regexes aren't Turing Complete. I'm not sure about Perl
> regexes. Either way, this might actually be less difficult than the
> Halting Problem, as in "amazingly difficult" rather than "impossible".]

Cheers,

- Alf

From: Steven D'Aprano on 6 Feb 2010 20:28

On Sun, 07 Feb 2010 01:51:19 +0100, Alf P. Steinbach wrote:

>> Regular expressions are programs in a "regex" programming language.
>> What you are asking for is the same as saying:
>>
>> "Is there a program that can enumerate every possible set of data that
>> is usable as valid input for a given program?"
>>
>> This, in turn, is equivalent to the Halting Problem -- if you can solve
>> one, you can solve the other. You might like to google on the Halting
>> Problem before you spend too much time on this.
>
> Hm, well, text editors /regularly/ do repeated regular expression
> searches, producing match after match after match, on request.

I think you have completely misunderstood what I'm saying.

I'm not saying that you can't *run* a regular expression against text and
generate output. That truly would be a stupid thing to say, because I
clearly can do this:

>>> import re
>>> mo = re.search("p.rr.t",
.... "Some text containing parrots as well as other things")
>>> mo.group()
'parrot'

As you point out, it's not hard to embed a regex interpreter inside a
text editor or other application, or to call an external library.

What is difficult, and potentially impossible, is to take an arbitrary
regular expression such as "p.rr.t" (the program in the regex language)
and generate every possible data ("parrot", "pbrrat", ...) that would
give a match when applied to that regular expression.

Now, in this case, my example is very simple, and it would be easy to
enumerate every possible data: there's only 65025 of them, limiting to
the extended ASCII range excluding NUL (1-255). But for an arbitrary
regex, it won't be that easy. Often it will be unbounded: the example of
enumerating every string that matches .* has already been given.

The second problem is, generating the data which gives the output you
want is potentially very, very, difficult, potentially as difficult as
finding collisions in cryptographic hash functions:

"Given the function hashlib.sha256, enumerate all the possible inputs
that give the hexadecimal result
0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91."

This too is unbounded, but you'll have your work cut out just to find
*one* match, let alone an infinite number of them.

(In this specific example, your best bet is to try a crib: knowing what
newsgroup this is, and knowing what I've written in the past, the message
is predictable for being totally unexpected. And yes, that's a hint. A
shiny penny for the first person to guess what it is.)

I'm suggesting that, in general, there's no way to tell in advance which
regexes will be easy and which will be hard, and even when they are easy,
the enumeration will often be infinite.

--
Steven

From: MRAB on 6 Feb 2010 20:48

Alf P. Steinbach wrote:
> * Steven D'Aprano:
>> On Sat, 06 Feb 2010 16:05:15 -0800, hzhuo1(a)gmail.com wrote:
>>
>>> Thanks for your reply.
>>> So there isn't such a routine just because some of the regular
>>> expressions cannot be enumerated. However, some of them can be
>>> enumerated. I guess I have to write a function myself.
>>
>> How do you expect to tell the ones that can be enumerated apart from
>> those that can't be?
>>
>> Regular expressions are programs in a "regex" programming language.
>> What you are asking for is the same as saying:
>>
>> "Is there a program that can enumerate every possible set of data that
>> is usable as valid input for a given program?"
>>
>> This, in turn, is equivalent to the Halting Problem -- if you can
>> solve one, you can solve the other. You might like to google on the
>> Halting Problem before you spend too much time on this.
>
> Hm, well, text editors /regularly/ do repeated regular expression
> searches, producing match after match after match, on request.
>
[snip]
I'm not sure you understood what the OP was requesting: a way of
generating the strings which would match a given regex.

From: Grant Edwards on 6 Feb 2010 21:44

On 2010-02-06, Roy Smith <roy(a)panix.com> wrote:

>> I am a fresh man with python. I know there is regular expressions in
>> Python. What I need is that given a particular regular expression,
>> output all the matches.
[..]

> Please enumerate all the strings which match ".*". Use additional sheets
> of paper if needed.

And be sure to show your work.

--
Grant

From: Alf P. Steinbach on 6 Feb 2010 21:53

* Steven D'Aprano:
> On Sun, 07 Feb 2010 01:51:19 +0100, Alf P. Steinbach wrote:
>
>>> Regular expressions are programs in a "regex" programming language.
>>> What you are asking for is the same as saying:
>>>
>>> "Is there a program that can enumerate every possible set of data that
>>> is usable as valid input for a given program?"
>>>
>>> This, in turn, is equivalent to the Halting Problem -- if you can solve
>>> one, you can solve the other. You might like to google on the Halting
>>> Problem before you spend too much time on this.
>> Hm, well, text editors /regularly/ do repeated regular expression
>> searches, producing match after match after match, on request.
>
> I think you have completely misunderstood what I'm saying.

Yes.

> I'm not saying that you can't *run* a regular expression against text and
> generate output. That truly would be a stupid thing to say, because I
> clearly can do this:
>
>>>> import re
>>>> mo = re.search("p.rr.t",
> ... "Some text containing parrots as well as other things")
>>>> mo.group()
> 'parrot'
>
> As you point out, it's not hard to embed a regex interpreter inside a
> text editor or other application, or to call an external library.
>
> What is difficult, and potentially impossible, is to take an arbitrary
> regular expression such as "p.rr.t" (the program in the regex language)
> and generate every possible data ("parrot", "pbrrat", ...) that would
> give a match when applied to that regular expression.

Hm, that's not difficult to do, it's just like counting, but it's rather
meaningless since either the output is trivial or it's in general exponential or
infinite.

So it seems we both misunderstood the problem.

I didn't read the top level article until now, and reading it, I can't make
sense of it.

It sounds like some kind of homework problem, but without the constraints that
surely would be there in a homework problem.

> Now, in this case, my example is very simple, and it would be easy to
> enumerate every possible data: there's only 65025 of them, limiting to
> the extended ASCII range excluding NUL (1-255). But for an arbitrary
> regex, it won't be that easy. Often it will be unbounded: the example of
> enumerating every string that matches .* has already been given.
>
> The second problem is, generating the data which gives the output you
> want is potentially very, very, difficult, potentially as difficult as
> finding collisions in cryptographic hash functions:
>
> "Given the function hashlib.sha256, enumerate all the possible inputs
> that give the hexadecimal result
> 0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91."

I tried some "parrot" variants but no dice. :-(

[snip]

> I'm suggesting that, in general, there's no way to tell in advance which
> regexes will be easy and which will be hard, and even when they are easy,
> the enumeration will often be infinite.

I agree about the (implied) meaningless, exponential/infinite output, which
means that possibly that's not what the OP meant, but disagree about the
reasoning about "no way": really, regular expressions are /very/ limited so it's
not hard to compute up front the number of strings it can generate from some
given character set, in time linear in the length of the regexp.

Essentially, any regexp that includes '+' or '*' (directly or via e.g. notation
that denotes "digit sequence") yields an infinite number of strings.

And otherwise the regexp is of the form ABCDE..., where A, B, C etc are parts
that each can generate a certain finite number of strings; multiplying these
numbers gives the total number of strings that the regexp can generate.

Cheers,

- Alf

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: intolerant HTML parser
Next: how to fix bugs (python)