From: Chris McDonald on
Hello All,

[please excuse the Subject: line, as I'm unsure of the best description]

I'm seeking pointers to a C library that provides basic-regular-expression (BRE)
pattern matching *and* permits me to define the equality of atoms.

C's standard qsort() function is able to sort vectors of objects by
calling-back to the user to ask about the relative order of two objects.
To permit sorting of arbitrary objects, the caller passes to qsort()
the length of each object, and two pointers are passed back to the
user-provided comparison function.

I'm seeking something similar for regular-expressions, appreciating that
some features (such as back-patterns) may become impossible.

At the heart of RE implementations are hundreds of inline comparisons
to check if char1 == char2. However, I would like:

- char1 and char2 to be my objects, not characters, and
- for my comparison function to be called each time == is required.

Because we're no longer comparing characters, we can't simply provide
them in the RE to be matched. Thus I'm imagining a mechanism where
identifiers in the pattern represent members of the input alphabet.

For example, we have an alphabet vector of alphabet[] = {obj1, obj2, obj3},
and we're seeking the regular expression "1.*[23]" where the 1,2, and 3
represent the 1st, 2nd, 3rd objects from the alphabet. A call to

int match( const char *pattern,
size_t objectSize,
void *inputAlphabet, size_t nAlphabet,
void *inputVector, size_t lenInput,
int(*compareObjects)(const void *, const void *));

will call my compareObjects() function many times, and it will return 0/1.
Even better (for my application) would be if the user-code retained its
own alphabet, reducing this to:

int match( const char *pattern,
void *inputVector, size_t lenInput, size_t objectSize,
int(*compareObjects)(int alphabetIndex, const void *element));

or even, best of all:

int match( const char *pattern,
size_t lenInput,
int(*compareObjects)(int alphabetIndex, int inputIndex));

Using this last approach, the value of alphabetIndex could represent a
private function/predicate requiring evaluation (which match() certainly
doesn't care about) e.g. (ignoring errors):

int match(...) {
return (predicates[alphabetIndex])(inputVector[inputIndex]);
}

Googling has uncovered

- the TRE library (http://laurikari.net/tre/documentation/reguexec/)
- and Ragel (http://www.complang.org/ragel/), promising

but neither quite, or easily, meet my requirements.

I'm not expecting the perfect library, and am quite willing to investigate
and modify. Does anyone know of any suitable/similar library?

Thanks in advance,

______________________________________________________________________________
Dr Chris McDonald E: chris(a)csse.uwa.edu.au
Computer Science & Software Engineering W: http://www.csse.uwa.edu.au/~chris
The University of Western Australia, M002 T: +618 6488 2533
Crawley, Western Australia, 6009 F: +618 6488 1089
--
comp.lang.c.moderated - moderation address: clcm(a)plethora.net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
From: Jasen Betts on
On 2010-01-21, Chris McDonald <chris(a)csse.uwa.edu.au> wrote:
> Hello All,
>
> [please excuse the Subject: line, as I'm unsure of the best description]
>
> I'm seeking pointers to a C library that provides basic-regular-expression (BRE)
> pattern matching *and* permits me to define the equality of atoms.

> Dr Chris McDonald E: chris(a)csse.uwa.edu.au
> Computer Science & Software Engineering W: http://www.csse.uwa.edu.au/~chris

maybe get a postgrad student to write you one :)

--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
--
comp.lang.c.moderated - moderation address: clcm(a)plethora.net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
From: Pascal J. Bourguignon on
Chris McDonald <chris(a)csse.uwa.edu.au> writes:
> I'm seeking pointers to a C library that provides basic-regular-expression (BRE)
> pattern matching *and* permits me to define the equality of atoms.

By the way, using regex(3), you can easily define the "equality" of
your atoms and match regular expressions for any kind of objects, as
long as you have less than 256 classes of objects.


--
__Pascal Bourguignon__ http://www.informatimago.com/
--
comp.lang.c.moderated - moderation address: clcm(a)plethora.net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
From: Victor Porton on
On Jan 24, 8:50 pm, p...(a)informatimago.com (Pascal J. Bourguignon)
wrote:
> Chris McDonald <ch...(a)csse.uwa.edu.au> writes:
> > I'm seeking pointers to a C library that provides basic-regular-expression (BRE)
> > pattern matching *and* permits me to define the equality of atoms.
>
> By the way, using regex(3), you can easily define the "equality" of
> your atoms and match regular expressions for any kind of objects, as
> long as you have less than 256 classes of objects.

One more (maybe stupid) idea: Use UTF-8 to encode more than 256
objects.
--
comp.lang.c.moderated - moderation address: clcm(a)plethora.net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
From: Pascal J. Bourguignon on
Victor Porton <porton.victor(a)gmail.com> writes:

> On Jan 24, 8:50�pm, p...(a)informatimago.com (Pascal J. Bourguignon)
> wrote:
>> Chris McDonald <ch...(a)csse.uwa.edu.au> writes:
>> > I'm seeking pointers to a C library that provides basic-regular-expression (BRE)
>> > pattern matching *and* permits me to define the equality of atoms.
>>
>> By the way, using regex(3), you can easily define the "equality" of
>> your atoms and match regular expressions for any kind of objects, as
>> long as you have less than 256 classes of objects.
>
> One more (maybe stupid) idea: Use UTF-8 to encode more than 256
> objects.

That could work, but you have to be extra careful when composing the
regular expression. This could be done since to generate the regular
expression from random objects you would have to have an API in any
case.

Specifically, If you want to match "�*" you actually get the UTF-8
string: char regexp[]={195,169,42,0}; which doesn't mean the same
thing. Unfortunately, it's not a simple matter of using groups:
"\\(�\\)*" {92, 40, 195, 169, 92, 41, 42, 0}, since adding a group
shifts the numbers of all the following the groups, so you have to
compensate. You also have similar problems in brackets, "[e�]"
doesn't mean what you want, you have to convert it to an alternative:
"\\(e\\|�\\)".


--
__Pascal Bourguignon__ http://www.informatimago.com/
--
comp.lang.c.moderated - moderation address: clcm(a)plethora.net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.