From: Keith Thompson on
Victor Porton <porton.victor(a)gmail.com> writes:
> On Jan 24, 8:50 pm, p...(a)informatimago.com (Pascal J. Bourguignon)
> wrote:
>> Chris McDonald <ch...(a)csse.uwa.edu.au> writes:
>> > I'm seeking pointers to a C library that provides
>> > basic-regular-expression (BRE) pattern matching *and* permits me
>> > to define the equality of atoms.
>>
>> By the way, using regex(3), you can easily define the "equality" of
>> your atoms and match regular expressions for any kind of objects, as
>> long as you have less than 256 classes of objects.
>
> One more (maybe stupid) idea: Use UTF-8 to encode more than 256
> objects.

Another (maybe stupid) idea: encode your objects as sequences
of characters, defined so that the boundary between one encoded
object and the next is unambiguous. Then write regular expressions
that operate on ordinary strings that happen to encode sequneces
of objects.

I strongly suspect this is going to be impractical, but I thought I'd
throw it out there anyway.

--
Keith Thompson (The_Other_Keith) kst-u(a)mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
--
comp.lang.c.moderated - moderation address: clcm(a)plethora.net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
From: Dag-Erling Smørgrav on
pjb(a)informatimago.com (Pascal J. Bourguignon) writes:
> Victor Porton <porton.victor(a)gmail.com> writes:
> > One more (maybe stupid) idea: Use UTF-8 to encode more than 256
> > objects.
> That could work, but you have to be extra careful when composing the
> regular expression.

Unless (as I think Victor assumed) your re library supports UTF-8 and
you loaded a UTF-8 locale.

DES
--
Dag-Erling Smørgrav - des(a)des.no
--
comp.lang.c.moderated - moderation address: clcm(a)plethora.net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
From: Moi on
On Thu, 21 Jan 2010 11:02:08 -0600, Chris McDonald wrote:

> Hello All,
>
> [please excuse the Subject: line, as I'm unsure of the best description]
>
> I'm seeking pointers to a C library that provides
> basic-regular-expression (BRE) pattern matching *and* permits me to
> define the equality of atoms.
>
> C's standard qsort() function is able to sort vectors of objects by
> calling-back to the user to ask about the relative order of two objects.
> To permit sorting of arbitrary objects, the caller passes to qsort() the
> length of each object, and two pointers are passed back to the
> user-provided comparison function.
>
> I'm seeking something similar for regular-expressions, appreciating that
> some features (such as back-patterns) may become impossible.
>
> At the heart of RE implementations are hundreds of inline comparisons to
> check if char1 == char2. However, I would like:
>
> - char1 and char2 to be my objects, not characters, and - for my
> comparison function to be called each time == is required.
>
> Because we're no longer comparing characters, we can't simply provide
> them in the RE to be matched. Thus I'm imagining a mechanism where
> identifiers in the pattern represent members of the input alphabet.
>
> For example, we have an alphabet vector of alphabet[] = {obj1, obj2,
> obj3}, and we're seeking the regular expression "1.*[23]" where the
> 1,2, and 3 represent the 1st, 2nd, 3rd objects from the alphabet. A
> call to
>

What is the expected size of the alphabet ?
If it is < 256, creating a mapping between your "objects" and characters would be trivial.

For larger alphabet sizes, you'd need to start with regexp and try to alter it's
basic unit from char to int. with special attention for the metacharacters.
(I guess, I'd make them negative)

If your alphabet is not too big, and your regex not too complex,
mapping the alphabet to "tokens" and using yacc/bison seems a possibility.

HTH,
AvK