Simple regex question [Shell]

Prev: percent encoding end decoding
Next: (tar -cf - /etc|gzip; dd if=/dev/zero count=1)...|rsh foo ddof=/dev/st0

From: Seebs on 23 Jan 2010 22:43

On 2010-01-24, Janis Papanagnou <janis_papanagnou(a)hotmail.com> wrote:
> You can do all that with the upthread mentioned globbing mechanisms in
> Kornshell, in bash (with extended globbing), and I think in zsh as well;
> use these constructs respectively: *(...) ?([ ]) ?(...)

I'm not sure that even ksh can do everything posix REs can.

> Your point seems to be that it's not possible in bourne shell and older
> bash'es, and it's supposedly not defined in POSIX. Granted.

Yeah. Which is to say, the standard shell glob mechanisms lack key
components of what makes regexes what they are.

I now also realize that you're probably making a point to do with the
existance of the term "regular expression" both as a general term for
a linguistic category, as well as the name for the POSIX pattern-matching
used by awk/sed, etcetera.

In the context of shell programming, usually "regular expression" is used
to refer explicitly to the specific set of closely-related regular
expression languages used by sed/awk, and the shell pattern list is not
such a list. Since one occasionally encounters people who actually
think that shell globs and "regular expressions" are the same thing, I
didn't pick that up.

-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / usenet-nospam(a)seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!

From: Kaz Kylheku on 23 Jan 2010 22:48

On 2010-01-24, Sven Mascheck <mascheck(a)email.invalid> wrote:
> PS: why are they characteristically different:
> The motivation for globbing was *intuitive* handling of file names
> - sometimes overlooked but important: globbing uses implicit anchors.

In this regard, globs are faithful to the concept of a regular
expression.

Mathematical regular expressions do not search; they test set
membership.

A regex describes a finite automaton which must accept the given set of
strings from beginning to end. I.e. for any string w in the
language L(R) of the regular expression R, that string is
accepted by the corresponding automaton, which means that when
after the characters of w are fed to that automaton, it is in
an acceptance state.

A glob used on the command line does exactly this: does the filename,
taken as a complete string, belong to the set of filenames described by
the expression.

The ability to search for a matching substring is an extended
application of regular expressions.

It's not what makes them regular expressions.

Moreover, glob patterns /are/ in fact employed in a ``de anchored''
searching situation. Namely, the ${VAR%pattern} expansion syntax, in
all its variations. If FOO contains "xyzabc" then ${VAR%a*} will trim
off the "abc" part, yielding "xyz". Clearly, the "a" is not anchored.

From: Seebs on 23 Jan 2010 23:35

On 2010-01-24, Kaz Kylheku <kkylheku(a)gmail.com> wrote:
> Moreover, glob patterns /are/ in fact employed in a ``de anchored''
> searching situation. Namely, the ${VAR%pattern} expansion syntax, in
> all its variations. If FOO contains "xyzabc" then ${VAR%a*} will trim
> off the "abc" part, yielding "xyz". Clearly, the "a" is not anchored.

Actually, it is -- that's why you need a * after it. %foo is anchored at
the end, #foo at the beginning.

But yeah, globs can be used without anchoring, and regexes can be used
anchored -- expr regexes, as I recall, are anchored on the left...

-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / usenet-nospam(a)seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!

From: Janis Papanagnou on 24 Jan 2010 00:11

Seebs wrote:
>
> I'm not sure that even ksh can do everything posix REs can.

I am confident and quite sure it does. Vice versa; I think the regexp
library will at least have problems emulating ksh's !(...) construct.
Ever tried? In general you'll get extremely bulky results here!
But the class of languages (regular expressions) is the same, anyway.[*]

[*] N.B. Newer ksh's also support back-references in their expressions,
so strictly speeking, with that feature, they exceed the Chomsky-3 grammar
class as well (analogous to other libraries with backreference extensions).

>
> I now also realize that you're probably making a point to do with the
> existance of the term "regular expression" both as a general term for
> a linguistic category, as well as the name for the POSIX pattern-matching
> used by awk/sed, etcetera.

Right.

Not only the regexp library that you mention here; the (extended) globbing
as well. Both can be categorized under that term. In other words; in Unix
context the regexp library "demands" (pars pro toto) being the sole real
regular expression parser, but that's not justified in the context of the
existing (extended) globbing.

Janis

From: Seebs on 24 Jan 2010 01:01

On 2010-01-24, Janis Papanagnou <janis_papanagnou(a)hotmail.com> wrote:
> Not only the regexp library that you mention here; the (extended) globbing
> as well. Both can be categorized under that term. In other words; in Unix
> context the regexp library "demands" (pars pro toto) being the sole real
> regular expression parser, but that's not justified in the context of the
> existing (extended) globbing.

Mostly, it's that "regexp" doesn't really mean "the formal computer
science term regular expression" but "this particular set of closely
related instances of that term".

-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / usenet-nospam(a)seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!