match across line using grep [Debian]

Prev: udev: renamed network interface eth0 to eth1
Next: Setting up local Debian mirror

From: Andre Majorel on 3 Aug 2010 08:50

On 2010-08-03 19:37 +0800, Zhang Weiwu wrote:
> On 2010???08???03??? 17:53, Andre Majorel wrote:
> >> > $ printf 'a\nb' | grep -zo a.*b
> >> >
> >> > (The above should output something /if/ -z would make egrep
> >> > not consider \n as string terminator. But it has produced no
> >> > output)
> >>
> > But grep -z does. This would seem to be an undocumented
> > limitation of -o.
> >
>
> No it doesn't.
>
> $ printf 'a\nb' | grep -z 'a.*b'
> $

You're welcome. What version of grep ?

--
Andr� Majorel <http://www.teaser.fr/~amajorel/>
If the Debian project published their users' email addresses,
we'd be getting spam. So I'm glad they don't.

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/20100803123949.GA4007(a)aym.net2.nerim.net

From: Bob McGowan on 3 Aug 2010 13:10

On 08/03/2010 05:39 AM, Andre Majorel wrote:
> On 2010-08-03 19:37 +0800, Zhang Weiwu wrote:
>> On 2010???08???03??? 17:53, Andre Majorel wrote:
>>>>> $ printf 'a\nb' | grep -zo a.*b
>>>>>
>>>>> (The above should output something /if/ -z would make egrep
>>>>> not consider \n as string terminator. But it has produced no
>>>>> output)
>>>>
>>> But grep -z does. This would seem to be an undocumented
>>> limitation of -o.
>>>
>>
>> No it doesn't.
>>
>> $ printf 'a\nb' | grep -z 'a.*b'
>> $
>
> You're welcome. What version of grep ?
>

The -z "sort of" does/doesn't work for me. If I do this:

$ perl -e 'print "a\nb\0"'| grep -z 'a.*b'
$

There's no output. But change it like this:

$ perl -e 'print "a\nb\0"'| grep -z 'a'
a
b$

It found, and printed, the newline containing string. I would suspect
the regex engine is still honoring '. (dot) does not match newline'
convention but is OK with literals, if present.

If, instead of using the '.*' pattern, I embed a literal newline, it
also works:

$ perl -e 'print "a\nb\0"'| grep -z 'a
> b'
a
b$

And just to prove the point, it does work with multiple null terminated
lines:

perl -e 'print "a\nb\0not here\0"'| grep -z 'a
> b'
a
b$

I'm using GNU grep 2.5.3

--
Bob McGowan

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C584A92.70102(a)symantec.com

From: Andre Majorel on 3 Aug 2010 14:30

On 2010-08-03 09:57 -0700, Bob McGowan wrote:
> On 08/03/2010 05:39 AM, Andre Majorel wrote:
> > On 2010-08-03 19:37 +0800, Zhang Weiwu wrote:
> >> On 2010???08???03??? 17:53, Andre Majorel wrote:
> >>>>> $ printf 'a\nb' | grep -zo a.*b
> >>>>>
> >>>>> (The above should output something /if/ -z would make egrep
> >>>>> not consider \n as string terminator. But it has produced no
> >>>>> output)
> >>>>
> >>> But grep -z does. This would seem to be an undocumented
> >>> limitation of -o.
> >>
> >> No it doesn't.
> >>
> >> $ printf 'a\nb' | grep -z 'a.*b'
> >> $
> >
> > You're welcome. What version of grep ?
>
> The -z "sort of" does/doesn't work for me. If I do this:
>
> $ perl -e 'print "a\nb\0"'| grep -z 'a.*b'
> $

$ printf 'a\nb\0'| grep -z 'a.*b'
a
b$ grep --version
GNU grep 2.5.3

Fun, eh ? Maybe the answer is in there :

$ locale
LANG=
LC_CTYPE=en_US
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE=C
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

> There's no output. But change it like this:
>
> $ perl -e 'print "a\nb\0"'| grep -z 'a'
> a
> b$
>
> It found, and printed, the newline containing string. I would suspect
> the regex engine is still honoring '. (dot) does not match newline'
> convention but is OK with literals, if present.

My grep -z acts like it used a regexp engine where "." matches
newline. Only when -o is in effect and there is a newline in the
match, there's no output. But the exit status is still good :

$ printf 'a\nb\0'| (grep -z 'a.*b' && printf 'st=%d chars=' $? >&2) | wc -c
st=0 chars=4
$ printf 'a\nb\0'| (grep -oz 'a.*b' && printf 'st=%d chars=' $? >&2) | wc -c
st=0 chars=0

--
Andr� Majorel <http://www.teaser.fr/~amajorel/>
No one ever sends you any email ? Report a bug in Debian !

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/20100803182837.GC4007(a)aym.net2.nerim.net

From: Bob McGowan on 3 Aug 2010 17:00

On 08/03/2010 11:28 AM, Andre Majorel wrote:
> On 2010-08-03 09:57 -0700, Bob McGowan wrote:
>> On 08/03/2010 05:39 AM, Andre Majorel wrote:
>>> On 2010-08-03 19:37 +0800, Zhang Weiwu wrote:
>>>> On 2010???08???03??? 17:53, Andre Majorel wrote:
>>>>>>> $ printf 'a\nb' | grep -zo a.*b
>>>>>>>

<--deleted-->

> Fun, eh ? Maybe the answer is in there :
>
> $ locale
> LANG=
> LC_CTYPE=en_US
> LC_NUMERIC="POSIX"
> LC_TIME="POSIX"
> LC_COLLATE=C
> LC_MONETARY="POSIX"
> LC_MESSAGES="POSIX"
> LC_PAPER="POSIX"
> LC_NAME="POSIX"
> LC_ADDRESS="POSIX"
> LC_TELEPHONE="POSIX"
> LC_MEASUREMENT="POSIX"
> LC_IDENTIFICATION="POSIX"
> LC_ALL=

This does appear to be the "issue". My settings are:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

>
>> There's no output. But change it like this:
>>
>> $ perl -e 'print "a\nb\0"'| grep -z 'a'
>> a
>> b$
>>
>> It found, and printed, the newline containing string. I would suspect
>> the regex engine is still honoring '. (dot) does not match newline'
>> convention but is OK with literals, if present.
>

I did a sub-shell and reset all the variables to match yours, and,
bingo, the wildcard worked.

Looking through the list of names, nothing seems 'obvious' as a single
contributor. In fact, the LC_ names all seem to be specific to things
that would not necessarily impact the regex operation.

So, I picked LANG as a starting point and reset it, *only*, to empty.
And got lucky. That is, apparently, the variable that affects how the
regex is handled.

--
Bob McGowan
Symantec
US Internationalization

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C588249.8010604(a)symantec.com

From: Zhang Weiwu on 5 Aug 2010 21:50

On 2010年08月04日 04:55, Bob McGowan wrote:
> In fact, the LC_ names all seem to be specific to things
> that would not necessarily impact the regex operation.
>
It is not totally true. The encoding part might. If it is UTF-8, in
theory, [:digit:] should match more than 0-9. It might, for example,
mache 一-十 (Chinese digits).

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C5B6A10.3070702(a)realss.com

First | Prev | Next | Last
Pages: 1 2 3
Prev: udev: renamed network interface eth0 to eth1
Next: Setting up local Debian mirror