Please help with this one-line sed [Unix Programming]

Prev: system calls
Next: Job control concerning tcsetpgrp

From: kolmogolov on 7 Mar 2010 07:10

Hi,

I am using a one-line sed script:

sed -n -e 's/^.*HREF="$[^"]*$".*>$.*$<\/[aA]>$/\1 "\2"/p

to extract the URL-field and
the description-fieled from the firefox bookmark-file:

(Oh je, .... upon cut-and-paste I hope the
google-editor is not inserting newlines.....)

<DT><A HREF="http://en.wikipedia.org/wiki/Bridal_Chorus"
ADD_DATE="126795746
6" LAST_CHARSET="UTF-8" ID="rdf:#$a2RoK">Bridal Chorus - Wikipedia,
the free enc
yclopedia</A>
<DT><A HREF="http://de.wikipedia.org/wiki/Mittelhochdeutsch"
ADD_DATE="12679
59991" ICON="data:image/x-
icon;base64,AAABAAEAEBAQAAEABAAoAQAAFgAAACgAAAAQAAAAIA
AAAAEABAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAEAgQAhIOEAMjHyABIR0gA6ejpAGlqaQCpqKkAKC
goAPz9/
AAZGBkAmJiYANjZ2ABXWFcAent6ALm6uQA8OjwAiIiIiIiIiIiIiI4oiL6IiIiIgzuIV4iIiI
hndo53KIiIiB/WvXoYiIiIfEZfWBSIiIEGi/
foqoiIgzuL84i9iIjpGIoMiEHoiMkos3FojmiLlUipYl
iEWIF
+iDe0GoRa7D6GPbjcu1yIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIgAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
LAST_CHARSET="UT
F-8" ID="rdf:#$Y2RoK">Mittelhochdeutsch <80><93> Wikipedia</A>

However, I got only the output for the first line:

http://en.wikipedia.org/wiki/Bridal_Chorus "Bridal Chorus - Wikipedia,
the free encyclopedia"

while it failed to match on the second line.

Thanks in advance for any hints
regards

From: Jens Thoms Toerring on 7 Mar 2010 09:31

kolmogolov(a)gmail.com <kolmogolov(a)gmail.com> wrote:
> I am using a one-line sed script:

> sed -n -e 's/^.*HREF="$[^"]*$".*>$.*$<\/[aA]>$/\1 "\2"/p

> to extract the URL-field and
> the description-fieled from the firefox bookmark-file:

> (Oh je, .... upon cut-and-paste I hope the
> google-editor is not inserting newlines.....)

> <DT><A HREF="http://en.wikipedia.org/wiki/Bridal_Chorus"
> ADD_DATE="126795746
> 6" LAST_CHARSET="UTF-8" ID="rdf:#$a2RoK">Bridal Chorus - Wikipedia,
> the free enc
> yclopedia</A>
> <DT><A HREF="http://de.wikipedia.org/wiki/Mittelhochdeutsch"
> ADD_DATE="12679
> 59991" ICON="data:image/x-
> icon;base64,AAABAAEAEBAQAAEABAAoAQAAFgAAACgAAAAQAAAAIA
> AAAAEABAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAEAgQAhIOEAMjHyABIR0gA6ejpAGlqaQCpqKkAKC
> goAPz9/
> AAZGBkAmJiYANjZ2ABXWFcAent6ALm6uQA8OjwAiIiIiIiIiIiIiI4oiL6IiIiIgzuIV4iIiI
> hndo53KIiIiB/WvXoYiIiIfEZfWBSIiIEGi/
> foqoiIgzuL84i9iIjpGIoMiEHoiMkos3FojmiLlUipYl
> iEWIF
> +iDe0GoRa7D6GPbjcu1yIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIgAAAAAAAAAAAAAAAAAAAAAAA
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
> LAST_CHARSET="UT
> F-8" ID="rdf:#$Y2RoK">Mittelhochdeutsch <80><93> Wikipedia</A>

> However, I got only the output for the first line:

> http://en.wikipedia.org/wiki/Bridal_Chorus "Bridal Chorus - Wikipedia,
> the free encyclopedia"

> while it failed to match on the second line.

Does it fail completely? It's a bit hard to say since, of course,
the lines have become split up;-) But if it's just that parts of
the description field of the second line isn't output correctly
then there's a simple fix. Instead of

> sed -n -e 's/^.*HREF="$[^"]*$".*>$.*$<\/[aA]>$/\1 "\2"/p'

use

sed -n -e 's/^.*HREF="$[^"]*$[^>]*>$.*$<\/[aA]>$/\1 "\2"/p'

since your '.*>' part will match everything up to (and including)
the '<93>' in the description field (remember it's greedy matching,
i.e. as much as any possible is matched) and what's in $2 is then
just ' Wikipedia'). With '[^>]*>' you instead match only up to the
very first '>' encountered after the URL.

Regards, Jens
--
\ Jens Thoms Toerring ___ jt(a)toerring.de
\__________________________ http://toerring.de

From: Ben Bacarisse on 7 Mar 2010 10:02

"kolmogolov(a)gmail.com" <kolmogolov(a)gmail.com> writes:

> I am using a one-line sed script:
>
> sed -n -e 's/^.*HREF="$[^"]*$".*>$.*$<\/[aA]>$/\1 "\2"/p

Missing ' at the end, I think.

> to extract the URL-field and
> the description-fieled from the firefox bookmark-file:
>
> (Oh je, .... upon cut-and-paste I hope the
> google-editor is not inserting newlines.....)

<snip data>

> However, I got only the output for the first line:
>
> http://en.wikipedia.org/wiki/Bridal_Chorus "Bridal Chorus - Wikipedia,
> the free encyclopedia"
>
> while it failed to match on the second line.

It's not clear what the problem is because of the messed up lines. If
I repair them as I think they should be repaired I do get something
from the second line though the pattern does not match what you want
it to. The '.*>' part after the HREF will match more than anything
left in this tag and you second line has this form:

...HREF="..." stuff>Text <80><93> More text</A>

so your \2 only starts after the > of <93>. Construct some simple
data with the same shape so you can post a real non-working example.
Alternatively post the same data but explicitly describe where the
lines end in your data file.

--
Ben.

From: kolmogolov on 8 Mar 2010 04:41

On Mar 8, 5:32 pm, "kolmogo...(a)gmail.com" <kolmogo...(a)gmail.com>
wrote:

(commenting my own follow-up)

> It turns out that my sed(1) got confused by the 0xe2 0x80 and 0x93
> characters in the description-field.
>
> I recall that I just replaced the grep(1) in one of my secrips by
> some ``agrep(1)'' for files containing iso-8859-1 characters...

That is, my GNU grep(1) did not match a german umlaut in 8859-1
by a single dot.

From: Jens Thoms Toerring on 8 Mar 2010 05:30

kolmogolov(a)gmail.com <kolmogolov(a)gmail.com> wrote:
> On Mar 8, 5:32 pm, "kolmogo...(a)gmail.com" <kolmogo...(a)gmail.com>
> wrote:

> (commenting my own follow-up)

> > It turns out that my sed(1) got confused by the 0xe2 0x80 and 0x93
> > characters in the description-field.
> >
> > I recall that I just replaced the grep(1) in one of my secrips by
> > some ``agrep(1)'' for files containing iso-8859-1 characters...

> That is, my GNU grep(1) did not match a german umlaut in 8859-1
> by a single dot.

It might be a encoding issue - if you e.g. have set your shell
to UTF-8 (or the file into which you put the grep command is
in UTF-8) but the file you're running grep on is in iso-8859-1
then grep will try to match the UTF-8 characters and, of course,
miss the characters that are in a different encoding. In the
iso-8859-1 file e.g. the character 'ä' will be represented by
the value 0xe4 while in UTF-8 it's actually two bytes, 0xC3
followed by 0xA4, so grep has no chance to figure out that
they are supposed to represent the same character. The same
holds, of course, for sed. That it still works with agrep is
probably due to agrep also accepting approximate matches.

Since I didn't found a way how to match binary data (and you
would have to know how the characters you're interested in are
stored in binary values) the simplest solution might be to con-
vert the file to the character encoding used for running sed
and afterwards back to the original encoding. So if the file
'in.txt' is in ISO-8859-1 but you "operate" in UTF-8, then the
following command will first convert the file to UTF-8, then
run sed on it, replacing 'ä' by 'ö', and the re-convert the
results back to ISO-8859-1:

iconv -f ISO-8859-1 -t UTF-8 in.txt | sed 's/ä/ö/' | \
iconv -f UTF-8 -t ISO-8859-1 - > out.txt

Sorry, but until everybody starts using UTF-8 only everywhere
things are probably going to remain that messy...

Regards, Jens
--
\ Jens Thoms Toerring ___ jt(a)toerring.de
\__________________________ http://toerring.de

| Next | Last
Pages: 1 2
Prev: system calls
Next: Job control concerning tcsetpgrp