Prev: system calls
Next: Job control concerning tcsetpgrp
From: kolmogolov on 7 Mar 2010 07:10 Hi, I am using a one-line sed script: sed -n -e 's/^.*HREF="\([^"]*\)".*>\(.*\)<\/[aA]>$/\1 "\2"/p to extract the URL-field and the description-fieled from the firefox bookmark-file: (Oh je, .... upon cut-and-paste I hope the google-editor is not inserting newlines.....) <DT><A HREF="http://en.wikipedia.org/wiki/Bridal_Chorus" ADD_DATE="126795746 6" LAST_CHARSET="UTF-8" ID="rdf:#$a2RoK">Bridal Chorus - Wikipedia, the free enc yclopedia</A> <DT><A HREF="http://de.wikipedia.org/wiki/Mittelhochdeutsch" ADD_DATE="12679 59991" ICON="data:image/x- icon;base64,AAABAAEAEBAQAAEABAAoAQAAFgAAACgAAAAQAAAAIA AAAAEABAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAEAgQAhIOEAMjHyABIR0gA6ejpAGlqaQCpqKkAKC goAPz9/ AAZGBkAmJiYANjZ2ABXWFcAent6ALm6uQA8OjwAiIiIiIiIiIiIiI4oiL6IiIiIgzuIV4iIiI hndo53KIiIiB/WvXoYiIiIfEZfWBSIiIEGi/ foqoiIgzuL84i9iIjpGIoMiEHoiMkos3FojmiLlUipYl iEWIF +iDe0GoRa7D6GPbjcu1yIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIgAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" LAST_CHARSET="UT F-8" ID="rdf:#$Y2RoK">Mittelhochdeutsch <80><93> Wikipedia</A> However, I got only the output for the first line: http://en.wikipedia.org/wiki/Bridal_Chorus "Bridal Chorus - Wikipedia, the free encyclopedia" while it failed to match on the second line. Thanks in advance for any hints regards
From: Jens Thoms Toerring on 7 Mar 2010 09:31 kolmogolov(a)gmail.com <kolmogolov(a)gmail.com> wrote: > I am using a one-line sed script: > sed -n -e 's/^.*HREF="\([^"]*\)".*>\(.*\)<\/[aA]>$/\1 "\2"/p > to extract the URL-field and > the description-fieled from the firefox bookmark-file: > (Oh je, .... upon cut-and-paste I hope the > google-editor is not inserting newlines.....) > <DT><A HREF="http://en.wikipedia.org/wiki/Bridal_Chorus" > ADD_DATE="126795746 > 6" LAST_CHARSET="UTF-8" ID="rdf:#$a2RoK">Bridal Chorus - Wikipedia, > the free enc > yclopedia</A> > <DT><A HREF="http://de.wikipedia.org/wiki/Mittelhochdeutsch" > ADD_DATE="12679 > 59991" ICON="data:image/x- > icon;base64,AAABAAEAEBAQAAEABAAoAQAAFgAAACgAAAAQAAAAIA > AAAAEABAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAEAgQAhIOEAMjHyABIR0gA6ejpAGlqaQCpqKkAKC > goAPz9/ > AAZGBkAmJiYANjZ2ABXWFcAent6ALm6uQA8OjwAiIiIiIiIiIiIiI4oiL6IiIiIgzuIV4iIiI > hndo53KIiIiB/WvXoYiIiIfEZfWBSIiIEGi/ > foqoiIgzuL84i9iIjpGIoMiEHoiMkos3FojmiLlUipYl > iEWIF > +iDe0GoRa7D6GPbjcu1yIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIgAAAAAAAAAAAAAAAAAAAAAAA > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" > LAST_CHARSET="UT > F-8" ID="rdf:#$Y2RoK">Mittelhochdeutsch <80><93> Wikipedia</A> > However, I got only the output for the first line: > http://en.wikipedia.org/wiki/Bridal_Chorus "Bridal Chorus - Wikipedia, > the free encyclopedia" > while it failed to match on the second line. Does it fail completely? It's a bit hard to say since, of course, the lines have become split up;-) But if it's just that parts of the description field of the second line isn't output correctly then there's a simple fix. Instead of > sed -n -e 's/^.*HREF="\([^"]*\)".*>\(.*\)<\/[aA]>$/\1 "\2"/p' use sed -n -e 's/^.*HREF="\([^"]*\)[^>]*>\(.*\)<\/[aA]>$/\1 "\2"/p' since your '.*>' part will match everything up to (and including) the '<93>' in the description field (remember it's greedy matching, i.e. as much as any possible is matched) and what's in $2 is then just ' Wikipedia'). With '[^>]*>' you instead match only up to the very first '>' encountered after the URL. Regards, Jens -- \ Jens Thoms Toerring ___ jt(a)toerring.de \__________________________ http://toerring.de
From: Ben Bacarisse on 7 Mar 2010 10:02 "kolmogolov(a)gmail.com" <kolmogolov(a)gmail.com> writes: > I am using a one-line sed script: > > sed -n -e 's/^.*HREF="\([^"]*\)".*>\(.*\)<\/[aA]>$/\1 "\2"/p Missing ' at the end, I think. > to extract the URL-field and > the description-fieled from the firefox bookmark-file: > > (Oh je, .... upon cut-and-paste I hope the > google-editor is not inserting newlines.....) <snip data> > However, I got only the output for the first line: > > http://en.wikipedia.org/wiki/Bridal_Chorus "Bridal Chorus - Wikipedia, > the free encyclopedia" > > while it failed to match on the second line. It's not clear what the problem is because of the messed up lines. If I repair them as I think they should be repaired I do get something from the second line though the pattern does not match what you want it to. The '.*>' part after the HREF will match more than anything left in this tag and you second line has this form: ...HREF="..." stuff>Text <80><93> More text</A> so your \2 only starts after the > of <93>. Construct some simple data with the same shape so you can post a real non-working example. Alternatively post the same data but explicitly describe where the lines end in your data file. -- Ben.
From: kolmogolov on 8 Mar 2010 04:41 On Mar 8, 5:32 pm, "kolmogo...(a)gmail.com" <kolmogo...(a)gmail.com> wrote: (commenting my own follow-up) > It turns out that my sed(1) got confused by the 0xe2 0x80 and 0x93 > characters in the description-field. > > I recall that I just replaced the grep(1) in one of my secrips by > some ``agrep(1)'' for files containing iso-8859-1 characters... That is, my GNU grep(1) did not match a german umlaut in 8859-1 by a single dot.
From: Jens Thoms Toerring on 8 Mar 2010 05:30
kolmogolov(a)gmail.com <kolmogolov(a)gmail.com> wrote: > On Mar 8, 5:32 pm, "kolmogo...(a)gmail.com" <kolmogo...(a)gmail.com> > wrote: > (commenting my own follow-up) > > It turns out that my sed(1) got confused by the 0xe2 0x80 and 0x93 > > characters in the description-field. > > > > I recall that I just replaced the grep(1) in one of my secrips by > > some ``agrep(1)'' for files containing iso-8859-1 characters... > That is, my GNU grep(1) did not match a german umlaut in 8859-1 > by a single dot. It might be a encoding issue - if you e.g. have set your shell to UTF-8 (or the file into which you put the grep command is in UTF-8) but the file you're running grep on is in iso-8859-1 then grep will try to match the UTF-8 characters and, of course, miss the characters that are in a different encoding. In the iso-8859-1 file e.g. the character 'ä' will be represented by the value 0xe4 while in UTF-8 it's actually two bytes, 0xC3 followed by 0xA4, so grep has no chance to figure out that they are supposed to represent the same character. The same holds, of course, for sed. That it still works with agrep is probably due to agrep also accepting approximate matches. Since I didn't found a way how to match binary data (and you would have to know how the characters you're interested in are stored in binary values) the simplest solution might be to con- vert the file to the character encoding used for running sed and afterwards back to the original encoding. So if the file 'in.txt' is in ISO-8859-1 but you "operate" in UTF-8, then the following command will first convert the file to UTF-8, then run sed on it, replacing 'ä' by 'ö', and the re-convert the results back to ISO-8859-1: iconv -f ISO-8859-1 -t UTF-8 in.txt | sed 's/ä/ö/' | \ iconv -f UTF-8 -t ISO-8859-1 - > out.txt Sorry, but until everybody starts using UTF-8 only everywhere things are probably going to remain that messy... Regards, Jens -- \ Jens Thoms Toerring ___ jt(a)toerring.de \__________________________ http://toerring.de |