Prev: system calls
Next: Job control concerning tcsetpgrp
From: kolmogolov on 9 Mar 2010 01:59 On Mar 8, 6:30 pm, j...(a)toerring.de (Jens Thoms Toerring) wrote: > kolmogo...(a)gmail.com <kolmogo...(a)gmail.com> wrote: > > On Mar 8, 5:32 pm, "kolmogo...(a)gmail.com" <kolmogo...(a)gmail.com> > > wrote: > > (commenting my own follow-up) > > > It turns out that my sed(1) got confused by the 0xe2 0x80 and 0x93 > > > characters in the description-field. > > > > I recall that I just replaced the grep(1) in one of my secrips by > > > some ``agrep(1)'' for files containing iso-8859-1 characters... > > That is, my GNU grep(1) did not match a german umlaut in 8859-1 > > by a single dot. > > It might be a encoding issue - if you e.g. have set your shell > to UTF-8 (or the file into which you put the grep command is > in UTF-8) but the file you're running grep on is in iso-8859-1 > then grep will try to match the UTF-8 characters and, of course, > miss the characters that are in a different encoding. In the > iso-8859-1 file e.g. the character 'ä' will be represented by > the value 0xe4 while in UTF-8 it's actually two bytes, 0xC3 > followed by 0xA4, so grep has no chance to figure out that > they are supposed to represent the same character. The same > holds, of course, for sed. That it still works with agrep is > probably due to agrep also accepting approximate matches. > > Since I didn't found a way how to match binary data (and you > would have to know how the characters you're interested in are > stored in binary values) the simplest solution might be to con- > vert the file to the character encoding used for running sed > and afterwards back to the original encoding. So if the file > 'in.txt' is in ISO-8859-1 but you "operate" in UTF-8, then the > following command will first convert the file to UTF-8, then > run sed on it, replacing 'ä' by 'ö', and the re-convert the > results back to ISO-8859-1: > > iconv -f ISO-8859-1 -t UTF-8 in.txt | sed 's/ä/ö/' | \ > iconv -f UTF-8 -t ISO-8859-1 - > out.txt > Thanks a lot! So, I solved both grep(1) and sed(1) problems by inserting a single line in my scripts: LC_CTYPE=en_US.iso88591 to cheat them into treating the data as one-byte- characters while leaving my shell working environment intact. Since lots of my private docments are encooded in the Big5 charset and my shell always has LANG=C LC_CTYPE=zh_TW.Big5 LC_NUMERIC="C" .... I'm simply not yet ready to switch over... Happy end! regards Rudi |