From: Guillaume Dargaud on 30 May 2010 17:37 Hello all, I'm playing with grep/sed on ISO-8859-1 encoded files, and I notice that . (the dot) doesn't seem to match accented chars, leading to some pretty unexpected results. I know that internationalization and encodings are a hornet's nest, so I'm seeking some advice here... -- Guillaume Dargaud http://www.gdargaud.net/
From: Ben Bacarisse on 30 May 2010 19:46 Guillaume Dargaud <use_my_web_form(a)www.gdargaud.net> writes: > I'm playing with grep/sed on ISO-8859-1 encoded files, There is a miss-match between the subject line and this remark. UTF usually means UTF-8 which in a multi-byte encoding used for Unicode. ISO-8859-1 is a single-byte character set. > and I > notice that . (the dot) doesn't seem to match accented chars, > leading to some pretty unexpected results. I don't see this problem with either UTF-8 encoded files or with ISO-8859-1 files (GNU grep 2.5.4). It seems to correctly pick up the character encoding from the environment (specifically LANG). I don't say this to "show off" just to point out that it does seem to work. It may simply be that you have a miss-match between the setting of LANG and the encoding in the file. You can change LANG for just one command like this: LANG=en_GB.iso-8859-1 grep c.d data > I know that > internationalization and encodings are a hornet's nest, so I'm > seeking some advice here... It can be. Best start with an example. Probably the only way for everyone to know exactly what you have in the data file is to post a hex dump of it (keep it short). Post the value of $LANG and the command line that does not do what you expect. Initially, avoid command lines that use anything but "plain" characters. -- Ben.
From: Thomas 'PointedEars' Lahn on 31 May 2010 05:26 Ben Bacarisse wrote: > Guillaume Dargaud <use_my_web_form(a)www.gdargaud.net> writes: >> I'm playing with grep/sed on ISO-8859-1 encoded files, > > There is a miss-match between the subject line and this remark. UTF > usually means UTF-8 UTF (usually) means Unicode Transformation Format. Nothing more, nothing less. > which in a multi-byte encoding used for Unicode. The trueness of this statement is questionable. While most of the characters in the Unicode character set require more than one UTF-8 code unit (so more than 8 bits, or 1 byte) to be encoded, there are characters (those below U+0080) that only require one UTF-8 code unit, so 8 bits, or 1 byte to be encoded. <http://unicode.org/faq/> <http://rishida.net/tools/conversion/> > ISO-8859-1 is a single-byte character set. Now you are obviously confusing character set and encoding. Further good, less formal, reading on that (if you ignore some sentiments): <http://www.joelonsoftware.com/articles/Unicode.html> >> and I notice that . (the dot) doesn't seem to match accented chars, >> leading to some pretty unexpected results. > > I don't see this problem with either UTF-8 encoded files or with > ISO-8859-1 files (GNU grep 2.5.4). It seems to correctly pick up the > character encoding from the environment (specifically LANG). GNU grep(1) uses the character encoding specified by the environment variables LC_ALL, LC_CTYPE, or LANG. RTFM. PointedEars
From: pk on 31 May 2010 06:55 Ben Bacarisse wrote: > Guillaume Dargaud <use_my_web_form(a)www.gdargaud.net> writes: > >> I'm playing with grep/sed on ISO-8859-1 encoded files, > > There is a miss-match between the subject line and this remark. UTF > usually means UTF-8 which in a multi-byte encoding used for Unicode. > ISO-8859-1 is a single-byte character set. > >> and I >> notice that . (the dot) doesn't seem to match accented chars, >> leading to some pretty unexpected results. > > I don't see this problem with either UTF-8 encoded files or with > ISO-8859-1 files (GNU grep 2.5.4). It seems to correctly pick up the > character encoding from the environment (specifically LANG). I don't > say this to "show off" just to point out that it does seem to work. > > It may simply be that you have a miss-match between the setting of LANG > and the encoding in the file. You can change LANG for just one command > like this: > > LANG=en_GB.iso-8859-1 grep c.d data > >> I know that >> internationalization and encodings are a hornet's nest, so I'm >> seeking some advice here... > > It can be. Best start with an example. Probably the only way for > everyone to know exactly what you have in the data file is to post a hex > dump of it (keep it short). Post the value of $LANG and the command > line that does not do what you expect. Initially, avoid command lines > that use anything but "plain" characters. I'm not sure this matters, but here's what info sed says (in the "bugs that are not bugs" section): `s/.*//' does not clear pattern space This happens if your input stream includes invalid multibyte sequences. POSIX mandates that such sequences are _not_ matched by `.', so that `s/.*//' will not clear pattern space as you would expect. In fact, there is no way to clear sed's buffers in the middle of the script in most multibyte locales (including UTF-8 locales). For this reason, GNU `sed' provides a `z' command (for `zap') as an extension. To work around these problems, which may cause bugs in shell scripts, set the `LC_COLLATE' and `LC_CTYPE' environment variables to `C'.
From: Janis Papanagnou on 31 May 2010 09:03 Thomas 'PointedEars' Lahn wrote: > Ben Bacarisse wrote: > >> Guillaume Dargaud <use_my_web_form(a)www.gdargaud.net> writes: >>> I'm playing with grep/sed on ISO-8859-1 encoded files, >> There is a miss-match between the subject line and this remark. UTF >> usually means UTF-8 > > [...] > >> ISO-8859-1 is a single-byte character set. > > Now you are obviously confusing character set and encoding. ISO/IEC 8859-1: "8-bit single-byte coded graphic character sets" Omitting the word "coded" is no fault, because it is clear that the coding is meant if Ben says "single byte"[*]. I am sure no one (but you) is confusing anything here. All "character set" needs some encoding specified; all character sets (ASCII, Latin 1, EBCDIC, etc.) do that. We're not talking about the visual graphemes or abstract characters, but of the coupling of those with their respective encoding when we speak of "character sets". But, anyway, Ben's main point here was that there's mismatch in the OP's posting. Janis [*] There's a better nitpick here, BTW; the definition of byte is not generally an 8 bit quantity, so it's better to define it as 8-bit byte or as octet.
|
Next
|
Last
Pages: 1 2 3 Prev: $$jordan coach bag ed tshirt with amazing price Next: shell script - resque |