Prev: Pattern matching
Next: Daemons and hangups
From: Lao Ming on 7 Jul 2010 22:44 When I download a text file containing octal chararcters e.g. \342\200\231 as in: can\342\200\231t or don\342\200\231t is there a way to replace these with their ascii equivalent from the shell with sed, perl or awk? Thanks.
From: Janis Papanagnou on 8 Jul 2010 04:13 Lao Ming schrieb: > When I download a text file containing octal chararcters > > e.g. \342\200\231 > > as in: can\342\200\231t or don\342\200\231t > > is there a way to replace these with their ascii equivalent > from the shell with sed, perl or awk? I fear there might not be an ASCII equivalent if some encoding of a different character set has been used here instead of ASCII. You'll have to find out what encoding has been used in the first place. Then the program iconv may help you converting the data. Janis > Thanks.
From: Ben Bacarisse on 8 Jul 2010 08:23 Lao Ming <laomingliu(a)gmail.com> writes: > When I download a text file containing octal chararcters > > e.g. \342\200\231 > > as in: can\342\200\231t or don\342\200\231t > > is there a way to replace these with their ascii equivalent > from the shell with sed, perl or awk? The example is a useful one. \342\200\231 is the UTF-8 encoding of a "right single quote" which Unicode recommends as the character to use for an apostrophe. It is therefore very likely that the file is UTF-8 encoded. When you say the file contains octal characters it is not clear if you are showing us the octal values for the characters or whether the file really has the backslash followed by the three digits. In other words, does \342\200\231 represent 3 or 12 octets? If (as is likely) it is the former then iconv (with //translit) is the place to start. You may run into trouble when there are characters in the file that have no obvious ASCII equivalent, but that is another problem. iconv --from=utf-8 --to=ascii//translit my-input-file -- Ben.
|
Pages: 1 Prev: Pattern matching Next: Daemons and hangups |