From: Peter Billam on 29 Apr 2010 07:36 I'm confused... in "perldoc length" it says if the EXPR is in Unicode, you will get the number of characters, not the number of bytes. which is what I would want. But (in a one-line demo of a problem I have in a much larger module): $> perl -e '$l=length "ö"; print "length=$l\n";' length=2 But I want to see length=1 here... (in case your news-client doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1 on debian squeeze and everything else works fine in utf8. Regards, Peter -- Peter Billam www.pjb.com.au www.pjb.com.au/comp/contact.html
From: Helmut Richter on 29 Apr 2010 08:59 On Thu, 29 Apr 2010, Peter Billam wrote: > $> perl -e '$l=length "�"; print "length=$l\n";' > length=2 > > But I want to see length=1 here... (in case your news-client > doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1 > on debian squeeze and everything else works fine in utf8. What happens: What you input is two bytes long, and perl does not know that the two bytes are meant as one character. perl sees the two characters "ö". If you output them as Unicode, you will even see them: perl -e '$l=length "�"; binmode (STDOUT, "utf8"); print "length=$l === �\n";' yields length=2 === ö That is, the binary output of the binary string "�" was two errors that compensated each other. What you mean: The input file is already to be interpreted as UTF-8. You should tell perl so: perl -e 'use utf8; $l=length "�"; print "length=$l\n";' -- Helmut Richter
From: Helmut Richter on 29 Apr 2010 09:00 On Thu, 29 Apr 2010, Helmut Richter wrote: > The input file is already to be interpreted as UTF-8. You should tell perl so: Better: The source file ... -- Helmut Richter
From: Peter Billam on 29 Apr 2010 10:54 On 2010-04-29, Helmut Richter <hhr-m(a)web.de> wrote: > On Thu, 29 Apr 2010, Helmut Richter wrote: >> The input file is already to be interpreted as UTF-8. >> You should tell perl so: > > Better: The source file ... But if I tell perl that the source file is in utf8, then though it gets the length right :-) it can't print the string out :-( $> perl -e 'use utf8; $s="ö"; $l=length $s; print "length $s =$l\n";' length =1 ( likewise if I use the code-point: '$s="\x{00f6}"; ) OTOH if I don't use "use utf8" then perl prints correctly :-) but gets the length wrong :-( $> perl -e '$s="ö"; $l=length $s; print "length $s =$l\n";' length ö =2 I can't really afford to set the binmode explicitly; the "length" code and some "print"s are actually in a module, and the strings are passed to it from some calling program. So when I code the module I don't know in advance from what program is going to be calling it, and whether it's printing into a utf environment. Does the module really have to test every string and inspect $ENV{LANG} and $ENV{LC_TYPE} and change binmode accordingly ? I had been reading perldoc perluniintro: Starting from Perl 5.8.0, the use of "use utf8" is needed only in much more restricted circumstances. In earlier releases the "utf8" pragma was used to declare that operations in the current block or file would be Unicode-aware. This model was found to be wrong, or at least clumsy: the "Unicodeness" is now carried with the data, instead of being attached to the operations. so why is the "print" wrong, if the "Unicodeness" is carried with the data ? Perl should know if it's in a utf environment and printing to a utf8 device; python does, and so does vi, less, slrn, alpine, firefox and everything else I use (except fmt). Sorry for being so confused, I realise this must be old stuff :-( Peter -- Peter Billam www.pjb.com.au www.pjb.com.au/comp/contact.html
From: Helmut Richter on 29 Apr 2010 12:02 On Thu, 29 Apr 2010, Peter Billam wrote: > I can't really afford to set the binmode explicitly; the "length" > code and some "print"s are actually in a module, and the strings > are passed to it from some calling program. So when I code the > module I don't know in advance from what program is going to > be calling it, and whether it's printing into a utf environment. > Does the module really have to test every string and inspect > $ENV{LANG} and $ENV{LC_TYPE} and change binmode accordingly ? > I had been reading perldoc perluniintro: > > Starting from Perl 5.8.0, the use of "use utf8" is needed only in > much more restricted circumstances. In earlier releases the "utf8" > pragma was used to declare that operations in the current block or > file would be Unicode-aware. This model was found to be wrong, > or at least clumsy: the "Unicodeness" is now carried with the data, > instead of being attached to the operations. > > so why is the "print" wrong, if the "Unicodeness" is carried with > the data ? I find the term "Unicodeness" confusing, much more than the distinction of "character strings" vs. "byte strings" (as in http://perldoc.perl.org/perlunitut.html). It is *you*, the programmer, who has to know whether strings are meant as strings of characters or a strings of bytes. Obviously, your strings are strings of characters. Whether perl stores them as Unicode or as anything else is not your problem, you cannot know and you need not know. Now, when you read from a file or write to a file, it is suddenly important that you know what encoding is to be used in that file, because the character strings whose internal encoding you do not know must be constructed from the bytes in the file (or, on writing, they must be stored as bytes in the file). As the code used in the file cannot be determined reliably from the name or the contents of the file, it is you who has to tell perl, either by explicitly decoding/encoding the strings from/to the code, or by specifying the code as a layer on open/binmode. This is *also* true for STDIN/STDOUT/STDERR. The open pragma <http://perldoc.perl.org/open.html> might assist you in selecting the right layers depending on the locale -- if the locale correctly specifies the code which is by no means guaranteed (e.g. the code may change from one window to another without being reflected in the locale environment variables). I have no experience with the open pragma, though, so you have to find your way through it. The utf8 pragma has no effect whatsoever on what the program does. It affects only the interpretation of the bytes in the source code. If your source code is in UTF-8 and contains "�", you should use the utf8 pragma if this "�" means one character, and you should not use it if it means two bytes (which in turn will be interpreted as two characters when you (ab)use this byte string in a context where a character string is needed). > Perl should know if it's in a utf environment and > printing to a utf8 device; python does, and so does vi, less, > slrn, alpine, firefox and everything else I use (except fmt). Whether the choice of perl that it does not guess the code without being told so is a good one, is a matter of opinion. It can be tedious in environments where the same code is used everywhere, including all files and all databases, but can save your application if this requirement is not met. I hope that was of some help. -- Helmut Richter
|
Next
|
Last
Pages: 1 2 3 Prev: FAQ 8.30 How can I convert my shell script to perl? Next: FAQ 6.10 What is "/o" really for? |