From: Dotan Cohen on 3 Aug 2010 14:53 On Tue, Aug 3, 2010 at 18:41, Dave Angel <davea(a)ieee.org> wrote: > I don't understand your wording. Certainly the server launches the python > script, and captures stdout. It then sends that stream of bytes out over > tcp/ip to the waiting browser. You ask when does it become html ? I don't > think the question has meaning. > ×HTML is just plain text. So the answer to the question is that ideally, the plain text that is sent to stdout would already be HTML. print ( "<title>My Greek Page</title>\n" ) -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
From: MRAB on 3 Aug 2010 15:04 Dave Angel wrote: > ¯º¿Â wrote: >>> On 3 Αύγ, 18:41, Dave Angel <da...(a)ieee.org> wrote: >>> >>>> Different encodings equal different ways of storing the data to the >>>> media, correct? >>>> >>> Exactly. The file is a stream of bytes, and Unicode has more than 256 >>> possible characters. Further, even the subset of characters that *do* >>> take one byte are different for different encodings. So you need to tell >>> the editor what encoding you want to use. >>> >> >> For example an 'a' char in iso-8859-1 is stored different than an 'a' >> char in iso-8859-7 and an 'a' char of utf-8 ? >> >> >> > Nope, the ASCII subset is identical. It's the ones between 80 and ff > that differ, and of course not all of those. Further, some of the codes > that are one byte in 8859 are two bytes in utf-8. > > You *could* just decide that you're going to hardwire the assumption > that you'll be dealing with a single character set that does fit in 8 > bits, and most of this complexity goes away. But if you do that, do > *NOT* use utf-8. > > But if you do want to be able to handle more than 256 characters, or > more than one encoding, read on. > > Many people confuse encoding and decoding. A unicode character is an > abstraction which represents a raw character. For convenience, the first > 128 code points map directly onto the 7 bit encoding called ASCII. But > before Unicode there were several other extensions to 256, which were > incompatible with each other. For example, a byte which might be a > European character in one such encoding might be a kata-kana character > in another one. Each encoding was 8 bits, but it was difficult for a > single program to handle more than one such encoding. > One encoding might be ASCII + accented Latin, another ASCII + Greek, another ASCII + Cyrillic, etc. If you wanted ASCII + accented Latin + Greek then you'd need more than 1 byte per character. If you're working with multiple alphabets it gets very messy, which is where Unicode comes in. It contains all those characters, and UTF-8 can encode all of them in a straightforward manner. > So along comes unicode, which is typically implemented in 16 or 32 bit > cells. And it has an 8 bit encoding called utf-8 which uses one byte for > the first 192 characters (I think), and two bytes for some more, and > three bytes beyond that. > [snip] In UTF-8 the first 128 codepoints are encoded to 1 byte.
From: Dave Angel on 3 Aug 2010 16:41 MRAB wrote: > <div class="moz-text-flowed" style="font-family: -moz-fixed">Dave > Angel wrote: >> ¯º¿Â wrote: >>>> On 3 Αύγ, 18:41, Dave Angel <da...(a)ieee.org> wrote: >>>>> Different encodings equal different ways of storing the data to the >>>>> media, correct? >>>> Exactly. The file is a stream of bytes, and Unicode has more than 256 >>>> possible characters. Further, even the subset of characters that *do* >>>> take one byte are different for different encodings. So you need to >>>> tell >>>> the editor what encoding you want to use. >>> >>> For example an 'a' char in iso-8859-1 is stored different than an 'a' >>> char in iso-8859-7 and an 'a' char of utf-8 ? >>> >>> >> Nope, the ASCII subset is identical. It's the ones between 80 and ff >> that differ, and of course not all of those. Further, some of the >> codes that are one byte in 8859 are two bytes in utf-8. >> >> You *could* just decide that you're going to hardwire the assumption >> that you'll be dealing with a single character set that does fit in 8 >> bits, and most of this complexity goes away. But if you do that, do >> *NOT* use utf-8. >> >> But if you do want to be able to handle more than 256 characters, or >> more than one encoding, read on. >> >> Many people confuse encoding and decoding. A unicode character is an >> abstraction which represents a raw character. For convenience, the >> first 128 code points map directly onto the 7 bit encoding called >> ASCII. But before Unicode there were several other extensions to 256, >> which were incompatible with each other. For example, a byte which >> might be a European character in one such encoding might be a >> kata-kana character in another one. Each encoding was 8 bits, but it >> was difficult for a single program to handle more than one such >> encoding. >> > One encoding might be ASCII + accented Latin, another ASCII + Greek, > another ASCII + Cyrillic, etc. If you wanted ASCII + accented Latin + > Greek then you'd need more than 1 byte per character. > > If you're working with multiple alphabets it gets very messy, which is > where Unicode comes in. It contains all those characters, and UTF-8 can > encode all of them in a straightforward manner. > >> So along comes unicode, which is typically implemented in 16 or 32 >> bit cells. And it has an 8 bit encoding called utf-8 which uses one >> byte for the first 192 characters (I think), and two bytes for some >> more, and three bytes beyond that. >> > [snip] > In UTF-8 the first 128 codepoints are encoded to 1 byte. > > Thanks for the correction. As I said, I wasn't sure. I did utf-8 encoder and decoder about a dozen years ago, and I remember parts of it use the top two bits specially. But I've checked now, and you're right, the cutoff is 7f. DaveA
From: Νίκος on 3 Aug 2010 21:41 >On 3 ÎÏγ, 21:00, Dave Angel <da...(a)ieee.org> wrote: > A string is an object containing characters. A string literal is one of > the ways you create such an object. When you create it that way, you > need to make sure the compiler knows the correct encoding, by using the > encoding: line at beginning of file. mymessage = "καλημÎÏα" <==== string mymessage = u"καλημÎÏα" <==== string literal? So, a string literal is one of the encodings i use to create a string object? Can the encodign of a python script file be in iso-8859-7 which means the file contents is saved to the hdd as greek-iso but the part of this variabel value mymessage = u"καλημÎÏα" is saved as utf-8 ot the opposite? have the file saved as utf-8 but one variuable value as greek encoding? Encodings still give me headaches. I try to understand them as different ways to store data in a media. Tell me something. What encoding should i pick for my scripts knowing that only contain english + greek chars?? iso-8859-7 or utf-8 and why? Can i save the sting lets say "ÎίκοÏ" in different encodings and still print out correctly in browser? ascii = the standard english character set only, right? > The web server wraps a few characters before and after your html stream, > but it shouldn't touch the stream itself. So the pythoon compiler using the cgi module is the one that is producing the html output that immediately after send to the web server, right? > > For example if i say mymessage = "καλημÎÏα" and the i say mymessage = u"καλημÎÏα" then the 1st one is a greek encoding variable while the > > 2nd its a utf-8 one? > > No, the first is an 8 bit copy of whatever bytes your editor happened to > save. But since mymessage = "καλημÎÏα" is a string containing greek characaters why the editor doesn't save it as such? It reminds me of varibles an valeus where if you say a = 5 , a var becomes instantly an integer variable while a = 'hello' , become instantly a string variable > mymessage = u"καλημÎÏα" > > creates an object that is *not* encoded. Because it isn't saved by the editor yet? In what satet is this object in before it gets encoded? And it egts encoded the minute i tell the editor to save the file? > Encoding is taking the unicode > stream and representing it as a stream of bytes, which may or may have > more bytes than the original has characters. So this line mymessage = u"καλημÎÏα" what it does is tell the browser thats when its time to save the whole file to save this string as utf-8? If yes, then if were to save the above string as greek encoding how was i suppose to right it? Also if u ise the 'coding line' in the beggining of the file is there a need for using the u literal? > I personally haven't done any cookie code. If I were debugging this, I'd > factor out the multiple parts of that if statement, and find out which > one isn't true. From here I can't guess. I did what you say and foudn out that both of the if condition parts were always false thast why the if code blck never got executed. And it is alwsy wrong because the cookie never gets set. So can you please tell me why this line cookie['visitor'] = ( 'nikos', time() + 60*60*24*365 ) #this cookie will expire in an year never created a cookie?
From: Benjamin Kaplan on 3 Aug 2010 22:36
2010/8/3 Íßêïò <nikos.the.gr33k(a)gmail.com>: >>On 3 Áýã, 21:00, Dave Angel <da...(a)ieee.org> wrote: > >> A string is an object containing characters. A string literal is one of >> the ways you create such an object. When you create it that way, you >> need to make sure the compiler knows the correct encoding, by using the >> encoding: line at beginning of file. > > > mymessage = "êáëçìÝñá" <==== string > mymessage = u"êáëçìÝñá" <==== string literal? Not quite. A literal is the actual string in the file, those letters between the quotes: "êáëçìÝñá" <=== String literal (a literal value of the string/str type) u"êáëçìÝñá" <=== Unicode literal (a literal value of the Unicode type. The bytes on the page will be converted to unicode using the file's encoding) mymessage <==== String (not literal, because it's a value) > > So, a string literal is one of the encodings i use to create a string > object? > > Can the encodign of a python script file be in iso-8859-7 which means > the file contents is saved to the hdd as greek-iso but the part of > this variabel value mymessage = u"êáëçìÝñá" is saved as utf-8 ot the > opposite? > The compiler does not see u"êáëçìÝñá" on the page. All it sees is the bytes ['0x75', '0x22', '0xea', '0xe1', '0xeb', '0xe7', '0xec', '0xdd', '0xf1', '0xe1', '0x22'] Now the compiler knows that the sequence 0x75 0x22 (Stuff) 0x22 means to create a Unicode literal. So it takes those bytes ('0xea', '0xe1', '0xeb', '0xe7', '0xec', '0xdd', '0xf1', '0xe1') and decodes them using the pages encoding, in your case ISO-8859-7. At this point, they don't have an encoding. They aren't bytes as far as you are concerned, they are code points. Internally, they're stored as either UTF-16 or UTF-32 depending on how Python was compiled, but that doesn't matter. You can treat them as if they are characters. > have the file saved as utf-8 but one variuable value as greek > encoding? > Sure you can. A unicode literal will always have the encoding of the file. But a string is just a sequence of bytes (forget about the characters that show up on the page for now). If you do "\xce\xba\xce\xb1\xce\xbb\xce\xb7\xce\xbc\xce\xad\xcf\x81\xce\xb1".encode('UTF-8') Then Python will take that sequence of bytes and interpret them as UTF-8. That will give you the same Unicode string you started out with: u"êáëçìÝñá" > Encodings still give me headaches. I try to understand them as > different ways to store data in a media. > > Tell me something. What encoding should i pick for my scripts knowing > that only contain english + greek chars?? > iso-8859-7 or utf-8 and why? > > Can i save the sting lets say "Íßêïò" in different encodings and still > print out correctly in browser? > > ascii = the standard english character set only, right? > Yes. >> The web server wraps a few characters before and after your html stream, >> but it shouldn't touch the stream itself. > > So the pythoon compiler using the cgi module is the one that is > producing the html output that immediately after send to the web > server, right? > > >> > For example if i say mymessage = "êáëçìÝñá" and the i say mymessage = u"êáëçìÝñá" then the 1st one is a greek encoding variable while the >> > 2nd its a utf-8 one? No. They both are in whatever encoding your file is using. But the first one will be interpreted as a sequence of bytes. the second one will be interpreted as a sequence of characters. For a single-byte encoding like ISO-8859-7, it doesn't make a difference. But if you were to encode it in UTF-8, the first one would have a length of 16 (because the Greek characters are all 2 bytes) and the 2nd one would have a length of 8. >> >> No, the first is an 8 bit copy of whatever bytes your editor happened to >> save. > > But since mymessage = "êáëçìÝñá" is a string containing greek > characaters why the editor doesn't save it as such? > Because you don't save characters, you save bytes. \xce\xba\xce\xb1\xce\xbb\xce\xb7\xce\xbc\xce\xad\xcf\x81\xce\xb1 is your String in UTF-8 \xea\xe1\xeb\xe7\xec\xdd\xf1\xe1 is that exact same string in ISO-8859-7 They are two different ways of representing the same characters > It reminds me of varibles an valeus where if you say > > a = 5 , a var becomes instantly an integer variable > while > a = 'hello' , become instantly a string variable > > >> mymessage = u"êáëçìÝñá" >> >> creates an object that is *not* encoded. > > Because it isn't saved by the editor yet? In what satet is this object > in before it gets encoded? > And it egts encoded the minute i tell the editor to save the file? > >> Encoding is taking the unicode >> stream and representing it as a stream of bytes, which may or may have >> more bytes than the original has characters. > > > So this line mymessage = u"êáëçìÝñá" what it does is tell the browser > thats when its time to save the whole file to save this string as > utf-8? > > If yes, then if were to save the above string as greek encoding how > was i suppose to right it? > > Also if u ise the 'coding line' in the beggining of the file is there > a need for using the u literal? > >> I personally haven't done any cookie code. If I were debugging this, I'd >> factor out the multiple parts of that if statement, and find out which >> one isn't true. From here I can't guess. > > I did what you say and foudn out that both of the if condition parts > were always false thast why the if code blck never got executed. > > And it is alwsy wrong because the cookie never gets set. > > So can you please tell me why this line > > cookie['visitor'] = ( 'nikos', time() + 60*60*24*365 ) #this cookie > will expire in an year > > never created a cookie? > -- > http://mail.python.org/mailman/listinfo/python-list > |