From: JB on 17 May 2010 18:10 I'm working on the webapp of our company intranet and I had a question about proper handling of user input that's causing encoding issues. Some of the uesrs take notes in Microsoft Office and copy/paste these into textarea's of the webapp. Some of the characters from Word such as hypens () and apostrophes () are in an odd encoding. When passed to the database using sqlalchemy they appear as â and other characters. What's the proper handling (conversion?) of user input before it gets to my database. Do I need to start making a list of the offending characters and .replace them? Or is there a means to decode/encode the user input to something more generic? Thanks for your time.
From: Neil Hodgson on 17 May 2010 19:05 JB: > as hypens (�) and apostrophes (�) are in an odd encoding. When passed > to the database using sqlalchemy they appear as – and other > characters. The encoding is UTF-8. Normally the best way to handle encodings is to convert to Unicode strings (unicode(s, "UTF-8")) as soon as possible and perform most processing in Unicode. Neil
From: Bryan on 17 May 2010 22:38 Neil Hodgson wrote: > JB: > > > as hypens () and apostrophes () are in an odd encoding. When passed > > to the database using sqlalchemy they appear as â and other > > characters. > > The encoding is UTF-8. Normally the best way to handle encodings is > to convert to Unicode strings (unicode(s, "UTF-8")) as soon as possible > and perform most processing in Unicode. Good advice to work in Unicode (and in Python 3.X str is unicode), but I'd guess the encoding he's getting is "Windows-1252". The default character set of HTTP is ISO-8859-1, but Microsoft likes to use Windows-1252 in it's place. What to do about it? First, try specifying utf-8 in the form containing the textarea, as in <form action="process.cgi" accept-charset="utf-8"> Note that specifying ISO-8859-1 will not work, in that Microsoft will still use Windows-1252. I've heard they've gotten better at supporting utf-8, but I haven't tested. When a request comes in, check for a Content-Type header that names the character set, which should be: Content-Type: application/x-www-form-urlencoded; charset=utf-8 Then you con decode to a unicode object as Neil Hodgson explained. In case you still have to deal with Windows-1252, Python knows how to translate Windows-1252 to the best-fit in Unicode. In current Python 2.x: ustring = unicode(raw_string, 'Windows-1252') In Python 3.X, what comes from a socket is bytes, and str means unicode: ustring = str(raw_bytes, 'Windows-1252') Of course this all assumes that JB's database likes Unicode. If it chokes, then alternatives include encoding back to utf-8 and storing as binary, or translating characters to some best-fit in the set the database supports. -- --Bryan Olson
|
Pages: 1 Prev: wxPython: How to get letter colour from TextCtrl Next: Can't find _sqlite3.so in lib-dynload |