Prev: Email in 2.6.4
Next: extracting unicode text from pdfs
From: Steven D'Aprano on 24 May 2010 08:13 Do unicode.lower() or unicode.upper() ever change the length of the string? The Unicode standard allows for case conversions that change length, e.g. sharp-S in German should convert to SS: http://unicode.org/faq/casemap_charprop.html#6 but I see that Python doesn't do that: >>> s = "Paßstraße" >>> s.upper() 'PAßSTRAßE' The more I think about this, the more I think that upper/lower/title case conversions should change length (at least sometimes) and if Python doesn't do so, that's a bug. Any thoughts? -- Steven
From: Mark Dickinson on 24 May 2010 08:20 On May 24, 1:13 pm, Steven D'Aprano <st...(a)REMOVE-THIS- cybersource.com.au> wrote: > Do unicode.lower() or unicode.upper() ever change the length of the > string? From looking at the source, in particular the fixupper and fixlower functions in Objects/unicode.c [1], it looks like not: they do a simple character-by-character replacement. [1] http://svn.python.org/view/python/trunk/Objects/unicodeobject.c?view=markup -- Mark
From: Mark Dickinson on 24 May 2010 08:44 On May 24, 1:13 pm, Steven D'Aprano <st...(a)REMOVE-THIS- cybersource.com.au> wrote: > Do unicode.lower() or unicode.upper() ever change the length of the > string? > > The Unicode standard allows for case conversions that change length, e.g. > sharp-S in German should convert to SS: > > http://unicode.org/faq/casemap_charprop.html#6 > > but I see that Python doesn't do that: > > >>> s = "Paßstraße" > >>> s.upper() > > 'PAßSTRAßE' > > The more I think about this, the more I think that upper/lower/title case > conversions should change length (at least sometimes) and if Python > doesn't do so, that's a bug. Any thoughts? Digging a bit deeper, it looks like these methods are using the Simple_{Upper,Lower,Title}case_Mapping functions described at http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14 of the unicode data; you can see this in the source in Tools/unicode/ makeunicodedata.py, which is the Python code that generates the database of unicode properties. It contains code like: if record[12]: upper = int(record[12], 16) else: upper = char if record[13]: lower = int(record[13], 16) else: lower = char if record[14]: title = int(record[14], 16) .... and so on. I agree that it might be desirable for these operations to product the multicharacter equivalents. That idea looks like a tough sell, though: apart from backwards compatibility concerns (which could probably be worked around somehow), it looks as though it would require significant effort to implement. -- Mark
From: MRAB on 24 May 2010 10:42 Mark Dickinson wrote: > On May 24, 1:13 pm, Steven D'Aprano <st...(a)REMOVE-THIS- > cybersource.com.au> wrote: >> Do unicode.lower() or unicode.upper() ever change the length of the >> string? >> >> The Unicode standard allows for case conversions that change length, e.g. >> sharp-S in German should convert to SS: >> >> http://unicode.org/faq/casemap_charprop.html#6 >> >> but I see that Python doesn't do that: >> >>>>> s = "Paßstraße" >>>>> s.upper() >> 'PAßSTRAßE' >> >> The more I think about this, the more I think that upper/lower/title case >> conversions should change length (at least sometimes) and if Python >> doesn't do so, that's a bug. Any thoughts? > > Digging a bit deeper, it looks like these methods are using the > Simple_{Upper,Lower,Title}case_Mapping functions described at > http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14 > of the unicode data; you can see this in the source in Tools/unicode/ > makeunicodedata.py, which is the Python code that generates the > database of unicode properties. It contains code like: > > if record[12]: > upper = int(record[12], 16) > else: > upper = char > if record[13]: > lower = int(record[13], 16) > else: > lower = char > if record[14]: > title = int(record[14], 16) > > ... and so on. > > I agree that it might be desirable for these operations to product the > multicharacter equivalents. That idea looks like a tough sell, > though: apart from backwards compatibility concerns (which could > probably be worked around somehow), it looks as though it would > require significant effort to implement. > If we were to make such a change, I think we should also cater for locale-specific case changes (passing the locale to 'upper', 'lower' and 'title'). For example, normally "i".upper() returns "I", but in Turkish "i".upper() should return "İ" (the uppercase version of lowercase dotted i is uppercase dotted I).
From: Terry Reedy on 24 May 2010 14:01 On 5/24/2010 10:42 AM, MRAB wrote: > Mark Dickinson wrote: >> Digging a bit deeper, it looks like these methods are using the >> Simple_{Upper,Lower,Title}case_Mapping functions described at >> http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14 >> of the unicode data; you can see this in the source in Tools/unicode/ >> makeunicodedata.py, which is the Python code that generates the >> database of unicode properties. It contains code like: >> >> if record[12]: >> upper = int(record[12], 16) >> else: >> upper = char >> if record[13]: >> lower = int(record[13], 16) >> else: >> lower = char >> if record[14]: >> title = int(record[14], 16) >> >> ... and so on. >> >> I agree that it might be desirable for these operations to product the >> multicharacter equivalents. That idea looks like a tough sell, >> though: apart from backwards compatibility concerns (which could >> probably be worked around somehow), it looks as though it would >> require significant effort to implement. >> > If we were to make such a change, I think we should also cater for > locale-specific case changes (passing the locale to 'upper', 'lower' and > 'title'). > > For example, normally "i".upper() returns "I", but in Turkish > "i".upper() should return "Ä°" (the uppercase version of lowercase dotted > i is uppercase dotted I). Given that the current (siimple) functions implement standard-defined functions, I think any change should be to *add* new 'complex-case-change' functions. Terry Jan Reedy
|
Pages: 1 Prev: Email in 2.6.4 Next: extracting unicode text from pdfs |