Can upper() or lower() ever change the length of a string? [Python]

Prev: Email in 2.6.4
Next: extracting unicode text from pdfs

From: Steven D'Aprano on 24 May 2010 08:13

Do unicode.lower() or unicode.upper() ever change the length of the
string?

The Unicode standard allows for case conversions that change length, e.g.
sharp-S in German should convert to SS:

http://unicode.org/faq/casemap_charprop.html#6

but I see that Python doesn't do that:

>>> s = "Paßstraße"
>>> s.upper()
'PAßSTRAßE'

The more I think about this, the more I think that upper/lower/title case
conversions should change length (at least sometimes) and if Python
doesn't do so, that's a bug. Any thoughts?

--
Steven

From: Mark Dickinson on 24 May 2010 08:20

On May 24, 1:13 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> Do unicode.lower() or unicode.upper() ever change the length of the
> string?

From looking at the source, in particular the fixupper and fixlower
functions in Objects/unicode.c [1], it looks like not: they do a
simple character-by-character replacement.

[1] http://svn.python.org/view/python/trunk/Objects/unicodeobject.c?view=markup
--
Mark

From: Mark Dickinson on 24 May 2010 08:44

On May 24, 1:13 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
cybersource.com.au> wrote:
> Do unicode.lower() or unicode.upper() ever change the length of the
> string?
>
> The Unicode standard allows for case conversions that change length, e.g.
> sharp-S in German should convert to SS:
>
> http://unicode.org/faq/casemap_charprop.html#6
>
> but I see that Python doesn't do that:
>
> >>> s = "Paßstraße"
> >>> s.upper()
>
> 'PAßSTRAßE'
>
> The more I think about this, the more I think that upper/lower/title case
> conversions should change length (at least sometimes) and if Python
> doesn't do so, that's a bug. Any thoughts?

Digging a bit deeper, it looks like these methods are using the
Simple_{Upper,Lower,Title}case_Mapping functions described at
http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
of the unicode data; you can see this in the source in Tools/unicode/
makeunicodedata.py, which is the Python code that generates the
database of unicode properties. It contains code like:

if record[12]:
upper = int(record[12], 16)
else:
upper = char
if record[13]:
lower = int(record[13], 16)
else:
lower = char
if record[14]:
title = int(record[14], 16)

.... and so on.

I agree that it might be desirable for these operations to product the
multicharacter equivalents. That idea looks like a tough sell,
though: apart from backwards compatibility concerns (which could
probably be worked around somehow), it looks as though it would
require significant effort to implement.

--
Mark

From: MRAB on 24 May 2010 10:42

Mark Dickinson wrote:
> On May 24, 1:13 pm, Steven D'Aprano <st...(a)REMOVE-THIS-
> cybersource.com.au> wrote:
>> Do unicode.lower() or unicode.upper() ever change the length of the
>> string?
>>
>> The Unicode standard allows for case conversions that change length, e.g.
>> sharp-S in German should convert to SS:
>>
>> http://unicode.org/faq/casemap_charprop.html#6
>>
>> but I see that Python doesn't do that:
>>
>>>>> s = "Paßstraße"
>>>>> s.upper()
>> 'PAßSTRAßE'
>>
>> The more I think about this, the more I think that upper/lower/title case
>> conversions should change length (at least sometimes) and if Python
>> doesn't do so, that's a bug. Any thoughts?
>
> Digging a bit deeper, it looks like these methods are using the
> Simple_{Upper,Lower,Title}case_Mapping functions described at
> http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
> of the unicode data; you can see this in the source in Tools/unicode/
> makeunicodedata.py, which is the Python code that generates the
> database of unicode properties. It contains code like:
>
> if record[12]:
> upper = int(record[12], 16)
> else:
> upper = char
> if record[13]:
> lower = int(record[13], 16)
> else:
> lower = char
> if record[14]:
> title = int(record[14], 16)
>
> ... and so on.
>
> I agree that it might be desirable for these operations to product the
> multicharacter equivalents. That idea looks like a tough sell,
> though: apart from backwards compatibility concerns (which could
> probably be worked around somehow), it looks as though it would
> require significant effort to implement.
>
If we were to make such a change, I think we should also cater for
locale-specific case changes (passing the locale to 'upper', 'lower' and
'title').

For example, normally "i".upper() returns "I", but in Turkish
"i".upper() should return "İ" (the uppercase version of lowercase dotted
i is uppercase dotted I).

From: Terry Reedy on 24 May 2010 14:01

On 5/24/2010 10:42 AM, MRAB wrote:
> Mark Dickinson wrote:

>> Digging a bit deeper, it looks like these methods are using the
>> Simple_{Upper,Lower,Title}case_Mapping functions described at
>> http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
>> of the unicode data; you can see this in the source in Tools/unicode/
>> makeunicodedata.py, which is the Python code that generates the
>> database of unicode properties. It contains code like:
>>
>> if record[12]:
>> upper = int(record[12], 16)
>> else:
>> upper = char
>> if record[13]:
>> lower = int(record[13], 16)
>> else:
>> lower = char
>> if record[14]:
>> title = int(record[14], 16)
>>
>> ... and so on.
>>
>> I agree that it might be desirable for these operations to product the
>> multicharacter equivalents. That idea looks like a tough sell,
>> though: apart from backwards compatibility concerns (which could
>> probably be worked around somehow), it looks as though it would
>> require significant effort to implement.
>>
> If we were to make such a change, I think we should also cater for
> locale-specific case changes (passing the locale to 'upper', 'lower' and
> 'title').
>
> For example, normally "i".upper() returns "I", but in Turkish
> "i".upper() should return "Ä°" (the uppercase version of lowercase dotted
> i is uppercase dotted I).

Given that the current (siimple) functions implement standard-defined
functions, I think any change should be to *add* new
'complex-case-change' functions.

Terry Jan Reedy

|
Pages: 1
Prev: Email in 2.6.4
Next: extracting unicode text from pdfs