anomalies in capitalization in String functions [Lisp]

Prev: tiny fix to asdf
Next: External command's output as stream

From: Tamas K Papp on 2 Mar 2010 11:33

On Tue, 02 Mar 2010 15:33:00 +0000, Tamas K Papp wrote:

> On Tue, 02 Mar 2010 07:27:16 -0800, Jerry Boetje wrote:
>
>> The spec for STRING-CAPITALIZE is defined to break into words where: "a
>> ``word'' is defined to be a consecutive subsequence consisting of
>> alphanumeric characters". This gives interesting results such as
>> "don't" => "Don'T". Any 4th-grader would know that the right
>> capitalization is "Don't". In CLforJava, we use the Unicode definitions
>> for breaking, and we get "Don't". Any thoughts about changing this
>> weirdness? Please, no "but, but it's the specification" comments. I get
>> the spec. This gets more into a transition from the 1980's definition
>> of characters and strings and into the Unicode world. I'd rather talk
>> about the world of today and what we can do about it.
>
> The obvious solution seems to be writing and using your own function to
> capitalize strings (which would be the usual approach to cases where the
> standard is clear, but you don't like it).

It also appears that the authors of the HS knew about this:

http://www.lispworks.com/documentation/HyperSpec/Body/f_stg_up.htm

Examples:

(string-capitalize "DON'T!") => "Don'T!" ;not "Don't!"

From: Raymond Toy on 2 Mar 2010 11:54

On 3/2/10 10:27 AM, Jerry Boetje wrote:
> The spec for STRING-CAPITALIZE is defined to break into words where:
> "a ``word'' is defined to be a consecutive subsequence consisting of
> alphanumeric characters". This gives interesting results such as
> "don't" => "Don'T". Any 4th-grader would know that the right
> capitalization is "Don't". In CLforJava, we use the Unicode
> definitions for breaking, and we get "Don't". Any thoughts about
> changing this weirdness? Please, no "but, but it's the specification"
> comments. I get the spec. This gets more into a transition from the
> 1980's definition of characters and strings and into the Unicode
> world. I'd rather talk about the world of today and what we can do
> about it.

FWIW, CMUCL adds two new keyword arguments to STRING-CAPITALIZE. One is
:UNICODE-WORD-BREAK to indicate that the Unicode word breaking rules
would be used. The other is :CASING, taking the values of :FULL or
:SIMPLE, to indicate how casing is to be done.

This was done mostly to preserve backward compatibility with existing
code, but allowing new behavior.

Ray

From: Jerry Boetje on 2 Mar 2010 12:01

On Mar 2, 11:33 am, Tamas K Papp <tkp...(a)gmail.com> wrote:
> On Tue, 02 Mar 2010 15:33:00 +0000, Tamas K Papp wrote:
> > On Tue, 02 Mar 2010 07:27:16 -0800, Jerry Boetje wrote:
>
> >> The spec for STRING-CAPITALIZE is defined to break into words where: "a
> >> ``word'' is defined to be a consecutive subsequence consisting of
> >> alphanumeric characters". This gives interesting results such as
> >> "don't" => "Don'T". Any 4th-grader would know that the right
> >> capitalization is "Don't". In CLforJava, we use the Unicode definitions
> >> for breaking, and we get "Don't". Any thoughts about changing this
> >> weirdness? Please, no "but, but it's the specification" comments. I get
> >> the spec. This gets more into a transition from the 1980's definition
> >> of characters and strings and into the Unicode world. I'd rather talk
> >> about the world of today and what we can do about it.
>
> > The obvious solution seems to be writing and using your own function to
> > capitalize strings (which would be the usual approach to cases where the
> > standard is clear, but you don't like it).
>
> It also appears that the authors of the HS knew about this:
>
> http://www.lispworks.com/documentation/HyperSpec/Body/f_stg_up.htm
>
> Examples:
>
> (string-capitalize "DON'T!") => "Don'T!" ;not "Don't!"

Yes, yes. I read the spec. I don't need to be reminded on that. We go
through the spec A LOT. But, if I understand Pascal's comment, that
the rest of the world has gone (way) ahead of the CL spec. I'm
pointing out that perhaps we should move CL into the current world -
which means Unicode. Unicode isn't just an encoding. It defines up,
down, and capitalize operations. Also ways to sort. In some places in
Unicode, the glyph for upcase is different from that of capitalize if
you mean Titlecase. Perhaps we should define a hold suite of functions
(after all, there are only 902 functions defined. What's another 20 or
so?). In our implementation, we can mix any amount of other Unicode
blocks in text or symbols. We can also use other numeric characters in
most of the blocks including non-positional numbers. Unicode is very
powerful (as is CL), but leaving CL in the dust of ASCII does it a
disservice.

From: Jerry Boetje on 2 Mar 2010 12:11

On Mar 2, 11:54 am, Raymond Toy <toy.raym...(a)gmail.com> wrote:
> On 3/2/10 10:27 AM, Jerry Boetje wrote:
>
> > The spec for STRING-CAPITALIZE is defined to break into words where:
> > "a ``word'' is defined to be a consecutive subsequence consisting of
> > alphanumeric characters". This gives interesting results such as
> > "don't" => "Don'T". Any 4th-grader would know that the right
> > capitalization is "Don't". In CLforJava, we use the Unicode
> > definitions for breaking, and we get "Don't". Any thoughts about
> > changing this weirdness? Please, no "but, but it's the specification"
> > comments. I get the spec. This gets more into a transition from the
> > 1980's definition of characters and strings and into the Unicode
> > world. I'd rather talk about the world of today and what we can do
> > about it.
>
> FWIW, CMUCL adds two new keyword arguments to STRING-CAPITALIZE. One is
> :UNICODE-WORD-BREAK to indicate that the Unicode word breaking rules
> would be used. The other is :CASING, taking the values of :FULL or
> :SIMPLE, to indicate how casing is to be done.
>
> This was done mostly to preserve backward compatibility with existing
> code, but allowing new behavior.
>
> Ray

I think this is a good step into Unicode, but it also has the feel of
bandaid. Here's another thought, which is probably as icky, but we
could have a special variable, say *character-system* that has at
least 2 values: :standard (get's current std) and :unicode. Binding
this to one of those values changes the behavior of the string
functions.

From: Kyle M on 2 Mar 2010 12:22

On Mar 2, 11:05 am, Pascal Costanza <p...(a)p-cos.net> wrote:
> On 02/03/2010 16:27, Jerry Boetje wrote:
>
> > The spec for STRING-CAPITALIZE is defined to break into words where:
> > "a ``word'' is defined to be a consecutive subsequence consisting of
> > alphanumeric characters". This gives interesting results such as
> > "don't" => "Don'T". Any 4th-grader would know that the right
> > capitalization is "Don't". In CLforJava, we use the Unicode
> > definitions for breaking, and we get "Don't". Any thoughts about
> > changing this weirdness? Please, no "but, but it's the specification"
> > comments. I get the spec. This gets more into a transition from the
> > 1980's definition of characters and strings and into the Unicode
> > world. I'd rather talk about the world of today and what we can do
> > about it.
>
> Even in the world of today, not everybody speaks only English.
>
> Pascal
>
> --
> My website:http://p-cos.net
> Common Lisp Document Repository:http://cdr.eurolisp.org
> Closer to MOP & ContextL:http://common-lisp.net/project/closer/

Good point. That makes me curious though: in what languages is the
spec's rule correct?

It's broken in English. In Arabic translations we use "Qu'ran" and not
"Qu'Ran", so it's broken in Arabic. I don't know much French, but I
thought "coup d'etat" became "Coup d'Etat", so isn't it wrong in
French too (albeit for different reasons)?

Even in English, you can't fix it for most proper nouns. My last name,
for example.

This function just stinks. Someone should just deprecate it, forget
about it, take it out back and shoot it.

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: tiny fix to asdf
Next: External command's output as stream