anomalies in capitalization in String functions [Lisp]

Prev: tiny fix to asdf
Next: External command's output as stream

From: Pascal J. Bourguignon on 2 Mar 2010 15:48

Jerry Boetje <jerryboetje(a)mac.com> writes:

> On Mar 2, 11:54�am, Raymond Toy <toy.raym...(a)gmail.com> wrote:
>> On 3/2/10 10:27 AM, Jerry Boetje wrote:
>>
>> > The spec for STRING-CAPITALIZE is defined to break into words where:
>> > "a ``word'' is defined to be a consecutive subsequence consisting of
>> > alphanumeric characters". This gives interesting results such as
>> > "don't" => "Don'T". Any 4th-grader would know that the right
>> > capitalization is "Don't". In CLforJava, we use the Unicode
>> > definitions for breaking, and we get "Don't". Any thoughts about
>> > changing this weirdness? Please, no "but, but it's the specification"
>> > comments. I get the spec. This gets more into a transition from the
>> > 1980's definition of characters and strings and into the Unicode
>> > world. I'd rather talk about the world of today and what we can do
>> > about it.
>>
>> FWIW, CMUCL adds two new keyword arguments to STRING-CAPITALIZE. �One is
>> :UNICODE-WORD-BREAK to indicate that the Unicode word breaking rules
>> would be used. �The other is :CASING, taking the values of :FULL or
>> :SIMPLE, to indicate how casing is to be done.
>>
>> This was done mostly to preserve backward compatibility with existing
>> code, but allowing new behavior.
>>
>> Ray
>
> I think this is a good step into Unicode, but it also has the feel of
> bandaid. Here's another thought, which is probably as icky, but we
> could have a special variable, say *character-system* that has at
> least 2 values: :standard (get's current std) and :unicode. Binding
> this to one of those values changes the behavior of the string
> functions.

Indeed. I'm told by API designers that it is better to provide two
different functions to do different things, thant to provide a single
function with 'flags'.

Now, CL has a lot of functions taking keywords such as :test, :key, or
even functions such as peek-char whose behavior changes according to the
kind of arguments it gets. If there was only this one function that would
need a unicode specific behavior, I could agree on some overloading thru
a non-standard keyword argument. But there are actually a lot of
features to be implemented for unicode, and this definitely cries for a
separate, higher level, library.

Jerry, I must concur with the majority here, leave CL:STRING-CAPITALIZE
alone, and provide a UNICODE:TEXT-CAPITALIZE and whatever else you need
to process unicode or natural language texts.

--
__Pascal Bourguignon__
http://www.informatimago.com

From: Raymond Toy on 2 Mar 2010 16:02

On 3/2/10 12:11 PM, Jerry Boetje wrote:
> On Mar 2, 11:54 am, Raymond Toy <toy.raym...(a)gmail.com> wrote:
>> On 3/2/10 10:27 AM, Jerry Boetje wrote:
>>
>>> The spec for STRING-CAPITALIZE is defined to break into words where:
>>> "a ``word'' is defined to be a consecutive subsequence consisting of
>>> alphanumeric characters". This gives interesting results such as
>>> "don't" => "Don'T". Any 4th-grader would know that the right
>>> capitalization is "Don't". In CLforJava, we use the Unicode
>>> definitions for breaking, and we get "Don't". Any thoughts about
>>> changing this weirdness? Please, no "but, but it's the specification"
>>> comments. I get the spec. This gets more into a transition from the
>>> 1980's definition of characters and strings and into the Unicode
>>> world. I'd rather talk about the world of today and what we can do
>>> about it.
>>
>> FWIW, CMUCL adds two new keyword arguments to STRING-CAPITALIZE. One is
>> :UNICODE-WORD-BREAK to indicate that the Unicode word breaking rules
>> would be used. The other is :CASING, taking the values of :FULL or
>> :SIMPLE, to indicate how casing is to be done.
>>
>> This was done mostly to preserve backward compatibility with existing
>> code, but allowing new behavior.
>>
>> Ray
>
> I think this is a good step into Unicode, but it also has the feel of
> bandaid. Here's another thought, which is probably as icky, but we
> could have a special variable, say *character-system* that has at
> least 2 values: :standard (get's current std) and :unicode. Binding
> this to one of those values changes the behavior of the string
> functions.

This seems like a bandaid too. I've forgotten the details, but isn't
there also the casing issue? What case do you want when you capitalize?
Should it be the upper case or title case character?

After trying to debug code with these variables everywhere (maxima),
I've decided that it really sucks, and it's much better to have them as
parameters. Stack traces and everything have the information needed to
figure out what's going on. :-)

I don't have a good solution. You could just do what you want and find
out how loudly people complain when string-capitalize is different from
the spec. :-)

Ray

From: Barry Margolin on 2 Mar 2010 17:55

In article <hmjmj6$uf3$1(a)news.eternal-september.org>,
Tim Bradshaw <tfb(a)tfeb.org> wrote:

> On 2010-03-02 17:01:16 +0000, Jerry Boetje said:
>
> > I'm
> > pointing out that perhaps we should move CL into the current world -
> > which means Unicode.
>
> However, as several people have pointed out to you *changing the way
> STRING-CAPITALIZE works by default is an INCOMPATIBLE CHANGE*, and, you
> know what, we don't like those, because they break our code. If you
> want this, define a standard suite of functions which are not in the CL
> package and which have the behavour you want. How hard can this be to
> understand?

What's the chance that any application is actually depending on the
current behavior? More likely, applications that might feed a word like
"don't" to a capitalization function avoid using the built-in function
because it doesn't work correctly. So actual use is mostly limited to
cases where a change in the definition would be compatible, and the
change would then make the function usable in a wider variety of cases.

But since there's no current plans to revise CL, what's the point of
this discussion? Even if we all agree on what STRING-CAPITALIZE
*should* do, it's not going to change anything.

--
Barry Margolin, barmar(a)alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***

From: Barry Margolin on 2 Mar 2010 18:00

In article
<4928966c-c38c-456c-94d6-502ef931bbb2(a)f8g2000yqn.googlegroups.com>,
Jerry Boetje <jerryboetje(a)mac.com> wrote:

> On Mar 2, 11:54�am, Raymond Toy <toy.raym...(a)gmail.com> wrote:
> > On 3/2/10 10:27 AM, Jerry Boetje wrote:
> >
> > > The spec for STRING-CAPITALIZE is defined to break into words where:
> > > "a ``word'' is defined to be a consecutive subsequence consisting of
> > > alphanumeric characters". This gives interesting results such as
> > > "don't" => "Don'T". Any 4th-grader would know that the right
> > > capitalization is "Don't". In CLforJava, we use the Unicode
> > > definitions for breaking, and we get "Don't". Any thoughts about
> > > changing this weirdness? Please, no "but, but it's the specification"
> > > comments. I get the spec. This gets more into a transition from the
> > > 1980's definition of characters and strings and into the Unicode
> > > world. I'd rather talk about the world of today and what we can do
> > > about it.
> >
> > FWIW, CMUCL adds two new keyword arguments to STRING-CAPITALIZE. �One is
> > :UNICODE-WORD-BREAK to indicate that the Unicode word breaking rules
> > would be used. �The other is :CASING, taking the values of :FULL or
> > :SIMPLE, to indicate how casing is to be done.
> >
> > This was done mostly to preserve backward compatibility with existing
> > code, but allowing new behavior.
> >
> > Ray
>
> I think this is a good step into Unicode, but it also has the feel of
> bandaid. Here's another thought, which is probably as icky, but we
> could have a special variable, say *character-system* that has at
> least 2 values: :standard (get's current std) and :unicode. Binding
> this to one of those values changes the behavior of the string
> functions.

What does the coding system have to do with this issue? No matter what
coding system you use, apostrophe is not an alphanumeric character. The
issue is the rule for finding word boundaries, not the coding system.

In this case, I think CL just wanted to have a very simple function, to
fit alongside STRING-DOWNCASE and STRING-UPCASE. Proper capitalization
is a somewhat complex matter, and if you're writing a text editor or
word processor you're probably in a better position to decide what to do
for the domain.

--
Barry Margolin, barmar(a)alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***

From: Thomas A. Russ on 2 Mar 2010 19:31

Tamas K Papp <tkpapp(a)gmail.com> writes:

> These kind of string manipulations are hairy. I am not surprised that
> the HS didn't try to get it right, I am surprised that it ended up in
> the HS _at all_.

Well, there are also all the odd English-specific items in the FORMAT
directives. After all, the ~R and ~P items are sort of surprising to
see.

But I guess string-capitalize was supposed to be a bit more useful. At
the time, of course, some of these were more useful as cute hacks than
for heavy-duty professional use. On the other hand, it did make it
easier to produce nicer sounding error or information messages.

One must just be careful to write formally and avoid contractions.
And possessives, oops. ;-)

--
Thomas A. Russ, USC/Information Sciences Institute

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: tiny fix to asdf
Next: External command's output as stream