anomalies in capitalization in String functions [Lisp]

Prev: tiny fix to asdf
Next: External command's output as stream

From: Tamas K Papp on 2 Mar 2010 12:38

On Tue, 02 Mar 2010 09:22:32 -0800, Kyle M wrote:

> On Mar 2, 11:05 am, Pascal Costanza <p...(a)p-cos.net> wrote:
>> On 02/03/2010 16:27, Jerry Boetje wrote:
>>
>> > The spec for STRING-CAPITALIZE is defined to break into words where:
>> > "a ``word'' is defined to be a consecutive subsequence consisting of
>> > alphanumeric characters". This gives interesting results such as
>> > "don't" => "Don'T". Any 4th-grader would know that the right
>> > capitalization is "Don't". In CLforJava, we use the Unicode
>> > definitions for breaking, and we get "Don't". Any thoughts about
>> > changing this weirdness? Please, no "but, but it's the specification"
>> > comments. I get the spec. This gets more into a transition from the
>> > 1980's definition of characters and strings and into the Unicode
>> > world. I'd rather talk about the world of today and what we can do
>> > about it.
>>
>> Even in the world of today, not everybody speaks only English.
>>
>> Pascal
>>
>> --
>> My website:http://p-cos.net
>> Common Lisp Document Repository:http://cdr.eurolisp.org Closer to MOP &
>> ContextL:http://common-lisp.net/project/closer/
>
> Good point. That makes me curious though: in what languages is the
> spec's rule correct?

It is correct in Hungarian (my native language), but I doubt that the
authors of the spec had that application in mind :-)

> Even in English, you can't fix it for most proper nouns. My last name,
> for example.
>
> This function just stinks. Someone should just deprecate it, forget
> about it, take it out back and shoot it.

Or (and I realize that this might sound crazy) you could just write
another function which does what you want and forget about the whole
issue.

These kind of string manipulations are hairy. I am not surprised that
the HS didn't try to get it right, I am surprised that it ended up in
the HS _at all_.

Tamas

From: Tim Bradshaw on 2 Mar 2010 13:49

On 2010-03-02 17:01:16 +0000, Jerry Boetje said:

> I'm
> pointing out that perhaps we should move CL into the current world -
> which means Unicode.

However, as several people have pointed out to you *changing the way
STRING-CAPITALIZE works by default is an INCOMPATIBLE CHANGE*, and, you
know what, we don't like those, because they break our code. If you
want this, define a standard suite of functions which are not in the CL
package and which have the behavour you want. How hard can this be to
understand?

From: Thomas A. Russ on 2 Mar 2010 12:36

Jerry Boetje <jerryboetje(a)mac.com> writes:
> ... leaving CL in the dust of ASCII does it a
> disservice.

Actually, CL doesn't require ASCII at all. In fact, that there are
parts of the specification that specifically allow for other character
encodings. (At the time EBCDIC was a prime competitor to ASCII). So,
there is no specification of exactly what CHAR-CODE produces, and it is
explicitly allowed for the character codes of letters to be
non-contiguous.

So, in principle, there is nothing that prevents a conforming Common
Lisp implementation from adopting Unicode as the character encoding
system. In fact, there are several implementations that offer unicode
strings.

Now this still doesn't address the specific rule about the behavior of
STRING-CAPITALIZE, which would still remain. It also doesn't address
the issue that in Unicode there are some differences between upper case
characters and capitalized characters. (IIRC there are 3 strange cases
where they are different.)

So, you would still want to have some unicode-specific functions. But
it seems to me (without deep study) that a unicode-based lisp would be
conforming and also a good idea.

--
Thomas A. Russ, USC/Information Sciences Institute

From: Jerry Boetje on 2 Mar 2010 14:24

On Mar 2, 1:49 pm, Tim Bradshaw <t...(a)tfeb.org> wrote:
> On 2010-03-02 17:01:16 +0000, Jerry Boetje said:
>
> > I'm
> > pointing out that perhaps we should move CL into the current world -
> > which means Unicode.
>
> However, as several people have pointed out to you *changing the way
> STRING-CAPITALIZE works by default is an INCOMPATIBLE CHANGE*, and, you
> know what, we don't like those, because they break our code. If you
> want this, define a standard suite of functions which are not in the CL
> package and which have the behavour you want. How hard can this be to
> understand?

Lets see... How many major revisions of FORTRAN and COBOL - 2 of the
original 3 survivors - have there been that are incompatible in at
least small ways - revisions necessitated by the environment. Those
constituents had to make changes to go forward because the world
changed - so they changed. Here we are over 20 years and we are
clinging to a document that hasn't changed since its inception. As to
making yet another module, hey - characters and strings are pretty
basic. And they changed in the world but not in the specification.
Yes, people make incompatible changes because the environment made
incompatible changes. The world is NOT just english (actually American
- it changed that much) any more.

Just to ratchet the heat, another major change should be in the File
system. Gee, we have networks now. Yes, there packages that handle
networks. But if the file system were to be changed to deal with URIs,
a lot of things would be much easier. Every knows how to
(write ...mumble.. a-URI-stream) if they thought about it. We tried
it, and it works. It would be easier if we could make a few changes in
the definition.

Look, I love this language as well as anyone on this board. I've sent
out hundreds of students knowing CL (and how to build one). There are
many, many fine packages that build on the strength of CL. But some
things are just a waste of time - like making a Unicode package rather
than making a few basic changes to the spec.

From: Pascal J. Bourguignon on 2 Mar 2010 15:44

Kyle M <kylemcg(a)gmail.com> writes:

> On Mar 2, 11:05�am, Pascal Costanza <p...(a)p-cos.net> wrote:
>> On 02/03/2010 16:27, Jerry Boetje wrote:
>>
>> > The spec for STRING-CAPITALIZE is defined to break into words where:
>> > "a ``word'' is defined to be a consecutive subsequence consisting of
>> > alphanumeric characters". This gives interesting results such as
>> > "don't" => �"Don'T". Any 4th-grader would know that the right
>> > capitalization is "Don't". In CLforJava, we use the Unicode
>> > definitions for breaking, and we get "Don't". Any thoughts about
>> > changing this weirdness? Please, no "but, but it's the specification"
>> > comments. I get the spec. This gets more into a transition from the
>> > 1980's definition of characters and strings and into the Unicode
>> > world. I'd rather talk about the world of today and what we can do
>> > about it.
>>
>> Even in the world of today, not everybody speaks only English.
>
> Good point. That makes me curious though: in what languages is the
> spec's rule correct?

Possibly in none, and that's the point!

STRING-CAPITALIZE is not ENGLISH-TEXT-CAPITALIZE or
ARABIC-TEXT-CAPITALIZE, and it is definitely NOT
UNICODE-STRING-CAPITALIZE.

> It's broken in English. In Arabic translations we use "Qu'ran" and not
> "Qu'Ran", so it's broken in Arabic. I don't know much French, but I
> thought "coup d'etat" became "Coup d'Etat", so isn't it wrong in
> French too (albeit for different reasons)?
>
> Even in English, you can't fix it for most proper nouns. My last name,
> for example.
>
> This function just stinks. Someone should just deprecate it, forget
> about it, take it out back and shoot it.

It's a technical specification, on which programs can rely to process
strings in a deterministic way.

It is up to libraries or applications to implement natural language
processing, including grammatically and typographically correct
capitalizations for the various natural languages. But it is not
acceptable to change the meaning of a low level function such as
STRING-CAPITALIZE, this could break a lot of programs (protocols, file
formats, "smart" algorithms, whatever.

--
__Pascal Bourguignon__
http://www.informatimago.com

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: tiny fix to asdf
Next: External command's output as stream