Collation settings [Server Configuration]

Prev: very slow SQL client
Next: SQL Server Output Buffer

From: Jeroen Mostert on 3 Dec 2009 19:51

Jay wrote:
> "Jay" <spam(a)nospam.org> wrote in message
> news:%23Qj%23YYGdKHA.1640(a)TK2MSFTNGP06.phx.gbl...
>> I'm going through a 2005 Install lesson and it is saying to change the
>> default collation to: Latin1_General with a Binary sort order.
>>
>> Well Latin1_General is normal enough, but why would they want to change
>> the default sort order?
>
> OK, the answer I'm getting is vague, to say the least. However, the
> implication is that: if you're not using Unicode, there will be small
> system-wide performance improvement in character operations - basically for
> free.
>
....well, no. See my other post for more on this. You'll get performance
improvements, but by no measure can this be called "free".

> All this for specifying: SQL_Latin1_General_CP1_CI_AS_BIN, rather than
> SQL_Latin1_General_CP1_CI_AS.
>
There is no collation SQL_Latin1_General_CP1_CI_AS_BIN. Accent- and
case-sensitivity is meaningless for binary collations.

Nor, for that matter, is there a collation SQL_Latin1_General_CP1_BIN, for
some reason (there are _BIN collations for the other legacy code pages, like
SQL_Latin1_General_CP437_BIN). The collation in question is
Latin1_General_BIN, which is one of the newer collations based on the
implementation in Windows, rather than SQL Server's own.

> I assume, but do not know, that if the binary sort order is specified and
> you use Unicode, there would be some slight performance degradation, but no
> errors.
>
Both Unicode and non-Unicode comparisons will go faster, as both comparison
types will start using code point order rather than language-specific order.

> My problem is that the accent-e is a Unicode character (I think)

It's also a Latin-1 character, a Latin-15 character, a Windows-1252
character, etcetera... It's a common misconception, but there is no division
between "ordinary" characters and "Unicode" characters.

As far as SQL Server goes, a field is either (VAR)CHAR or N(VAR)CHAR. A
(VAR)CHAR field can only contain characters present in the code page
specified by the collation (in the case of Latin1_General, that's
Windows-1252 if I'm not mistaken). An N(VAR)CHAR field can contain all
characters in Unicode, regardless of collation. Collation does affect sort
order for Unicode fields, and it also specifies the code page used when
converting the field to a (VAR)CHAR field.

--
J.

From: Jay on 3 Dec 2009 21:36

Wow, that's a lot. You seem to be a bit passionate on the subject, but I
can't see anything wrong with your reasoning.

Thanks,
Jay

"Jeroen Mostert" <jmostert(a)xs4all.nl> wrote in message
news:4b185783$0$22934$e4fe514c(a)news.xs4all.nl...
> Jay wrote:
>> I'm going through a 2005 Install lesson and it is saying to change the
>> default collation to: Latin1_General with a Binary sort order.
>>
>> Well Latin1_General is normal enough, but why would they want to change
>> the default sort order?
> Because they don't care about their end users. :-)
>
> The binary collations relieve SQL Server from thinking about things like
> accent- and case-sensitivity: they compare characters on their code
> points, not strings on their language sorting order. Under a binary
> collation, 'TEST' < 'Test' because 'E' < 'e', and that's because 69 < 101.
> This speeds up string comparisons.
>
> The problem is that if you set this as the default it affects
> *everything*, including the system catalogs. In particular, table names
> and column names then use binary collation as well, so everything becomes
> case- and accent-sensitive (everything-sensitive, really). Your statements
> must have been written to take this into account. Worse, it allows for
> awful mistakes like creating a table "Foo" and a table "FOO" and a table
> "foo", or a table "cr�pes" with the "�" a single LATIN SMALL LETTER E WITH
> CIRCUMFLEX and a table "cr�pes" with the "�" a LATIN SMALL LETTER E
> followed by a COMBINING CIRCUMFLEX ACCENT. Under a binary collation, these
> are all different strings and so they are admissible as unique names. Good
> luck sorting that out (no pun intended).
>
> This is also not friendly on the user data. Any ORDER BY will sort by code
> point values[*], and that order matches the sorting rules of no natural
> language on Earth (that includes English). There's little point to being
> faster if users can't find what they're looking for or if they can
> introduce logical inconsistencies by inserting values that differ in code
> points but not in characters. You can avert this by including an explicit
> COLLATE clause in your queries, but if you don't have an index on a
> computed field using this collation then I hope you like index scans and
> explicit sort operations, because that's what you're going to get.
>
> It is generally much more practicable to use a sensible non-binary
> collation as the default (like Latin1_General_CI_AS) and use the binary
> collations (like Latin1_General_BIN) only for specialized fields where the
> performance gain makes sense and you can afford true binary lookup. In
> those cases it still nearly always makes sense to have a separate computed
> field that uses a non-binary collation, for purposes of human interaction.
>
> --
> J.
> [*]This is not strictly true because SQL Server 2005's binary collations
> are sort-of broken: the first character is compared according to code
> point, but the remainder is compared by comparing the bytes. This has
> quite unintuitive results, and SQL Server 2008 fixes this with the new
> _BIN2 collations that do compare by code point for all characters.

From: Jeroen Mostert on 4 Dec 2009 14:44

Jay wrote:
> Wow, that's a lot. You seem to be a bit passionate on the subject

There's no malice in it, I assure you. I haven't been terrorized by hundreds
of configurations that used binary collations. :-) I have only tried such a
configuration once, and that was enough to convince me to learn more about
how collations work and when it is and isn't a good idea to use a binary
collation.

--
J.

From: Jay on 4 Dec 2009 22:20

OK, I'll bite. When is it good to use a binary collation?

"Jeroen Mostert" <jmostert(a)xs4all.nl> wrote in message
news:4b1966b8$0$22913$e4fe514c(a)news.xs4all.nl...
> Jay wrote:
>> Wow, that's a lot. You seem to be a bit passionate on the subject
>
> There's no malice in it, I assure you. I haven't been terrorized by
> hundreds of configurations that used binary collations. :-) I have only
> tried such a configuration once, and that was enough to convince me to
> learn more about how collations work and when it is and isn't a good idea
> to use a binary collation.
>
> --
> J.

From: Jeroen Mostert on 5 Dec 2009 05:11

Jay wrote:
> OK, I'll bite. When is it good to use a binary collation?
>
I hoped my post covered that, but OK...

A binary collation gives you two things a non-binary collation doesn't: the
ability to have values that are unique by code points (not characters) and a
performance gain in comparisons (which speeds up sorting and indexing in
particular). The first is typically a drawback, but it can be a benefit. For
example, if you *want* to be able to store "Test", "TEST" and "Test" (the
latter is "Test" with a ZERO-WIDTH NON-BREAKING SPACE appended to it, but
you can't see that and I didn't bother to actually type it :-)) and have
them all compare differently, a binary collation will do that.

If your system stores only data generated by other systems, this can be
useful -- you may need to be able to store data *exactly* as you received
it, but still have a unique key. For example, I maintain a data warehouse
that's filled by Integration Services. Internally, Integration Services uses
what is effectively a binary collation to determine if two strings are
unique, so my database field better be as well, or they won't agree on
what's unique. Similarly, other external systems may not (be able to) use
SQL Server's collation system. The least common denominator in that case is
a binary collation.

Even so, the uses for binary collations are pretty specialist. They're not
the simple performance boosters they're often presented as. Using them
indiscriminately can lead to data integrity and usability problems.

--
J.

First | Prev | Next | Last
Pages: 1 2 3
Prev: very slow SQL client
Next: SQL Server Output Buffer