The need to support data from sources around the
world as well as to support clients from various cultures is a common
need these days. SQL Server has been designed with support for these
international data sets and clients out of the box. The two main
features we will discuss here are the support for storing character
data, or collations, and support for returning error messages to clients
in an appropriate language.
Understanding Collations
Collations are the
mechanism by which we tell SQL Server which characters (letters and
symbols) we want to be able to work with, and how it should sort and
compare values based on those characters. We have the ability to define
these collation
settings for each instance, database, column, and expression. To truly
understand collations, we first need to understand how SQL server stores
and retrieves data using data types; next we need to understand the
difference between non-Unicode and Unicode data; and finally we need to
understand how SQL Server sorts and compares character values.
Computers, and thus SQL
Server, work with and store data as binary values. We humans don’t have
this capability (at least not most of us). When we ask SQL Server to
store data for us in a database, SQL is responsible for properly
transforming the values (words, letters, numbers, dates, etc.) that we
give it into binary sequences that can be written to disk. When we later
ask SQL Server to retrieve the values for us, it reverses the process
and transforms the stored binary values back into the words, letters,
numbers, or dates we originally gave it. SQL Server uses data types (int, datetime, char, nchar,
etc.) to perform and manage that transformation in a consistent and
formal manner. There is a set of data types specifically for storing
character data; they include char, varchar, nchar, and nvarchar.
The difference between char and nchar as well as varchar and nvarchar is how many bytes each stored character takes, and from that, the range of possible character symbols that can be stored. The char and varchar
types normally store characters from a limited set of 256 possible
characters (determined by the selected “character set” or “code page”),
with each character taking a single byte of storage. The nchar and nvarchar
data types store characters from the Unicode character set that
contains more than 65,000 possible characters, which each character
stores using two bytes.
SQL Server makes numerous character sets from the ANSI and ISO standards bodies available for the non-Unicode char and varchar
data types. Most of those character sets define a single byte per
character. Some Asian character sets require two bytes per character
because they have so many characters in their language. We call a
non-Unicode character set that requires two bytes a Double Byte
Character Set (or DBCS). With a single byte we can represent a total of
256 possible distinct values. A character maps each of the possible
values to a given character. The most common ANSI code page used on
North American systems is code page 1252, also referred to as Latin1. Table 1
shows a sample of binary values, their decimal equivalents, and the
character that is mapped to it. Because different cultures and languages
have different symbols, there is a problem fitting all symbols into a
single set of 256. It is for that reason that we require multiple
character sets. We need to then instruct SQL Server which specific
character set we wish to use for each char and varchar value. Each char or varchar value can be a maximum of 8000 bytes. With each character taking one byte, that means they can store 8000 characters. The varchar(max) large object data type (the successor to the old text data type) can store up to 2GB (231-1 bytes) of data, or 231-1 (more than two billion) characters.
Table 1. Sample Code Page 1252 Character Set Values
Binary Value | Decimal Value | ASCII Character |
---|
01000001 | 65 | A |
01000010 | 66 | B |
01000011 | 67 | C |
01100001 | 97 | a |
01100010 | 98 | b |
01100011 | 99 | c |
SQL Server also makes available the set of characters from the Unicode Consortium’s Unicode character set for the nchar and nvarchar
data types. Unicode stores each character as two bytes. With two bytes
available, the Unicode Character set can represent 65,536 possible
distinct values. With all those characters, we can represent all the
symbols from the world’s languages in a single character set—hence the
name Unicode. This makes it an obvious choice when supporting
international data sources and clients. Of course, the convenience comes
with a cost. Like the char and varchar data types, the nchar and nvarchar
data types can store values up to 8000 bytes. Each character, however,
takes two bytes of storage, meaning they can store a total of 4000
characters each (or 230-1 characters, just over one billion) for nvarchar(max). It should be mentioned that the “ n ” in-front of the nchar and nvarchar data types comes from the SQL standard’s name for Unicode, which is “National Character.”
So to recap, we can represent
both non-Unicode and Unicode character data. The good news is that we
don’t have to decide to make exclusive use of nonUnicode or Unicode
data. Both are supported at all times. The char and varchar data types store non-Unicode data using a code page we specify. The nchar and nvarchar data types store Unicode data.
The next step is to
understand how SQL Server sorts and compares data. Specifically, how
does SQL Server sort or compare two “similar” values like an uppercase
letter “A” and a lowercase letter “a”? The answer is that it depends. It
depends on our preferences for case sensitivity (A vs. a), accent
sensitivity (à vs. å), Japanese character kana sensitivity (Hiragana vs.
Katakana), and width sensitivity (single-byte A vs. double-byte A). We
could also choose to have SQL Server just sort and compare based on the
binary values themselves rather than their character representations.
The case sensitivity
we choose determines whether SQL Server treats uppercase “A” and
lowercase “a” as equal or as distinct. The “case-insensitive” option
implies that they should be treated as equals, whereas “case sensitive”
implies that they should be treated as distinct. The case-insensitive
option actually costs a little more in terms of procession because SQL
server has to treat two things that are not the same as though they are.
That takes processing to make it happen. We can cut costs somewhat by
choosing a case-sensitive sort order. We can cut costs even more by
choosing to sort simply by the binary values rather than by the
characters they represent. Each cost-cutting step, while improving
performance, adds difficulty for us as to how we specify character
literals. For example if we had a case-sensitive or binary collation
selected, “This” does not equal “this” because they ar not the exact
same characters. The accent, kana, and width sensitivity choices follow
the same pattern.
The sorting of
character values is also determined by the selected collation. Different
languages or cultures may sort certain values differently. For example,
some Spanish-speaking cultures may choose to sort the “ch” digraph after cz rather than between ce and ci, as would be done in English. For example, churro would come after Czech rather than before it as would be expected in U.S. English.
We choose which non-Unicode
code page we wish to use, as well as how all character data should be
sorted and compared by selecting a collation. A collation combines the
code page selection and sorting preferences into a single setting. SQL
Server supports collations that are compatible with previous versions of
SQL Server as well as collations that are compatible with your Windows
operating system. Choosing between Windows and SQL Collations is based
on whether you are trying to maintain compatibility with a previous
version of SQL Server (SQL Collation) or with your Windows server and
client operating systems (Windows Collations).
Windows Collation names take the following form:
<Windows_collation_name> :: =
CollationDesignator_<ComparisonStyle>
<ComparisonStyle> :: =
{ CaseSensitivity_AccentSensitivity
[ _KanatypeSensitive ] [ _WidthSensitive ] }
| { _BIN | _BIN2 }
SQL collation names take the form:
<SQL_collation_name> :: =
SQL_SortRules[_Pref]_CPCodepage_<ComparisonStyle>
<ComparisonStyle> ::=
_CaseSensitivity_AccentSensitivity | _BIN
For a list of all
collations, both Windows and SQL, we can run the following query (the
SQL Collation names start with SQL; the rest are Windows Collation
names):
SELECT * FROM fn_helpcollations()
We can specify a
different collation with every character value we store, but this
action is not recommended. Ideally, we would use the same collation for
every character value. For that reason, we should consider what our
collation should be before we install the instance of SQL Server. During
the installation process, we will select a default collation that all
databases, columns, and expressions will use. We can then explicitly
override the default when we have a specific reason to do so. We use the
COLLATE clause to specify which collation we wish to use. The COLLATE
clause can be used when creating a database, defining a column, or at
the end of a character expression. The following example creates a table
with two character columns, each based on a different collation. The
first column (col1) is based on a case-insensitive (CI) character set,
whereas the second column (col2) is based on a case-sensitive (CS)
collation:
USE AdventureWorks2008;
GO
CREATE TABLE CollateTest
(col1 nvarchar(10) COLLATE Latin1_General_100_CI_AI_KS_WS NOT NULL,
col2 nvarchar(10) COLLATE Latin1_General_100_CS_AI_KS_WS NOT NULL);
INSERT INTO CollateTest (col1,col2) VALUES ('Hello','World');
--The following query will return a single row ('Hello' and 'hello'
match when case insensitive):
SELECT * FROM CollateTest WHERE col1 = 'hello';
--This query won't return any rows ('World' and 'world' do not match
when case sensitive):
SELECT * FROM CollateTest WHERE col2 = 'world';
This example highlights
how confusing it would be if every character value used a different
collation. Not only would there be issues in sorting and matching
values, but when trying to assign values from character-based column to
another, corruption could occur if the target char or varchar column did not use the same code page as the source and therefore could not map some of the characters.
Tip
Don’t let all this discussion of collations confuse you. We want
to use the same collation in all places. They help us manage what kinds
of character data we can store, and how we can find and sort that data.
You are on the right track if you can pick an appropriate collation and
then stick to it.
Exercise . Testing Database Collations
Open a new query in SQL Server Management Studio and select a master from the list of available databases. Enter and run the following code to create a new database with a case- and accent-sensitive collation: CREATE DATABASE CollationTest COLLATE SQL_Latin1_General_Cp1_CS_AS; GO USE CollationTest; GO CREATE TABLE TestTable (Col1 nvarchar(20)) INSERT INTO TestTable (Col1) VALUES ('Some Data'); SELECT * FROM testtable; GO SELECT * FROM TestTable WHERE Col1 = 'some data'; GO SELECT * FROM TestTable WHERE Col1 = 'Some Data'; Now try each of the three following queries separately. Try to understand the result you get. SELECT * FROM testtable; SELECT * FROM TestTable WHERE Col1 = 'some data'; SELECT * FROM TestTable WHERE Col1 = 'Some Data';
|
The
preceding section covered the basics of working with collations. Now
that we understand how we can accept and store values from international
data sources, we will discuss returning error messages to clients based
on their connections in culture settings.
Defining Language-Specific Custom Errors
When clients from multiple
cultures connect to the same SQL Server instance, it would be
beneficial to return error information to those clients in a language
they can understand. When we create user-defined error messages in SQL
Server, we can provide multiple versions of the error text, with each
version targeting a different language.
SQL Server gives
developers the ability to create error messages specific to their
applications. We create the error message definitions using the
sp_addmessage system stored procedure, and we then raise them using the RAISERROR
statement. When we raise the errors in a client’s connection, SQL
Server can select an error message that matches the client session’s
culture if one is available. You can use the sp_helplanguage
system stored procedure for details on all possible languages. To
create localized versions of error messages, the following conditions
must also be met:
The U.S. English version of custom error message must be defined before any localized versions.
The localized version must have the same defined severity as the U.S. English version.
If
the U.S. English version of the language defines parameters, the
localized versions then reference the parameters by their position in
the original error message. If the original message had two parameters,
the localized versions refer to the first as “%1!” (without the quotes),
and the second as “%2!”, and so on. Notice the exclamation mark at the
end of each parameter number.
The following example
creates a parameterized custom error message in U.S. English, as well as
in French. It then tests the error in both languages:
-- Create the original error message with the number
-- 50001 in US English
DECLARE @msg nvarchar(255);
SET @msg = 'This is a sample message with two parameters. ' +
'The parameters values are %s, and %s.';
EXEC sp_addmessage
@msgnum = 50001,
@severity = 16,
@lang='us_english',
@msgtext = @msg;
-- Next create a French language version of the same error.
-- Notice the %1! and %2! parameter place holder tokens.
-- Also notice that they have been used in a different order
-- than they were defined in the original error message. That
-- is perfectly valid.
SET @msg = 'C''est un message tèmoin avec deux paramètres. ' +
'Les valeurs de paramètres sont %2!, et %1!.';
EXEC sp_addmessage
@msgnum = 50001,
@severity = 16,
@lang='french',
@msgtext = @msg,
@replace='replace';
-- Test the US English Vesrion by ensurin the session is
-- set to use US_English as the language, and then raising
-- the error with the RAISERROR statement.
SET LANGUAGE us_english;
RAISERROR(50001,16,1,'Hello','World');
-- Now try the exact same RAISERROR statement, but with the
-- session set to use the French language.
SET LANGUAGE French;
RAISERROR(50001,16,1,'Hello','World');
By
combining the use of collations with localized versions of error
messages, we have a toolset for supporting a variety of data sources and
clients from around the world.