SQL Server 2008 : Explaining Advanced Query Techniques - Managing Internationalization Considerations

1/31/2011 5:14:16 PM

The need to support data from sources around the world as well as to support clients from various cultures is a common need these days. SQL Server has been designed with support for these international data sets and clients out of the box. The two main features we will discuss here are the support for storing character data, or collations, and support for returning error messages to clients in an appropriate language.

Understanding Collations

Collations are the mechanism by which we tell SQL Server which characters (letters and symbols) we want to be able to work with, and how it should sort and compare values based on those characters. We have the ability to define these collation settings for each instance, database, column, and expression. To truly understand collations, we first need to understand how SQL server stores and retrieves data using data types; next we need to understand the difference between non-Unicode and Unicode data; and finally we need to understand how SQL Server sorts and compares character values.

Computers, and thus SQL Server, work with and store data as binary values. We humans don’t have this capability (at least not most of us). When we ask SQL Server to store data for us in a database, SQL is responsible for properly transforming the values (words, letters, numbers, dates, etc.) that we give it into binary sequences that can be written to disk. When we later ask SQL Server to retrieve the values for us, it reverses the process and transforms the stored binary values back into the words, letters, numbers, or dates we originally gave it. SQL Server uses data types (int, datetime, char, nchar, etc.) to perform and manage that transformation in a consistent and formal manner. There is a set of data types specifically for storing character data; they include char, varchar, nchar, and nvarchar.

The difference between char and nchar as well as varchar and nvarchar is how many bytes each stored character takes, and from that, the range of possible character symbols that can be stored. The char and varchar types normally store characters from a limited set of 256 possible characters (determined by the selected “character set” or “code page”), with each character taking a single byte of storage. The nchar and nvarchar data types store characters from the Unicode character set that contains more than 65,000 possible characters, which each character stores using two bytes.

SQL Server makes numerous character sets from the ANSI and ISO standards bodies available for the non-Unicode char and varchar data types. Most of those character sets define a single byte per character. Some Asian character sets require two bytes per character because they have so many characters in their language. We call a non-Unicode character set that requires two bytes a Double Byte Character Set (or DBCS). With a single byte we can represent a total of 256 possible distinct values. A character maps each of the possible values to a given character. The most common ANSI code page used on North American systems is code page 1252, also referred to as Latin1. Table 1 shows a sample of binary values, their decimal equivalents, and the character that is mapped to it. Because different cultures and languages have different symbols, there is a problem fitting all symbols into a single set of 256. It is for that reason that we require multiple character sets. We need to then instruct SQL Server which specific character set we wish to use for each char and varchar value. Each char or varchar value can be a maximum of 8000 bytes. With each character taking one byte, that means they can store 8000 characters. The varchar(max) large object data type (the successor to the old text data type) can store up to 2GB (2³¹-1 bytes) of data, or 2³¹-1 (more than two billion) characters.

Table 1. Sample Code Page 1252 Character Set Values
Binary Value	Decimal Value	ASCII Character
01000001	65	A
01000010	66	B
01000011	67	C
01100001	97	a
01100010	98	b
01100011	99	c

SQL Server also makes available the set of characters from the Unicode Consortium’s Unicode character set for the nchar and nvarchar data types. Unicode stores each character as two bytes. With two bytes available, the Unicode Character set can represent 65,536 possible distinct values. With all those characters, we can represent all the symbols from the world’s languages in a single character set—hence the name Unicode. This makes it an obvious choice when supporting international data sources and clients. Of course, the convenience comes with a cost. Like the char and varchar data types, the nchar and nvarchar data types can store values up to 8000 bytes. Each character, however, takes two bytes of storage, meaning they can store a total of 4000 characters each (or 2³⁰-1 characters, just over one billion) for nvarchar(max). It should be mentioned that the “ n ” in-front of the nchar and nvarchar data types comes from the SQL standard’s name for Unicode, which is “National Character.”

So to recap, we can represent both non-Unicode and Unicode character data. The good news is that we don’t have to decide to make exclusive use of nonUnicode or Unicode data. Both are supported at all times. The char and varchar data types store non-Unicode data using a code page we specify. The nchar and nvarchar data types store Unicode data.

The next step is to understand how SQL Server sorts and compares data. Specifically, how does SQL Server sort or compare two “similar” values like an uppercase letter “A” and a lowercase letter “a”? The answer is that it depends. It depends on our preferences for case sensitivity (A vs. a), accent sensitivity (à vs. å), Japanese character kana sensitivity (Hiragana vs. Katakana), and width sensitivity (single-byte A vs. double-byte A). We could also choose to have SQL Server just sort and compare based on the binary values themselves rather than their character representations.

The case sensitivity we choose determines whether SQL Server treats uppercase “A” and lowercase “a” as equal or as distinct. The “case-insensitive” option implies that they should be treated as equals, whereas “case sensitive” implies that they should be treated as distinct. The case-insensitive option actually costs a little more in terms of procession because SQL server has to treat two things that are not the same as though they are. That takes processing to make it happen. We can cut costs somewhat by choosing a case-sensitive sort order. We can cut costs even more by choosing to sort simply by the binary values rather than by the characters they represent. Each cost-cutting step, while improving performance, adds difficulty for us as to how we specify character literals. For example if we had a case-sensitive or binary collation selected, “This” does not equal “this” because they ar not the exact same characters. The accent, kana, and width sensitivity choices follow the same pattern.

The sorting of character values is also determined by the selected collation. Different languages or cultures may sort certain values differently. For example, some Spanish-speaking cultures may choose to sort the “ch” digraph after cz rather than between ce and ci, as would be done in English. For example, churro would come after Czech rather than before it as would be expected in U.S. English.

We choose which non-Unicode code page we wish to use, as well as how all character data should be sorted and compared by selecting a collation. A collation combines the code page selection and sorting preferences into a single setting. SQL Server supports collations that are compatible with previous versions of SQL Server as well as collations that are compatible with your Windows operating system. Choosing between Windows and SQL Collations is based on whether you are trying to maintain compatibility with a previous version of SQL Server (SQL Collation) or with your Windows server and client operating systems (Windows Collations).

Windows Collation names take the following form:

<Windows_collation_name> :: =
      CollationDesignator_<ComparisonStyle>
<ComparisonStyle> :: =
      { CaseSensitivity_AccentSensitivity
         [ _KanatypeSensitive ] [ _WidthSensitive ] }
   | { _BIN | _BIN2 }

SQL collation names take the form:

<SQL_collation_name> :: =
   SQL_SortRules[_Pref]_CPCodepage_<ComparisonStyle>
<ComparisonStyle> ::=
   _CaseSensitivity_AccentSensitivity | _BIN

For a list of all collations, both Windows and SQL, we can run the following query (the SQL Collation names start with SQL; the rest are Windows Collation names):

SELECT * FROM fn_helpcollations()

We can specify a different collation with every character value we store, but this action is not recommended. Ideally, we would use the same collation for every character value. For that reason, we should consider what our collation should be before we install the instance of SQL Server. During the installation process, we will select a default collation that all databases, columns, and expressions will use. We can then explicitly override the default when we have a specific reason to do so. We use the COLLATE clause to specify which collation we wish to use. The COLLATE clause can be used when creating a database, defining a column, or at the end of a character expression. The following example creates a table with two character columns, each based on a different collation. The first column (col1) is based on a case-insensitive (CI) character set, whereas the second column (col2) is based on a case-sensitive (CS) collation:

USE AdventureWorks2008;
GO
CREATE TABLE CollateTest
(col1 nvarchar(10) COLLATE Latin1_General_100_CI_AI_KS_WS NOT NULL,
col2 nvarchar(10) COLLATE Latin1_General_100_CS_AI_KS_WS NOT NULL);
INSERT INTO CollateTest (col1,col2) VALUES ('Hello','World');
--The following query will return a single row ('Hello' and 'hello'
match when case insensitive):
SELECT * FROM CollateTest WHERE col1 = 'hello';
--This query won't return any rows ('World' and 'world' do not match
when case sensitive):
SELECT * FROM CollateTest WHERE col2 = 'world';

This example highlights how confusing it would be if every character value used a different collation. Not only would there be issues in sorting and matching values, but when trying to assign values from character-based column to another, corruption could occur if the target char or varchar column did not use the same code page as the source and therefore could not map some of the characters.

Tip

Don’t let all this discussion of collations confuse you. We want to use the same collation in all places. They help us manage what kinds of character data we can store, and how we can find and sort that data. You are on the right track if you can pick an appropriate collation and then stick to it.

Exercise . Testing Database Collations

Open a new query in SQL Server Management Studio and select a master from the list of available databases.

Enter and run the following code to create a new database with a case- and accent-sensitive collation:

CREATE DATABASE CollationTest
  COLLATE SQL_Latin1_General_Cp1_CS_AS;
  GO
  USE CollationTest;
  GO
  CREATE TABLE TestTable
  (Col1 nvarchar(20))
  INSERT INTO TestTable (Col1) VALUES ('Some Data');
  SELECT * FROM testtable;
  GO
  SELECT * FROM TestTable WHERE Col1 = 'some data';
  GO
SELECT * FROM TestTable WHERE Col1 = 'Some Data';

Now try each of the three following queries separately. Try to understand the result you get.

SELECT * FROM testtable;
SELECT * FROM TestTable WHERE Col1 = 'some data';
SELECT * FROM TestTable WHERE Col1 = 'Some Data';

The preceding section covered the basics of working with collations. Now that we understand how we can accept and store values from international data sources, we will discuss returning error messages to clients based on their connections in culture settings.

Defining Language-Specific Custom Errors

When clients from multiple cultures connect to the same SQL Server instance, it would be beneficial to return error information to those clients in a language they can understand. When we create user-defined error messages in SQL Server, we can provide multiple versions of the error text, with each version targeting a different language.

SQL Server gives developers the ability to create error messages specific to their applications. We create the error message definitions using the sp_addmessage system stored procedure, and we then raise them using the RAISERROR statement. When we raise the errors in a client’s connection, SQL Server can select an error message that matches the client session’s culture if one is available. You can use the sp_helplanguage system stored procedure for details on all possible languages. To create localized versions of error messages, the following conditions must also be met:

The U.S. English version of custom error message must be defined before any localized versions.
The localized version must have the same defined severity as the U.S. English version.
If the U.S. English version of the language defines parameters, the localized versions then reference the parameters by their position in the original error message. If the original message had two parameters, the localized versions refer to the first as “%1!” (without the quotes), and the second as “%2!”, and so on. Notice the exclamation mark at the end of each parameter number.

The following example creates a parameterized custom error message in U.S. English, as well as in French. It then tests the error in both languages:

-- Create the original error message with the number
-- 50001 in US English
DECLARE @msg nvarchar(255);
SET @msg = 'This is a sample message with two parameters. ' +
           'The parameters values are %s, and %s.';

EXEC sp_addmessage
   @msgnum = 50001,
   @severity = 16,
   @lang='us_english',
   @msgtext = @msg;
-- Next create a French language version of the same error.
-- Notice the %1! and %2! parameter place holder tokens.
-- Also notice that they have been used in a different order
-- than they were defined in the original error message. That
-- is perfectly valid.
SET @msg = 'C''est un message tèmoin avec deux paramètres. ' +
           'Les valeurs de paramètres sont %2!, et %1!.';
EXEC sp_addmessage
   @msgnum = 50001,
   @severity = 16,
   @lang='french',
   @msgtext = @msg,
   @replace='replace';
-- Test the US English Vesrion by ensurin the session is
-- set to use US_English as the language, and then raising
-- the error with the RAISERROR statement.
SET LANGUAGE us_english;
RAISERROR(50001,16,1,'Hello','World');
-- Now try the exact same RAISERROR statement, but with the
-- session set to use the French language.
SET LANGUAGE French;
RAISERROR(50001,16,1,'Hello','World');

By combining the use of collations with localized versions of error messages, we have a toolset for supporting a variety of data sources and clients from around the world.

Other

Programming Microsoft SQL Server 2005 : Deployment (part 2) - Testing Your Stored Procedures

Programming Microsoft SQL Server 2005 : Deployment (part 1) - Deploying Your Assembly

Programming Microsoft SQL Server 2005 : CLR Stored Procedures and Server-Side Data Access

SQL Server 2008 : Explaining Advanced Query Techniques - Controlling Execution Plans (part 3) - Using the Resource Governor

SQL Server 2008 : Explaining Advanced Query Techniques - Controlling Execution Plans (part 2)

SQL Server 2008 : Explaining Advanced Query Techniques - Controlling Execution Plans (part 1)

Reporting Services with SQL Azure : Deploying the Report & Creating a Subreport

Reporting Services with SQL Azure : Creating the Report Design

SQL Server 2008 : Explaining Advanced Query Techniques - Applying Ranking Functions (part 2) - Using RANK, DENSE_RANK and NTILE

SQL Server 2008 : Explaining Advanced Query Techniques - Applying Ranking Functions (part 1) - Using ROW_NUMBER

SQL Server 2008 : Explaining Advanced Query Techniques - Managing Internationalization Considerations

Understanding Collations

Table 1. Sample Code Page 1252 Character Set Values

Exercise . Testing Database Collations

Defining Language-Specific Custom Errors