collation

You are currently browsing articles tagged collation.

If you’ve ever encountered the words collation and conflict in the same sentence, or better, error message, then you’re going to love this one.  It’s a Microsoft Connect request that exists since 2008, created by SQL Server MVP Erland Sommarskog, and it suggests that the SQL Server installer should not select a server collation by default.

Ow yeah, sure got my vote!  I don’t think I ever came across a server which was initially installed with the correct collation unless it was installed by myself.  Not selecting a collation by default in the setup means that whoever is installing the server, usually a DBA, needs to think about (and more importantly: understand!) what collations are and how selecting the wrong one might impact the applications using the databases.  No more messed up servers with half of the databases using Latin1_General_CI_AS and the other half using SQL_Latin1_General_CP1_CI_AS, with possibly a couple of French_CI_AS databases thrown in for fun.

Sound familiar?  Then vote this up!

Share

Tags: , ,

Initially I was going to call this article “Struggling With Collation: The SeQueL”, but it just doesn’t have the same ring to it as “Fun With Strings”.  In that previous article I showed how you might get different results when loading data from a temporary table or table variable and I suggested that one way of solving this is by switching your data type to nvarchar.

Unicode Or Not?

Reason #1 For Not

Today I’m going to show you that nvarchar is not always what we want to use, especially if we don’t need to support Unicode strings.  Imagine a staging scenario when loading a data warehouse.  Often the Business Keys (BK) are strings, and depending on the source system, sometimes very long strings – I’ve seen situations with a combined business key of over 500 bytes!  (You can’t imagine what some data sources look like but that’s another story.)  Do we really want to convert these to Unicode, and thus double their size?  Furthermore, to improve lookups we put indexes on those BKs.  These indexes would double in size as well.  So no, we don’t really want to make these fields Unicode, and certainly not when we want our ETLs to perform as fast as possible.

Reason #2 For Not

That was reason number one why nvarchar is not always the solution.  And here comes reason number two.  In my scenario, the source tables are located in an Oracle database.  And guess what: by default Oracle’s ORDER BY behaves different than SQL Server’s ORDER BY (when using the regular Latin1_General_CI_AS or SQL_Latin1_General_CP1_CI_AS collations)!  By default Oracle uses binary string comparison to sort its data and the reason for it appears to be that that’s the only way to prevent a full table scan.  I’m no Oracle expert but that’s what the documentation states.

Here’s a little demonstration.  The following script prepares a table variable and selects the data from it, sorted ascending.

declare @tbl table( col1 varchar(20));
insert into @tbl select ‘AA’;
insert into @tbl select ‘A’;
insert into @tbl select ‘A-’;
insert into @tbl select ‘A A’;
insert into @tbl select ‘BA’;
insert into @tbl select ’0′;
insert into @tbl select ’1′;
insert into @tbl select ‘-0′;
insert into @tbl select ‘-1′;
insert into @tbl select ‘A0′;
insert into @tbl select ’0A’;
insert into @tbl select ‘-A’;
insert into @tbl select ‘-B’;
insert into @tbl select ‘a’;
insert into @tbl select ‘b’;
insert into @tbl select ‘ ‘;

select * from @tbl
order by col1 asc;

I have executed it once just as stated above (while connected to a database that uses the SQL_Latin1_General_CP1_CI_AS collation) and once more while using nvarchar as data type for the column in the table variable.  The first execution will sort the data using a non-Unicode sorting algorithm, while the second execution will order the data according to the Unicode sorting method.  The results will be shown further below for easier comparison.

On Oracle I performed a similar procedure, as shown in following script.

select cast(‘AA’ as varchar(20)) as col1 from Dual union
select ‘A’ as col1 from Dual union
select ‘A-’ as col1 from Dual union
select ‘A A’ as col1 from Dual union
select ‘BA’ as col1 from Dual union
select ’0′ as col1 from Dual union
select ’1′ as col1 from Dual union
select ‘-0′ as col1 from Dual union
select ‘-1′ as col1 from Dual union
select ‘A0′ as col1 from Dual union
select ’0A’ as col1 from Dual union
select ‘-A’ as col1 from Dual union
select ‘-B’ as col1 from Dual union
select ‘a’ as col1 from Dual union
select ‘b’ as col1 from Dual union
select ‘ ‘ as col1 from Dual
order by col1 asc;

The Oracle script doesn’t use a table variable, it just creates a result set using several select statements with a union in between.  But for our test that doesn’t matter, the results using this method are suitable.

In the table below you can see the result of the three executions.

SQL non-Unicode SQL Unicode Oracle Binary
< space
-0 0 -0
-1 -0 -1
-A 0A -A
-B 1 -B
0 -1 0
0A A 0A
1 a 1
A -A A
a A- A A
A A A A A-
A- A0 A0
A0 AA AA
AA b BA
b -B a
BA BA b

As you can see, they only agree on one thing: space really is the smallest character in my test set!  And that’s not what I want, I want all the data to be sorted consistently, no matter what the source is.

But Why Sorted?

You may wonder why I need to sort the data.  Well, some components in Integration Services expect the incoming data flows to be ordered.  One of the standard components that requires this is the Merge Transformation.  Another (custom!) component is Table Difference.  I could of course add a Sort Transformation to my Data Flow, but that would not be interesting for performance.  I want the data to come from the database server in the expected order.  So now I’ll show you how you can do that.

Taking Control!

SQL Server: ORDER BY … COLLATE …

On SQL Server this was fairly easy.  The ORDER BY clause has a COLLATE part where you can specify what collation should be used to order the data.  Because Oracle sorts its data using a binary algorithm, I’ll tell SQL Server to do that as well.  More precisely I’ll tell SQL Server to use the Latin1_General_BIN collation.  The updated SELECT statement from the T-SQL script above looks like this:

select * from @tbl
order by col1 collate Latin1_General_BIN asc;

Oracle: ORDER BY NLSSORT()

To ensure that results from Oracle are always returned using the same sorting algorithm, I will also tell the Oracle server to sort it’s data using the binary algorithm.

The first way I came up with was to change the NLS_SORT setting on the session.  That can be done by executing the following command before the SELECT statement:

ALTER SESSION SET NLS_SORT=BINARY;

This method is fine when you’re running the queries manually from a client such as Oracle SQL Developer.  However, in SSIS the OLE DB Source component will not accept anything else besides the SELECT statement.

Then I found another way.  There’s a function called NLSSORT() which you can apply to a column in the ORDER BY clause.  The following statement demonstrates how to use this function.  (I only show the ORDER BY clause as it can be applied to the Oracle script mentioned earlier.)

ORDER BY NLSSORT(col1, ‘NLS_SORT=BINARY’)

The following table shows the results from both binary sort queries:

SQL Binary Oracle Binary
-0 -0
-1 -1
-A -A
-B -B
0 0
0A 0A
1 1
A A
A A A A
A- A-
A0 A0
AA AA
BA BA
a a
b b

Finally I am able to get data from both Oracle and SQL Server using a consistent sort order.

But, How Big Is NULL?

However, even on this straightforward request, both database servers do not fully agree!  Here’s what they have to say about the topic:

“NULL is the smallest.”

“No, it’s the largest.”

“No, smallest!”

“Largest!!”

“Smallest.”

“Largest I tell you!!!”

“Bladiebla, not hearing you, anyway, it’s NOTHING!”

“No, it isn’t!”

*discussion goes on and on*

If I add NULL to my test data set, SQL Server will sort it first (thus NULL is the smallest value in my test set), while Oracle will put it last.  In my situation it wasn’t really an issue (the BKs are not supposed to be NULL), but it’s quite important to remember in cases where NULLs are actually possible.

Conclusion

When working with strings, always keep collation in mind.  And even more so when dealing with several different source systems!

Additional reference material:

Database Journal: The Globalization of Language in Oracle – The NLS_COMP and NLS_SORT variables

BOL 2008: How to: Sort Data for the Merge and Merge Join Transformations

Share

Tags: , , , , , ,

Recently I was investigating an issue related to ordering data.  As a test, I ran the following script:

– Sorting data from a table variable
declare @tbl table ( ProductNumber varchar(25) );
insert into @tbl select ‘BBBB’;
insert into @tbl select ‘AAAA’;
insert into @tbl select ‘A-B’;
insert into @tbl select ‘A123′;
select * from @tbl order by ProductNumber asc;

– Sorting data from a temporary table
create table #tbl ( ProductNumber varchar(25) );
insert into #tbl select ‘BBBB’;
insert into #tbl select ‘AAAA’;
insert into #tbl select ‘A-B’;
insert into #tbl select ‘A123′;
select * from #tbl order by ProductNumber asc;
drop table #tbl;

And it gave me this result:

Results from script

As you can see, the order of the data coming from a temporary table is different than the data from a table variable, even though the same data type is used.  At first I thought, how on earth is this possible?  After spending some time pondering about this problem (and after a colleague reported that when he executed the above script, the result was as expected), it came to me.  Don’t tell me it’s a collation problem?!  Well, it is.  (In case you don’t know collation, in short “Collations specify the rules for how strings of character data are sorted and compared”.  More info through that previous link and here.)

When I executed the script, my Management Studio session was connected to AdventureWorks2008, one of my test databases.  When I switched to tempdb, the result was normal – both queries returned the same result.  Then I had a look at the collations.  My tempdb is using Latin1_General_CI_AS while the AdventureWorks2008 database is apparently using SQL_Latin1_General_CP1_CI_AS.

A temporary table is created in the tempdb and thus uses the collation of the tempdb for its string columns.  And a table variable uses the same collation as the database to which the session is connected.  Which is quite logical because otherwise you would get collation conflicts when using the table variable in combination with a table from the active database in the same query.  I will demonstrate that with the following script:

declare @tbl table ( ProductNumber varchar(25) );
insert into @tbl select ‘BBBB’;
select * from AdventureWorks.Production.Product P
where P.ProductNumber in (select ProductNumber from @tbl);

On my (SQL Server 2008) server I also have the old SQL2005 AdventureWorks database up and running, which is using the Latin1_General_CI_AS collation.  Executing the above script (which references that AdventureWorks DB) while being connected to AdventureWorks2008 (or any other DB that uses a different collation from Latin1_General_CI_AS) results in the following error:

Msg 468, Level 16, State 9, Line 3
Cannot resolve the collation conflict between “SQL_Latin1_General_CP1_CI_AS” and “Latin1_General_CI_AS” in the equal to operation.

Changing the active database to any other DB that uses the Latin1_General_CI_AS collation and then executing the query results in a positive execution.

Now, to get back to the initial issue of sorting inconsistency, even though the collations are not exactly the same, they are both Latin1, Case Insensitive, Accent Sensitive and still they don’t sort the data in the same way??  For an explanation on that I found the following page on the Microsoft Support site: Comparing SQL collations to Windows collations.  In short: a Windows collation (such as Latin1_General_CI_AS) uses a different comparison algorithm than a SQL collation (SQL_Latin1_General_CP1_CI_AS).  A Windows collation uses the same algorithm as for Unicode data, even when the data is non-Unicode.

All that means is that in our test example, in order to get the data in the right order all the time, we could switch to Unicode fields:

– Sorting data from a table variable
declare @tbl table ( ProductNumber nvarchar(25) );
insert into @tbl select ‘BBBB’;
insert into @tbl select ‘AAAA’;
insert into @tbl select ‘A-B’;
insert into @tbl select ‘A123′;
select * from @tbl order by ProductNumber asc;

– Sorting data from a temporary table
create table #tbl ( ProductNumber nvarchar(25) );
insert into #tbl select ‘BBBB’;
insert into #tbl select ‘AAAA’;
insert into #tbl select ‘A-B’;
insert into #tbl select ‘A123′;
select * from #tbl order by ProductNumber asc;
drop table #tbl;

Executing that returns the following result, no matter what the active database is:

Script result when using Unicode fields

And it also means that you need to be careful when retrieving data from different sources (which is how I came across the issue in the first place).  If you need to have your data sorted in a particular, consistent way and you’re doing that by using an ORDER BY in a SELECT statement on the source system, double-check if all sources are using the same collation!

In case the above left you wondering about what collation to use: the SQL collations are there for backward compatibility.  For new developments use the Windows collations!

See my Fun With Strings article for another story related to collation.

Additional reference material:

SQL Server Central – Comparing Table Variables with Temporary Tables

Share

Tags: , ,

© 2008-2017 BI: Beer Intelligence? All Rights Reserved