Initially I was going to call this article “Struggling With Collation: The SeQueL”, but it just doesn’t have the same ring to it as “Fun With Strings”. In that previous article I showed how you might get different results when loading data from a temporary table or table variable and I suggested that one way of solving this is by switching your data type to nvarchar.
Unicode Or Not?
Reason #1 For Not
Today I’m going to show you that nvarchar is not always what we want to use, especially if we don’t need to support Unicode strings. Imagine a staging scenario when loading a data warehouse. Often the Business Keys (BK) are strings, and depending on the source system, sometimes very long strings – I’ve seen situations with a combined business key of over 500 bytes! (You can’t imagine what some data sources look like but that’s another story.) Do we really want to convert these to Unicode, and thus double their size? Furthermore, to improve lookups we put indexes on those BKs. These indexes would double in size as well. So no, we don’t really want to make these fields Unicode, and certainly not when we want our ETLs to perform as fast as possible.
Reason #2 For Not
That was reason number one why nvarchar is not always the solution. And here comes reason number two. In my scenario, the source tables are located in an Oracle database. And guess what: by default Oracle’s ORDER BY behaves different than SQL Server’s ORDER BY (when using the regular Latin1_General_CI_AS or SQL_Latin1_General_CP1_CI_AS collations)! By default Oracle uses binary string comparison to sort its data and the reason for it appears to be that that’s the only way to prevent a full table scan. I’m no Oracle expert but that’s what the documentation states.
Here’s a little demonstration. The following script prepares a table variable and selects the data from it, sorted ascending.
declare @tbl table( col1 varchar(20));
insert into @tbl select ‘AA’;
insert into @tbl select ‘A’;
insert into @tbl select ‘A-’;
insert into @tbl select ‘A A’;
insert into @tbl select ‘BA’;
insert into @tbl select ’0′;
insert into @tbl select ’1′;
insert into @tbl select ‘-0′;
insert into @tbl select ‘-1′;
insert into @tbl select ‘A0′;
insert into @tbl select ’0A’;
insert into @tbl select ‘-A’;
insert into @tbl select ‘-B’;
insert into @tbl select ‘a’;
insert into @tbl select ‘b’;
insert into @tbl select ‘ ‘;
select * from @tbl
order by col1 asc;
I have executed it once just as stated above (while connected to a database that uses the SQL_Latin1_General_CP1_CI_AS collation) and once more while using nvarchar as data type for the column in the table variable. The first execution will sort the data using a non-Unicode sorting algorithm, while the second execution will order the data according to the Unicode sorting method. The results will be shown further below for easier comparison.
On Oracle I performed a similar procedure, as shown in following script.
select cast(‘AA’ as varchar(20)) as col1 from Dual union
select ‘A’ as col1 from Dual union
select ‘A-’ as col1 from Dual union
select ‘A A’ as col1 from Dual union
select ‘BA’ as col1 from Dual union
select ’0′ as col1 from Dual union
select ’1′ as col1 from Dual union
select ‘-0′ as col1 from Dual union
select ‘-1′ as col1 from Dual union
select ‘A0′ as col1 from Dual union
select ’0A’ as col1 from Dual union
select ‘-A’ as col1 from Dual union
select ‘-B’ as col1 from Dual union
select ‘a’ as col1 from Dual union
select ‘b’ as col1 from Dual union
select ‘ ‘ as col1 from Dual
order by col1 asc;
The Oracle script doesn’t use a table variable, it just creates a result set using several select statements with a union in between. But for our test that doesn’t matter, the results using this method are suitable.
In the table below you can see the result of the three executions.
| SQL non-Unicode |
SQL Unicode |
Oracle Binary |
|
| |
|
|
< space |
| -0 |
0 |
-0 |
|
| -1 |
-0 |
-1 |
|
| -A |
0A |
-A |
|
| -B |
1 |
-B |
|
| 0 |
-1 |
0 |
|
| 0A |
A |
0A |
|
| 1 |
a |
1 |
|
| A |
-A |
A |
|
| a |
A- |
A A |
|
| A A |
A A |
A- |
|
| A- |
A0 |
A0 |
|
| A0 |
AA |
AA |
|
| AA |
b |
BA |
|
| b |
-B |
a |
|
| BA |
BA |
b |
|
As you can see, they only agree on one thing: space really is the smallest character in my test set! And that’s not what I want, I want all the data to be sorted consistently, no matter what the source is.
But Why Sorted?
You may wonder why I need to sort the data. Well, some components in Integration Services expect the incoming data flows to be ordered. One of the standard components that requires this is the Merge Transformation. Another (custom!) component is Table Difference. I could of course add a Sort Transformation to my Data Flow, but that would not be interesting for performance. I want the data to come from the database server in the expected order. So now I’ll show you how you can do that.
Taking Control!
SQL Server: ORDER BY … COLLATE …
On SQL Server this was fairly easy. The ORDER BY clause has a COLLATE part where you can specify what collation should be used to order the data. Because Oracle sorts its data using a binary algorithm, I’ll tell SQL Server to do that as well. More precisely I’ll tell SQL Server to use the Latin1_General_BIN collation. The updated SELECT statement from the T-SQL script above looks like this:
select * from @tbl
order by col1 collate Latin1_General_BIN asc;
Oracle: ORDER BY NLSSORT()
To ensure that results from Oracle are always returned using the same sorting algorithm, I will also tell the Oracle server to sort it’s data using the binary algorithm.
The first way I came up with was to change the NLS_SORT setting on the session. That can be done by executing the following command before the SELECT statement:
ALTER SESSION SET NLS_SORT=BINARY;
This method is fine when you’re running the queries manually from a client such as Oracle SQL Developer. However, in SSIS the OLE DB Source component will not accept anything else besides the SELECT statement.
Then I found another way. There’s a function called NLSSORT() which you can apply to a column in the ORDER BY clause. The following statement demonstrates how to use this function. (I only show the ORDER BY clause as it can be applied to the Oracle script mentioned earlier.)
ORDER BY NLSSORT(col1, ‘NLS_SORT=BINARY’)
The following table shows the results from both binary sort queries:
| SQL Binary |
Oracle Binary |
| |
|
| -0 |
-0 |
| -1 |
-1 |
| -A |
-A |
| -B |
-B |
| 0 |
0 |
| 0A |
0A |
| 1 |
1 |
| A |
A |
| A A |
A A |
| A- |
A- |
| A0 |
A0 |
| AA |
AA |
| BA |
BA |
| a |
a |
| b |
b |
Finally I am able to get data from both Oracle and SQL Server using a consistent sort order.
But, How Big Is NULL?
However, even on this straightforward request, both database servers do not fully agree! Here’s what they have to say about the topic:
“NULL is the smallest.”
“No, it’s the largest.”
“No, smallest!”
“Largest!!”
“Smallest.”
“Largest I tell you!!!”
“Bladiebla, not hearing you, anyway, it’s NOTHING!”
“No, it isn’t!”
*discussion goes on and on*
If I add NULL to my test data set, SQL Server will sort it first (thus NULL is the smallest value in my test set), while Oracle will put it last. In my situation it wasn’t really an issue (the BKs are not supposed to be NULL), but it’s quite important to remember in cases where NULLs are actually possible.
Conclusion
When working with strings, always keep collation in mind. And even more so when dealing with several different source systems!
Additional reference material:
Database Journal: The Globalization of Language in Oracle – The NLS_COMP and NLS_SORT variables
BOL 2008: How to: Sort Data for the Merge and Merge Join Transformations