SSIS, Flat Files And Accents (é, è, …)
This is a follow-up to my article on Loading Complex XML Using SSIS and XSLT. In that article I demonstrated how you can convert complex XML into simple CSV using XSLT in SSIS.
The resulting DTSX package and input files can be downloaded from my SkyDrive through this link.
Dealing With Special Characters
If you’ve followed the instructions in my article mentioned above and you need to deal with special characters such as the é and è encountered in the French language, you probably noticed that it wouldn’t really work as expected. In fact, in your final result you may have ended up with the special characters being replaced with other, even more special, characters. Obviously not good.
Here’s an explanation on the reason why that happens, and also how to deal with it.
Setting The Scene
Imagine the following sample XML, representing a really huge book collection:
<books> <book> <title>The Hitchhiker's Guide to the Galaxy</title> <author>Douglas Adams</author> <language>EN</language> <description>The Hitchhiker's Guide to the Galaxy is a science fiction comedy series created by Douglas Adams.</description> </book> <book> <title>Le Trône de fer</title> <author>George R.R. Martin</author> <language>FR</language> <description>Le Trône de fer est une série de romans de fantasy de George R. R. Martin, dont l'écriture et la parution sont en cours. Martin a commencé à l'écrire en 1991 et le premier volume est paru en 1996. Prévue à l'origine comme une trilogie, la série compte désormais cinq volumes publiés et deux autres sont attendus.</description> </book> </books>
As you can see, the second book in the list is the French version of the first book in the A Song of Ice and Fire series by George R.R. Martin and as it goes with French, there are some accents in the description of the book.
We’ll use the following XSLT to convert it to CSV:
<?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs=http://www.w3.org/2001/XMLSchema xmlns:fn="http://www.w3.org/2005/xpath-functions"> <xsl:output method="text" version="1.0" encoding="UTF-8" indent="no"/> <xsl:template match="/"> <xsl:text>BookTitle;Author;Language;Description</xsl:text> <xsl:text> </xsl:text> <xsl:for-each select="books/book"> <xsl:text>"</xsl:text> <xsl:value-of select="title"/> <xsl:text>";"</xsl:text> <xsl:value-of select="author"/> <xsl:text>";"</xsl:text> <xsl:value-of select="language"/> <xsl:text>";"</xsl:text> <xsl:value-of select="description"/> <xsl:text>"</xsl:text> <xsl:text> </xsl:text> </xsl:for-each> </xsl:template> </xsl:stylesheet>
Using an XML Task in the Control Flow, as explained in my article, we’d get the following output:
“The Hitchhiker’s Guide to the Galaxy”;”Douglas Adams”;”EN”;”The Hitchhiker’s Guide to the Galaxy is a science fiction comedy series created by Douglas Adams.”
“Le Trône de fer”;”George R.R. Martin”;”FR”;”Le Trône de fer (A Song of Ice and Fire) est une série de romans de fantasy de George R. R. Martin, dont l’écriture et la parution sont en cours. Martin a commencé à l’écrire en 1991 et le premier volume est paru en 1996. Prévue à l’origine comme une trilogie, la série compte désormais cinq volumes publiés et deux autres sont attendus.”
So far so good, all accents are still present!
Then we’d import the file using a Flat File Source component in a Data Flow Task. Here’s what the General page of the Flat File Connection Manager would look like:
We’ve set double-quote as Text Qualifier and checked the Column names in the first data row textbox.
Switching to the Columns page we’d get the following:
Hang on, that’s not right! The Preview is not displaying our accents as expected! Oh my, what’s going on here? Let’s call the code page detectives!
A Mismatch Investigation
Take a good look at the XSLT which we’ve used to convert the XML into CSV, especially the xsl:output line:
<xsl:output method=“text” version=“1.0” encoding=“UTF-8” indent=“no“/>
That line specifies that the text output should be encoded using the UTF-8 code page.
Now take a good look at the General page in the screenshot earlier, more precisely this part:
Indeed, code page 1252 (ANSI – Latin I). While the input is UTF-8. Of course that results in a mismatch of certain characters, as demonstrated here. The fix is fairly easy, just change the Code page setting to 65001 (UTF-8).
If we now switch back to the Columns page we should come to the following result:
Ah, sure looks better doesn’t it? All accents are present as expected.
But in case you thought that’s it, I’d advise you to think again. Don’t worry, I’ll demonstrate what I mean. Let’s do that by setting up a simple Data Flow.
Setting Up The Data Flow
Throw in a Flat File Source and specify our Flat File Connection Manager. I also prefer to keep NULLs as they come in, using the Retain null values from the source as null values in the data flow checkbox.
If you click the Preview button you should get similar output as shown one screenshot earlier.
Now hook this up to an OLE DB Destination that writes the incoming data into a table in your favorite database:
As you can see, our destination is not entirely happy with all this. Here are the details of one of the error messages:
Validation error. Data Flow Task: Data Flow Task: The column “BookTitle” cannot be processed because more than one code page (65001 and 1252) are specified for it.
Looks like once more we’ve got a code page conflict. And we sure do. Clicking the Data Flow connector between the Flat File source and OLE DB destination shows us the following:
Each of our incoming string values is encoded using the 65001 (UTF-8) code page. But our database was created using the Latin1_General_CI_AS collation. So we’ve indeed got a code page conflict!
Fear not, that’s easily remedied. Add a Derived Column transformation in between the source and destination and convert each incoming string value using a cast expression such as this one:
(DT_STR, 50, 1252)BookTitle_IN
Note: whenever I need to manipulate incoming columns to create a second version of the same column, I rename the incoming column to TheColumn_IN. The new version will be called TheColumn and preferably TheColumn is the name of the field in the destination table. This makes it easy to distinguish all columns later down the flow.
Here’s what the final version of the Derived Column looks like:
Next we’ll need to open the Destination and change the mapped fields to the new ones. Because my new columns are called exactly the same as the fields in the destination table, I can do that easily. In the Mappings page, all I need to do is right-click the grey background in between the two tables and click Select All Mappings, hit the Delete button, right-click again and click Map Items By Matching Names:
With the data flow finished, let’s give our package a run!
Ouch, our source is not happy! A closer examination of the Output pane brings us to the following error:
Error: 0xC02020A1 at Data Flow Task, Flat File Source : Data conversion failed. The data conversion for column “Description” returned status value 4 and status text “Text was truncated or one or more characters had no match in the target code page.”.
Oh right, so far we haven’t bothered looking at the actual length of the data that we’re importing. Actually, what is the length of our data flow columns?? Well, if you’ve been paying close attention you should have noticed the number 50 several times in the screenshots and expressions above. That’s indeed the default length for text columns when importing a flat file.
And if you scroll back up to the sample XML, you’ll notice that the content for the description is longer than 50 characters, thus causing our error! Let’s find out how to get that solved!
Fixing The Field Length Issue
The first step in getting this fixed is opening up the Advanced page in the Flat File Connection Manager editor.
Then select the Description field and change its OutputColumnWidth property from 50 to 500.
That will cause the source to generate a warning. Remove this warning by opening and closing the source editor. Click the Yes button in the popup that appears.
The next step is changing the expression for the Description field in the Derived Column to this:
Indeed, the field length is one of the parameters in that cast. The other numeric parameter is obviously the code page.
Having done that you’ll notice that the destination will start complaining. Of course, you’ll need to adapt the destination table to reflect the field length increase as well. So change the table definition and open/close the destination editor to make it happy.
Alright, let’s run the package once more!
Finally the data flow is happy with it all and has inserted two records:
And what does our table contain? Let’s find out:
That’s looking good for sure!
In this follow-up article I have demonstrated what might go wrong when you need to deal with special characters while importing flat files, and how to solve your possible issues. In case you missed the original article, have a look through this link.
- T-SQL Tuesday 42: Life, Change, Don’t Panic!
- Local Install of Books Online 2012
- Hidden Collections in SSRS
- How To Tweet About SQL Server Blog Posts
- Filtering Data Without Changing Dataset [SSRS]
- SSRS Deployment: Generate The Batch Script Through SQL!
- Connecting Shapes In Word
- Automating SSRS Deployment: Download
- SQLUG: Automating SSRS Deployments
- Shall We Join Or Shall We Split Now?
- May 2013 (2)
- March 2013 (3)
- February 2013 (2)
- January 2013 (2)
- December 2012 (2)
- November 2012 (3)
- October 2012 (2)
- August 2012 (2)
- July 2012 (2)
- June 2012 (2)
- May 2012 (2)
- April 2012 (3)
- March 2012 (4)
- February 2012 (4)
- January 2012 (2)
- December 2011 (2)
- November 2011 (2)
- October 2011 (1)
- September 2011 (3)
- August 2011 (2)
- June 2011 (2)
- May 2011 (3)
- April 2011 (3)
- March 2011 (3)
- February 2011 (2)
- January 2011 (5)
- December 2010 (1)
- November 2010 (3)
- October 2010 (3)
- September 2010 (2)
- August 2010 (4)
- July 2010 (2)
- June 2010 (4)
- May 2010 (6)
- April 2010 (3)
- March 2010 (3)
- February 2010 (11)
- January 2010 (9)
- December 2009 (2)
- November 2009 (3)
- October 2009 (3)
- September 2009 (4)
- August 2009 (6)
- July 2009 (2)
- June 2009 (3)
- May 2009 (7)
- April 2009 (3)
- March 2009 (3)
- February 2009 (5)
- January 2009 (4)
- December 2008 (2)
- November 2008 (3)
- October 2008 (1)
- September 2008 (1)
- August 2008 (4)
- July 2008 (3)