Integration Services

You are currently browsing the archive for the Integration Services category.

Once again I’ve been wasting some time because of a silly bug.  This time it was due to the OLE DB Source component and the way it works with parameters.  If you are in a situation where you know your query is working fine and yet no records are going down the data flow, here’s a possible solution!

Disclaimer: this issue exists up until SQL Server 2008 R2.  Read on for details!

Update: after being advised to do so by several people, including Jamie Thomson, I’ve filed a bug at MS Connect: SSIS OLE DB Source incorrectly returns zero records in combination with parameter and comment

The Situation

I had a Data Flow with an OLE DB Source that uses one parameter, for instance:

select ProductAlternateKey, EnglishProductName
from dbo.DimProduct
--some really smart comment goes here
where Color = ?

I knew the query was working fine because when executed through SSMS and with the question mark replaced with ‘blue’, it would return 28 rows:

28 records in Management Studio

But when executed in BIDS, through either Execute Package or Execute Task, it would return zero records:

Zero records, zilch, nada, niente, none at all!

So I thought something must be going wrong with the package variable that gets passed into the source parameter, somehow.  I’m not going into details on what I tried out in my attempt to get this working, but I can tell you that I started to get really irritated.  My colleague Koen Verbeeck (b|t) can confirm this because I called him over to my desk to help me think! (thanks btw!) Smile

After some further tinkering with the data flow, we had our smart moment of the day and decided to launch SQL Server Profiler to see what BIDS was sending to the server!  I’m not sure if you’re aware of this but BIDS is doing some metadata-related stuff when preparing queries.  As far as I can tell, it also tries to determine the parameter type by running the following query:

 set fmtonly on select Color from  dbo.DimProduct
--some really smart comment goes here where 1=2 set fmtonly off

When creating this statement, it seems to use the whole FROM clause of the original query, including any trailing comments.  It combines that with a SELECT statement that contains the field that gets filtered and it appends " where 1=2 set fmtonly off".

But alas, apparently it’s not aware that lines can be commented out by using a double dash.  So part of its generated statement is commented out.  What it should have done is used some CRLFs, especially in front of the WHERE clause.  But it didn’t.

So, as a result of that, FMTONLY remains on while the SELECT statement gets executed, resulting in zero records!

For those unfamiliar with the FMTONLY setting:

Returns only metadata to the client. Can be used to test the format of the response without actually running the query.

And I can actually confirm what I’m stating here by changing the query to the following:

set fmtonly off;
select ProductAlternateKey, EnglishProductName
from dbo.DimProduct
--some really smart comment goes here
where Color = ?

28 records down the pipe!

We've got data!

But this hack is a little too dirty to put in production.  So what else can we do?  Well, use block-style comments instead and we won’t face the issue!

select ProductAlternateKey, EnglishProductName
from dbo.DimProduct
/* some even smarter comment goes here */
where Color = ?

So, as I mentioned at the start of the post, this behavior can be reproduced using SSIS versions prior to 2012.  What about 2012 then?  Here’s the result of the Data Flow using the first query mentioned above:

SSIS 2012: we've got data, even with the "faulty" query!

Alright, that works better!  Now let’s use Profiler to check what’s going on here.  This is the first statement that gets executed:

exec [sys].sp_describe_undeclared_parameters N'select ProductAlternateKey, EnglishProductName
from dbo.DimProduct
--some really smart comment goes here
where Color = @P1'

Further down, I also see this one:

exec [sys].sp_describe_first_result_set N'select ProductAlternateKey, EnglishProductName
from dbo.DimProduct
--some really smart comment goes here
where Color = @P1',N'@P1 nvarchar(15)',1

It is using an entirely different approach, no longer using the FMTONLY setting!  Hang on, this rings a bell!  Look what the BOL page for SET FMTONLY (2012 version) specifies:

Do not use this feature. This feature has been replaced by sp_describe_first_result_set (Transact-SQL), sp_describe_undeclared_parameters (Transact-SQL), sys.dm_exec_describe_first_result_set (Transact-SQL), and sys.dm_exec_describe_first_result_set_for_object (Transact-SQL).

Cool stuff!

Conclusion

If you’re not on SQL Server 2012 yet, be careful with comments in OLE DB Sources in the SSIS Data Flow!  Ow, and get the SQL Server Profiler off its dusty shelf now and then!

Have fun!

Valentino.

Share

Tags: , , , ,

The other day I needed a counter in my SSIS Control Flow.  Using the Foreach Loop container I was looping over a bunch of files and needed to count the number of files that got imported successfully.  The counter had to comply with the following requirements:

  • Easy to implement
  • Fast

Let’s test some different methods on getting this implemented. 

Counting Possibilities

Two of the methods are possible as of SSIS 2005 while the third one is new since SSIS 2012. So yeah, the screenshots are taken using SQL Server 2012 RTM.

Setting The Scene

We’ll start from a new package and create a package variable of type Int32.  This variable will keep track of the count.

To be able to performance test the different possibilities, I’m using a For Loop container.  This loop is using another package variable called loop, type Int32, and performs the loop 10,000 times.  In case you don’t know how to set up such a loop, check out this screenshot:

Using a For Loop container to test task performance

So my package contains the following two variables:

Two package variables of type Int32

The cnt variable is the one to actually hold the count.

To test performance I’ll be using the command-line dtexec utility so that the debugging overhead does not mess up the numbers.   I’ve also executed each method at least three times to ensure a certain result was not “by accident”.

Using a Script Task

The most logical component that came to mind was a Script Task.

Using a couple lines of code, C# in this case, the counter value can be incremented with one:

int cnt = (int)Dts.Variables["cnt"].Value;
Dts.Variables["cnt"].Value = ++cnt;

The above code assumes that the cnt variable has been put in the ReadWriteVariables property:

The User::cnt variable is in the ReadWriteVariables for the script to use

Here’s what that looks like in the Control Flow:

For Loop container with a Script Task

So how fast does this execute?

Looping 10,000 times over the Script Task takes 8 seconds

About eight seconds, that’s acceptable.

However, what I don’t like about this method is how long it takes to implement.  I mean, it takes more than just a couple of seconds, right?  And the fact that you actually need to use custom .NET code to perform such a trivial task as adding one to a number.  Using .NET code is good for complex situations, when there’s no other option.  But the risk of a bug is always larger, imagine I wrote cnt++ instead of ++cnt, what do you think the results would be? (Hint: my laptop would crash before the counter reaches 10,000).

On to another option then!

Using a Execute SQL Task

Instead of resorting to .NET code, increasing a number by one is easy to achieve using a simple T-SQL statement, right?  So I thought, let’s try a Execute SQL Task!

Here’s the query:

select ? + 1 as cnt

What does the task look like?

Using Execute SQL Task to increase a package variable

ResultSet has been set to Single row.

The Parameter Mapping page has got the User::cnt variable specified:

Specifying the User::cnt package variable in the Parameter Mapping page

And the Result Set page has got the single value from the single row mapped to the User::cnt package variable:

Mapping the cnt result to the User::cnt variable

What do you say, easier to implement than the Script method?  I do think so!

This method has one limitation though: it needs a Connection Manager connecting to a SQL Server database.  However, in most ETL packages you’ll probably have that present already.  What I was a bit worried about though is the overhead of connecting to the server, how much will it be?

Let’s find out!

The Execute SQL Task needs almost a minute to increase the counter

That’s right, using the Execute SQL Task to increment the variable 10,000 times takes about a minute.  On my laptop.  Connecting to the tempdb on my local instance.  When testing this over a network, it even resulted in timings over four minutes.  So this solution is really unacceptable in my opinion.

However, we can give it one more try.  A Connection Manager has got a property called RetainSameConnection.  By default this is set to False which means that our test above has opened and closed ten thousand connections to my local tempdb.  Oh my, of course that takes a while!

Setting it to True gives us the following result:

Setting RetainSameConnection to True speeds up the process by three

About twenty seconds, which is about one third of the previous result.  That surely is better.  And what’s perhaps even more interesting is that a similar result is achieved when connecting to a database over the network: from over four minutes down to twenty seconds.  So yeah, this would work for me.

Sidenote: for other interesting use of the RetainSameConnection property, check out this post regarding transactions by Matt Masson.

But we’re not stopping just yet.  As of SQL Server 2012, we’ve got a third possibility!

Using an Expression Task (SQL2012!)

Instead of resorting to custom .NET code or (ab)using the Execute SQL Task, in SSIS 2012 we’ve got a new task: the Expression Task.

The new Expression Task in a For Loop ContainerThe Expression Task builds and evaluates SSIS expressions, nifty!

As you can read in the SSIS Toolbox, the Expression Task builds and evaluates SSIS Expressions that set variable values at runtime.  That surely sounds exactly like what we’d need, doesn’t it?

So how does that work?

Using the Expression Task to increase the counter

Really easy to set up, we just specify that the User::cnt variable should be set to itself plus one.  When put in a For Container, we’ve got a counter!

But how does it perform?

The Expression Task is the easiest and the fastest method, mission accomplished!

About seven seconds, which is even slightly faster than the Script Task method!

With that we’ve found the method that complies with both requirements: easy to implement and performs fast!  Now how am I going to convince my clients to upgrade to 2012, hmm, need to think a bit…

Conclusion

We found out that the new Expression Task is a very useful task for its purpose.  In our case we used it to create a counter.

If you’re not on SQL Server 2012 yet, better stick to either Script Task or Execute SQL Task with RetainSameConnection set to True.

Have fun!

Valentino.

Share

Tags: , ,

In case you’ve read my article on using SSIS and XSLT to get XML imported into the database, you know that I cheated a little by first manually removing the namespaces from the XML document.

Well, that obviously doesn’t work smoothly when the process needs to get automated.

So here’s a method to use XSLT to remove the namespaces for you.

Removing The Namespaces

Using the XML Task as explained in my article you can apply the XSLT to remove the namespaces as an additional step prior to the XML Task that applies the XSLT to CSV conversion.  As output destination, you could set up a package variable that accepts the “XML without namespaces”, or you can write to file.  Up to you to decide.

Here’s the XSLT that will remove namespaces from the XML:

<!-- remove namespaces -->
<xsl:stylesheet xmlns:xsl ="http://www.w3.org/1999/XSL/Transform" version ="1.0" >
  <xsl:template match ="@*" >
    <xsl:attribute name ="{local-name()}" >
      <xsl:value-of select ="." />
    </xsl:attribute>
    <xsl:apply-templates/>
  </xsl:template>
  <xsl:template match ="*" >
    <xsl:element name ="{local-name()}" >
      <xsl:apply-templates select ="@* | node()" />
    </xsl:element>
  </xsl:template>
</xsl:stylesheet>

(Ref. http://blogs.msdn.com/b/kaevans/archive/2003/06/13/8679.aspx)

Removing namespaces is one thing, but you’re losing some information.  What if you’d like to keep the namespaces as part of the node name?

Replacing The Namespaces

Well, that possible too!  Using the XSLT below, namespaces are kept but the colons separating the namespaces from the attribute names are replaced with underscores.  The translate() function is used to achieve this:

<!-- replace namespaces -->
<xsl:stylesheet xmlns:xsl ="http://www.w3.org/1999/XSL/Transform" version ="1.0" >
  <xsl:template match ="@*" >
    <xsl:attribute name ="{local-name()}" >
      <xsl:value-of select ="." />
    </xsl:attribute>
    <xsl:apply-templates/>
  </xsl:template>
  <xsl:template match ="*" >
    <!--keep namespace prefix as first part of node name (replaced colon with underscore) -->
    <xsl:element name ="{translate(name(), ':', '_')}" >
      <xsl:apply-templates select ="@* | node()" />
    </xsl:element>
  </xsl:template>
</xsl:stylesheet>

Have fun!

Valentino.

Share

Tags: , ,

This is a follow-up to my article on Loading Complex XML Using SSIS and XSLT.  In that article I demonstrated how you can convert complex XML into simple CSV using XSLT in SSIS.

The resulting DTSX package and input files can be downloaded from my SkyDrive through this link.

Dealing With Special Characters

If you’ve followed the instructions in my article mentioned above and you need to deal with special characters such as the é and è encountered in the French language, you probably noticed that it wouldn’t really work as expected.  In fact, in your final result you may have ended up with the special characters being replaced with other, even more special, characters.  Obviously not good.

Here’s an explanation on the reason why that happens, and also how to deal with it.

Setting The Scene

Imagine the following sample XML, representing a really huge book collection:

<books>
    <book>
        <title>The Hitchhiker's Guide to the Galaxy</title>
        <author>Douglas Adams</author>
        <language>EN</language>
        <description>The Hitchhiker's Guide to the Galaxy is a science fiction comedy series created by Douglas Adams.</description>
    </book>
    <book>
        <title>Le Trône de fer</title>
        <author>George R.R. Martin</author>
        <language>FR</language>
        <description>Le Trône de fer est une série de romans de fantasy de George R. R. Martin, dont l'écriture et la parution sont en cours. Martin a commencé à l'écrire en 1991 et le premier volume est paru en 1996. Prévue à l'origine comme une trilogie, la série compte désormais cinq volumes publiés et deux autres sont attendus.</description>
    </book>
</books>

As you can see, the second book in the list is the French version of the first book in the A Song of Ice and Fire series by George R.R. Martin and as it goes with French, there are some accents in the description of the book.

We’ll use the following XSLT to convert it to CSV:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs=http://www.w3.org/2001/XMLSchema xmlns:fn="http://www.w3.org/2005/xpath-functions">
  <xsl:output method="text" version="1.0" encoding="UTF-8" indent="no"/>
  <xsl:template match="/">
    <xsl:text>BookTitle;Author;Language;Description</xsl:text>
    <xsl:text>&#13;&#10;</xsl:text>

    <xsl:for-each select="books/book">
      <xsl:text>"</xsl:text>
      <xsl:value-of select="title"/>
      <xsl:text>";"</xsl:text>
      <xsl:value-of select="author"/>
      <xsl:text>";"</xsl:text>
      <xsl:value-of select="language"/>
      <xsl:text>";"</xsl:text>
      <xsl:value-of select="description"/>
      <xsl:text>"</xsl:text>
      <xsl:text>&#13;&#10;</xsl:text>
    </xsl:for-each>

  </xsl:template>
</xsl:stylesheet>

Using an XML Task in the Control Flow, as explained in my article, we’d get the following output:

BookTitle;Author;Language;Description
“The Hitchhiker’s Guide to the Galaxy”;”Douglas Adams”;”EN”;”The Hitchhiker’s Guide to the Galaxy is a science fiction comedy series created by Douglas Adams.”
“Le Trône de fer”;”George R.R. Martin”;”FR”;”Le Trône de fer (A Song of Ice and Fire) est une série de romans de fantasy de George R. R. Martin, dont l’écriture et la parution sont en cours. Martin a commencé à l’écrire en 1991 et le premier volume est paru en 1996. Prévue à l’origine comme une trilogie, la série compte désormais cinq volumes publiés et deux autres sont attendus.”

So far so good, all accents are still present!

Then we’d import the file using a Flat File Source component in a Data Flow Task.  Here’s what the General page of the Flat File Connection Manager would look like:

Flat File Connection Manager: General

We’ve set double-quote as Text Qualifier and checked the Column names in the first data row textbox.

Switching to the Columns page we’d get the following:

Flat File Connection Manager: Columns - the Preview has messed up the accents!

Hang on, that’s not right!  The Preview is not displaying our accents as expected!  Oh my, what’s going on here? Let’s call the code page detectives!

A Mismatch Investigation

Take a good look at the XSLT which we’ve used to convert the XML into CSV, especially the xsl:output line:

<xsl:output method=textversion=1.0encoding=UTF-8indent=no/>

That line specifies that the text output should be encoded using the UTF-8 code page.

Now take a good look at the General page in the screenshot earlier, more precisely this part:

Code page: 1252 (ANSI - Latin I) is not what we need right now!

Indeed, code page 1252 (ANSI – Latin I).  While the input is UTF-8.  Of course that results in a mismatch of certain characters, as demonstrated here.  The fix is fairly easy, just change the Code page setting to 65001 (UTF-8).

Code page: 65001 (UTF-8) - much better!

If we now switch back to the Columns page we should come to the following result:

Flat File Connection Manager: Columns page preview with accents!

Ah, sure looks better doesn’t it?  All accents are present as expected.

But in case you thought that’s it, I’d advise you to think again.  Don’t worry, I’ll demonstrate what I mean.  Let’s do that by setting up a simple Data Flow.

Setting Up The Data Flow

Throw in a Flat File Source and specify our Flat File Connection Manager.  I also prefer to keep NULLs as they come in, using the Retain null values from the source as null values in the data flow checkbox.

Flat File Source: Connection Manager

If you click the Preview button you should get similar output as shown one screenshot earlier.

Now hook this up to an OLE DB Destination that writes the incoming data into a table in your favorite database:

OLE DB Destination is not happy :(

As you can see, our destination is not entirely happy with all this.  Here are the details of one of the error messages:

Validation error. Data Flow Task: Data Flow Task: The column “BookTitle” cannot be processed because more than one code page (65001 and 1252) are specified for it.

Looks like once more we’ve got a code page conflict.  And we sure do. Clicking the Data Flow connector between the Flat File source and OLE DB destination shows us the following:

Data Flow Path Editor shows that our strings are encoded using the 65001 code page.

Each of our incoming string values is encoded using the 65001 (UTF-8) code page.  But our database was created using the Latin1_General_CI_AS collation.  So we’ve indeed got a code page conflict!

Fear not, that’s easily remedied.  Add a Derived Column transformation in between the source and destination and convert each incoming string value using a cast expression such as this one:

(DT_STR, 50, 1252)BookTitle_IN

Note: whenever I need to manipulate incoming columns to create a second version of the same column, I rename the incoming column to TheColumn_IN.  The new version will be called TheColumn and preferably TheColumn is the name of the field in the destination table.  This makes it easy to distinguish all columns later down the flow.

Here’s what the final version of the Derived Column looks like:

Using the Derived Column transformation to cast the incoming strings into the correct code page.

Next we’ll need to open the Destination and change the mapped fields to the new ones.  Because my new columns are called exactly the same as the fields in the destination table, I can do that easily.  In the Mappings page, all I need to do is right-click the grey background in between the two tables and click Select All Mappings, hit the Delete button, right-click again and click Map Items By Matching Names:

Using Map Items By Matching Names, easy!

With the data flow finished, let’s give our package a run!

Flat File Source has got a length issue!

Ouch, our source is not happy!  A closer examination of the Output pane brings us to the following error:

Error: 0xC02020A1 at Data Flow Task, Flat File Source [16]: Data conversion failed. The data conversion for column “Description” returned status value 4 and status text “Text was truncated or one or more characters had no match in the target code page.”.

Oh right, so far we haven’t bothered looking at the actual length of the data that we’re importing.  Actually, what is the length of our data flow columns??  Well, if you’ve been paying close attention you should have noticed the number 50 several times in the screenshots and expressions above.  That’s indeed the default length for text columns when importing a flat file.

And if you scroll back up to the sample XML, you’ll notice that the content for the description is longer than 50 characters, thus causing our error!  Let’s find out how to get that solved!

Fixing The Field Length Issue

The first step in getting this fixed is opening up the Advanced page in the Flat File Connection Manager editor.

Flat File Connection Manager: using the Advanced page to change field length.

Then select the Description field and change its OutputColumnWidth property from 50 to 500.

That will cause the source to generate a warning.  Remove this warning by opening and closing the source editor.  Click the Yes button in the popup that appears.

The next step is changing the expression for the Description field in the Derived Column to this:

(DT_STR,500,1252)Description_IN

Indeed, the field length is one of the parameters in that cast.  The other numeric parameter is obviously the code page.

Having done that you’ll notice that the destination will start complaining.  Of course, you’ll need to adapt the destination table to reflect the field length increase as well.  So change the table definition and open/close the destination editor to make it happy.

Alright, let’s run the package once more!

Finally the data flow is happy with it all and has inserted two records:

That's more like it: all components colored green!

And what does our table contain?  Let’s find out:

All accents have been imported!

That’s looking good for sure!

Conclusion

In this follow-up article I have demonstrated what might go wrong when you need to deal with special characters while importing flat files, and how to solve your possible issues.  In case you missed the original article, have a look through this link.

Have fun!

Valentino.

References

Wikipedia: UTF-8

Share

Tags: , ,

Do you like the Custom Code functionality in SSRS?

And what would you think if SSIS offered the same possibility?  Imagine, being able to write a custom function in .NET and then use it in any expression in your package, how powerful that would be!

There’s already one function I would have written today: GetFilename(string path).

If you believe such functionality to be useful, please vote on the following Connect request: Add user defined function support to the SSIS expression language

Have fun!

Valentino.

Share

Tags: , ,

« Older entries § Newer entries »

© 2008-2014 BI: Beer Intelligence? All Rights Reserved