Integration Services

You are currently browsing articles tagged Integration Services.

SQL Server Integration Services can read XML files, that’s known by every BI developer.  (If you didn’t, don’t worry, I’m aiming this article at newcomers as well.)

But how far can you go?  When does the XML Source component become unusable?  Let’s find out!

To create the examples I’m using the following SQL Server version:

Microsoft SQL Server 2008 R2 (RTM) – 10.50.1600.1 (X64)   Apr  2 2010 15:48:46   Copyright (c) Microsoft Corporation  Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7600: )

Basic Example

This first example is a really simple XML file containing a list of colors with their corresponding RGB code.

<colors>
  <color RGB="FF0000">Red</color>
  <color RGB="00FF00">Green</color>
  <color RGB="0000FF">Blue</color>
  <color RGB="FFFFFF">White</color>
  <color RGB="000000">Black</color>
</colors>

Let’s import this into a database.  Open up the BIDS, create an SSIS project and throw a Data Flow Task into the package and open it up.

The component that we’re now most interested in is the XML Source, one of the components in the Data Flow Sources category in the Toolbox.

The XML Source component

Add one of those to your Data Flow and double-click it to open up the XML Source Editor.

The Data Access Mode should be set to XML file location, which is the default setting.  The other options are XML file from variable – useful if you’ve got the file path and name of the XML file in a variable – and XML data from variable – interesting if your XML data is actually stored in a variable.

As XML Location, select the .xml file.  Our XML sample does not have an inline schema, so we can’t use that checkbox.  And we can’t click the OK button either, it’s grayed out.  The source component really expects a description of the XML structure before the editor can be closed.

The bottom of the screen even shows a warning with the following message:

XML Schema (XSD) is not specified. Select an existing XSD or click Generate XSD to create an XSD from the XML file.

So, what are you waiting for,  Click the Generate XSD button to let the XML Source Editor generate the XSD schema for us.  Real easy, right?

Remember where you save the file, and when it’s generated, select the .xsd file in the XSD location textbox.  As you can see, the OK button will become available.  But don’t click it just yet.

Here’s what the XML Source Editor now looks like:

XML Source Editor with an XML and XSD file specified

Let’s now move on to the second page of the XML Source Editor, called Columns.  When you open it, you’ll receive the following popup with a couple of warnings:

Warning gets displayed when opening the Columns page

The editor is letting us know that the columns that are being generated do not have a maximum length specified.  So it’s setting them to Unicode (DT_WSTR) with a length of 255.  Click the OK button to get rid of that message and to be able to see the generated columns.

Note: if your data elements or attributes may contain longer strings then you should have a look at modifying the length specification.  This can be done through the Advanced Editor, which is opened by right-clicking the XML Source.  The Input and Output Properties page is the one you’re after.

The Columns page, showing the columns that the XML Source generated for us

As you can see, our only attribute – RGB, is nicely put in a column with the same name.  The value of each <color> node however is not put in a column called Color.  By default, this value is put into a column called “text”.  Which is a weird name for a column in an SSIS data flow if you ask me.  The good thing is that you can just rename it by changing the Output Column value.

Let’s test this out.  My favorite way is to add a Multicast component to the Data Flow, then add a Data Viewer on the connector (right-click the green arrow, select Data Viewers, click Add > OK > OK).  Now execute the package to get this result:

Testing the output of the XML Source through the Data Viewer

Mission accomplished, we’ve retrieved data from a very basic XML file!

Adding Some Complexity

Let’s move on to the second example of this article.  The difference with the previous example is that now we’ve got multiple nested structures to deal with.

The example represents a list of book reviews, including some details on the books themselves.  A book can have multiple writers and obviously multiple reviews as well.

<books>
    <book pages="300">
        <title>Microsoft SQL Server 2008 R2 Master Data Services</title>
        <category>Information Technology</category>
        <authors>
            <author>Jeremy Kashel</author>
            <author>Tim Kent</author>
            <author>Martyn Bullerwell</author>
        </authors>
        <reviews>
            <review>If you're looking for an excellent book on the new Master Data Services component of SQL Server 2008 R2, definitely check this one out!  To be released in June 2011 by Packt Publishing!</review>
        </reviews>
    </book>
    <book pages="832">
        <title>Inside Microsoft SQL Server 2008: T-SQL Querying</title>
        <category>Information Technology</category>
        <authors>
      <author>Itzik Ben-gan</author>
      <author>Lubor Kollar</author>
      <author>Dejan Sarka</author>
      <author>Steve Kass</author>
        </authors>
        <reviews>
            <review>Every "Inside SQL Server" book can be recommended, especially when written by Itzik!</review>
        </reviews>
    </book>
    <book pages="1137">
        <title>The Lord of the Rings</title>
        <category>Fantasy</category>
        <authors>
            <author>J.R.R. Tolkien</author>
        </authors>
        <reviews>
            <review>Like fantasy?  What are you waiting for then?  It's a classic!</review>
            <review>If you liked the movie, you'll love the book.</review>
        </reviews>
    </book>
</books>

Configure an XML Source so that it uses the books.xml file, generate the XSD and specify its location.  I’m not going into details on that, the procedure is the same as in our first example above.

Now open up the Columns page to have a closer look at how the XML data is going to get imported.

XML Source generates multiple=

So how does the XML Source component deal with the multiple nested structures?  It generates multiple outputs!  If you select another output from that dropdown, you get to see its fields.

To get a clear understanding of what exactly is going on, let’s connect each output with an OLE DB Destination component.  The target table can be generated based on the incoming fields by clicking the New button.  Replace the table name in the generated CREATE TABLE script with a clear one that fulfills your naming convention requirements – such as NO SPACES IN A TABLE NAME for instance – and hit the OK button.

Destinatio table can be generated by using the New button in the OLE DB Destination Editor

Now that the table is created, it will be automatically selected in the Name of the table or the view dropdown.  Don’t forget to visit the Mappings page so that the, well, mappings are created.  If no field names were modified in the CREATE TABLE script then all fields should be mapped automatically based on their names.

With all five destinations added, execute the package.

Each XML Source output is connected to an OLE DB Destination - executes fine

So now we’ve loaded the data from our XML file into a database, but the data is spread over five tables.  How do we retrieve that data?  Join them together!

select * from XML_book
inner join XML_authors on XML_authors.book_Id = XML_book.book_Id
inner join XML_author on XML_author.authors_Id = XML_authors.authors_Id
inner join XML_reviews on XML_reviews.book_Id = XML_book.book_Id
inner join XML_review on XML_review.reviews_Id = XML_reviews.reviews_Id;

And the result looks like this:

XML data imported into the SQL Server database

Conclusion

We have managed to flatten the data from an XML file containing multiple nested repeating nodes, nice huh?  But do you also feel the limitation using this method?  In terms of modern XML, this was still a fairly easy XML file and yet we already needed five tables to store the data.  Can you imagine what this will give with a really complex file?

Watch this blog for the follow-up article where I will try to import data from a really complex XML file!

Have fun!

Valentino.

References

XML Source component

Share

Tags: , , ,

With the holidays I haven’t been able to write much.  So I’ll make up for it with this +3000 words article.  If you’re reading this early in the morning, you’d better get a double espresso first ;-)

In this article I will demonstrate a method that can be used to calculate aggregations over a certain period of time in the past, or LastXMonths aggregations as I’m calling them throughout the article.  I’ll be using T-SQL, SQL Server Integration Services and a relational database as source.  More specifically I will be using the Merge Join data transformation in SSIS, and Common Table Expressions in T-SQL.

Version-wise I’m using SQL Server 2008 R2, but this method should work as of SQL Server 2005.  Furthermore I’m using the Contoso DWH, available for download at the Microsoft Download Center.  (In case you’re wondering, it’s the .BAK file.)

You can download the finished SSIS package from my Skydrive.  (The file is called MergeJoin.dtsx.)

The Scenario

Let’s say we’ve got a relational database containing some sales figures.  Management has asked for sales-related data to be available somewhere for easy analysis.  Ideally a cube would be built for that purpose but as budgets are currently tight, a temporary solution needs to be provided meanwhile.  So it’s been decided that an additional table will be created, populated with the exact data as required by management.  This table should contain all details (number of items and amount of the sale) about products sold, grouped by the date of the sale, the zip code of the place where the sale occurred and the category of the product.

Furthermore, each record should contain the sum of all sales of the last month for the zip code and product category of each particular record.  Two additional aggregations should calculate the sales for the last three months and last six months.

A Simple Example

To make sure we’re all on the same track on the requirements, here’s a small example to illustrate the expected outcome.

Small example displaying the expected outcome of the process

I’ve omitted the SalesAmount numbers for readability reasons.  The records are ordered chronologically, with the oldest first.  As you can see, the bottom record shows 16 as value for Last6MSalesQuantity.  This is the result of the SalesQuantity of the current record and the SalesQuantity of the previous record, which happens to fall within the timespan of the lowest record’s SaleDate going back six months.  The two other records do not fall within the six months timespan and are thus not included in the sum for the Last6MSalesQuantity of that bottom record.

Fetching The Data Into A Table

Our scenario requires that the sales figures are calculated and put into a new table.  Let’s first start with creating the queries to fetch the data.

Step 1: The Daily Numbers

The easiest part are the daily sales numbers.  These can be retrieved fairly easy from the Contoso data warehouse, just by using a GROUP BY clause as shown in the following query.

--daily sales
select DD.Datekey, DS.ZipCode, DPC.ProductCategoryName,
    SUM(FS.SalesAmount) SalesAmount_SUM,
    SUM(FS.SalesQuantity) SalesQuantity_SUM
from dbo.FactSales FS
    inner join dbo.DimStore DS on DS.StoreKey = FS.StoreKey
    inner join dbo.DimProduct DP on DP.ProductKey = FS.ProductKey
    inner join dbo.DimProductSubcategory DPS
        on DPS.ProductSubcategoryKey = DP.ProductSubcategoryKey
    inner join dbo.DimProductCategory DPC
        on DPC.ProductCategoryKey = DPS.ProductSubcategoryKey
    inner join dbo.DimDate DD on DD.Datekey = FS.DateKey
group by DD.Datekey, DS.ZipCode, DPC.ProductCategoryName
order by DD.Datekey asc, DS.ZipCode asc, DPC.ProductCategoryName asc;

Part of the result of that query looks like this:

Result of the daily sales query

Nothing special to mention so far so let’s continue to the next step.

Step 2: The Monthly Numbers

In this step, we’ll use the query from step 1 as base for the full query.  I’ll first show you the query and then provide you with some explanation of what’s going on.

--LastMonth
declare @numberOfMonths tinyint = 1;
with DailySalesData as
(
    select DD.Datekey, DS.ZipCode, DPC.ProductCategoryName,
        SUM(FS.SalesAmount) SalesAmount_SUM,
        SUM(FS.SalesQuantity) SalesQuantity_SUM
    from dbo.FactSales FS
        inner join dbo.DimStore DS on DS.StoreKey = FS.StoreKey
        inner join dbo.DimProduct DP on DP.ProductKey = FS.ProductKey
        inner join dbo.DimProductSubcategory DPS
            on DPS.ProductSubcategoryKey = DP.ProductSubcategoryKey
        inner join dbo.DimProductCategory DPC
            on DPC.ProductCategoryKey = DPS.ProductSubcategoryKey
        inner join dbo.DimDate DD on DD.Datekey = FS.DateKey
    group by DD.Datekey, DS.ZipCode, DPC.ProductCategoryName
),
UniqueRecordsPerDay as
(
    select Datekey, ZipCode, ProductCategoryName
    from DailySalesData
    group by Datekey, ZipCode, ProductCategoryName
)
select UR.Datekey, DSD.ZipCode, DSD.ProductCategoryName,
    SUM(DSD.SalesAmount_SUM) SalesAmount_SUM,
    SUM(DSD.SalesQuantity_SUM) SalesQuantity_SUM
from DailySalesData DSD
    inner join UniqueRecordsPerDay UR
            on UR.ProductCategoryName = DSD.ProductCategoryName
        and UR.ZipCode = DSD.ZipCode
        and DSD.Datekey
            between DATEADD(month, -@numberOfMonths, UR.Datekey + 1)
            and UR.Datekey
group by UR.Datekey, DSD.ZipCode, DSD.ProductCategoryName
order by UR.Datekey asc, DSD.ZipCode asc, DSD.ProductCategoryName asc;

The query uses a variable called @numberOfMonths.  This will allow us to use the same query for the totals of last month, as well as for the Last3M and the Last6M numbers.  All that’s needed is changing the variable to 3 or 6.

But how does the query get to the results?  To start, it uses two CTEs (Common Table Expressions).  The first one is called DailySalesData.  And the query for that CTE should look familiar to you by now: it’s the one from step 1, without the ORDER BY clause.

The second CTE is called UniqueRecordsPerDay and gives us one record for each unique date, zip code and product category as found in the Contoso data.  The DateKey, ZipCode and ProductCategoryName fields are our key grouping fields.  And this CTE is actually the key to calculating the monthly aggregated data, as I’ll explain next.

What the main query does is the following.  It selects the data from the DailySalesData CTE and joins that with the unique records per day recordset.  All grouping key fields need to be included in the join.  However, as you can see, to add the DateKey into the join I’m not just using the equals operator but the BETWEEN keyword instead.  I’ve also used the DATEADD function to subtract the number of months as specified through the @numberOfMonths variable.  That statement is saying: “give me all records starting from DateKey, going back @numberOfMonths”.  The query again groups by the key fields to be able to sum the records up.

This construction ensures that the SalesAmount_SUM and SalesQuantity_SUM fields represent the sum for the record’s zip code and product category and for the period as indicated by the @numberOfMonths variable.

Step 3: Merging It All Together Into One Table

Now that we know how to retrieve the data, we still need to get it into a table.  One option would be to use the INSERT statement on the daily records, followed by UPDATE statements to populate the monthly (1, 3, 6) aggregated columns.  However, I’m a BI guy so let’s use an SSIS package to get to the result (plus it allows me to illustrate the Merge Join data flow transformation :-) ).

So open up the BIDS and create a new package.  Drop a Data Flow Task into the Control Flow and add a Connection Manager connecting to your Contoso DWH.  Then switch to the Data Flow page.

Nothing special so far I believe.  Next we need to set up four Data Flow Sources: one for the daily figures, one for the monthly, one for the 3M and one for the 6M data.

Setting Up The Data Sources

Throw in an OLE DB Source component, configure it to use your connection manager and copy/paste the first query above into the command textbox.  Again nothing special, right?

However, the Merge Join component expects its incoming data to be sorted.  That’s why I’ve included the ORDER BY clause in the queries above.  But that’s not all.  Connecting our data source to a Merge Join transformation without any additional change will result in an error such as the following:

Validation error. Data Flow Task Merge Join [457]: The input is not sorted. The “input “Merge Join Left Input” (458)” must be sorted.

To avoid this error, we need to explicitly inform our data flow that the data is actually ordered, and we need to give it all the details: on what fields has the data been ordered and in what order!  And that needs to be done through the Advanced Editor.

So, right-click the OLE DB Source and select Show Advanced Editor.

Right-click OLE DB Source to open up the Advanced Editor

In the Advanced Editor, navigate to the last tab called Input and Output Properties and select the “OLE DB Source Output” node in the tree structure on the left.  Doing that will show the properties for the selected output and one of those properties is called IsSorted.  By default it is set to False.  Set it to True.

Tip: double-clicking the label of the property will swap its value to the other value.  This can be useful in cases when you need to change several options but even here is saves a couple of clicks.  It’s all about optimization. :-)

Advanced Editor on OLE DB Source: the IsSorted property

At this moment the component knows that the incoming data is sorted, but it still doesn’t know on what fields.  To specify that, open up the OLE DB Source Output node, followed by the Output Columns node.  You’ll now see the list of fields.  As specified in the query, the data is ordered firstly on DateKey, secondly on ZipCode and thirdly on ProductCategoryName.

Select DateKey to see its properties.

Advanced Editor of OLE DB Source showing the SortKeyPosition property

The property in which we’re interested here is called SortKeyPosition.  By default it is set to zero.  When the incoming data is sorted,  this property should reflect in what order the data is sorted, starting with one for the first field.  So in our case here the value should be set to 1.

Set the SortKeyPosition property for ZipCode to 2 and for ProductCategoryName to 3.

That’s one of the four OLE DB sources set up.  The other three will be easier as we can start from the first one.  So, copy and paste the source component, open it up by double-clicking it and replace the query with our second query from earlier, the one returning the monthly figures.  Ow, and give it a decent name but I’m sure you knew that.

Create the third source component in the same way, but change the value for the @numberOfMonths variable to 3.  And again the same process for source number four, changing the variable’s value to 6.

Here’s what we have so far:

Four OLE DB sources set up - waiting to be merged

Merging The Sources Into One Flow

Next up is merging the incoming flows.  Drag a Merge Join data flow transformation under the Daily Sales source and connect the source to the Merge Join.  That will open the following Input Output Selection screen.

Input Output Selection window

A Merge Join expects two inputs: one is called the Left Input and the other is called the Right Input.  Select Merge Join Left Input as value for the Input dropdown.

Close the popup window and connect the second source (with the monthly data) as well to the Merge Join.  There’s only one input remaining so this one is automatically the right input – no popup window is shown.

Next we need to configure the Merge Join so that it merges the data as expected.  Open the Merge Join Transformation Editor by double-clicking the component.

Merge Join Transformation Editor

By default the Join type dropdown is set to Inner join.  In our situation that’s good enough.  In the case that only one record exists for a certain zip code and product category on a given day, the monthly data for this record will be the sum of just that one record but in any case: there’s always at least one record for each incoming flow to be combined with each other.

As you can see, because both incoming flows are ordered in the same way, it automatically knows on which fields to put the join.

By default, no output fields are created as the white bottom half of the screenshot indicates.

Now I’ll show you a screenshot of the expected setup:

Merge Join Transformation Editor set up as expected

There are several ways to specify the output fields.  The first method is by using the dropdown in the Input column.  Selecting a value there will populate a dropdown in the column called Input Column (djeez, that was one column too much).  Here’s what that method looks like:

Specifying the output fields by using the dropdowns

Selecting a value in the second column will then give you a default value for the Output Alias.  This default can be freely modified.  As you may have guessed, this is not my preferred method – way too many comboboxes.

Another method of specifying the output fields is by using the checkboxes in front of the fields in the top part of the window.  I believe the larger screenshot above says it all.  Just check the fields that you need and then change their default Output Alias to whatever suits you.   In my example here I only needed to modify the alias for the last two records.

With our first Merge Join set up, only two are remaining.  So drag in a second Merge Join from the Toolbox, connect the output of the first join as Left Input on the second join and add the output of the third OLE DB source as Right Input.

Interesting to note here is that the output of the Merge Join is sorted in the same manner as its inputs.  One way of verifying this is by right-clicking the connector between the two joins and choosing Edit.

Right-click data flow connector and select Edit to open up Data Flow Path Editor

That opens up the Data Flow Path Editor.

Tip: double-clicking the connector will also open the editor!

Examine the Metadata of the Data Flow Path to verify the sort order

As you can see in the above screenshot, the metadata page shows a list of the available fields with some properties, such as the Sort Key Position.  Now if that doesn’t look familiar?! :-)

So far, the second Merge Join has been added and connected but it hasn’t been configured yet.  The process is very similar to the way we’ve set up the first join.  Just select all fields from the left input by checking all the checkboxes and select the two SUM fields from the right input.  Don’t forget to give those SUM fields a clear name.

Two joins done, one remaining.  Just drag one in and connect it with the second join plus the last remaining OLE DB source.  I won’t go into further details here, it’s exactly the same as I just explained for the second join.

Here’s what the Data Flow should look like:

The Data Flow with all the Merge Joins connected

And here’s what the third Merge Join should look like:The third Merge Join as set up for the example

An Error That You May Encounter

When using sorted data flows and the Merge Join component, you may encounter the following error message:

An error that you may encounter while using the Merge Join component

And now in words for the search engines:

The component has detected potential metadata corruption during validation.

Error at Data Flow Task [SSIS.Pipeline]: The IsSorted property of output “Merge Join Output” (91) is set to TRUE, but the absolute values of the non-zero output column SortKeyPositions do not form a monotonically increasing sequence, starting at one.

Yeah right, you had to read that twice, didn’t you?  And the best is yet to come:

Due to limitations of the Advanced Editor dialog box, this component cannot be edited using this dialog box.

So there’s a problem with your Merge Join but you cannot use the Advanced Editor to fix it, hmm, and you call that the ADVANCED editor?  Is there anything more advanced perhaps?  Well, actually, there is.  It’s called the Properties pane.  With the Merge Join selected, one of the properties there is called NumKeyColumns.  That property reflects on how many columns the incoming data is sorted.  And currently it contains the wrong value.  Changing its value to the correct number of columns will remove the error.

Properties pane displaying the Merge Join's properties, including NumKeyColumns

In case you’re wondering when you might encounter this particular problem, here’s how you can simulate it.  (Don’t forget to make a copy of the package before messing around with it.)

With the package as it currently is, remove the ZipCode field from the first two sources by unchecking it in the Columns page of the OLE DB Source Editor.

The sources are now complaining so open up their Advanced Editor and correct the SortKeyPosition of the ProductCategoryName field: it should become 2 instead of 3 because ZipCode was 2 and has been removed.

Now try to open the first Merge Join.  The first time it will complain about invalid references so delete those.  With the references deleted, if you now try to open the Merge Join editor, you’ll see the error we’re discussing here.  To fix it, change the NumKeyColumns property of the Merge Join to 2 instead of 3.

Adding The Destination Table

Now there’s only one step remaining: adding a destination for our merged data.  So, throw in an OLE DB Destination and connect it with the output of the last Merge Join:

An OLE DB Destination connected to the join that merges it all together

I’ll just use a quick and dirty way of creating a new table in the database.  Open up the OLE DB Destination Editor by double-clicking it and select a Connection Manager in the dropdown.  Now click the New button next to the Name of the table or the view dropdown.

That opens up the Create Table window, with a CREATE TABLE query pre-generated for you for free.  Isn’t that nice?  Change the name of the table to something nice (at least remove those spaces, yuk!!) and click OK.

The Create Table window

The new table is created at the moment that the OK button gets clicked.

Right, so are we there?  Well, almost.  As you can see now in the next screenshot, the BIDS does not want us to click the OK button just yet.

The OLE DB Destination Editor with the Mappings still missing

To resolve that warning, just open the Mappings page.  As the names of the input columns are matching exactly with the names of the fields in the destination table, everything will be automagically configured at this moment.  So now you can close the window with the OK button.

And that’s it!  Everything is set up to populate the new table with the aggregated figures, as requested by management.  To give it a run, right-click your package in the Solution Explorer and guess what… select Execute Package!  If everything has been configured as expected, you should get some green boxes soon.  And some data in the table, like this:

The final result: sales figures aggregated over different periods in time

Conclusion

In this article I’ve demonstrated a way to aggregate data over different periods in time, using T-SQL and Integration Services.  Obviously this method does not replace the flexibility that one gets when analyzing data stored in an OLAP cube, but it can be a practical method when you quickly need to provide aggregated data for management.

Have fun!

Valentino.

References

Merge Join Data Flow Transformation

Common Table Expressions (CTEs)

DATEADD() function

Share

Tags: , , ,

A while ago I wrote an article about how you can use the SQL Server Import and Export wizard to import Excel data into a SQL Server database.  In this sequel I’m going to show you some problems which you may encounter when using that wizard.  Just like any good old wizard, he’s only as good as his recipes.  If a recipe is missing an ingredient, the resulting potion will probably not behave as expected and before you know it it explodes in your face.

I’ll be using the same Excel file as in my previous article.  In case you’re having some problems understanding the data or locating certain screens mentioned in this article, I recommend you to first read the prequel.

Some Common Wizard Pitfalls And Their Solution

Drop And Re-Create Destination Table

Imagine that for your import process you’re planning to import that Excel data regularly.  So the logical step is to make sure that each time the process runs, it starts from scratch.  Following that, what seems to be an interesting checkbox is located on the Column Mappings screen.  This checkbox is called Drop and re-create destination table.

Column Mappings: Drop and re-create destination table

So you activate that checkbox and execute the package.  However, it ends with an error in the Executing phase:

SQL Server Import and Export Wizard: error while Executing

Here’s what the error details say:

image

Error 0xc002f210: Drop table failed… Cannot drop the table dbo.ProductList$ because it does not exist or you do not have permission.

So why does this error occur?  Let’s examine the SSIS package that was generated.  This is what the Control Flow looks like:

A Control Flow that doesn't take a non-existing table into account

New here is that Drop table SQL Task.  Taking a closer look at the task, here’s the query:

drop table [dbo].[ProductList$]
GO

So the first task is to drop the table.  This flow does not take into account that the table may not yet exist, as is the case here, causing the error.

However, do you notice the blue arrow connecting the Drop table task with the next SQL Task?  That means that it doesn’t require the execution of the task to end successfully, the flow will continue even when an error occurred (unlike when the green connector – meaning Success – is used).

Double-clicking the blue connector gives us the Precedence Constraint Editor window:

Precedence Constraint Editor

Indeed, the value for the constraint is set to Completion.

This also explains why the wizard continued executing all phases, even though an error occurred.  And the end result was that the table is really created, containing the data as expected.

If you’d like to avoid the error and handle the non-existing table, you could replace the query in the Drop table SQL Task with the following one:

if exists
(
    select * from INFORMATION_SCHEMA.TABLES T
    where T.TABLE_NAME = 'ProductList$'
        and T.TABLE_SCHEMA = 'dbo'
)
    drop table [dbo].[ProductList$];

It first checks if the table exists, taking schema into account, and will only perform the DROP TABLE statement if the table actually exists.

Field Length Insufficient

On to the next possible issue.  When examining our Excel sheet we find out that the content of the Class column is either blank or just one character:

Possible values of the Class field

So we decide to change its type to char(1), again using the Column Mappings screen.

Column Mappings: change type to char(1)

However, clicking Finish at the end of the wizard gives us a fatal error:

SQL Server Import and Export Wizard: Operatoin stopped in Error

And these are the details of the error:

    • Error 0xc02020c5: Data Flow Task 1: Data conversion failed while converting column “Class” (63) to column “Class” (139).  The conversion returned status value 4 and status text “Text was truncated or one or more characters had no match in the target code page.”.(SQL Server Import and Export Wizard)
    • Error 0xc020902a: Data Flow Task 1: The “output column “Class” (139)” failed because truncation occurred, and the truncation row disposition on “output column “Class” (139)” specifies failure on truncation. A truncation error occurred on the specified object of the specified component.

      (SQL Server Import and Export Wizard)

    • Error 0xc0047022: Data Flow Task 1: SSIS Error Code DTS_E_PROCESSINPUTFAILED.  The ProcessInput method on component “Data Conversion 0 – 0″ (131) failed with error code 0xC020902A while processing input “Data Conversion Input” (132). The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running.  There may be error messages posted before this with more information about the failure.

      (SQL Server Import and Export Wizard)

Wow, that’s quite a long list for such a simple change!  Apparently it wasn’t such a good idea to limit the length of this field.  But why does this error occur?  After all, the values in this field are just one character long.  Except, after taking a closer look at the Excel sheet it turns out that these values contain a trailing space!

One way of avoiding this issue is by specifying a higher length for the field.  In this particular case however, you may decide to modify your Data Flow so that it removes the trailing space.  Here’s what the Data Flow currently looks like:

Data Flow to import Excel data

The Data Conversion transformation takes care of converting the Class column into a DT_STR of length 1.

Data Conversion Transformation Editor

We’re going to replace that Data Conversion Transformation with another one, a Derived Column transformation.  This is one of the transformations which I’m using all the time.

Derived Column Transformation Editor

As you can see in the screenshot, I’ve set it up so that it uses the TRIM() function on the Class input column to remove any leading or trailing spaces.  Furthermore I’m using a cast to DT_STR of length 1 to ensure the correct field type.  The resulting column is called Class_STR, to clearly indicate that it has been converted to DT_STR.

This is the resulting Data Flow:

Data Flow: Destination component complains about lineage ID

Looks like we’ve got an issue with our Destination component now.  Which is quite logical: we still need to tell it that it should use the newly-created Class_STR column.  So double-click the component.

 Restore Invalid Column References Editor

Double-clicking the Destination component will open up the Restore Invalid Column References Editor.  In Available Columns, select the new Class_STR column to replace the Class column which was used previously.

That’s it, your SSIS package will now remove the trailing spaces from the Class column and store it in a column of char(1).

Excel Layout Change

Here’s another common issue when dealing with Excel SSIS imports.  If you’re not 100% in control of that Excel sheet, someone will someday make a structural change to it and it will cause your import process to fail.

In this really simple example I’ve opened up the Excel sheet and renamed the Color column to Colour.  Which is something that may happen in real life: a British person takes over the maintenance of that product list and sees that the Color column is spelled the wrong way, and corrects it without informing anyone.

What does that mean for our import?  Here’s the result when manually executing the package using DTExecUI.exe:

DTExecUI: Package Execution Progress

In this case execution will fail because the package cannot find the Color column in the Excel sheet.  More precisely it says VS_NEEDSNEWMETADATA.  This is a really common error when using Integration Services, but you need to manually update the package to handle such changes.

In other words: try to be in control of that sheet as much as you can, and if possible: set up another way to maintain such data.  For instance by using a Master Data Management system.  But that’s stuff for later on, in future articles.

Conclusion

As long as you’re aware of some of the common issues into which you may run, I still think the Import and Export Wizard is an interesting option to start your first SSIS package.  When running into an issue, I recommend to open up the SSIS package and have a closer look through the Business Intelligence Development Studio.  Then make any changes there so you can handle the errors.

Happy importing!

Valentino.

Share

Tags: , , ,

In a previous article I’ve shown you how to import data from an Excel sheet using the OPENROWSET() function.  And I concluded by stating that it’s not the best option when automating your data import.

Today I’ll repeat the Excel data import process by using SQL Server Integration Services, also known as SSIS.

I’ll be using SQL Server 2008 R2, but I’m quite sure that the process is very similar to the first release of 2008, and even to 2005.  The Excel file that I will be importing is the one used in my previous article, and I’ll also refer to some parts of that article, so you may want to have a read over that one when something here isn’t clear.

Furthermore I’m using a Windows 7 64-bit machine, with the ACE 14 driver (beta) installed.  To avoid any discussion about versions and for my own (future) reference, here’s the result of a SELECT @@VERSION:

Microsoft SQL Server 2008 R2 (CTP) – 10.50.1352.12 (X64)   Oct 30 2009 18:06:48   Copyright (c) Microsoft Corporation  Enterprise Evaluation Edition (64-bit) on Windows NT 6.1 <X64> (Build 7600: )

Create SSIS Package To Import Excel Data

Usually you will start by creating a new package in an Integration Services project, add an Excel source to a new Data Flow, throw in some Data Flow Transformations and end your flow with an OLE DB Destination connecting to your SQL Server.

But that’s not the approach that I’ll take in this article.  I’ll make use of a shortcut (and meanwhile I’m showing you how well integrated some components really are).

First, I’m creating a new database called ExcelImport, using the Management Studio (aka SSMS).  Once the database is created, right-click on it and choose Tasks > Import Data.

Start the Import Data Wizard through Management Studio

This will open up the SQL Server Import And Export Data Wizard.  Like all good wizards, it starts with a Welcome screen containing an introductory text about its purpose – something about “create simple packages that import and export data between many popular data formats including databases, spreadsheets” – and so on.  It also includes a checkbox that says “Do not show this starting page again”.  That’s my favorite object on the page :-)   Okay, I agree, the page is useful for people who have never seen the wizard before and who may have opened it up by accident, but that’s about as far as its use goes methinks.

So, do whatever you like with the checkbox and click Next.  That opens up the Choose a Data Source screen.

SQL Server Import and Export Wizard - Choose a Data Source

In that screen you have several options in the Data Source dropdown.  The one we’re interested in is called Microsoft Excel.  Once that option is selected, the controls further down the screen change into what is shown in the screenshot.  Select your file and the right version, in my case I’ve got an Excel 2007 file.  If your sheet contains a header row, activate the First row has column names checkbox.

Clicking Next will open up the Choose a Destination screen.

SQL Server Import and Export Wizard - Choose a Destination

In that screen, select Microsoft OLE DB Provider for SQL Server as Destination.  Ensure that your SQL Server instance is the right one in the Server Name dropdown and the Authentication is filled out as expected.  The correct database should be selected by default because we did a right-click on it to start the wizard.

Another Next click opens up the Specify Table Copy or Query window.

SQL Server Import and Export Wizard - Specify Table Copy or Query

Here we can choose between either retrieving the full table – all rows, all columns – or writing a query ourselves.  I’ll go for the first option.

Click Next once more to open the Select Source Tables and Views screen.  That name seems a bit weird in the context of an Excel import but I guess that’s not really important here.  Just read it as “Select the sheet that you’d like to import”.

SQL Server Import and Export Wizard - Select Source Table and Views 

I’ll go for the ProductList$ Source.  The sheet in my Excel file is called ProductList.  Note that the Destination is editable – I’m changing the destination table to dbo.ProductList (with the dollar sign removed).

If you’d like to view your data right now you can hit the Preview button.  It opens up a window such as this one:

SQL Server Import and Export Wizard - Select Source Table and Views - Preview Data 

That should look familiar.

A more interesting button is the one called Edit Mappings.  Clicking that button opens a screen that lets you change the destination table’s schema.  By default, all columns that seem to be numeric are mapped to a type of float and all the others are mapped to nvarchar with a length of 255.  Depending on your situation you can either leave it as it is (in case you’re just loading a staging table and want to handle data conversions later on) or you should review each column (for instance when you’re erasing and loading the full table every night and this is the actual table being used by other processes).

Note: whenever a column contains a blank cell, its type will be nvarchar(255), even when all other values are numeric.  Also, if you don’t need Unicode support, don’t use nvarchar but change it to varchar instead.

Here’s the Column Mappings screen with some of the defaults changed:

The Column Mappings window

Everything that I changed has been indicated using the yellow marker.  I’ve changed some field types and lenghts, made one field non-nullable and adapted the destination name.

Clicking OK followed by Next brings us to the following screen, Review Data Type Mapping.

SQL Server Import and Export Wizard - Review Data Type Mapping

This screen gives another overview of all the columns that we’re planning to import.  Note that each column of which we’ve changed the type has gotten a nice yellow warning sign.  This happens because a data conversion needs to occur, and there’s always something that can go wrong when converting data.  The On Error and On Truncation columns show the action that should happen when such an event occurs.  We’ll leave them all at Use Global.  The Global settings are located at the bottom of the screen and are both set to Fail.  That’s the best option at the moment (the only other one is Ignore but that means you won’t get any notification in case of an error or truncation problem).

After clicking Next we end up at the Save and Run Package window.

SQL Server Import and Export Wizard - Save and Run Package

I have activated the Save SSIS Package checkbox and chose the Do not save sensitive data option.  By default it is saved in SQL Server, which is actually the MSDB.  That’s good because we’re going to examine the contents of the package later on so we want to keep it.

The Run immediately checkbox was activated by default.  This will execute the package in the last step of the wizard.

Another Next click and we’re on the Save SSIS Package screen.

SQL Server Import and Export Wizard - Save SSIS Package

Here we can give our package a decent name, such as ExcelImport.  You can also see the server on which I’m saving the package.

The final Next click brings us to the Complete the Wizard screen, woohoo!

 SQL Server Import and Export Wizard - Complete the Wizard

We can see a short overview of what we’ve configured in the previous steps.  Click Finish to execute and save the package!

SQL Server Import and Export Wizard - The execution was successful

And we’ve successfully executed the package!

Taking A Closer Look At The SSIS Package

When connecting to the Integration Services server (through SSMS for instance), I now have a new package in the root of the MSDB folder.

The ExcelImport SSIS package, stored in MSDB

It’s located there because I chose to save the package to SQL Server.

Adding An Existing Package To An SSIS Project

We’re now going to open it in the Business Intelligence Development Studio (aka BIDS).  So, open the BIDS and create a new SSIS Project (or use an existing one, doesn’t really matter).

Once the project is open, right-click the SSIS Packages folder in the Solution Explorer and select Add Existing Package.

Right-click SSIS Packages folder in Solution Explorer to Add Existing Package

That opens up the Add Copy of Existing Package window.

Sidenote: do you see that SSIS Import and Export Wizard option in the previous screenshot?  That’s right, the wizard that we’ve used extensively in the earlier part of this article can be launched from here as well.

Add Copy of Existing Package

Select SSIS Package Store as location of the package and enter the name of your server in the second dropdown.  Once that is done you can click the button with the ellipsis and select your package under the MSDB node.

Clicking OK will add the package to your project in the Solution Explorer.  Double-click it to open it up.

The Control Flow

In the Control Flow you can see two components: a SQL Task that contains a CREATE TABLE statement and a Data Flow.

Here’s what the CREATE TABLE statement looks like:

CREATE TABLE statement in the Execute SQL Task

As you can see, the table is created using the column names and types just like we configured them through the wizard.

Important to note here is that the Control Flow does not take anything else into account.  For instance, what happens if we execute the package twice?  It will fail because the table already exists!

In case you don’t believe me, just try it out!

Execute SQL Statement failed There is already an object named 'ProductList' in the database. 

The Data Flow

Opening up the Data Flow we see that it contains three components: an Excel source component that uses an Excel Connection Manager to connect to our now well-known Excel sheet, a Data Conversion Transformation to take care of the conversions that we requested and an OLE DB Destination that uses an OLE DB Connection Manager to connect to our SQL Server.

The Data Flow to transfer Excel data into SQL Server

Important to note here is that whenever an issue occurs, such as a conversion problem, the flow will fail.

In production environments, certain events need to be taken into account.  The purpose of this article was just to show you how you can use a wizard to generate an SSIS package for you.  You can now use this package as the basis for a well-developed Excel Import template.

Conclusion

With this article I hope to have shown you how to use Integration Services to import Excel data, and also that the Management Studio knows how to use other SQL Server components, such as SSIS, quite well.

If you’re running into some issues while using the wizard, or you just like reading what I write, check out my follow-up article covering some common pitfalls.

Need to go to sleep now, long drive tomorrow, PASS European Conference in Germany!  I do hope that Adam Saxton will be there because I was planning to see his presentations all day long.  I already read that one of the speakers – Brent Ozar – won’t be able to make it.  Darned ash cloud…  Next time someone starts talking to me about Azure I’ll run away screaming.

Just kidding :-)

Have fun!

Valentino.

References

Microsoft Support: How to import data from Excel to SQL Server

How to: Run the SQL Server Import and Export Wizard

Share

Tags: , , , ,

A while ago I posted a query to create a list of all the Integration Services packages deployed to the MSDB.  I am now using that query to take it a step further.

If you’ve been using SSIS for a while you’ve probably noticed that the Management Studio doesn’t like to delete Integration Services folders that are not empty.  In fact, it will politely ask you if you’re sure that you want to delete the folder on which you’ve just selected the “Delete” option through the right-click menu.

Right-click pop-up menu on SSIS folder

I am sure I want to delete this non-empty SSIS folder

So you click the Yes button.  But then it shows you the following message:

SSIS folder ‘FolderWithSubfolders’ contains packages and/or other folders. You must drop these first. (Microsoft SQL Server Native Client 10.0)

Graphically it looks like this:

Object Explorer pop-up: you can't delete SSIS folders that contain packages or other folders

And this message can be really annoying if you’ve got a main folder with, let’s say, five subfolders, and each subfolder contains about 20-30 packages.  If you want to delete this folder you first need to delete each package separately and then delete the five subfolders, and then you can finally delete the main folder.  And all that through the right-click pop-up menu because you can’t just select the object in the Object Explorer and hit the Delete button on the keyboard – it doesn’t have an action on SSIS objects…

So, I wasn’t planning on doing such a job manually and came up with the following stored procedure.

It’s probably a bit long but don’t run away just yet, I will explain what’s going on down below the code, and there are some comments in the code as well.

/*
DESCRIPTION: Deletes all folders and packages under, and including, specified folder.
WRITTEN BY:  Valentino Vranken
CREATED:     2010-02-28
VERSION:     1.0
USAGE:
  -- mind the forward slash
  EXEC dbo.SSIS_RecursiveDeleteFolder '/FolderWithSubfolders'
  -- to delete a subfolder
  EXEC dbo.SSIS_RecursiveDeleteFolder '/FolderWithSubfolders/ASubfolderWithPackages'

COPIED FROM: http://blog.hoegaerden.be

Note 1: folder names are not case-sensitive
Note 2: uses system tables and (undocumented) stored procedures located in MSDB.
Note 3: this code was written for SQL Server 2008. For 2005:
  o sysssispackagefolders -> sysdtspackagefolders90
  o sysssispackages -> sysdtspackages90
  o sp_ssis_deletefolder -> sp_dts_deletefolder
  o sp_ssis_deletepackage -> sp_dts_deletepackage
*/
CREATE PROCEDURE dbo.SSIS_RecursiveDeleteFolder
    @Folder varchar(2000)
AS
BEGIN
    set nocount on;

    declare @foldersToDelete table
    (
        folderid uniqueidentifier,
        Lvl int
    );

    declare @packagesToDelete table
    (
        PackageName sysname,
        folderid uniqueidentifier,
        Lvl int
    );

    --retrieve list of folders to be deleted
    with ChildFolders
    as
    (
        select PARENT.parentfolderid, PARENT.folderid, PARENT.foldername,
            cast('' as sysname) as RootFolder,
            cast(PARENT.foldername as varchar(max)) as FullPath,
            0 as Lvl
        from msdb.dbo.sysssispackagefolders PARENT
        where PARENT.parentfolderid is null
        UNION ALL
        select CHILD.parentfolderid, CHILD.folderid, CHILD.foldername,
            case ChildFolders.Lvl
                when 0 then CHILD.foldername
                else ChildFolders.RootFolder
            end as RootFolder,
            cast(ChildFolders.FullPath + '/' + CHILD.foldername as varchar(max))
                as FullPath,
            ChildFolders.Lvl + 1 as Lvl
        from msdb.dbo.sysssispackagefolders CHILD
            inner join ChildFolders on ChildFolders.folderid = CHILD.parentfolderid
    )
    insert into @foldersToDelete
    select F.folderid, F.Lvl
    from ChildFolders F
    where F.FullPath like @Folder + '%';

    --retrieve list of packages to be deleted
    with ChildFolders
    as
    (
        select PARENT.parentfolderid, PARENT.folderid, PARENT.foldername,
            cast('' as sysname) as RootFolder,
            cast(PARENT.foldername as varchar(max)) as FullPath,
            0 as Lvl
        from msdb.dbo.sysssispackagefolders PARENT
        where PARENT.parentfolderid is null
        UNION ALL
        select CHILD.parentfolderid, CHILD.folderid, CHILD.foldername,
            case ChildFolders.Lvl
                when 0 then CHILD.foldername
                else ChildFolders.RootFolder
            end as RootFolder,
            cast(ChildFolders.FullPath + '/' + CHILD.foldername as varchar(max))
                as FullPath,
            ChildFolders.Lvl + 1 as Lvl
        from msdb.dbo.sysssispackagefolders CHILD
            inner join ChildFolders on ChildFolders.folderid = CHILD.parentfolderid
    )
    insert into @packagesToDelete
    select P.name, F.folderid, F.Lvl
    from ChildFolders F
        inner join msdb.dbo.sysssispackages P on P.folderid = F.folderid
    where F.FullPath like @Folder + '%';

    --use cursor to loop over objects to be deleted
    declare objectsToDelete_cursor cursor
    for
        select P.folderid, P.Lvl, P.PackageName, 'P' as ObjectType
        from @packagesToDelete P
        UNION ALL
        select F.folderid, F.Lvl, null, 'F'
        from @foldersToDelete F
        order by Lvl desc, ObjectType desc;

    open objectsToDelete_cursor;

    declare @folderid uniqueidentifier;
    declare @lvl int;
    declare @packageName sysname;
    declare @objectType char;

    fetch next from objectsToDelete_cursor
    into @folderid, @lvl, @packageName, @objectType;

    while @@FETCH_STATUS = 0
    begin
        if @objectType = 'F'
        begin
            print 'exec msdb.dbo.sp_ssis_deletefolder '
                + cast(@folderid as varchar(max));
            exec msdb.dbo.sp_ssis_deletefolder @folderid;
        end
        else
        begin
            print 'exec msdb.dbo.sp_ssis_deletepackage '
                + @packageName + ', ' + cast(@folderid as varchar(max));
            exec msdb.dbo.sp_ssis_deletepackage @packageName, @folderid;
        end

        fetch next from objectsToDelete_cursor
        into @folderid, @lvl, @packageName, @objectType;
    end;

    close objectsToDelete_cursor;
    deallocate objectsToDelete_cursor;
END

Before trying to dismantle this stored procedure, I recommend you to read my previous article on retrieving the list of packages.  That already explains half of the code, if not 75%.

Our mission is to find a way to recursively delete packages and folders contained in a specified folder.  To be able to loop over those objects in the correct order (from the deepest level up until the level of the folder specified), the SP creates two table variables: one to hold all folders under the specified folder (@foldersToDelete) and one to hold the packages under the specified folder, including all subfolders (@packagesToDelete).

Based on those two lists I create a cursor that joins these two together, taking their level and object type into consideration.  That’s important because we first need to delete the packages in the lowest level folder, followed by their containing folder, then move one level up and do the same.

We then use the cursor to loop over the packages and folders and use two undocumented system stored procedures – one for each object type- to delete the package or folder.  These system SPs are located in the MSDB.  Here’s how they are defined:

ALTER PROCEDURE [dbo].[sp_ssis_deletefolder]
  @folderid uniqueidentifier
AS

ALTER PROCEDURE [dbo].[sp_ssis_deletepackage]
  @name sysname,
  @folderid uniqueidentifier
AS

As you can see, the parameters for these procedures are not that complicated.  Both of them expect a uniqueidentifier as identification for the folder.  That’s okay, these IDs are stored in the msdb.dbo.sysssispackagefolders table and retrieved by our queries to create the list of to-be-deleted objects.

Furthermore, the sp_ssis_deletepackage SP expects the name of the package to be deleted.  Not a problem either, those names are obtained from the msdb.dbo.sysssispackages table.

Note for SQL Server 2005 users: this code was written for SQL Server 2008.  The system stored procedures and system tables exist in 2005 as well, but they have different names.  See the comment header of my SP for more details.

So, let’s give it a little test.  Following screenshot shows the setup.  What I will do is use the stored procedure to delete the FolderWithSubfolders folder.  If you’ve been paying close attention, that is the same folder which I tried to delete manually through the Management Studio’s right-click menu (see first screenshot above).

Overview of my deployed folders and packages

After creating the SP, I ran following command:

EXEC dbo.SSIS_RecursiveDeleteFolder '/FolderWithSubfolders'

And that gave me the following output in the Messages pane:

exec msdb.dbo.sp_ssis_deletepackage AnotherPackage, 7F38288D-4370-40A8-80E3-E92283033E4C

exec msdb.dbo.sp_ssis_deletepackage Package, 7F38288D-4370-40A8-80E3-E92283033E4C

exec msdb.dbo.sp_ssis_deletefolder 4102ED59-ED75-4D93-BBAE-0A162447BF02

exec msdb.dbo.sp_ssis_deletefolder 7F38288D-4370-40A8-80E3-E92283033E4C

exec msdb.dbo.sp_ssis_deletefolder C156B436-8C78-4BF9-99F9-5ABFAB10C405

I have deliberately put a couple of print commands in the stored procedure to dump the commands that are actually being executed.  This gives us a good idea of what’s going on.

That’s it for now folks.  Thank you for reading this, and if you found it useful or you’ve got some questions about it: post a comment!

Have fun!

Valentino.

Share

Tags: , , , ,

« Older entries § Newer entries »

© 2008-2014 BI: Beer Intelligence? All Rights Reserved