Loading XML Using SSIS

SQL Server Integration Services can read XML files, that’s known by every BI developer.  (If you didn’t, don’t worry, I’m aiming this article at newcomers as well.)

But how far can you go?  When does the XML Source component become unusable?  Let’s find out!

To create the examples I’m using the following SQL Server version:

Microsoft SQL Server 2008 R2 (RTM) – 10.50.1600.1 (X64)   Apr  2 2010 15:48:46   Copyright (c) Microsoft Corporation  Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7600: )

Basic Example

This first example is a really simple XML file containing a list of colors with their corresponding RGB code.

<colors>
  <color RGB="FF0000">Red</color>
  <color RGB="00FF00">Green</color>
  <color RGB="0000FF">Blue</color>
  <color RGB="FFFFFF">White</color>
  <color RGB="000000">Black</color>
</colors>

Let’s import this into a database.  Open up the BIDS, create an SSIS project and throw a Data Flow Task into the package and open it up.

The component that we’re now most interested in is the XML Source, one of the components in the Data Flow Sources category in the Toolbox.

The XML Source component

Add one of those to your Data Flow and double-click it to open up the XML Source Editor.

The Data Access Mode should be set to XML file location, which is the default setting.  The other options are XML file from variable – useful if you’ve got the file path and name of the XML file in a variable – and XML data from variable – interesting if your XML data is actually stored in a variable.

As XML Location, select the .xml file.  Our XML sample does not have an inline schema, so we can’t use that checkbox.  And we can’t click the OK button either, it’s grayed out.  The source component really expects a description of the XML structure before the editor can be closed.

The bottom of the screen even shows a warning with the following message:

XML Schema (XSD) is not specified. Select an existing XSD or click Generate XSD to create an XSD from the XML file.

So, what are you waiting for,  Click the Generate XSD button to let the XML Source Editor generate the XSD schema for us.  Real easy, right?

Remember where you save the file, and when it’s generated, select the .xsd file in the XSD location textbox.  As you can see, the OK button will become available.  But don’t click it just yet.

Here’s what the XML Source Editor now looks like:

XML Source Editor with an XML and XSD file specified

Let’s now move on to the second page of the XML Source Editor, called Columns.  When you open it, you’ll receive the following popup with a couple of warnings:

Warning gets displayed when opening the Columns page

The editor is letting us know that the columns that are being generated do not have a maximum length specified.  So it’s setting them to Unicode (DT_WSTR) with a length of 255.  Click the OK button to get rid of that message and to be able to see the generated columns.

Note: if your data elements or attributes may contain longer strings then you should have a look at modifying the length specification.  This can be done through the Advanced Editor, which is opened by right-clicking the XML Source.  The Input and Output Properties page is the one you’re after.

The Columns page, showing the columns that the XML Source generated for us

As you can see, our only attribute – RGB, is nicely put in a column with the same name.  The value of each <color> node however is not put in a column called Color.  By default, this value is put into a column called “text”.  Which is a weird name for a column in an SSIS data flow if you ask me.  The good thing is that you can just rename it by changing the Output Column value.

Let’s test this out.  My favorite way is to add a Multicast component to the Data Flow, then add a Data Viewer on the connector (right-click the green arrow, select Data Viewers, click Add > OK > OK).  Now execute the package to get this result:

Testing the output of the XML Source through the Data Viewer

Mission accomplished, we’ve retrieved data from a very basic XML file!

Adding Some Complexity

Let’s move on to the second example of this article.  The difference with the previous example is that now we’ve got multiple nested structures to deal with.

The example represents a list of book reviews, including some details on the books themselves.  A book can have multiple writers and obviously multiple reviews as well.

<books>
    <book pages="300">
        <title>Microsoft SQL Server 2008 R2 Master Data Services</title>
        <category>Information Technology</category>
        <authors>
            <author>Jeremy Kashel</author>
            <author>Tim Kent</author>
            <author>Martyn Bullerwell</author>
        </authors>
        <reviews>
            <review>If you're looking for an excellent book on the new Master Data Services component of SQL Server 2008 R2, definitely check this one out!  To be released in June 2011 by Packt Publishing!</review>
        </reviews>
    </book>
    <book pages="832">
        <title>Inside Microsoft SQL Server 2008: T-SQL Querying</title>
        <category>Information Technology</category>
        <authors>
      <author>Itzik Ben-gan</author>
      <author>Lubor Kollar</author>
      <author>Dejan Sarka</author>
      <author>Steve Kass</author>
        </authors>
        <reviews>
            <review>Every "Inside SQL Server" book can be recommended, especially when written by Itzik!</review>
        </reviews>
    </book>
    <book pages="1137">
        <title>The Lord of the Rings</title>
        <category>Fantasy</category>
        <authors>
            <author>J.R.R. Tolkien</author>
        </authors>
        <reviews>
            <review>Like fantasy?  What are you waiting for then?  It's a classic!</review>
            <review>If you liked the movie, you'll love the book.</review>
        </reviews>
    </book>
</books>

Configure an XML Source so that it uses the books.xml file, generate the XSD and specify its location.  I’m not going into details on that, the procedure is the same as in our first example above.

Now open up the Columns page to have a closer look at how the XML data is going to get imported.

XML Source generates multiple=

So how does the XML Source component deal with the multiple nested structures?  It generates multiple outputs!  If you select another output from that dropdown, you get to see its fields.

To get a clear understanding of what exactly is going on, let’s connect each output with an OLE DB Destination component.  The target table can be generated based on the incoming fields by clicking the New button.  Replace the table name in the generated CREATE TABLE script with a clear one that fulfills your naming convention requirements – such as NO SPACES IN A TABLE NAME for instance – and hit the OK button.

Destinatio table can be generated by using the New button in the OLE DB Destination Editor

Now that the table is created, it will be automatically selected in the Name of the table or the view dropdown.  Don’t forget to visit the Mappings page so that the, well, mappings are created.  If no field names were modified in the CREATE TABLE script then all fields should be mapped automatically based on their names.

With all five destinations added, execute the package.

Each XML Source output is connected to an OLE DB Destination - executes fine

So now we’ve loaded the data from our XML file into a database, but the data is spread over five tables.  How do we retrieve that data?  Join them together!

select * from XML_book
inner join XML_authors on XML_authors.book_Id = XML_book.book_Id
inner join XML_author on XML_author.authors_Id = XML_authors.authors_Id
inner join XML_reviews on XML_reviews.book_Id = XML_book.book_Id
inner join XML_review on XML_review.reviews_Id = XML_reviews.reviews_Id;

And the result looks like this:

XML data imported into the SQL Server database

Conclusion

We have managed to flatten the data from an XML file containing multiple nested repeating nodes, nice huh?  But do you also feel the limitation using this method?  In terms of modern XML, this was still a fairly easy XML file and yet we already needed five tables to store the data.  Can you imagine what this will give with a really complex file?

Watch this blog for the follow-up article where I will try to import data from a really complex XML file!

Have fun!

Valentino.

References

XML Source component

Share

Tags: , , ,

  1. Rahul’s avatar

    Hi,

    What if the XMl file is really complex..is there any specific approach in SSIS to read the XML file..Please get back to me..

    Reply

    1. Valentino Vranken’s avatar

      Hi Rahul,

      I’m currently working on the follow-up article that I mentioned in the Conclusion above. In that article I will demonstrate how to get data out of a really complex XML file while still using SSIS (but not the XML Source component though). Hope to publish by the end of this week, possibly early next week, so watch this space.

      Thanks for contacting me!
      Valentino.

      Reply

  2. raghav’s avatar

    Hi Valentino,

    As you mentioned earlier in the message, u will be publishing the some XML complex scenarios, could u please publish them here, which helps a lot.

    Thanks,
    Ragha.

    Reply

    1. Valentino Vranken’s avatar

      Hi Ragha,

      The article has been linked to in the “Watch this blog …” sentence. Here’s the direct link for your convenience: http://blog.hoegaerden.be/2011/04/20/loading-complex-xml-using-ssis/

      PS: I deleted those three duplicated comments.

      Regards,
      Valentino.

      Reply

  3. raghav’s avatar

    Hi Valentino,

    could please help me how to load the data from below xml format.

    I hav a file in the below xml format, so it has to loaded to the text file. but it not reading the data.

    FA
    441
    Error
    -
    -
    -
    -
    -
    - <![CDATA[
    The Trizetto Group, Inc cerbnav0
    Batch Router Module
    Version 4.41.00.01 Jan 29 2007 03:59:48
    Proprietary Property The Trizetto Group, Inc © 2004

    Application Server: FSBATCH01
    Process ID : 27892

    *-- Standard Error --*
    ================================
    Initialization Phase Started.

    Actual Run Start Date and Time [11/30/2010 18:03:49.725]

    Initialization Phase Complete.
    ================================
    ================================
    Execution Phase Started.

    The Trizetto Group, Inc CCSLUMI0
    Utilization Management Import
    Version 4.41.00.11 Apr 18 2007 09:46:13
    Proprietary Property The Trizetto Group, Inc © 2004

    Application Server: FSBATCH01
    Process ID : 27892

    *– Standard Error –*
    ================================
    Initialization Phase Started.

    Actual Run Start Date and Time [11/30/2010 18:03:49.725]

    Initialization Phase Complete.
    ================================
    ================================
    Execution Phase Started.

    Rectype: UMIN
    Key:UMUM_REF_ID=198656340
    UMIN ERROR :Allowed days sum on UMLS must equal UMIN_TOT_ALW_DAYS=1

    Rectype: UMIN
    Key:UMUM_REF_ID=199352790
    UMIN ERROR :Allowed days sum on UMLS must equal UMIN_TOT_ALW_DAYS=38

    Rectype: UMSV
    Key:UMUM_REF_ID=199434616,UMSV_SEQ_NO=2
    UMSV ERROR :P revious UMVT_STS equal Void (CL)

    Rectype: UMIN
    Key:UMUM_REF_ID=199716270
    UMIN ERROR :P revious UMIT_STS equal Void (CL)

    Note: Upto tag the data is reading, but below data which is not exactly in xml format not hav the tags is not reading,

    my intention is use to load only that particular data below the tag
    i.e;

    Rectype,Key,uminerror, UMIN_TOT_ALW_DAYS–> colums
    UMIN,UMUM_REF_ID=198656340,Allowed days sum on UMLS must equal UMIN_TOT_ALW_DAYS=1—->Rows.

    Please help, hw it can be resolved.

    Thanks in Advance,
    Raghava.

    Reply

  4. raghav’s avatar

    Hi Valentino,

    the file i am sending is coming to u without tags.

    Reply

    1. Valentino Vranken’s avatar

      Hi Ragha,

      That file doesn’t seem to be XML but some sort of logging file? You say “it’s coming to me without tags”, I assume you mean that the content that you copied into the comment box didn’t come through as expected?

      In any case, looks like you’re probably going to need some customized application to retrieve the data that you need out of that file. You can do that by implementing a command-line app, then call the app with the appropriate parameters (such as the location/name of your file) from SSIS using the Execute Process task.

      Hope that helps to get you started?

      Regards,
      Valentino.

      Reply

  5. raghav’s avatar

    Thanks Valentino.

    Reply

  6. raghav’s avatar

    Hi Valentino,

    I want to knw how to schedule the jobs using Autosys Tool in SSIS, and in that if there is cosalidation of packages, one or more jobs in one package consalidated with other other package.

    Suppose two packages r there. one package is child.dtsx and it has two jobs in it which r scheduled in Autosys.
    other is mainpack.dtsx in job is scheduled after the child job gets completed.

    This is the scenario, Can u please show me how we can schedule this using Autosys with Some print screens in Autosys.

    After Scheduling gets completed how to monitor these jobs in Autosys cn u show me with slides.

    u can explain with simple scenario but using Autosys Scheduling i want to knw and how to monitor and what kind of work we will do when a job fail, is there any specific procedure is there.

    I am into production support nw, my job is that i will recieve some tickets regarding the jobs whn it does nt work properly.

    I have to monitor in Autosys n i hav to resolve.
    Could u plz me knw with ur patience that helps me a lot, and i vl b very thankful to u valentino.

    Thanks in Advance,
    Raghava.

    Reply

    1. Valentino Vranken’s avatar

      Sorry, don’t know Autosys. I use SQL Server’s Agent…

      Reply

  7. raghav’s avatar

    hi,

    thing is that yesterdays task was resolved using XML source by converting it into NTEXT.

    Reply

  8. MA Belleau’s avatar

    Is there any way to have Authors with unique IDs?

    Reply

    1. Valentino Vranken’s avatar

      Do you mean in your database or in the data flow? If database, you could create a PK field of type int and activate the identity increment. To ensure that you’re not creating duplicate records, you could use the Lookup component to check if the author’s name (by lack of a better BK – business key) already exists. Watch out for possible duplicate entries in the same flow.

      It all depends a bit on what you’re trying to achieve though…

      Reply

  9. Murli’s avatar

    Hi
    Is there a way to get SQL Server to generate the XML (either an entire table output or a contents of Select query) in a format that conforms to your XSD automatically. Or do you know of any other resources.

    I brainstormed some ideas but on thinking further none of them seemed practical. I am hoping someone in SSIS community is able to help. I have scoured google for answers but so far I haven’t found anyone having a solution for this problem.

    Thanks

    Reply

  10. Greg’s avatar

    In the second example, where does the Book_ID come from? I don’t see it in the XML.

    I tried a test with my XML/XSD. It also generated a mystery ID but I cannot see where this is coming from or how it is being generated. Frankly, I don’t want it either but I want to understand.

    Thanks.

    Reply

    1. Valentino Vranken’s avatar

      Hi Greg,

      That Book_ID is the “mystery ID” to which you’re refering. Each output gets an auto-generated ID which can then be used to join them together. See the SELECT statement at the end of the article. Outside that context the ID doesn’t have much meaning.

      Regards,
      Valentino.

      Reply

  11. ravi’s avatar

    Hello Valentino,

    I have seen your post on Experts-Exchange website. I have a complete opposite scenario. I have data in 6 tables. I want to generate an XML file. I have the XSD genereated using your method.

    Can you please help me with this? I have posted this question in Experts-Exchange.com (here is the link: http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SQL_Server_2008/Q_28061325.html )

    Reply

  12. LOKESH’s avatar

    How is ID Field generated from the XML?

    Reply

  13. Duncan’s avatar

    Hello Valentino,
    I messed around unsuccessfully for hours with bulkimport, sp_xml_preparedocument openxml in sql, then I stumbled across your blog. This really made my day!
    Following your instructions for the BIDS setup, I was able to import the data I require from an xml into the database relatively quickly!
    This was the first time I had used BIDS.
    How can I call up this packet in a job in sql-management-studio. I assume it is an SQL Server Integration Services Packet, but I cannot find the saved packet. Would be grateful for any tips how to “find” it in SMS
    thanks
    groet
    Duncan

    Reply

© 2008-2014 BI: Beer Intelligence? All Rights Reserved