SSIS

You are currently browsing articles tagged SSIS.

A while ago I posted a query to create a list of all the Integration Services packages deployed to the MSDB.  I am now using that query to take it a step further.

If you’ve been using SSIS for a while you’ve probably noticed that the Management Studio doesn’t like to delete Integration Services folders that are not empty.  In fact, it will politely ask you if you’re sure that you want to delete the folder on which you’ve just selected the “Delete” option through the right-click menu.

Right-click pop-up menu on SSIS folder

I am sure I want to delete this non-empty SSIS folder

So you click the Yes button.  But then it shows you the following message:

SSIS folder ‘FolderWithSubfolders’ contains packages and/or other folders. You must drop these first. (Microsoft SQL Server Native Client 10.0)

Graphically it looks like this:

Object Explorer pop-up: you can't delete SSIS folders that contain packages or other folders

And this message can be really annoying if you’ve got a main folder with, let’s say, five subfolders, and each subfolder contains about 20-30 packages.  If you want to delete this folder you first need to delete each package separately and then delete the five subfolders, and then you can finally delete the main folder.  And all that through the right-click pop-up menu because you can’t just select the object in the Object Explorer and hit the Delete button on the keyboard – it doesn’t have an action on SSIS objects…

So, I wasn’t planning on doing such a job manually and came up with the following stored procedure.

It’s probably a bit long but don’t run away just yet, I will explain what’s going on down below the code, and there are some comments in the code as well.

/*
DESCRIPTION: Deletes all folders and packages under, and including, specified folder.
WRITTEN BY:  Valentino Vranken
CREATED:     2010-02-28
VERSION:     1.0
USAGE:
  -- mind the forward slash
  EXEC dbo.SSIS_RecursiveDeleteFolder '/FolderWithSubfolders'
  -- to delete a subfolder
  EXEC dbo.SSIS_RecursiveDeleteFolder '/FolderWithSubfolders/ASubfolderWithPackages'

COPIED FROM: http://blog.hoegaerden.be

Note 1: folder names are not case-sensitive
Note 2: uses system tables and (undocumented) stored procedures located in MSDB.
Note 3: this code was written for SQL Server 2008. For 2005:
  o sysssispackagefolders -> sysdtspackagefolders90
  o sysssispackages -> sysdtspackages90
  o sp_ssis_deletefolder -> sp_dts_deletefolder
  o sp_ssis_deletepackage -> sp_dts_deletepackage
*/
CREATE PROCEDURE dbo.SSIS_RecursiveDeleteFolder
    @Folder varchar(2000)
AS
BEGIN
    set nocount on;

    declare @foldersToDelete table
    (
        folderid uniqueidentifier,
        Lvl int
    );

    declare @packagesToDelete table
    (
        PackageName sysname,
        folderid uniqueidentifier,
        Lvl int
    );

    --retrieve list of folders to be deleted
    with ChildFolders
    as
    (
        select PARENT.parentfolderid, PARENT.folderid, PARENT.foldername,
            cast('' as sysname) as RootFolder,
            cast(PARENT.foldername as varchar(max)) as FullPath,
            0 as Lvl
        from msdb.dbo.sysssispackagefolders PARENT
        where PARENT.parentfolderid is null
        UNION ALL
        select CHILD.parentfolderid, CHILD.folderid, CHILD.foldername,
            case ChildFolders.Lvl
                when 0 then CHILD.foldername
                else ChildFolders.RootFolder
            end as RootFolder,
            cast(ChildFolders.FullPath + '/' + CHILD.foldername as varchar(max))
                as FullPath,
            ChildFolders.Lvl + 1 as Lvl
        from msdb.dbo.sysssispackagefolders CHILD
            inner join ChildFolders on ChildFolders.folderid = CHILD.parentfolderid
    )
    insert into @foldersToDelete
    select F.folderid, F.Lvl
    from ChildFolders F
    where F.FullPath like @Folder + '%';

    --retrieve list of packages to be deleted
    with ChildFolders
    as
    (
        select PARENT.parentfolderid, PARENT.folderid, PARENT.foldername,
            cast('' as sysname) as RootFolder,
            cast(PARENT.foldername as varchar(max)) as FullPath,
            0 as Lvl
        from msdb.dbo.sysssispackagefolders PARENT
        where PARENT.parentfolderid is null
        UNION ALL
        select CHILD.parentfolderid, CHILD.folderid, CHILD.foldername,
            case ChildFolders.Lvl
                when 0 then CHILD.foldername
                else ChildFolders.RootFolder
            end as RootFolder,
            cast(ChildFolders.FullPath + '/' + CHILD.foldername as varchar(max))
                as FullPath,
            ChildFolders.Lvl + 1 as Lvl
        from msdb.dbo.sysssispackagefolders CHILD
            inner join ChildFolders on ChildFolders.folderid = CHILD.parentfolderid
    )
    insert into @packagesToDelete
    select P.name, F.folderid, F.Lvl
    from ChildFolders F
        inner join msdb.dbo.sysssispackages P on P.folderid = F.folderid
    where F.FullPath like @Folder + '%';

    --use cursor to loop over objects to be deleted
    declare objectsToDelete_cursor cursor
    for
        select P.folderid, P.Lvl, P.PackageName, 'P' as ObjectType
        from @packagesToDelete P
        UNION ALL
        select F.folderid, F.Lvl, null, 'F'
        from @foldersToDelete F
        order by Lvl desc, ObjectType desc;

    open objectsToDelete_cursor;

    declare @folderid uniqueidentifier;
    declare @lvl int;
    declare @packageName sysname;
    declare @objectType char;

    fetch next from objectsToDelete_cursor
    into @folderid, @lvl, @packageName, @objectType;

    while @@FETCH_STATUS = 0
    begin
        if @objectType = 'F'
        begin
            print 'exec msdb.dbo.sp_ssis_deletefolder '
                + cast(@folderid as varchar(max));
            exec msdb.dbo.sp_ssis_deletefolder @folderid;
        end
        else
        begin
            print 'exec msdb.dbo.sp_ssis_deletepackage '
                + @packageName + ', ' + cast(@folderid as varchar(max));
            exec msdb.dbo.sp_ssis_deletepackage @packageName, @folderid;
        end

        fetch next from objectsToDelete_cursor
        into @folderid, @lvl, @packageName, @objectType;
    end;

    close objectsToDelete_cursor;
    deallocate objectsToDelete_cursor;
END

Before trying to dismantle this stored procedure, I recommend you to read my previous article on retrieving the list of packages.  That already explains half of the code, if not 75%.

Our mission is to find a way to recursively delete packages and folders contained in a specified folder.  To be able to loop over those objects in the correct order (from the deepest level up until the level of the folder specified), the SP creates two table variables: one to hold all folders under the specified folder (@foldersToDelete) and one to hold the packages under the specified folder, including all subfolders (@packagesToDelete).

Based on those two lists I create a cursor that joins these two together, taking their level and object type into consideration.  That’s important because we first need to delete the packages in the lowest level folder, followed by their containing folder, then move one level up and do the same.

We then use the cursor to loop over the packages and folders and use two undocumented system stored procedures – one for each object type- to delete the package or folder.  These system SPs are located in the MSDB.  Here’s how they are defined:

ALTER PROCEDURE [dbo].[sp_ssis_deletefolder]
  @folderid uniqueidentifier
AS

ALTER PROCEDURE [dbo].[sp_ssis_deletepackage]
  @name sysname,
  @folderid uniqueidentifier
AS

As you can see, the parameters for these procedures are not that complicated.  Both of them expect a uniqueidentifier as identification for the folder.  That’s okay, these IDs are stored in the msdb.dbo.sysssispackagefolders table and retrieved by our queries to create the list of to-be-deleted objects.

Furthermore, the sp_ssis_deletepackage SP expects the name of the package to be deleted.  Not a problem either, those names are obtained from the msdb.dbo.sysssispackages table.

Note for SQL Server 2005 users: this code was written for SQL Server 2008.  The system stored procedures and system tables exist in 2005 as well, but they have different names.  See the comment header of my SP for more details.

So, let’s give it a little test.  Following screenshot shows the setup.  What I will do is use the stored procedure to delete the FolderWithSubfolders folder.  If you’ve been paying close attention, that is the same folder which I tried to delete manually through the Management Studio’s right-click menu (see first screenshot above).

Overview of my deployed folders and packages

After creating the SP, I ran following command:

EXEC dbo.SSIS_RecursiveDeleteFolder '/FolderWithSubfolders'

And that gave me the following output in the Messages pane:

exec msdb.dbo.sp_ssis_deletepackage AnotherPackage, 7F38288D-4370-40A8-80E3-E92283033E4C

exec msdb.dbo.sp_ssis_deletepackage Package, 7F38288D-4370-40A8-80E3-E92283033E4C

exec msdb.dbo.sp_ssis_deletefolder 4102ED59-ED75-4D93-BBAE-0A162447BF02

exec msdb.dbo.sp_ssis_deletefolder 7F38288D-4370-40A8-80E3-E92283033E4C

exec msdb.dbo.sp_ssis_deletefolder C156B436-8C78-4BF9-99F9-5ABFAB10C405

I have deliberately put a couple of print commands in the stored procedure to dump the commands that are actually being executed.  This gives us a good idea of what’s going on.

That’s it for now folks.  Thank you for reading this, and if you found it useful or you’ve got some questions about it: post a comment!

Have fun!

Valentino.

  • Share/Bookmark

Tags: , , , ,

Sometimes my posts are over 20 pages long when pasted into a Word document.  That’s when I call them article, or tutorial.  Other times I post real quickies about little things that have annoyed me in the past, because I had to spend too much time looking for a solution to a certain issue, or just because they are not very obvious and I can image fellow developers doing a search on the internet on that specific subject.

This post is one of the latter.

In a SQL Server Integration Services project, have you ever wondered how on earth you can get files into that Miscellaneous folder?  When you right-click on the folder in the Solution Explorer, nothing happens, no pop-up menu.

Well, the answer is simple, once you know it.

All you need to do is go one level up in the tree and right-click the Project node in the Solution Explorer.  In the menu that appears, select Add > Existing Item….

How to add a file to the SSIS Miscellaneous folder

In the pop-up window, navigate to any file that you’d like to add to the project.  Files of a different type than the usual SSIS files such as .dtsx and .ds are automatically added under the Miscellaneous folder.  When adding files, they will be automatically copied to the SSIS project folder, no matter where they were stored originally.  See, not that complicated… once you know it!

In the following screenshot I’ve added three different file types to the folder, just to prove that it’s working.

Miscellaneous folder with some files added to it

I like using this folder to store files that belong with that particular project.  Examples of those are configuration files, XSLT files for complex XML conversions, and also the SQL scripts that create my databases.  Each time when I make schema changes, I update the scripts.  And as I’m using TFS integration, I can rest assured that I always have a backup of my files.  (At least, assuming the TFS team is doing their job – this is usually beyond my responsibility :-) )

Speaking about TFS, watch out if you use Business Intelligence Development Studio 2008 to connect to Team Foundation Server 2005!  There’s an interesting setting in the Options screen (through the Tools menu), located in the Source Control > Visual Studio Team Foundation Server page.  This setting is called Get latest version of file on check out.

Get latest version of item on check out

However, there’s one caveat!  On TFS2005 it doesn’t do anything!  If you’re not aware of that, you may get an annoying surprise when you’re trying to check in your changes because you may have been working on an outdated version of your package!  And as you probably already know: merging two versions of an SSIS package is, well, what shall I call it, a challenge?

Another setting that helps you to avoid the issue described above is located under Source Control > Environment and is called Get everything when a solution or project is opened.  Activate this setting and each time when you open your project, you’ll get a popup window which allows you to Get the latest version of the files.

Get evrything when a solution or project is opened

That leaves one more possible conflict situation.  If someone changes a package on the same day as you, the second person will need to explicitly do a Get Latest Version or he/she will be working on an outdated version.  So, communicate with your team mates so that you know if someone has gotten an assignment that collides with yours.  Of course, this last problem is just a theoretical possibility.  In teams, work is usually divided so that developers do not need to work with more than one person on the same piece of code.  The same logic applies to SSIS packages.

And remember, have fun!

Valentino.

  • Share/Bookmark

Tags: , , ,

If you’ve got some experience building Data Flows in your Integration Services packages, you probably already know that SSIS has its own representation of the different data types.  The names of these data types start with DT_, followed by the actual type such as BOOL for boolean and STR for string.

Some data types you use every day and others only once in a while.  So some you already know by heart and others you don’t.  A nice example of one that I certainly don’t need to look up anymore is the following cast as used in a Derived Column Transformation: (DT_STR, 100, 1252)YourField.  It converts YourField to a string field of length 100 using the ANSI – Latin 1 code page.

But the reason for this post are the other types, the ones not used daily.  Because each time that I need info on one of those, I find myself ploughing through several BOL pages before finding that page of which I know it exists but where, oh where?!

So, on the following Books Online page you can find a list of all SSIS data types and their definition: Integration Services Data Types

And at the bottom of that same page there’s an interesting table showing the mapping between the SSIS types and those of several RDBMS, including SQL Server of course.

Another interesting page is this one: Working with Data Types in the Data Flow

The bottom half of this page contains a mapping table between the SSIS data types and their corresponding managed type in .NET.  This is especially useful when you’re doing some custom development like a custom data flow component.  But you don’t need to take it that far, even when you’re just using the built-in Script Component task for some complex data conversions, you’ll find this useful.  After all, you’re also programming in a managed .NET language in that component.

Okay, that’s it for now, hopefully this will save you some time searching for those reference pages.  I already know that I’ll be coming back to this post once in a while :-)

  • Share/Bookmark

Tags: ,

When deploying packages to SQL Server Integration Services, it’s advisable to set up a folder structure so that you can easily distinguish packages belonging to different projects.  Furthermore it may be interesting to create subfolders under the main project folder to separate packages according to the different phases in your ETL (Extract, Transform, Load) process.  When loading a data warehouse, interesting folder names are Dimensions for your dimension ETLs and Facts for the packages that load the fact tables.

After a while you end up with lots of packages spread over lots of folders.  To get a good view of what is deployed on your server, it may be interesting to find a way to list all the packages.  And that’s exactly the reason why I’m writing this article.

The query further down generates a list of all packages deployed in the MSDB database on your SQL Server.  What you get is the name of the packages, their location and version-related information.  I’ve also created a  RootFolder column so that it’s easy to filter on project.  (See now why it’s interesting to create separate folders per project?)

It’s important to note that packages deployed to the File System will not be shown in the list.  After all, they are not stored in the MSDB database but in a folder somewhere on the server’s hard drive, more precisely in a subfolder of where you’ve installed your SQL Server.  In case you’ve forgotten where that was, here’s a small tip.  On your server, open up the list of Windows Services (Start > Run > type "services.msc" > enter) and locate the service called SQL Server Integration Services 10.0.  Open the properties of that service and have a look at the Path to executable value in the General tab.  Take the path, drop the \Binn part and add \Packages instead.  That is where, by default, the packages are deployed.  (If you’re running SQL Server 2005, apply the same procedure but look for a service called SQL Server Integration Services.)

On my system, this is where the packages are located: D:\Program Files\Microsoft SQL Server\100\DTS\Packages.  I will also prove it with following screenshot:

Packages deployed to the file system on SSIS 2008

Okay, time for the real stuff, the query:

/*
    DESCRIPTION: Lists all SSIS packages deployed to the MSDB database.
    WRITTEN BY: Valentino Vranken
    VERSION: 1.1
    COPIED FROM: http://blog.hoegaerden.be

    Note: this query was written for SQL Server 2008. For SQL2005:
        o sysssispackagefolders => sysdtspackagefolders90
        o sysssispackages => sysdtspackages90
*/
with ChildFolders
as
(
    select PARENT.parentfolderid, PARENT.folderid, PARENT.foldername,
        cast('' as sysname) as RootFolder,
        cast(PARENT.foldername as varchar(max)) as FullPath,
        0 as Lvl
    from msdb.dbo.sysssispackagefolders PARENT
    where PARENT.parentfolderid is null
    UNION ALL
    select CHILD.parentfolderid, CHILD.folderid, CHILD.foldername,
        case ChildFolders.Lvl
            when 0 then CHILD.foldername
            else ChildFolders.RootFolder
        end as RootFolder,
        cast(ChildFolders.FullPath + '/' + CHILD.foldername as varchar(max))
            as FullPath,
        ChildFolders.Lvl + 1 as Lvl
    from msdb.dbo.sysssispackagefolders CHILD
        inner join ChildFolders on ChildFolders.folderid = CHILD.parentfolderid
)
select F.RootFolder, F.FullPath, P.name as PackageName,
    P.description as PackageDescription, P.packageformat, P.packagetype,
    P.vermajor, P.verminor, P.verbuild, P.vercomments,
    cast(cast(P.packagedata as varbinary(max)) as xml) as PackageData
from ChildFolders F
    inner join msdb.dbo.sysssispackages P on P.folderid = F.folderid
order by F.FullPath asc, P.name asc;

The query uses a recursive CTE (Common Table Expression) to get data out of a system table called sysssispackagefolders, located in the MSDB system database.  The CTE gives us a list of all folders stored in the database and at the same time uses the hierarchical structure of the table to build the FullPath and the Lvl columns.

Note: the CAST() calls are needed because the data type of the foldername column is sysname.  And sysname does not implicitly convert to varchar, which is needed for the concatenation building the FullPath column.

The CTE is joined with another system table called sysssispackages, also located in MSDB.  Not all columns are being retrieved from that table but I believe I’ve selected the most important ones.  Have a look in the Books Online for more info on the columns available.

There’s one column however on which I’d like to add some additional info myself.  That column is called packagedata and it contains the actual SSIS package.  The data type of this column is image, not sure why because after all, an SSIS package (or .dtsx file for that matter) is pure XML.  So why isn’t it stored as XML?  If anyone knows the reason: post a comment!

Anyway, as you can see in the query, to get it converted from image to XML you need to go through varbinary.  The image datatype cannot convert directly to XML.  See the Books Online here on what casts are allowed: http://msdn.microsoft.com/en-us/library/ms187928.aspx

Note for SQL Server 2005 users: as mentioned in the query’s comments, these tables don’t exist in SQL Server 2005.  Well, actually they do, they just have different names.  See the comment in the code for their equivalent.

To finish off I’ll show you what the results look like when executing the query on my test system.  But first, following screenshot shows all deployed packages as reported by the Management Studio.  As you can see, two packages are deployed to the File System.  These two packages were shown earlier in the first screenshot.  Some other packages have been deployed to the MSDB database.

Object Explorer showing all deployed SSIS packages

And here are the results of the query:

A list of all SSIS packages deployed to my SQL Server 2008 MSDB database

To be honest, I added a little filter to keep the results clean.  The Data Collector, a new feature of SQL Server 2008, also uses some packages so I’ve filtered those out by adding a WHERE clause to the SELECT statement at the bottom of the full query:

select F.RootFolder, F.FullPath, P.name as PackageName,
    P.description as PackageDescription, P.packageformat, P.packagetype,
    P.vermajor, P.verminor, P.verbuild, P.vercomments,
    cast(cast(P.packagedata as varbinary(max)) as xml) as PackageData
from ChildFolders F
    inner join msdb.dbo.sysssispackages P on P.folderid = F.folderid
where F.RootFolder <> 'Data Collector'
order by F.FullPath asc, P.name asc;

If you’ve been paying attention, you’ve noticed that the two packages deployed to the File System are not mentioned in the output of the query, as expected.

That’s all for now folks, have fun!

References

BOL 2008: Tutorial: Creating a Simple ETL Package

BOL 2008: sysssispackagefolders (Transact-SQL)

BOL 2008: sysssispackages (Transact-SQL)

BOL 2008: Recursive Queries Using Common Table Expressions

  • Share/Bookmark

Tags: , , , , ,

Ever since I upgraded to SQL Server 2008 Service Pack 1 I noticed that the Management Studio was reporting incorrect version numbers when connected to Integration or Reporting Services.  This incorrect version number is located to the right of the server instance in the Object Explorer.

As usual, a picture says so much more than … :

Object Explorer showing wrong version numbers

As I have posted earlier, 10.0.2531 is the version number for SP1, while 10.0.1600 is the original RTM version number.

I never really spent time looking for an answer to this.  It was obviously a bug but I could live with it and someone else would probably already have filed it as being a bug.  So recently I came across a post by Phil Brammer that mentioned this issue.  This post got a comment from Matt Masson, a developer on the SSIS team.  Have a look at the comment but in short: the version numbers that are being shown in the Object Explorer are actually the version numbers of the service’s .exe file!  And SSMS is now showing the wrong number because these files didn’t get an update in SP1.

After a little search I found the bug report on Microsoft Connect, reported on March 11, 2009, by Dan English.  Its status is Fixed but it seems that it isn’t.  At least, looking at the comments, CU5 (Cumulative Update) for SQL Server 2008 SP1 is still showing the problem.  So I guess you could go over to the Connect page and click on that Yes button if you’re interested in seeing this fixed.  After all, it could be quite misleading to novice DB guys and gals…

On this same subject, there’s another interesting post by Adam W. Saxton, a member of the Microsoft SQL Server Escalation Services Team.  In this post he takes a closer look at the SQL Server 2008 Reporting Services version number after having installed CU2.

Conclusion: if you need to find out what version your server is running, do not rely on the version numbers that you see in the Object Explorer.  As Adam explained, one way is to look at the version numbers of the files that were included in the upgrade.  But that may a bit of an overkill.  My favorite way, assuming that all components of the SQL Server installation have been upgraded to the same version, is to use the following query:

SELECT @@VERSION;

 

On my machine that comes back with the following result:

Microsoft SQL Server 2008 (SP1) – 10.0.2531.0 (Intel X86)   Mar 29 2009 10:27:29   Copyright (c) 1988-2008 Microsoft Corporation  Developer Edition on Windows NT 5.1 <X86> (Build 2600: Service Pack 3)

And remember, have fun!

  • Share/Bookmark

Tags: , , , , , , , ,

Initially I was going to call this article “Struggling With Collation: The SeQueL”, but it just doesn’t have the same ring to it as “Fun With Strings”.  In that previous article I showed how you might get different results when loading data from a temporary table or table variable and I suggested that one way of solving this is by switching your data type to nvarchar.

Unicode Or Not?

Reason #1 For Not

Today I’m going to show you that nvarchar is not always what we want to use, especially if we don’t need to support Unicode strings.  Imagine a staging scenario when loading a data warehouse.  Often the Business Keys (BK) are strings, and depending on the source system, sometimes very long strings – I’ve seen situations with a combined business key of over 500 bytes!  (You can’t imagine what some data sources look like but that’s another story.)  Do we really want to convert these to Unicode, and thus double their size?  Furthermore, to improve lookups we put indexes on those BKs.  These indexes would double in size as well.  So no, we don’t really want to make these fields Unicode, and certainly not when we want our ETLs to perform as fast as possible.

Reason #2 For Not

That was reason number one why nvarchar is not always the solution.  And here comes reason number two.  In my scenario, the source tables are located in an Oracle database.  And guess what: by default Oracle’s ORDER BY behaves different than SQL Server’s ORDER BY (when using the regular Latin1_General_CI_AS or SQL_Latin1_General_CP1_CI_AS collations)!  By default Oracle uses binary string comparison to sort its data and the reason for it appears to be that that’s the only way to prevent a full table scan.  I’m no Oracle expert but that’s what the documentation states.

Here’s a little demonstration.  The following script prepares a table variable and selects the data from it, sorted ascending.

declare @tbl table( col1 varchar(20));
insert into @tbl select ‘AA’;
insert into @tbl select ‘A’;
insert into @tbl select ‘A-’;
insert into @tbl select ‘A A’;
insert into @tbl select ‘BA’;
insert into @tbl select ‘0′;
insert into @tbl select ‘1′;
insert into @tbl select ‘-0′;
insert into @tbl select ‘-1′;
insert into @tbl select ‘A0′;
insert into @tbl select ‘0A’;
insert into @tbl select ‘-A’;
insert into @tbl select ‘-B’;
insert into @tbl select ‘a’;
insert into @tbl select ‘b’;
insert into @tbl select ‘ ‘;

select * from @tbl
order by col1 asc;

I have executed it once just as stated above (while connected to a database that uses the SQL_Latin1_General_CP1_CI_AS collation) and once more while using nvarchar as data type for the column in the table variable.  The first execution will sort the data using a non-Unicode sorting algorithm, while the second execution will order the data according to the Unicode sorting method.  The results will be shown further below for easier comparison.

On Oracle I performed a similar procedure, as shown in following script.

select cast(‘AA’ as varchar(20)) as col1 from Dual union
select ‘A’ as col1 from Dual union
select ‘A-’ as col1 from Dual union
select ‘A A’ as col1 from Dual union
select ‘BA’ as col1 from Dual union
select ‘0′ as col1 from Dual union
select ‘1′ as col1 from Dual union
select ‘-0′ as col1 from Dual union
select ‘-1′ as col1 from Dual union
select ‘A0′ as col1 from Dual union
select ‘0A’ as col1 from Dual union
select ‘-A’ as col1 from Dual union
select ‘-B’ as col1 from Dual union
select ‘a’ as col1 from Dual union
select ‘b’ as col1 from Dual union
select ‘ ‘ as col1 from Dual
order by col1 asc;

The Oracle script doesn’t use a table variable, it just creates a result set using several select statements with a union in between.  But for our test that doesn’t matter, the results using this method are suitable.

In the table below you can see the result of the three executions.

SQL non-Unicode SQL Unicode Oracle Binary
< space
-0 0 -0
-1 -0 -1
-A 0A -A
-B 1 -B
0 -1 0
0A A 0A
1 a 1
A -A A
a A- A A
A A A A A-
A- A0 A0
A0 AA AA
AA b BA
b -B a
BA BA b

As you can see, they only agree on one thing: space really is the smallest character in my test set!  And that’s not what I want, I want all the data to be sorted consistently, no matter what the source is.

But Why Sorted?

You may wonder why I need to sort the data.  Well, some components in Integration Services expect the incoming data flows to be ordered.  One of the standard components that requires this is the Merge Transformation.  Another (custom!) component is Table Difference.  I could of course add a Sort Transformation to my Data Flow, but that would not be interesting for performance.  I want the data to come from the database server in the expected order.  So now I’ll show you how you can do that.

Taking Control!

SQL Server: ORDER BY … COLLATE …

On SQL Server this was fairly easy.  The ORDER BY clause has a COLLATE part where you can specify what collation should be used to order the data.  Because Oracle sorts its data using a binary algorithm, I’ll tell SQL Server to do that as well.  More precisely I’ll tell SQL Server to use the Latin1_General_BIN collation.  The updated SELECT statement from the T-SQL script above looks like this:

select * from @tbl
order by col1 collate Latin1_General_BIN asc;

Oracle: ORDER BY NLSSORT()

To ensure that results from Oracle are always returned using the same sorting algorithm, I will also tell the Oracle server to sort it’s data using the binary algorithm.

The first way I came up with was to change the NLS_SORT setting on the session.  That can be done by executing the following command before the SELECT statement:

ALTER SESSION SET NLS_SORT=BINARY;

This method is fine when you’re running the queries manually from a client such as Oracle SQL Developer.  However, in SSIS the OLE DB Source component will not accept anything else besides the SELECT statement.

Then I found another way.  There’s a function called NLSSORT() which you can apply to a column in the ORDER BY clause.  The following statement demonstrates how to use this function.  (I only show the ORDER BY clause as it can be applied to the Oracle script mentioned earlier.)

ORDER BY NLSSORT(col1, ‘NLS_SORT=BINARY’)

The following table shows the results from both binary sort queries:

SQL Binary Oracle Binary
-0 -0
-1 -1
-A -A
-B -B
0 0
0A 0A
1 1
A A
A A A A
A- A-
A0 A0
AA AA
BA BA
a a
b b

Finally I am able to get data from both Oracle and SQL Server using a consistent sort order.

But, How Big Is NULL?

However, even on this straightforward request, both database servers do not fully agree!  Here’s what they have to say about the topic:

“NULL is the smallest.”

“No, it’s the largest.”

“No, smallest!”

“Largest!!”

“Smallest.”

“Largest I tell you!!!”

“Bladiebla, not hearing you, anyway, it’s NOTHING!”

“No, it isn’t!”

*discussion goes on and on*

If I add NULL to my test data set, SQL Server will sort it first (thus NULL is the smallest value in my test set), while Oracle will put it last.  In my situation it wasn’t really an issue (the BKs are not supposed to be NULL), but it’s quite important to remember in cases where NULLs are actually possible.

Conclusion

When working with strings, always keep collation in mind.  And even more so when dealing with several different source systems!

Additional reference material:

Database Journal: The Globalization of Language in Oracle – The NLS_COMP and NLS_SORT variables

BOL 2008: How to: Sort Data for the Merge and Merge Join Transformations

  • Share/Bookmark

Tags: , , , , , ,

If you’ve ever needed to load a wide table, for instance one that consists of about 250 columns, you may have come to the conclusion that some manipulations in your SSIS package take forever.  Here’s a quick tip how to speed up your package development.

This tip is actually quite generic and applicable to daily Windows use.  It’s just that sometimes you forget about the most simple things because you’re used to a different way of working.  Why make it easy when difficult works fine, right?  Well, maybe not.

Okay, here it goes: don’t forget that you’ve got a keyboard, do not always use the mouse! The Business Intelligence Development Studio offers a very graphical way of designing your data flow (or control flow for that matter) but sometimes it’s really faster to use the keyboard.

An example.  Imagine the Derived Column Transformation.  To avoid typographical errors you want to select a column name from the tree view on the left in the Derived Column Transformation Editor.  And then you discover that the list of column names is not ordered alphabetically, and on top of that, there’s not way to rearrange them!  Good luck finding your column.

Derived Column Transformation Editor

And then you remember that you’ve got a keyboard.  By typing the first letters of the column name, the selected item will jump down to the first match.  Isn’t that a great invention?!  In the screenshot above I typed “pu” and it selected PurchaseOrderNumber, which is about 32 columns down from the top.

This tip not only works in our Derived Column Transformation but in any screen where a similar tree view appears, such as in the “Inputs and Outputs Properties” tab of the Advanced Editor of an OLE DB Source:

Advanced Editor of OLE DB Source

This can be useful if you need to tell the component that you’ve ordered the data on certain columns.  And why on earth would you want to do that, you may ask yourself.  Well, there are certain components that expect the data to be ordered, such as the Merge item.

If you apply this tip whenever you need to locate something in a tree view, you’ll find that your live has become just a bit easier.

And if you’re not satisfied with the way of working as described above, there’s always the XML!  That’s right, just right-click on your package in the Solution Explorer and select View Code.

Context menu on package in Solution Explorer

Then you’ll discover that a .dtsx file is made up completely of XML.  If you spend some time examining the XML code, you’ll probably find what you’re looking for.

As an example let’s say we want to locate our PurchaseOrderNumber from earlier and change its SortKeyPosition property to 1 (which means the data coming in is ordered by PurchaseOrderNumber).  So I open the XML and scroll down a bit.  I recognize a component which I’ve named “OLE_SRC Wide Table”:

<component id=”1″ name=”OLE_SRC Wide Table”

I also recognize my SELECT statement.  Scrolling down a bit further I see the

<outputs>

followed by a bunch of outputColumn items.  I put the cursor at the <outputs> line and hit CTRL-F.  Then I type “pur” and hit ENTER.

Find and Replace window in dtsx XML

The selection jumps to the column that we’re after.  I close the Find and Replace window and scroll to the right to find the sortKeyPosition property and I update its value to 1.

To confirm that this is working I open up the Advanced Editor again and locate our column.  SortKeyPosition is set to 1!

SortKeyPosition has been updated through XML

I hope that with this I showed you that sometimes it’s interesting to think about what you’re actually doing and try to find a way to do it better, and faster.

  • Share/Bookmark

Tags: , , , ,

Imagine an Integration Services package where you need to load data from an Oracle database.  Typical for ETL you try to use a timestamp field in the source table to determine if there were any changes since last load.  Let’s say that field is called LAST_MODIFICATION.  The query that you’re using to load the data is parameterized and the value passed in is the highest timestamp of the previous Load (or should I say Extract?).  So your query may look something like this:

SELECT * FROM THE_SCHEMA.THE_TABLE
WHERE LAST_MODIFICATION > to_date(?, ‘yyyy/mm/dd hh:mi:ss’)

In this query the question mark gets replaced with a datetime value, for instance ‘2009/08/27 21:26:32’.  (Yeah, I know, you shouldn’t select star but that’s not the point of this post.)

When running your package you suddenly get the following error:

ORA-01849: hour must be between 1 and 12

After some cursing and trial and error, you decide to have a closer look at Oracle’s to_date() function.  And there is your answer: HH returns the same as HH12, an hour between 1 and 12.  Which is not what we want here, we want hours starting at 1 and ending at 23.  So we should use HH24 instead.  A correct query would be:

SELECT * FROM THE_SCHEMA.THE_TABLE
WHERE LAST_MODIFICATION > to_date(?, ‘yyyy/mm/dd hh24:mi:ss’)

Another lesson learned… (and noted in case I need it again :-) )

  • Share/Bookmark

Tags: , , ,

If you’re using Integration Services 2008 and the Foreach Loop Container in the Control Flow, you’ll very likely encounter this bug.

The Foreach Loop has several enumerators available.  By default it selects the Foreach File Enumerator.  However, as the screenshot below shows, there’s no way to configure it – the Enumerator configuration group just shows blank space.

Foreach Loop Editor with empty Enumerator configuration

This phenomenon is caused by a bug which has been reported on Microsoft Connect.  The workaround, if you really need the File Enumerator, is to select another enumerator first and then switch back to the File Enumerator.  You’ll notice that the regular controls show up and on you go, define that folder name!

  • Share/Bookmark

Tags: , , , ,

Here’s a quick tip on the usage of the Merge, one of the Data Flow Transformation components in Integration Services.  When merging data from two different sources one of the issues that you’ll need to cope with are the data types.  The Merge component may start complaining about certain columns when the metadata has been changed higher in the flow.  The error would look like this:

Validation error. Data Flow Task: Data Flow Task: The metadata for “input column “YourColumn” (590)” does not match the metadata for the associated output column.

A common issue is the length of a string field.  My favorite way to solve these types of issue is by opening up the Advanced Editor, locating the particular field in the Output Columns and increasing its Length property to match the length of the longer field.

But unfortunately the Merge component does not have an Advanced Editor option.  The easiest way to solve it in this case is by first deleting the column in the Merge Transformation Editor and then adding it again.  Luckily the dropdown only shows the input fields that haven’t been matched yet. :-)

  • Share/Bookmark

Tags: , ,

« Older entries

© 2008-2010 A Developer's Blog All Rights Reserved