You are currently browsing articles tagged T-SQL.

Source: http://commons.wikimedia.org/wiki/File:Estrangela.jpgThis post is a follow-up to my presentation on Automating SSRS Deployments Using Built-in SQL Server Tools.

During that presentation, I demonstrated that you can use the Reporting Services web service in a custom application to get a list of your deployed objects from the server.  Using that functionality you’d be able to generate the batch script containing the RS commands which in their turn use the RSS scripts which I’ve demonstrated.

For those who missed the presentation but ended up here nevertheless, you can download the slides and demo code through my other follow-up post.

However, that’s not the only option!

In this post I’ll be using a technique which is one of my favorite methods of saving work.  Instead of typing a lot of code yourself, type just a bit of code to generate the code that you actually want.  Or, in other words and more concrete: today I’ll be using T-SQL to generate the RS commands for the batch file!

I will be querying some SSRS system tables, so take into account that the code may need changes for future versions of SQL Server, but I can confirm that it should work with both SQL Server 2008 and 2012.  SQL Server 2005 probably too but I don’t have that running anymore so can’t test it out.  Don’t hesitate to leave a comment if you can confirm that this will run on 2005.

The database on which you need to run these queries is called ReportServer, or ReportServer$YourInstance if you didn’t install it with the default instance name.

Figuring It All Out

One of the tougher parts of writing these queries was figuring out how the system tables can be linked with each other.  I studied some of the stored procedures and combined that knowledge with what I actually saw as content on my test system.  I’ll first share a couple of generic queries which may prove useful in case you just want to query the tables.  Then I’ll share the queries that will produce the RS statements.

Usually you’re only interested in the objects that belong to your project.  So each of the following queries has a filter to only show the content of my project folder, called Awesome Reports.

List Of Objects Dependent On Data Source

The following query shows a list of all objects that have a data source dependency.

--list of objects dependent on data sources
    case C.[Type]
        when 1 then 'Folder'
        when 2 then 'Report'
        when 3 then 'Resource'
        when 4 then 'Linked Report'
        when 5 then 'Data Source'
        when 6 then 'Report Model'
        when 7 then '???' --post comment if you know what 7 means
        when 8 then 'Shared Dataset'
        when 9 then 'Report Part'
    end ObjectType
    , C.Name
    , DS.Name DataSourceName
    , DSET_SRC.Name DatasetExternalDataSourceName
    , PARENT.Path
from dbo.Catalog C
inner join dbo.Catalog PARENT on PARENT.ItemID = C.ParentID
inner join dbo.DataSource DS on C.ItemID = DS.ItemID
left outer join dbo.DataSets DSET on DSET.LinkID = C.ItemID
left outer join dbo.DataSource DSET_SRC on DSET_SRC.ItemID = DSET.ItemID
where C.Type <> 5 --no data sources
    and C.Path like '/Awesome Reports/%'
order by c.Type, C.Name

So what does this return on my test system?

List of objects with data source dependency

In the case of a shared dataset, the DataSetDataSource name is what the data source is called inside the shared dataset.  To get to the real name of the data source, we need to join with the dbo.DataSource table through the dbo.DataSets table while using the right key fields.

List Of Datasets With Linked Data Source

The query below can be interesting to generate a list of your datasets with their data source dependency, including their location.

--list of datasets with linked data source
select DSET.Name DatasetName
    , DSET_PARENT.Path DatasetLocation
    , DS.Name DataSourceName
    , DS_PARENT.Path DataSourceLocation
from dbo.DataSets DSET
inner join dbo.Catalog DSET_C on DSET_C.ItemID = DSET.LinkID
inner join dbo.Catalog DSET_PARENT on DSET_PARENT.ItemID = DSET_C.ParentID
inner join dbo.DataSource DS on DS.ItemID = DSET.ItemID
inner join dbo.Catalog DS_C on DS_C.ItemID = DS.Link
inner join dbo.Catalog DS_PARENT on DS_PARENT.ItemID = DS_C.ParentID
where DSET_C.Path like '/Awesome Reports/%'

On my system I get the following result:

List of datasets with their data source dependency

Generating The RS Commands

Using the knowledge we got by examining above queries, we can now build some queries that generate the RS commands.  And here they are:

--generate DeployDataset commands
select '%RS% -i "%SCRIPTLOCATION%\DeployDataset.rss" -s %REPORTSERVER% -l %TIMEOUT% -e %ENDPOINT% -v dataset="'
     + DSET_C.Name + '" -v datasetLocation="' + DSET_PARENT.Path
     + '" -v dataSourceName="' + DS.Name
     + '" -v dataSourceLocation="' + DS_PARENT.Path + '"'
from dbo.DataSets DSET
inner join dbo.Catalog DSET_C on DSET_C.ItemID = DSET.LinkID
inner join dbo.Catalog DSET_PARENT on DSET_PARENT.ItemID = DSET_C.ParentID
inner join dbo.DataSource DS on DS.ItemID = DSET.ItemID
inner join dbo.Catalog DS_C on DS_C.ItemID = DS.Link
inner join dbo.Catalog DS_PARENT on DS_PARENT.ItemID = DS_C.ParentID
where DSET_C.Path like '/Awesome Reports/%'

--generate DeployReport commands
select '%RS% -i "%SCRIPTLOCATION%\DeployReport.rss" -s %REPORTSERVER% -l %TIMEOUT% -e %ENDPOINT% -v name="'
    + C.Name + '" -v folder="' + PARENT.Path + '"'
FROM dbo.Catalog AS C
inner join dbo.Catalog PARENT on PARENT.ItemID = C.ParentID
where C.Type = 2 --report
    and C.Path like '/Awesome Reports/%'

--generate LinkReportToDataSource commands
select '%RS% -i "%SCRIPTLOCATION%\LinkReportToDataSource.rss" -s %REPORTSERVER% -l %TIMEOUT% -e %ENDPOINT% -v reportName="'
    + REP.Name + '" -v parent="' + REP_PARENT.Path + '" -v dataSource="'
    + DS.Name + '" -v dataSourceLocation="' + DS_PARENT.Path + '"'
from dbo.Catalog AS REP
inner join dbo.Catalog REP_PARENT on REP_PARENT.ItemID = REP.ParentID
inner join dbo.DataSource DS on REP.ItemID = DS.ItemID
inner join dbo.Catalog DS_C on DS_C.ItemID = DS.Link
inner join dbo.Catalog DS_PARENT on DS_PARENT.ItemID = DS_C.ParentID
where REP.Type = 2 --report
    and REP.Path like '/Awesome Reports/%'
order by REP.Name asc, DS.Name asc

--generate LinkReportToDataset commands
SELECT '%RS% -i "%SCRIPTLOCATION%\LinkReportToDataset.rss" -s %REPORTSERVER% -l %TIMEOUT% -e %ENDPOINT% -v reportName="'
    + REP.Name + '" -v reportLocation="' + REP_PARENT.Path + '" -v datasetName="'
    + DSET.Name + '" -v datasetLocation="' + DSET_PARENT.Path + '"'
from dbo.Catalog AS REP
inner join dbo.Catalog REP_PARENT on REP_PARENT.ItemID = REP.ParentID
inner join dbo.DataSets AS DSET ON REP.ItemID = DSET.ItemID
inner join dbo.Catalog DSET_C on DSET_C.ItemID = DSET.LinkID
inner join dbo.Catalog DSET_PARENT on DSET_PARENT.ItemID = DSET_C.ParentID
where REP.Path like '/Awesome Reports/%'

And here’s the output:

Generating the RS commands


It may not be a recommended practice to use the Reporting Services system tables, but as long as you’re only reading them and you keep in mind that your code may break in a next version of SQL Server, they can surely be useful!

Have fun!



Tags: , , , ,

I came across an interesting little enhancement which I’d like to share with you.  I’m sure you’re all familiar with the following error message:

Msg 2601, Level 14, State 1, Line 4
Cannot insert duplicate key row in object ‘dbo.#t’ with unique index ‘PK_Unique’.

That’s right, the “hey, you’re inserting junk, stop that right now!” message.

However, earlier this week I encountered a variation of that error in our ETL logs:

Msg 2601, Level 14, State 1, Line 3
Cannot insert duplicate key row in object ‘dbo.#t’ with unique index ‘PK_Unique’. The duplicate key value is (1, 2).

I’m sure I don’t need to explain how useful that extra sentence can be, right?  So, then I started digging because obviously I want that error to always include the offending values.

Recently our database back-end servers have been replaced with new machines running SQL Server 2008 R2 SP1.  And indeed, as of R2 SP1, this error message has gotten an upgrade!

If you want to check for yourself, the following code snippet can be used to generate the error:

create table #t (PK1 int, PK2 int);
create unique index PK_Unique on #t(PK1, PK2);
insert into #t values (1, 2),(1, 2);

Now, message 2601 is not the only one complaining about duplicate keys.  Another example can be generated using the following query:

create table #t (PK1 int unique, PK2 int unique);
insert into #t values (1, 2),(1, 2);

Executing that snippet on SQL Server 2008 R2 SP1 results in this error:

Msg 2627, Level 14, State 1, Line 1

Violation of UNIQUE KEY constraint ‘UQ__#t________C5776555123EB7A3′. Cannot insert duplicate key in object ‘dbo.#t’. The duplicate key value is (2).

Do you notice the difference?  The message did indeed get an upgrade, but is not able to support a simultaneous violation on multiple columns, while msg 2601 can.  Hopefully they can get that fixed by SQL Server 2012 RTM!

Out of curiosity I decided to get a closer look at other possible variations on this error and dug into the sys.messages table using this query:

select * from sys.messages
where text like ('%duplicate key%');

Results of that query:

All error messages containing "duplicate key"

So, there’s one more to investigate: 1505.  To get that error generated is not that complicated.  We actually already have all the statements, just need to use them in the right order, like so:

create table #t (PK1 int, PK2 int);
insert into #t values (1, 2),(1, 2);
create unique index PK_Unique on #t(PK1, PK2);

And what do we get?

Msg 1505, Level 16, State 1, Line 1

The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name ‘dbo.#t__________________________________________________________________________________________________________________00000000001D’ and the index name ‘PK_Unique’. The duplicate key value is (1, 2).

Nice, message 1505 also supports multiple columns, that makes two out of three!

Ow, don’t forget to clean up (yes, I even do that for temporary tables).

drop table #t;

Additional Info

Some further digging around on the internet brought me to the following suggestion on Microsoft Connect: Duplicate Key Values by Anton Plotnikov.  Looks like our enhanced statements are the result of that request.

There’s another suggestion as well: Tweak To Duplicate Key Message by Louis Davidson.  Louis suggests to have all offending values added to the message as well, but in a different situation than when dealing with multiple unique columns.  You can create one insert statement that inserts more than one record with offending values, and that’s the situation he’s referring to.  If you feel that’s important, get over to Connect and cast vote!  I do think that the number of values would need to get limited in that case, we might end up with really long error messages otherwise.

One last link, also by Louis Davidson.  He also wrote a blog post about the error message when he found out about it in Denali.

Have fun!



Tags: ,

In this post I’m actually just going to repeat what I answered to a question on Experts Exchange.

The OP was having an issue getting a record count of all tables in his database.  It was working through the Management Studio, using the undocumented but well-known sp_MSforeachtable stored procedure.

exec sp_MSforeachtable 'select ''?'' TableName, count(*) Cnt from ?';

However, the OP needed the list of record counts inside an Integration Services package.  And this didn’t work out as expected, because sp_MSforeachtable uses a temporary table, something that SSIS does not like.

So I proposed to use a script similar to the following, possibly in a stored procedure.

DESCRIPTION: Returns a list of record counts for each table in the database
AUTHOR:         Valentino Vranken
CREATED:     2011/05/21
VERSION:     1.0
COPIED FROM: http://blog.hoegaerden.be
declare @tempTable table
    TableSchema nvarchar(256),
    TableName nvarchar(256),
    Cnt bigint

declare @sql nvarchar(4000);
declare @tableSchema nvarchar(256);
declare @tableName nvarchar(256);
declare @columnName sysname;
declare @cnt bigint;

declare tableCursor cursor for
    where TABLE_TYPE = 'BASE TABLE';

open tableCursor;

fetch next from tableCursor into @tableSchema, @tableName;

while @@FETCH_STATUS = 0
    set @sql = 'select @cnt = COUNT(*) from [' + @tableSchema + '].[' + @tableName + ']';

    exec sp_executesql @sql, N'@cnt bigint output', @cnt = @cnt output;

    insert into @tempTable select @tableSchema, @tableName, @cnt;

    fetch next from tableCursor into @tableSchema, @tableName;

close tableCursor;
deallocate tableCursor;

select * from @tempTable;

Yeah, I know, cursors are bad… But hey, at least it works!  Just don’t schedule it for execution every five seconds.

I seem to be repeating myself.  I wrote something similar when I posted the script about finding empty columns many months ago.

The most important part of the script is the following:


This is using one of the information schema views to retrieve a list of tables in the database.  The INFORMATION_SCHEMA.TABLES view contains a list of tables and views, so the where clause is needed to filter out the views.  INFORMATION_SCHEMA.TABLES is just one of many interesting views available in SQL Server, and should always be used in favor of the sys tables.

What does the output look like?  Here’s part of the output when executed on the AdventureWorks2008R2 database:

Record count of all tables in the AdventureWorks2008R2 database (although not all are shown, you get the idea)

Have fun!



Tags: , ,

Have you ever needed to write a script that would store the Windows login of the user that executes the script in a table, without hard-coding it?  And you couldn’t find your way through the user jungle of T-SQL?  Then read on!

A normal approach would be to rely on the IntelliSense feature of SSMS and start typing something like:

select use

Which gives you a dropdown like:


Cool you think, there are four system functions to return something user-related, one of them would be the Windows user, right?  Well, wrong.

Let’s find out what these functions actually return, and locate a method that does return what we require.

The Different User-Related System Functions


According to the BOL:

Allows a system-supplied value for the database user name of the current user to be inserted into a table when no default value is specified.

So, how many times did you read that?  And do you now know what you’ll get?  Me neither.  As far as I understand it, you’ll get the database user name of the connection context.  Not what we need.


According to BOL:

Returns the identification number for a database user.

The function accepts one argument that accepts a string representing a username.  When no argument is provided, it will return the user ID of the execution context.

But that’s not important because the BOL also mentions that it will be removed in a future version and that you thus shouldn’t use it anymore.  No problem if you need it though, its replacement is called DATABASE_PRINCIPAL_ID.


According to BOL:

Returns a database user name from a specified identification number.

This is the inverse of USER_ID.  It accepts one argument representing the user ID.  When omitted, the user of the current execution context will be returned.  Which is actually the equivalent of the USER function.

Still no Windows user though, so let’s move on to the next one.


Not documented in the BOL.

It seems to represent the sid column found through following query:

select * from sys.sysusers

And the sys.sysusers view is documented.  Apparently “sid” stands for “security identifier” and it’s a varbinary.  But not only is the view documented, it is also deprecated.  You should use the following view instead:

select * from sys.database_principals

That query returns a list of principals in the current database, and one of the fields is indeed our dear sid.  However, we can now conclude that sid is definitely not what we’re looking for in this little quest, it doesn’t even closely resemble a Windows user name.  So let’s move on again!


Hmm, “current user”, will this finally be the Windows user with which I’m running my queries?

According to BOL:

Returns the name of the current user. This function is equivalent to USER_NAME().

Equivalent to USER_NAME, so still not what we’re looking for.


According to BOL:

SESSION_USER returns the user name of the current context in the current database.

You can probably guess by now that again this is not what we’re searching for.  Next.


According to BOL:

Allows a system-supplied value for the current login to be inserted into a table when no default value is specified.

Ah, finally a description that contains the word “login”!  And indeed, here’s what it returns:


That is indeed my Windows user.  Mission accomplished!

Please note: if you’re running this through a window connected via a SQL Server login, you will get that login as opposed to your Windows login.  And if you get ‘sa’ while you’re connected to a production server you should look into making some security changes. :-)


Two more similar functions exist: SUSER_NAME and SUSER_SNAME.

Both return a login name as well, and both accept a parameter.  When executed without parameter, they return the login name of the current user.  Their only difference is the parameter that they accept.  SUSER_NAME accepts the “login identification number of the user”, which is an int, and SUSER_SNAME accepts the “login security identification number”, a varbinary(85) (remember sid from above?).

Quick Query

The following query shows how to use all the functions, with the default values for any parameters.


Have fun!


Tags: , ,

With the holidays I haven’t been able to write much.  So I’ll make up for it with this +3000 words article.  If you’re reading this early in the morning, you’d better get a double espresso first ;-)

In this article I will demonstrate a method that can be used to calculate aggregations over a certain period of time in the past, or LastXMonths aggregations as I’m calling them throughout the article.  I’ll be using T-SQL, SQL Server Integration Services and a relational database as source.  More specifically I will be using the Merge Join data transformation in SSIS, and Common Table Expressions in T-SQL.

Version-wise I’m using SQL Server 2008 R2, but this method should work as of SQL Server 2005.  Furthermore I’m using the Contoso DWH, available for download at the Microsoft Download Center.  (In case you’re wondering, it’s the .BAK file.)

You can download the finished SSIS package from my Skydrive.  (The file is called MergeJoin.dtsx.)

The Scenario

Let’s say we’ve got a relational database containing some sales figures.  Management has asked for sales-related data to be available somewhere for easy analysis.  Ideally a cube would be built for that purpose but as budgets are currently tight, a temporary solution needs to be provided meanwhile.  So it’s been decided that an additional table will be created, populated with the exact data as required by management.  This table should contain all details (number of items and amount of the sale) about products sold, grouped by the date of the sale, the zip code of the place where the sale occurred and the category of the product.

Furthermore, each record should contain the sum of all sales of the last month for the zip code and product category of each particular record.  Two additional aggregations should calculate the sales for the last three months and last six months.

A Simple Example

To make sure we’re all on the same track on the requirements, here’s a small example to illustrate the expected outcome.

Small example displaying the expected outcome of the process

I’ve omitted the SalesAmount numbers for readability reasons.  The records are ordered chronologically, with the oldest first.  As you can see, the bottom record shows 16 as value for Last6MSalesQuantity.  This is the result of the SalesQuantity of the current record and the SalesQuantity of the previous record, which happens to fall within the timespan of the lowest record’s SaleDate going back six months.  The two other records do not fall within the six months timespan and are thus not included in the sum for the Last6MSalesQuantity of that bottom record.

Fetching The Data Into A Table

Our scenario requires that the sales figures are calculated and put into a new table.  Let’s first start with creating the queries to fetch the data.

Step 1: The Daily Numbers

The easiest part are the daily sales numbers.  These can be retrieved fairly easy from the Contoso data warehouse, just by using a GROUP BY clause as shown in the following query.

--daily sales
select DD.Datekey, DS.ZipCode, DPC.ProductCategoryName,
    SUM(FS.SalesAmount) SalesAmount_SUM,
    SUM(FS.SalesQuantity) SalesQuantity_SUM
from dbo.FactSales FS
    inner join dbo.DimStore DS on DS.StoreKey = FS.StoreKey
    inner join dbo.DimProduct DP on DP.ProductKey = FS.ProductKey
    inner join dbo.DimProductSubcategory DPS
        on DPS.ProductSubcategoryKey = DP.ProductSubcategoryKey
    inner join dbo.DimProductCategory DPC
        on DPC.ProductCategoryKey = DPS.ProductSubcategoryKey
    inner join dbo.DimDate DD on DD.Datekey = FS.DateKey
group by DD.Datekey, DS.ZipCode, DPC.ProductCategoryName
order by DD.Datekey asc, DS.ZipCode asc, DPC.ProductCategoryName asc;

Part of the result of that query looks like this:

Result of the daily sales query

Nothing special to mention so far so let’s continue to the next step.

Step 2: The Monthly Numbers

In this step, we’ll use the query from step 1 as base for the full query.  I’ll first show you the query and then provide you with some explanation of what’s going on.

declare @numberOfMonths tinyint = 1;
with DailySalesData as
    select DD.Datekey, DS.ZipCode, DPC.ProductCategoryName,
        SUM(FS.SalesAmount) SalesAmount_SUM,
        SUM(FS.SalesQuantity) SalesQuantity_SUM
    from dbo.FactSales FS
        inner join dbo.DimStore DS on DS.StoreKey = FS.StoreKey
        inner join dbo.DimProduct DP on DP.ProductKey = FS.ProductKey
        inner join dbo.DimProductSubcategory DPS
            on DPS.ProductSubcategoryKey = DP.ProductSubcategoryKey
        inner join dbo.DimProductCategory DPC
            on DPC.ProductCategoryKey = DPS.ProductSubcategoryKey
        inner join dbo.DimDate DD on DD.Datekey = FS.DateKey
    group by DD.Datekey, DS.ZipCode, DPC.ProductCategoryName
UniqueRecordsPerDay as
    select Datekey, ZipCode, ProductCategoryName
    from DailySalesData
    group by Datekey, ZipCode, ProductCategoryName
select UR.Datekey, DSD.ZipCode, DSD.ProductCategoryName,
    SUM(DSD.SalesAmount_SUM) SalesAmount_SUM,
    SUM(DSD.SalesQuantity_SUM) SalesQuantity_SUM
from DailySalesData DSD
    inner join UniqueRecordsPerDay UR
            on UR.ProductCategoryName = DSD.ProductCategoryName
        and UR.ZipCode = DSD.ZipCode
        and DSD.Datekey
            between DATEADD(month, -@numberOfMonths, UR.Datekey + 1)
            and UR.Datekey
group by UR.Datekey, DSD.ZipCode, DSD.ProductCategoryName
order by UR.Datekey asc, DSD.ZipCode asc, DSD.ProductCategoryName asc;

The query uses a variable called @numberOfMonths.  This will allow us to use the same query for the totals of last month, as well as for the Last3M and the Last6M numbers.  All that’s needed is changing the variable to 3 or 6.

But how does the query get to the results?  To start, it uses two CTEs (Common Table Expressions).  The first one is called DailySalesData.  And the query for that CTE should look familiar to you by now: it’s the one from step 1, without the ORDER BY clause.

The second CTE is called UniqueRecordsPerDay and gives us one record for each unique date, zip code and product category as found in the Contoso data.  The DateKey, ZipCode and ProductCategoryName fields are our key grouping fields.  And this CTE is actually the key to calculating the monthly aggregated data, as I’ll explain next.

What the main query does is the following.  It selects the data from the DailySalesData CTE and joins that with the unique records per day recordset.  All grouping key fields need to be included in the join.  However, as you can see, to add the DateKey into the join I’m not just using the equals operator but the BETWEEN keyword instead.  I’ve also used the DATEADD function to subtract the number of months as specified through the @numberOfMonths variable.  That statement is saying: “give me all records starting from DateKey, going back @numberOfMonths”.  The query again groups by the key fields to be able to sum the records up.

This construction ensures that the SalesAmount_SUM and SalesQuantity_SUM fields represent the sum for the record’s zip code and product category and for the period as indicated by the @numberOfMonths variable.

Step 3: Merging It All Together Into One Table

Now that we know how to retrieve the data, we still need to get it into a table.  One option would be to use the INSERT statement on the daily records, followed by UPDATE statements to populate the monthly (1, 3, 6) aggregated columns.  However, I’m a BI guy so let’s use an SSIS package to get to the result (plus it allows me to illustrate the Merge Join data flow transformation :-) ).

So open up the BIDS and create a new package.  Drop a Data Flow Task into the Control Flow and add a Connection Manager connecting to your Contoso DWH.  Then switch to the Data Flow page.

Nothing special so far I believe.  Next we need to set up four Data Flow Sources: one for the daily figures, one for the monthly, one for the 3M and one for the 6M data.

Setting Up The Data Sources

Throw in an OLE DB Source component, configure it to use your connection manager and copy/paste the first query above into the command textbox.  Again nothing special, right?

However, the Merge Join component expects its incoming data to be sorted.  That’s why I’ve included the ORDER BY clause in the queries above.  But that’s not all.  Connecting our data source to a Merge Join transformation without any additional change will result in an error such as the following:

Validation error. Data Flow Task Merge Join [457]: The input is not sorted. The “input “Merge Join Left Input” (458)” must be sorted.

To avoid this error, we need to explicitly inform our data flow that the data is actually ordered, and we need to give it all the details: on what fields has the data been ordered and in what order!  And that needs to be done through the Advanced Editor.

So, right-click the OLE DB Source and select Show Advanced Editor.

Right-click OLE DB Source to open up the Advanced Editor

In the Advanced Editor, navigate to the last tab called Input and Output Properties and select the “OLE DB Source Output” node in the tree structure on the left.  Doing that will show the properties for the selected output and one of those properties is called IsSorted.  By default it is set to False.  Set it to True.

Tip: double-clicking the label of the property will swap its value to the other value.  This can be useful in cases when you need to change several options but even here is saves a couple of clicks.  It’s all about optimization. :-)

Advanced Editor on OLE DB Source: the IsSorted property

At this moment the component knows that the incoming data is sorted, but it still doesn’t know on what fields.  To specify that, open up the OLE DB Source Output node, followed by the Output Columns node.  You’ll now see the list of fields.  As specified in the query, the data is ordered firstly on DateKey, secondly on ZipCode and thirdly on ProductCategoryName.

Select DateKey to see its properties.

Advanced Editor of OLE DB Source showing the SortKeyPosition property

The property in which we’re interested here is called SortKeyPosition.  By default it is set to zero.  When the incoming data is sorted,  this property should reflect in what order the data is sorted, starting with one for the first field.  So in our case here the value should be set to 1.

Set the SortKeyPosition property for ZipCode to 2 and for ProductCategoryName to 3.

That’s one of the four OLE DB sources set up.  The other three will be easier as we can start from the first one.  So, copy and paste the source component, open it up by double-clicking it and replace the query with our second query from earlier, the one returning the monthly figures.  Ow, and give it a decent name but I’m sure you knew that.

Create the third source component in the same way, but change the value for the @numberOfMonths variable to 3.  And again the same process for source number four, changing the variable’s value to 6.

Here’s what we have so far:

Four OLE DB sources set up - waiting to be merged

Merging The Sources Into One Flow

Next up is merging the incoming flows.  Drag a Merge Join data flow transformation under the Daily Sales source and connect the source to the Merge Join.  That will open the following Input Output Selection screen.

Input Output Selection window

A Merge Join expects two inputs: one is called the Left Input and the other is called the Right Input.  Select Merge Join Left Input as value for the Input dropdown.

Close the popup window and connect the second source (with the monthly data) as well to the Merge Join.  There’s only one input remaining so this one is automatically the right input – no popup window is shown.

Next we need to configure the Merge Join so that it merges the data as expected.  Open the Merge Join Transformation Editor by double-clicking the component.

Merge Join Transformation Editor

By default the Join type dropdown is set to Inner join.  In our situation that’s good enough.  In the case that only one record exists for a certain zip code and product category on a given day, the monthly data for this record will be the sum of just that one record but in any case: there’s always at least one record for each incoming flow to be combined with each other.

As you can see, because both incoming flows are ordered in the same way, it automatically knows on which fields to put the join.

By default, no output fields are created as the white bottom half of the screenshot indicates.

Now I’ll show you a screenshot of the expected setup:

Merge Join Transformation Editor set up as expected

There are several ways to specify the output fields.  The first method is by using the dropdown in the Input column.  Selecting a value there will populate a dropdown in the column called Input Column (djeez, that was one column too much).  Here’s what that method looks like:

Specifying the output fields by using the dropdowns

Selecting a value in the second column will then give you a default value for the Output Alias.  This default can be freely modified.  As you may have guessed, this is not my preferred method – way too many comboboxes.

Another method of specifying the output fields is by using the checkboxes in front of the fields in the top part of the window.  I believe the larger screenshot above says it all.  Just check the fields that you need and then change their default Output Alias to whatever suits you.   In my example here I only needed to modify the alias for the last two records.

With our first Merge Join set up, only two are remaining.  So drag in a second Merge Join from the Toolbox, connect the output of the first join as Left Input on the second join and add the output of the third OLE DB source as Right Input.

Interesting to note here is that the output of the Merge Join is sorted in the same manner as its inputs.  One way of verifying this is by right-clicking the connector between the two joins and choosing Edit.

Right-click data flow connector and select Edit to open up Data Flow Path Editor

That opens up the Data Flow Path Editor.

Tip: double-clicking the connector will also open the editor!

Examine the Metadata of the Data Flow Path to verify the sort order

As you can see in the above screenshot, the metadata page shows a list of the available fields with some properties, such as the Sort Key Position.  Now if that doesn’t look familiar?! :-)

So far, the second Merge Join has been added and connected but it hasn’t been configured yet.  The process is very similar to the way we’ve set up the first join.  Just select all fields from the left input by checking all the checkboxes and select the two SUM fields from the right input.  Don’t forget to give those SUM fields a clear name.

Two joins done, one remaining.  Just drag one in and connect it with the second join plus the last remaining OLE DB source.  I won’t go into further details here, it’s exactly the same as I just explained for the second join.

Here’s what the Data Flow should look like:

The Data Flow with all the Merge Joins connected

And here’s what the third Merge Join should look like:The third Merge Join as set up for the example

An Error That You May Encounter

When using sorted data flows and the Merge Join component, you may encounter the following error message:

An error that you may encounter while using the Merge Join component

And now in words for the search engines:

The component has detected potential metadata corruption during validation.

Error at Data Flow Task [SSIS.Pipeline]: The IsSorted property of output “Merge Join Output” (91) is set to TRUE, but the absolute values of the non-zero output column SortKeyPositions do not form a monotonically increasing sequence, starting at one.

Yeah right, you had to read that twice, didn’t you?  And the best is yet to come:

Due to limitations of the Advanced Editor dialog box, this component cannot be edited using this dialog box.

So there’s a problem with your Merge Join but you cannot use the Advanced Editor to fix it, hmm, and you call that the ADVANCED editor?  Is there anything more advanced perhaps?  Well, actually, there is.  It’s called the Properties pane.  With the Merge Join selected, one of the properties there is called NumKeyColumns.  That property reflects on how many columns the incoming data is sorted.  And currently it contains the wrong value.  Changing its value to the correct number of columns will remove the error.

Properties pane displaying the Merge Join's properties, including NumKeyColumns

In case you’re wondering when you might encounter this particular problem, here’s how you can simulate it.  (Don’t forget to make a copy of the package before messing around with it.)

With the package as it currently is, remove the ZipCode field from the first two sources by unchecking it in the Columns page of the OLE DB Source Editor.

The sources are now complaining so open up their Advanced Editor and correct the SortKeyPosition of the ProductCategoryName field: it should become 2 instead of 3 because ZipCode was 2 and has been removed.

Now try to open the first Merge Join.  The first time it will complain about invalid references so delete those.  With the references deleted, if you now try to open the Merge Join editor, you’ll see the error we’re discussing here.  To fix it, change the NumKeyColumns property of the Merge Join to 2 instead of 3.

Adding The Destination Table

Now there’s only one step remaining: adding a destination for our merged data.  So, throw in an OLE DB Destination and connect it with the output of the last Merge Join:

An OLE DB Destination connected to the join that merges it all together

I’ll just use a quick and dirty way of creating a new table in the database.  Open up the OLE DB Destination Editor by double-clicking it and select a Connection Manager in the dropdown.  Now click the New button next to the Name of the table or the view dropdown.

That opens up the Create Table window, with a CREATE TABLE query pre-generated for you for free.  Isn’t that nice?  Change the name of the table to something nice (at least remove those spaces, yuk!!) and click OK.

The Create Table window

The new table is created at the moment that the OK button gets clicked.

Right, so are we there?  Well, almost.  As you can see now in the next screenshot, the BIDS does not want us to click the OK button just yet.

The OLE DB Destination Editor with the Mappings still missing

To resolve that warning, just open the Mappings page.  As the names of the input columns are matching exactly with the names of the fields in the destination table, everything will be automagically configured at this moment.  So now you can close the window with the OK button.

And that’s it!  Everything is set up to populate the new table with the aggregated figures, as requested by management.  To give it a run, right-click your package in the Solution Explorer and guess what… select Execute Package!  If everything has been configured as expected, you should get some green boxes soon.  And some data in the table, like this:

The final result: sales figures aggregated over different periods in time


In this article I’ve demonstrated a way to aggregate data over different periods in time, using T-SQL and Integration Services.  Obviously this method does not replace the flexibility that one gets when analyzing data stored in an OLAP cube, but it can be a practical method when you quickly need to provide aggregated data for management.

Have fun!



Merge Join Data Flow Transformation

Common Table Expressions (CTEs)

DATEADD() function


Tags: , , ,

« Older entries

© 2008-2019 BI: Beer Intelligence? All Rights Reserved