SQL Server

You are currently browsing articles tagged SQL Server.

As a BI developer I regularly use Excel to analyze my data on correctness.  In Excel it’s easy to filter out unrelated records, as you’ll probably already know.

How do I get my data in Excel?

With a simple copy/paste.  Since SQL Server 2008, you can just right-click on the results in the Management Studio and select Copy with Headers.  This will put the selected data into memory, including the column headers.

SSMS 2008: right-click on results to Copy with Headers

You can even make a selection in the Results pane and only copy that, as shown in the screenshot above.  Here’s what it looks like when pasted into Excel:

Data with headers pasted into Excel

How do I make the columns wider?

Excel will not expand the column widths by default, shown in that last screenshot.  You could manually enlarge them one by one, but that’s not convenient when you last filled column is called “DI”.  Instead, I use the following trick.  First I select the whole sheet by clicking on the grey square in the top-left corner:

Select the whole sheet using the grey square

Then I double-click on the vertical divider in between columns A and B.  This will adjust all column widths so that the data fits nicely.

Double-click the divider in between two columns to adjust width automatically

Does Excel know SELECT DISTINCT?

The other day I needed to filter out all duplicate records out of a really large dataset.  I first tried by activating the regular filter.  This is an Excel feature that I use a lot.  Just clicking the Filter button on the Data ribbon will add the small dropdowns next to each column’s header:

Excel's Filter functionality

But in those dropdowns I couldn’t find an option which would get me to the required result.

(Note that I added a couple of duplicate records in my dataset.)

After a little search, I found the magic checkbox!  I didn’t know that it could be so easy.

To find it, click the Advanced button in the Data ribbon.

The Advanced button on the Data ribbon

In the Advanced Filter window that appears you see a checkbox called Unique records only.

Advanced Filter: Unique records only checkbox

There’s your SELECT DISTINCT!

And here’s the sheet with the duplicates filtered out:

Excel sheet after filtering out duplicate records with the "Unique records only" option

Do you also see that the filter dropdowns are gone, although I did not remove them?  The Advanced Filter cannot be used in combination with the regular filter.  Once you activate the regular filter by clicking the Filter button, your duplicates will re-appear.

If you want to further filter your data after filtering out all duplicate records, you should select the Copy to another location option in the Advanced Filter window.  Copy it below your data on the same sheet and apply the regular filter on that new set of data.

Filtered set of data below full dataset on same sheet

Copying to another sheet is not supported.  Apparently you can only copy to the active sheet.  Should you try it out anyway, you’ll get this message:

Warning: You can only copy filtered data to the active sheet

Happy filtering!

Valentino.

  • Share/Bookmark

Tags: ,

Lately I’ve been using Visual Studio Team System 2008 Database Edition with the GDR R2 add-on, also known as “The Data Dude”, to compare databases with each other.

And during a schema compare I noticed that the dude can be quite stubborn now and then.

Here’s an example and an explanation of why, as far as I can tell, the dude is behaving that way.

The Scenario

At work we use the data dude’s schema compare to prepare our deployment scripts.  The most fun can be had when preparing a script to deploy the latest finalized version, version x, from the Development (DEV) to the Acceptance (UAT) environment.

In our DEV database there have already been changes for the next version, version x+1, so these changes need to be skipped.

Now imagine the following situation:

  • a table called TableA exists
  • a new field called NewFieldA was added to TableA
  • a view called ViewA exists and was modified – it uses TableA but not NewFieldA

The changes to ViewA belong with version x, while the changes to TableA are part of version x+1.  So the script that I’m going to generate should contain an ALTER VIEW ViewA … but not an ALTER TABLE TableA ….

The Behavior

Using the dude, I open up the dropdown on the line that states Different definition on TableA and I select Skip.  The value gets changed from Update to Skip.  However, when the focus changes to another line in the grid the value changes back to Update!  Furthermore, it’s now grayed out, impossible to change!

The first time that you see this happening you can’t believe your eyes and try it once more, first closing everything and then re-doing your actions.  Just to see if you didn’t select the wrong option somewhere.  But alas, the dude remains stubborn!

Tip: if you have just executed the compare and you find that some update actions cannot be modified, try to do a refresh (right-click on the background of the comparison window and click Refresh).  I noticed that after the refresh you will get the dropdown for these lines and thus will be able to change the Update Action.

The Explanation

It took me a while to realize, but here’s what seems to be happening.  Like with all things stubborn, it helps if you know what the thing in question is actually being stubborn for.

The dude knows that ViewA depends on TableA and it also knows that both ViewA and TableA where changed.  However, the dude is a little short-sighted, it doesn’t know that NewFieldA is not used in ViewA.  It doesn’t know that the change to ViewA was not related to the change on TableA.  So it automatically assumes that you’re making a mistake and prevents you from messing up.  It thinks, “You stupid user, I know it better, I am smart software!  And that’s not all, I AM THE DATA DUDE! And if there’s one thing you don’t want, that’s to mess with the Data Dude!”.

Here’s a screenshot of the dude’s stubbornness:

Grayed-out Update Action: the result of a stubborn data dude

In this particular case it refused to skip certain new tables, even though there were no dependent objects that needed those tables!  The first two lines show the grayed-out items.  These tables were located under a new schema and I had to put the schema’s Update Action to Skip to get the tables be skipped as well.  The Update Action for the tables was changed automatically when I modified the action for the schema.

But, if you ask me, I don’t see why I shouldn’t be able to Skip a new table even when I am going to create the new schema to which it belongs?  It’s the table that depends on the schema, not the other way around.

The Feature Request

It would have been great if I could manually override the dude’s decision.  As that was not the case, I have to manually change the generated script so that it only contains the changes that I want.  As long as it’s technically feasible, I think manual changes to the Update Action should be allowed.

One More Tip

Take a good look at the following screenshot.

Beware the miniature exclamation mark in the icon's corner!

By default, the node that reads Different dependencies was collapsed.  So you think, good, the line says Skip so nothing will happen.  Except, did you notice that minuscule exclamation mark in the bottom-left corner of the icon?  That means that other actions may be happening further down the dependency tree, as illustrated in the screenshot.

So don’t let the dude mislead you into thinking nothing will happen when it says Skip.  Sometimes the actual actions are hidden further down and you need to open up every node with the exclamation mark on the icon to verify if that’s what you want.

Have fun training the dude,

Valentino.

  • Share/Bookmark

Tags: , ,

In this article I will show you a couple of different T-SQL queries to fetch aggregated data.  The main purpose is to illustrate how the OVER clause can be used to aggregate data.

For the examples I will use data from the AdventureWorks2008R2 database, available at CodePlex.

The Data

The AdventureWorks 2008 R2 database contains a view called Sales.vSalesPerson.  This is the data with which I’ll be working in the examples below.  Here’s what it looks like:

My Working Data

I’ve hidden some fields so that all the relevant ones are in view.

The Scenario

Your manager has asked you to create one query, to be executed on the Sales.vSalesPerson table, that returns a list of:

  • all employees (FirstName, LastName, JobTitle, CountryRegionName, StateProvinceName, City),
  • their sales of last year (SalesLastYear),
  • the sum of the sales of last year for their country,
  • the average of the sales of last year compared to all employees with the same type of phone (PhoneNumberType)
  • the overall average and sum of the sales of last year.

Using Derived Tables

No problem you say, coming right up.  So you start building your query, retrieving all fields as requested.

After quite some typing, here’s what your query looks like:

select S.FirstName, S.LastName, S.JobTitle, S.PhoneNumberType, S.CountryRegionName,
    S.StateProvinceName, S.City, S.SalesLastYear,
    GeographicSales.SalesLastYearGeographic_SUM,
    SalesByPhoneType.SalesLastYearByPhoneNumberType_AVG,
    SalesSUM.SalesLastYear_AVG, SalesSUM.SalesLastYear_SUM
from Sales.vSalesPerson S
--Derived Table 1: the overall aggregates
cross join (
    select SUM(SalesLastYear) SalesLastYear_SUM, AVG(SalesLastYear) SalesLastYear_AVG
    from Sales.vSalesPerson
) SalesSUM
--Derived Table 2: the aggregate on Country level
inner join (
    select CountryRegionName, SUM(SalesLastYear) SalesLastYearGeographic_SUM
    from Sales.vSalesPerson
    group by CountryRegionName
) GeographicSales on GeographicSales.CountryRegionName = S.CountryRegionName
--Derived Table 3: the aggregate on phone type
inner join (
    select PhoneNumberType, AVG(SalesLastYear) SalesLastYearByPhoneNumberType_AVG
    from Sales.vSalesPerson
    group by PhoneNumberType
) SalesByPhoneType on SalesByPhoneType.PhoneNumberType= S.PhoneNumberType;

The main query is retrieving all fields as requested.  Further down there are three derived table queries, each one retrieving aggregates on a different level.

The first derived table is retrieving the overall aggregates.  These are cross-joined with every record in our main query so for each record the totals will be the same, which is what we want.

The second derived table retrieves the aggregates on Country level, including the CountryRegionName.  This is done using the conventional GROUP BY method.  The CountryRegionName is the key on which the derived table is joined to the main table.

The third derived table uses a similar system, this time for the aggregate on phone type.

And here’s the query’s output:

Output of the query using subqueries

Happy with this result, you go up to the cafeteria to finally have lunch with your colleagues (who left 15 minutes earlier but you wanted to get your query finished first).

Using The OVER Clause

During lunch you explain to your peers what kind of funny request you got from management and told them how you solved it.

Then one of them speaks up and says: “Want to know how you can avoid all that typing?  Use the OVER clause!  I’ll show you when we are back at our desks.”

After lunch, here’s what your colleague helps to produce:

select S.FirstName, S.LastName, S.JobTitle, S.PhoneNumberType, S.CountryRegionName,
    S.StateProvinceName, S.City, S.SalesLastYear,
    SUM(SalesLastYear) OVER (PARTITION BY CountryRegionName)
        SalesLastYearGeographic_SUM,
    AVG(SalesLastYear) OVER (PARTITION BY PhoneNumberType)
        SalesLastYearByPhoneNumberType_AVG,
    SalesSUM.SalesLastYear_AVG, SalesSUM.SalesLastYear_SUM
from Sales.vSalesPerson S
--Derived Table 1: the overall aggregates
cross join (
    select SUM(SalesLastYear) SalesLastYear_SUM, AVG(SalesLastYear) SalesLastYear_AVG
    from Sales.vSalesPerson
) SalesSUM;

As you can see, derived tables 2 and 3 are gone.  They have been replaced with the OVER clause, in combination with PARTITION BY.  What you say with the OVER clause is: “partition the dataset by the fields specified in the PARTITION BY and apply the aggregation on those partitions”.  Another word for this is aggregate window function.

As you like the approach, you ask your co-worker how you can get rid of that cross join.  He doesn’t really know but then another colleague who overheard your conversation says: “On this blog the other day I read that you can use the OVER clause and partition by anything you want.  As long as it’s a constant, it will work!”.

So you give that a try and you end up with the following final query:

select FirstName, LastName, JobTitle, PhoneNumberType, CountryRegionName,
    StateProvinceName, City, SalesLastYear,
    SUM(SalesLastYear) OVER (PARTITION BY CountryRegionName)
        SalesLastYearGeographic_SUM,
    AVG(SalesLastYear) OVER (PARTITION BY PhoneNumberType)
        SalesLastYearByPhoneNumberType_AVG,
    AVG(SalesLastYear) OVER (PARTITION BY 'duh') SalesLastYear_AVG,
    SUM(SalesLastYear) OVER (PARTITION BY 1) SalesLastYear_SUM
from Sales.vSalesPerson;

As illustrated in the example, you can use any constant value to calculate overall aggregates over the whole dataset using the OVER clause.

You happily thank your colleagues and tell them that next time you’ll be able to join them for lunch on time.

About a week later you’re explaining to one of your friends how you’ve gotten to know the OVER clause. After hearing how you use it to aggregate over the whole dataset, he smiles and says: “I know how you can simplify it even more! Don’t partition at all!”.

Taking a closer look it turns out that the PARTITION BY is actually optional:

Ranking Window Functions
< OVER_CLAUSE > :: =     OVER ( [ PARTITION BY value_expression, ... [ n ] ]            <ORDER BY_Clause> )Aggregate Window Functions
< OVER_CLAUSE > :: =     OVER ( [ PARTITION BY value_expression, ... [ n ] ] )

See those square brackets? Means it’s optional.

So here is the real final query:

select FirstName, LastName, JobTitle, PhoneNumberType, CountryRegionName,
    StateProvinceName, City, SalesLastYear,
    SUM(SalesLastYear) OVER (PARTITION BY CountryRegionName)
        SalesLastYearGeographic_SUM,
    AVG(SalesLastYear) OVER (PARTITION BY PhoneNumberType)
        SalesLastYearByPhoneNumberType_AVG,
    AVG(SalesLastYear) OVER () SalesLastYear_AVG,
    SUM(SalesLastYear) OVER () SalesLastYear_SUM
from Sales.vSalesPerson;

Conclusion

When you compare the final query with the first one, tell me, which one would you prefer to maintain?  Do you prefer to have lunch with your peers or to arrive late and miss all the fun?

Have fun!

Valentino.

References

OVER Clause (Transact-SQL)

Join Fundamentals

SELECT (Transact-SQL)

  • Share/Bookmark

Tags: , , ,

The purpose of this article is to demonstrate how to retrieve data from an Excel sheet and put it in a table in a SQL Server database.

Introduction

Anyone who’s ever used a computer for a significant amount of time has probably come into contact with Excel, the spreadsheet application part of the Microsoft Office suite. Its main purposes are to perform calculations and create charts and pivot tables for analysis.

But people have great imagination and invent new uses for it every day.  I’ve even seen it used as a picture album.  (Sorry dad, but I know you won’t be reading this anyway. :-) )  Ever since he had this specific YACI, or “Yet Another Computer Issue”, because his PC wasn’t powerful enough to open his 45 MB Excel file, uh, “picture collection”, he took some evening classes.  He’s now putting his Photoshopped pictures in PowerPoint…  Anyway, let’s get back on track now.

Another use, and the one that’s the subject of this article, is when Excel has been used as a database.  Come on, you know what I’m talking about, with the first row containing the column headers followed by possibly thousands of data rows.  The following screenshot contains an example, and is also the file that I will be using in this article.  I took all records from the Production.Product table in the AdventureWorks 2008R2 database and dumped them in Excel.

An Excel sheet used as a data store

At some point people will realize, either because someone told them or because they lost some data due to inattentiveness, that it wasn’t a really good idea to keep all that data in an Excel sheet.  And they’ll ask you to put it in a real database such as SQL Server.

That’s what I’m going to show you in the next paragraphs: how to import data from Excel into SQL Server.

Using OPENROWSET() To Query Excel Files

There are actually several different ways to achieve this.  In this article I will use the OPENROWSET() function.  This is a T-SQL function that can be used to access any OLE DB data source.  All you need is the right OLE DB driver.  The oldest version which I could confirm that contains this function is SQL Server 7.0, good enough to say that any version supports it.

My sample Excel files are located in C:\temp\.  This folder contains two files: Products.xls and Products.xlsx.  The first file is saved in the old format, Excel 97-2003, while the second file was saved from Excel 2010.  Both files contain the same data.  The sheet containing the list of products is called ProductList.

And here are the queries:

--Excel 2007-2010
SELECT * --INTO #productlist
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
    'Excel 12.0 Xml;HDR=YES;Database=C:\temp\Products.xlsx',
    'SELECT * FROM [ProductList$]');

--Excel 97-2003
SELECT * --INTO #productlist
FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',
    'Excel 8.0;HDR=YES;Database=C:\temp\Products.xls',
    'select * from [ProductList$]');

These queries are just returning the data from the Excel file into the Results window, when executed using the Management Studio.  To insert the data into a table, uncomment the INTO clause.  When uncommented, the statement retrieves the data from the Excel sheet and puts it into a newly-created local temporary table called #productlist.

Furthermore, the query assumes that the first row contains the header.  If that’s not the case, replace HDR=YES with HDR=NO.

Note: if you get an error message when running the query, look further down in this article.  I’ve covered a couple of them.

With the INTO clause uncommented and the query executed, the temporary table can now be queried just like any other table:

SELECT * FROM #productlist

What Type Is Your Data?

Let’s have a look if this method of using a SELECT INTO in combination with OPENROWSET and a temporary table is smart enough to interpret the correct data types of the data coming in.  Use the following command to describe the metadata of the temporary table:

USE tempdb;
GO
sp_help '#productlist';

Because a temporary table is stored in the tempdb, the sp_help command should be issued against that database.

Here’s the part of the output in which we’re interested:

The data types used when combining OPENROWSET with SELECT INTO

As you can see, anything that looks like text will be put in a field of type nvarchar(510) and anything that looks like a number (integers, floating-point numbers, datetime values, …) is put into a float(53).  Not a lot of intelligence there.  This is the result when no formatting was put on the cells in Excel.

As an experiment I’ve changed the format of some fields in the Excel file and then retried the SELECT INTO statement.  What did I change?  I identified ProductID as being a number without any decimals, changed StandardCost and ListPrice to a currency with four decimal digits and I changed SellStartDate and SellEndDate to a custom date/time format showing both date and time.

The effect on the table creation was not completely as I would have expected:

SELECT INTO with some field types changed

ProductID is still being stored into a float field, even though in Excel it’s defined as having no decimals.  And the datetime values are not recognized either.  Okay, I used a custom format there, so maybe it’s due to that.

It’s up to you of course how you use this method of importing the data.  You can put your records into a temporary table to process further, or you can create a table with the expected data types upfront and import the data directly into that one.

Some Possible Issues

Let’s cover some issues related to this method.

Enable ‘AD Hoc Distributed Queries’

The OPENROWSET() function expects that the ‘Ad Hoc Distributed Queries’ option is enabled on the server.  When that’s not the case you’ll see the following message:

Msg 15281, Level 16, State 1, Line 1

SQL Server blocked access to STATEMENT ‘OpenRowset/OpenDatasource’ of component ‘Ad Hoc Distributed Queries’ because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of ‘Ad Hoc Distributed Queries’ by using sp_configure. For more information about enabling ‘Ad Hoc Distributed Queries’, see "Surface Area Configuration" in SQL Server Books Online.

This is one of the advanced options.  To enable it you can use the following command:

sp_configure 'show advanced options', 1;
GO
RECONFIGURE;
GO

sp_configure 'Ad Hoc Distributed Queries', 1;
GO
RECONFIGURE;
GO

To get a good look at all the different settings, just run the sp_configure procedure without any parameters.

Note: if you’re not the administrator of the server, you should talk to the DBA who’s responsible before attempting this.

The File Needs To Be Closed

When the Excel file is not closed, you’ll end up with the following error:

Msg 7399, Level 16, State 1, Line 1

The OLE DB provider "Microsoft.Jet.OLEDB.4.0" for linked server "(null)" reported an error. The provider did not give any information about the error.

Msg 7303, Level 16, State 1, Line 1

Cannot initialize the data source object of OLE DB provider "Microsoft.Jet.OLEDB.4.0" for linked server "(null)".

So close the file and try the query again.

OLE DB Driver Not Installed

The OPENROWSET() function uses OLE DB, so it needs a driver for your data source, in this case for Excel.  If the right driver is not installed, you’ll see the following error (or similar, depends on the version used).

Msg 7302, Level 16, State 1, Line 1

Cannot create an instance of OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)".

To solve the issue, install the right driver and try again.

How can you tell what drivers are installed?  Open up the ODBC Data Source Administrator window (Start > Run > type ODBCAD32.EXE and enter) and have a look in the Drivers tab.  The following screenshot (taken on a Dutch Windows XP) shows both the JET 4.0 driver for Excel 97-2003 and the fairly-new ACE driver for Excel 2007.

odbcad32.exe - ODBC Data Source Administrator

The drivers can be downloaded from the following pages on the Microsoft site:

Excel 97-2003 Jet 4.0 driver

Excel 2007 ACE driver – 12.00.6423.1000

Excel 2010 ACE driver (beta) – 14.00.4732.1000

Sidenote: the Excel 2010 driver is not supported on Windows XP, but I was able to query the 2010 Excel sheet using the 2007 driver.  I guess that this is the result of the Office Open XML standard which was introduced in Office 2007.

Driver backward-compatibility

The ACE drivers are backwards-compatible.  So the following queries are working perfectly:

--old Excel with new ACE driver - working query 1
SELECT * --INTO #productlist
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
    'Excel 8.0;HDR=YES;Database=C:\temp\Products.xls',
    'SELECT * FROM [ProductList$]');

--old Excel with new ACE driver - working query 2
SELECT * --INTO #productlist
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
    'Excel 12.0;HDR=YES;Database=C:\temp\Products.xls',
    'SELECT * FROM [ProductList$]');

In other words, you won’t be needing that first link for the Jet driver.  For the full story have a look at this blog post by Adam Saxton of the CSS SQL Server Escalation Services team.

The 64-bit Story

So, what if you’re running a 64-bit OS?  I’ll start by saying that I had quite some issues getting OPENROWSET to work, but finally I managed it.  Following is a list of my attempts, each time with the resulting message.  And finally I’ll show you how I got it to work.  The problem was something really unexpected…

ACE 14 64-bit through SSMS

My main laptop is running Windows 7 64-bit, Office 2010 64-bit and SQL Server 2008 R2 64-bit.  So I installed the 64-bit version of the ACE 14 driver, which happens to be the first OLE DB driver for Excel that ships in 64-bit.  But when I execute my query I’m getting the following message:

Msg 7403, Level 16, State 1, Line 1

The OLE DB provider "Microsoft.ACE.OLEDB.14.0" has not been registered.

Is this because SSMS ships only in 32-bit?  Maybe, but I’m not able to install the 32-bit driver.  It doesn’t allow me to because I’ve got Office in 64-bit installed.  The installer throws me the following error:

Microsoft Access database engine 2010 (beta) - You cannot install the 32-bit version of Access Database engine for Microsoft Office 2010 because you currently have 64-bit Office products installed...

ACE 12 32-bit on a 64-bit machine

When I check the installed drivers using the 32-bit version of the ODBC Data Source Administrator (located in C:\Windows\SysWOW64), I notice that the ACE 12 driver is installed.  However, trying to use that one from the Management Studio gives me this:

Msg 7399, Level 16, State 1, Line 1

The OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)" reported an error. The provider did not give any information about the error.

Msg 7330, Level 16, State 2, Line 1

Cannot fetch a row from OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)".

The Results pane shows all the columns with the right column names, retrieved from Excel.  But the driver seems to have a problem retrieving the actual data.

This issue with error 7330 is mentioned in the following thread on the SQL Server MSDN forum, but unfortunately the proposed solution didn’t solve the problem in my case.

64-bit SQLCMD using ACE 14 driver

I also tried using the 64-bit version of sqlcmd.exe, but strangely enough that throws the same error.

Using sqlcmd 64-bit to query Excel

I actually expected this last method to work, after all, everything is now running in 64-bit.  But alas, it didn’t…

One more go…

After some more trial and error, I have actually found a way to get the query to work.  I don’t have a logical explanation on why it’s behaving the way it is, but, well, it is working…

This query is running fine:

SELECT * --INTO #productlist
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
    'Excel 12.0 Xml;HDR=YES;Database=C:\temp\Products.xlsx',
    'SELECT * FROM [ProductList$]');

But this one isn’t:

--Excel 2007-2010
SELECT * --INTO #productlist
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
    'Excel 12.0 Xml;HDR=YES;Database=C:\temp\Products.xlsx',
    'SELECT * FROM [ProductList$]');

It’s exactly the same query, only difference is the comment line at the start.  And even weirder, if I add a space after the double-dash, the query works fine as well!

Then I decided to remove the commented INTO clause.  This made the weird behavior disappear.  So for some reason SQL Server doesn’t like the OPENROWSET function combined with comments inside the query.  The strange behavior also disappears when a space is added between the double-dash and the INTO keyword.

Uh, computers can be so much fun, right? :-)

If anyone has got an explanation on this strange behavior: please do post a comment!  For now my conclusion is: don’t use comments when creating an OPENROWSET query.

 

IMPORTANT UPDATE (April 11, 2010): it seems that the current installer for the ACE 14 driver contains a bug and registers it as being “Microsoft.ACE.OLEDB.12.0” instead of “Microsoft.ACE.OLEDB.14.0” .  This explains some of the issues shown above.  Some evidence on the issue:

Microsoft Connect: Access Database Engine 2010 installation issue to use with ADO access technology to access data from Jet database (.mdb files)

The ‘Microsoft.ACE.OLEDB.14.0′ provider is not registered ….. (see last comment)

Excel Services, ODC and Microsoft.ACE.OLEDB.14.0

 

Conclusion

The above has shown that OPENROWSET() can be a useful function, given the right circumstances.  But in the wrong setting it can be quite cumbersome to get to work.

I would recommend this method only for one-off quick imports, such as when you as a developer are given a bunch of data in a spreadsheet and need to get it into the database, one way or another.  I would not use it for an automated import process.  For that we’ve got a more interesting alternative which I’ll cover in an upcoming article.

Have fun!

Valentino.

References

BOL 2008: Special Table Types (incl. temporary tables)

BOL 2008: OPENROWSET() function

BOL 2008: the INTO clause

CSS SQL Server Engineers: How to get a x64 version of Jet?

  • Share/Bookmark

Tags: , , ,

If you’re interested in seeing the next Service Pack for SQL Server 2008 released, vote for it at this Microsoft Connect page!  Now that the release date for R2 has been announced, hopefully the next one is for SQL 2008 Service Pack 2.

If you’re still on 2005 and thus more interested in an SP4 for that version, it has also been posted at Connect.

How did I find out?  Through the following blog posts at SSQA.NET:

SQL Server 2008 SP2

SQL Server 2005 SP4

Update (27 Feb 2010):

SQL Server 2008 SP2 is scheduled for Q3 2010.

SQL Server 2005 SP4 is scheduled for Q4 2010.

See this post on the Microsoft SQL Server Release Services blog for details.

  • Share/Bookmark

Tags: , ,

image

For the second year in a row, the Belgian SQL Server User Group organizes the SQL Server Day.

Like last year, the event will take place at Utopolis Mechelen.

If you’re ready for a day filled with learning and networking opportunities and you can make yourself available at December 3, register at www.sqlserverday.be!

There will be some interesting speakers such as:

  • Chris Webb – SSAS and Gemini
  • Henk van der Valk – SSIS world record
  • Dirk Gubbels – Change tracking
  • my colleagues Luc Lemaire and Mario Van Hissenhoven – Report Builder 3.0

If you’re attending as well and would like to meet up for a chat, don’t hesitate to post a comment here!

See you there!

  • Share/Bookmark

Tags: , ,

Initially I was going to call this article “Struggling With Collation: The SeQueL”, but it just doesn’t have the same ring to it as “Fun With Strings”.  In that previous article I showed how you might get different results when loading data from a temporary table or table variable and I suggested that one way of solving this is by switching your data type to nvarchar.

Unicode Or Not?

Reason #1 For Not

Today I’m going to show you that nvarchar is not always what we want to use, especially if we don’t need to support Unicode strings.  Imagine a staging scenario when loading a data warehouse.  Often the Business Keys (BK) are strings, and depending on the source system, sometimes very long strings – I’ve seen situations with a combined business key of over 500 bytes!  (You can’t imagine what some data sources look like but that’s another story.)  Do we really want to convert these to Unicode, and thus double their size?  Furthermore, to improve lookups we put indexes on those BKs.  These indexes would double in size as well.  So no, we don’t really want to make these fields Unicode, and certainly not when we want our ETLs to perform as fast as possible.

Reason #2 For Not

That was reason number one why nvarchar is not always the solution.  And here comes reason number two.  In my scenario, the source tables are located in an Oracle database.  And guess what: by default Oracle’s ORDER BY behaves different than SQL Server’s ORDER BY (when using the regular Latin1_General_CI_AS or SQL_Latin1_General_CP1_CI_AS collations)!  By default Oracle uses binary string comparison to sort its data and the reason for it appears to be that that’s the only way to prevent a full table scan.  I’m no Oracle expert but that’s what the documentation states.

Here’s a little demonstration.  The following script prepares a table variable and selects the data from it, sorted ascending.

declare @tbl table( col1 varchar(20));
insert into @tbl select ‘AA’;
insert into @tbl select ‘A’;
insert into @tbl select ‘A-’;
insert into @tbl select ‘A A’;
insert into @tbl select ‘BA’;
insert into @tbl select ‘0′;
insert into @tbl select ‘1′;
insert into @tbl select ‘-0′;
insert into @tbl select ‘-1′;
insert into @tbl select ‘A0′;
insert into @tbl select ‘0A’;
insert into @tbl select ‘-A’;
insert into @tbl select ‘-B’;
insert into @tbl select ‘a’;
insert into @tbl select ‘b’;
insert into @tbl select ‘ ‘;

select * from @tbl
order by col1 asc;

I have executed it once just as stated above (while connected to a database that uses the SQL_Latin1_General_CP1_CI_AS collation) and once more while using nvarchar as data type for the column in the table variable.  The first execution will sort the data using a non-Unicode sorting algorithm, while the second execution will order the data according to the Unicode sorting method.  The results will be shown further below for easier comparison.

On Oracle I performed a similar procedure, as shown in following script.

select cast(‘AA’ as varchar(20)) as col1 from Dual union
select ‘A’ as col1 from Dual union
select ‘A-’ as col1 from Dual union
select ‘A A’ as col1 from Dual union
select ‘BA’ as col1 from Dual union
select ‘0′ as col1 from Dual union
select ‘1′ as col1 from Dual union
select ‘-0′ as col1 from Dual union
select ‘-1′ as col1 from Dual union
select ‘A0′ as col1 from Dual union
select ‘0A’ as col1 from Dual union
select ‘-A’ as col1 from Dual union
select ‘-B’ as col1 from Dual union
select ‘a’ as col1 from Dual union
select ‘b’ as col1 from Dual union
select ‘ ‘ as col1 from Dual
order by col1 asc;

The Oracle script doesn’t use a table variable, it just creates a result set using several select statements with a union in between.  But for our test that doesn’t matter, the results using this method are suitable.

In the table below you can see the result of the three executions.

SQL non-Unicode SQL Unicode Oracle Binary
< space
-0 0 -0
-1 -0 -1
-A 0A -A
-B 1 -B
0 -1 0
0A A 0A
1 a 1
A -A A
a A- A A
A A A A A-
A- A0 A0
A0 AA AA
AA b BA
b -B a
BA BA b

As you can see, they only agree on one thing: space really is the smallest character in my test set!  And that’s not what I want, I want all the data to be sorted consistently, no matter what the source is.

But Why Sorted?

You may wonder why I need to sort the data.  Well, some components in Integration Services expect the incoming data flows to be ordered.  One of the standard components that requires this is the Merge Transformation.  Another (custom!) component is Table Difference.  I could of course add a Sort Transformation to my Data Flow, but that would not be interesting for performance.  I want the data to come from the database server in the expected order.  So now I’ll show you how you can do that.

Taking Control!

SQL Server: ORDER BY … COLLATE …

On SQL Server this was fairly easy.  The ORDER BY clause has a COLLATE part where you can specify what collation should be used to order the data.  Because Oracle sorts its data using a binary algorithm, I’ll tell SQL Server to do that as well.  More precisely I’ll tell SQL Server to use the Latin1_General_BIN collation.  The updated SELECT statement from the T-SQL script above looks like this:

select * from @tbl
order by col1 collate Latin1_General_BIN asc;

Oracle: ORDER BY NLSSORT()

To ensure that results from Oracle are always returned using the same sorting algorithm, I will also tell the Oracle server to sort it’s data using the binary algorithm.

The first way I came up with was to change the NLS_SORT setting on the session.  That can be done by executing the following command before the SELECT statement:

ALTER SESSION SET NLS_SORT=BINARY;

This method is fine when you’re running the queries manually from a client such as Oracle SQL Developer.  However, in SSIS the OLE DB Source component will not accept anything else besides the SELECT statement.

Then I found another way.  There’s a function called NLSSORT() which you can apply to a column in the ORDER BY clause.  The following statement demonstrates how to use this function.  (I only show the ORDER BY clause as it can be applied to the Oracle script mentioned earlier.)

ORDER BY NLSSORT(col1, ‘NLS_SORT=BINARY’)

The following table shows the results from both binary sort queries:

SQL Binary Oracle Binary
-0 -0
-1 -1
-A -A
-B -B
0 0
0A 0A
1 1
A A
A A A A
A- A-
A0 A0
AA AA
BA BA
a a
b b

Finally I am able to get data from both Oracle and SQL Server using a consistent sort order.

But, How Big Is NULL?

However, even on this straightforward request, both database servers do not fully agree!  Here’s what they have to say about the topic:

“NULL is the smallest.”

“No, it’s the largest.”

“No, smallest!”

“Largest!!”

“Smallest.”

“Largest I tell you!!!”

“Bladiebla, not hearing you, anyway, it’s NOTHING!”

“No, it isn’t!”

*discussion goes on and on*

If I add NULL to my test data set, SQL Server will sort it first (thus NULL is the smallest value in my test set), while Oracle will put it last.  In my situation it wasn’t really an issue (the BKs are not supposed to be NULL), but it’s quite important to remember in cases where NULLs are actually possible.

Conclusion

When working with strings, always keep collation in mind.  And even more so when dealing with several different source systems!

Additional reference material:

Database Journal: The Globalization of Language in Oracle – The NLS_COMP and NLS_SORT variables

BOL 2008: How to: Sort Data for the Merge and Merge Join Transformations

  • Share/Bookmark

Tags: , , , , , ,

Recently I was investigating an issue related to ordering data.  As a test, I ran the following script:

– Sorting data from a temporary table 
declare @tbl table ( ProductNumber varchar(25) ); 
insert into @tbl select ‘BBBB’; 
insert into @tbl select ‘AAAA’; 
insert into @tbl select ‘A-B’; 
insert into @tbl select ‘A123′; 
select * from @tbl order by ProductNumber asc;

– Sorting data from a table variable 
create table #tbl ( ProductNumber varchar(25) ); 
insert into #tbl select ‘BBBB’; 
insert into #tbl select ‘AAAA’; 
insert into #tbl select ‘A-B’; 
insert into #tbl select ‘A123′; 
select * from #tbl order by ProductNumber asc; 
drop table #tbl;

 

And it gave me this result:

Results from script

As you can see, the order of the data coming from a temporary table is different than the data from a table variable, even though the same data type is used.  At first I thought, how on earth is this possible?  After spending some time pondering about this problem (and after a colleague reported that when he executed the above script, the result was as expected), it came to me.  Don’t tell me it’s a collation problem?!  Well, it is.  (In case you don’t know collation, in short “Collations specify the rules for how strings of character data are sorted and compared”.  More info through that previous link and here.)

When I executed the script, my Management Studio session was connected to AdventureWorks2008, one of my test databases.  When I switched to tempdb, the result was normal – both queries returned the same result.  Then I had a look at the collations.  My tempdb is using Latin1_General_CI_AS while the AdventureWorks2008 database is apparently using SQL_Latin1_General_CP1_CI_AS.

A temporary table is created in the tempdb and thus uses the collation of the tempdb for its string columns.  And a table variable uses the same collation as the database to which the session is connected.  Which is quite logical because otherwise you would get collation conflicts when using the table variable in combination with a table from the active database in the same query.  I will demonstrate that with the following script:

declare @tbl table ( ProductNumber varchar(25) );
insert into @tbl select ‘BBBB’;
select * from AdventureWorks.Production.Product P
where P.ProductNumber in (select ProductNumber from @tbl);

 

On my (SQL Server 2008) server I also have the old SQL2005 AdventureWorks database up and running, which is using the Latin1_General_CI_AS collation.  Executing the above script (which references that AdventureWorks DB) while being connected to AdventureWorks2008 (or any other DB that uses a different collation from Latin1_General_CI_AS) results in the following error:

Msg 468, Level 16, State 9, Line 3
Cannot resolve the collation conflict between "SQL_Latin1_General_CP1_CI_AS" and "Latin1_General_CI_AS" in the equal to operation.

Changing the active database to any other DB that uses the Latin1_General_CI_AS collation and then executing the query results in a positive execution.

Now, to get back to the initial issue of sorting inconsistency, even though the collations are not exactly the same, they are both Latin1, Case Insensitive, Accent Sensitive and still they don’t sort the data in the same way??  For an explanation on that I found the following page on the Microsoft Support site: Comparing SQL collations to Windows collations.  In short: a Windows collation (such as Latin1_General_CI_AS) uses a different comparison algorithm than a SQL collation (SQL_Latin1_General_CP1_CI_AS).  A Windows collation uses the same algorithm as for Unicode data, even when the data is non-Unicode.

All that means is that in our test example, in order to get the data in the right order all the time, we could switch to Unicode fields:

– Sorting data from a temporary table 
declare @tbl table ( ProductNumber nvarchar(25) ); 
insert into @tbl select ‘BBBB’; 
insert into @tbl select ‘AAAA’; 
insert into @tbl select ‘A-B’; 
insert into @tbl select ‘A123′; 
select * from @tbl order by ProductNumber asc;

– Sorting data from a table variable 
create table #tbl ( ProductNumber nvarchar(25) ); 
insert into #tbl select ‘BBBB’; 
insert into #tbl select ‘AAAA’; 
insert into #tbl select ‘A-B’; 
insert into #tbl select ‘A123′; 
select * from #tbl order by ProductNumber asc; 
drop table #tbl;

 

Executing that returns the following result, no matter what the active database is:

Script result when using Unicode fields

And it also means that you need to be careful when retrieving data from different sources (which is how I came across the issue in the first place).  If you need to have your data sorted in a particular, consistent way and you’re doing that by using an ORDER BY in a SELECT statement on the source system, double-check if all sources are using the same collation!

In case the above left you wondering about what collation to use: the SQL collations are there for backward compatibility.  For new developments use the Windows collations!

 

See my Fun With Strings article for another story related to collation.

 

Additional reference material:

SQL Server Central – Comparing Table Variables with Temporary Tables

  • Share/Bookmark

Tags: , ,

If you don’t have anything planned on next September 2 and you’re interested in some free SQL Server-related learning: it’s the 24 Hours of PASS!

image

You can even stay in your lazy chair at home because it’s an online event, no worrying about bus/train/plane/hotel/…  Just install the software (or browser plug-in, I actually don’t know because I haven’t performed the preparation procedure yet) and off you go.

I have registered for the following 5 sessions myself:

  • Session 10 (Dev) – Working with Spatial Data in SQL Server 2008 (Greg Low)
  • Session 11 (DBA) – Effective Indexing (Gail Shaw)
  • Session 12 (BI) – Reporting Services Inside Out the Things You Should Know (Simon Sabin)
  • Session 13 (Dev) – Query Performance Tuning 101 (Grant Fritchey)
  • Session 16 (DBA) – Database Compatibility Settings: What They Really Do .. and Don’t Do (Don Vilen)

Yep, it will be a busy holiday.  That same day they’ll be delivering our new combi oven, ideally that would be right after session 13 ends.  Fingers crossed.

Anyway, I’ll be seeing you September 2?  Or well, maybe not as it’s an online event…

Happy learning!

  • Share/Bookmark

Tags: , , ,

For this example I’ll be using the Adventure Works cube running on SQL Server Analysis Services 2008 and Reporting Services 2008.

In case you don’t have the AdventureWorks databases and cubes yet, they’re available at CodePlex.

Scenario

The sales department has asked for a report that displays the number of product items sold during a selected period.  As the company is active in two different markets, both the internet and reseller numbers should be shown.  The figures need to be grouped by product category, with drilldown to product level through subcategory.

Besides the period filter, it should be possible to filter on product category to limit the number of items shown.

Also, the background of the numeric cells should get a color depending on the value in the cell.  Colors range from red for low sales figures to green for high sale volumes.  The ranges are variable and should thus be configurable using 3 threshold parameters.  Following table shows the ranges as the department has requested them:

Value X Background Color
X < LowThreshold Red
LowThreshold <= MiddleThreshold Orange
MiddleThreshold <= X < HighThreshold Yellow
HighThreshold < X Green

 

Selecting The Data

At first sight this seems like a fairly simple report.  So you start building your MDX query using the Query Designer:

MDX Query Designer

Two filters have been specified: one of them is a date range and the other is based on product category.

Visualizing The Data

Then you drag a tablix onto the report body and play around with it until you get to the following:

Tablix in design mode

This is what it looks like when rendered, all seems to work fine:

Rendered report

To get the textbox background coloured based on the thresholds, you’ve produced an expression.  This expression is specified in Textbox Properties > Fill > Fill Color and looks like the following:

=Switch
(
    Fields!Internet_Order_Quantity.Value < Parameters!LowThreshold.Value, "#ff0e0e",
    Fields!Internet_Order_Quantity.Value >= Parameters!LowThreshold.Value
        and Fields!Internet_Order_Quantity.Value < Parameters!MiddleThreshold.Value, "#ff922d",
    Fields!Internet_Order_Quantity.Value >= Parameters!MiddleThreshold.Value
        and Fields!Internet_Order_Quantity.Value < Parameters!HighThreshold.Value, "#fff70f",
    Fields!Internet_Order_Quantity.Value >= Parameters!HighThreshold.Value, "#5cff21"
)

It’s a simple Switch statement using the threshold parameters.

A Missing Field Issue

So you deploy your report to the server for the users to test.  All is quiet, until someone starts complaining that the colouring doesn’t always work, for instance when filtering on Components.  Of course, you don’t always believe what the user says and try it out for yourself:

Rendered report with missing field issue

Indeed, the background is no longer coloured for the internet sales.  On top of that, the BIDS shows a couple of warnings in its output window:

[rsMissingFieldInDataSet] The dataset ‘ProductSales’ contains a definition for the Field ‘Internet_Sales_Amount’. This field is missing from the returned result set from the data source.

[rsErrorReadingDataSetField] The dataset ‘ProductSales’ contains a definition for the Field ‘Internet_Sales_Amount’. The data extension returned an error during reading the field. There is no data for the field at position 4.

Hang on, but I am selecting the field in my dataset, how can it be missing?  Except, this is MDX and OLAP, not SQL and OLTP.  By default, the MDX Query Designer uses NON EMPTY in the SELECT statement.  This means that the rows where there are no values for the selected measures will not be contained in the result set.  It also means that the complete measure will be omitted in the case that there are no values for it in any of the rows, which is the reason for our problem.

You could choose to not use NON EMPTY in the query.  To achieve this using the designer, right-click in the results pane and click the Include Empty Cells item.

MDX Query Designer result pane popup menu

Keep in mind that this will result in more rows in your result set because you’re now selecting all the empty measure cells as well.  Depending on your report requirements this may not be the desired effect.  On the other hand, it could be exactly what you want.  If our sales department had asked that the report should always show all products, even when there are no sales for the period, then we’d need to query the cube in this way.

For the sake of the example (and to save some trees in case the sales department is going to print the report :-) ) we will not choose this option.

Attempt to fix #1

As the field does not always exist, you decide that it’s a good idea to test for its existence.  A field in a resultset has an IsMissing property which serves that purpose.  So you adapt your expression to the following:

=IIF(Fields!Internet_Order_Quantity.IsMissing, Nothing,
    Switch
    (
        Fields!Internet_Order_Quantity.Value < Parameters!LowThreshold.Value, "#ff0e0e",
        Fields!Internet_Order_Quantity.Value >= Parameters!LowThreshold.Value
            and Fields!Internet_Order_Quantity.Value < Parameters!MiddleThreshold.Value, "#ff922d",
        Fields!Internet_Order_Quantity.Value >= Parameters!MiddleThreshold.Value
            and Fields!Internet_Order_Quantity.Value < Parameters!HighThreshold.Value, "#fff70f",
        Fields!Internet_Order_Quantity.Value >= Parameters!HighThreshold.Value, "#5cff21"
    )
)

However, when filtering on Components the same problem still occurs.  How can this be?  Expressions in SSRS are built using Visual Basic where expressions are evaluated completely.  In our case both the True and the False part of the IIF function are evaluated even when it will always be true.

On to another attempt to get this working.

(Attempt to) fix #2

The previous fix attempt has shown that it’s not possible to use an expression for the field validity test.  At least, not in the way we’ve tried until now.  Let’s try using custom code.

Custom code can be added to a report through the Code page in the Report Properties dialog box (accessible through the menu Report > Report Properties… or by right-clicking the report’s yellow background).

Report Properties > Code dialog box

Let’s start with a small extra requirement.  When a measure is not present in a row, such as the Internet Order Quantity for the products in the Components category, the report should display a zero instead of blank space.  To get this done we again need to test on whether or not the field exists in the result set.

The following Visual Basic function accepts a Field object and returns the value of the field when the field exists or zero when the field does not exist.

'returns the field's value or zero if the field does not exist
Public Function GetValue(field as Field) as Long
  If (field.IsMissing) Then
    Return 0
  ElseIf (IsNothing(field.Value)) Then
    Return 0
  Else
    Return field.Value
  End If
End Function

This function can now be used in an expression anywhere in the report.  Here’s what the expression looks like for the Value of the textbox that shows the Internet Order Quantity:

=Code.GetValue(Fields!Internet_Order_Quantity)

The same expression is used for the textboxes that display the sum values:

=Sum(Code.GetValue(Fields!Internet_Order_Quantity))

Attention: the function calls above are passing the actual Field object, not the Value property of the field, so not Fields!Internet_Order_Quantity.Value.

So, on to getting our coloring working as required.  For this we need a function that returns the right color for the given amount.  Something like this:

Public Const ColorLow As String = "#ff0e0e"      'red
Public Const ColorLowMid As String = "#ff922d"   'orange
Public Const ColorMidHigh As String = "#fff70f"  'yellow
Public Const ColorHigh As String = "#5cff21"     'green

Public Function GetColor(field as Field, low as Integer, mid as Integer, high as Integer) as String
  If (field.IsMissing) Then
    Return ColorLow
  ElseIf (IsNothing(field.Value)) Then
    Return ColorLow
  Else
    Select Case field.Value
      Case Is < low
        Return ColorLow
      Case Is < mid
        Return ColorLowMid
      Case Is < high
        Return ColorMidHigh
      Case Is >= high
        Return ColorHigh
    End Select
  End If
End Function

This function accepts a field plus the three threshold values.  Depending on the value of the field and the thresholds, the expected color string is returned.  The red color is returned as well when the field does not exist.

As a good coding practice I’ve created constants for the color strings.  This method allows you to define constants that are available in the whole report – could be interesting if the same colors are used in different parts of a report for instance.

This is the expression used for the BackgroundColor property of the TextBox:

=Code.GetColor(Fields!Internet_Order_Quantity,
    Parameters!LowThreshold.Value,
    Parameters!MiddleThreshold.Value,
    Parameters!HighThreshold.Value)

Again the actual Field object gets passed as first parameter, not just the value.

If we now run the report with a filter on Components, the warnings will still appear in the Output window, but the report will function as expected as the following screenshot shows.  (No Photoshop was used in the making of this screenshot.)  Instead of empty cells the report shows zeroes and the background is coloured even when there are no sales.

Rendered report - fully functional

Extra info can be found in the MSDN page about Using Dataset Field Collection References in Expressions.

  • Share/Bookmark

Tags: , , , , , , ,

« Older entries

© 2008-2010 A Developer's Blog All Rights Reserved