June 2010

You are currently browsing the monthly archive for June 2010.

As a BI developer I regularly use Excel to analyze my data on correctness.  In Excel it’s easy to filter out unrelated records, as you’ll probably already know.

How do I get my data in Excel?

With a simple copy/paste.  Since SQL Server 2008, you can just right-click on the results in the Management Studio and select Copy with Headers.  This will put the selected data into memory, including the column headers.

SSMS 2008: right-click on results to Copy with Headers

You can even make a selection in the Results pane and only copy that, as shown in the screenshot above.  Here’s what it looks like when pasted into Excel:

Data with headers pasted into Excel

How do I make the columns wider?

Excel will not expand the column widths by default, shown in that last screenshot.  You could manually enlarge them one by one, but that’s not convenient when you last filled column is called “DI”.  Instead, I use the following trick.  First I select the whole sheet by clicking on the grey square in the top-left corner:

Select the whole sheet using the grey square

Then I double-click on the vertical divider in between columns A and B.  This will adjust all column widths so that the data fits nicely.

Double-click the divider in between two columns to adjust width automatically

Does Excel know SELECT DISTINCT?

The other day I needed to filter out all duplicate records out of a really large dataset.  I first tried by activating the regular filter.  This is an Excel feature that I use a lot.  Just clicking the Filter button on the Data ribbon will add the small dropdowns next to each column’s header:

Excel's Filter functionality

But in those dropdowns I couldn’t find an option which would get me to the required result.

(Note that I added a couple of duplicate records in my dataset.)

After a little search, I found the magic checkbox!  I didn’t know that it could be so easy.

To find it, click the Advanced button in the Data ribbon.

The Advanced button on the Data ribbon

In the Advanced Filter window that appears you see a checkbox called Unique records only.

Advanced Filter: Unique records only checkbox

There’s your SELECT DISTINCT!

And here’s the sheet with the duplicates filtered out:

Excel sheet after filtering out duplicate records with the "Unique records only" option

Do you also see that the filter dropdowns are gone, although I did not remove them?  The Advanced Filter cannot be used in combination with the regular filter.  Once you activate the regular filter by clicking the Filter button, your duplicates will re-appear.

If you want to further filter your data after filtering out all duplicate records, you should select the Copy to another location option in the Advanced Filter window.  Copy it below your data on the same sheet and apply the regular filter on that new set of data.

Filtered set of data below full dataset on same sheet

Copying to another sheet is not supported.  Apparently you can only copy to the active sheet.  Should you try it out anyway, you’ll get this message:

Warning: You can only copy filtered data to the active sheet

Happy filtering!

Valentino.

Share

Tags: ,

Lately I’ve been using Visual Studio Team System 2008 Database Edition with the GDR R2 add-on, also known as “The Data Dude”, to compare databases with each other.

And during a schema compare I noticed that the dude can be quite stubborn now and then.

Here’s an example and an explanation of why, as far as I can tell, the dude is behaving that way.

The Scenario

At work we use the data dude’s schema compare to prepare our deployment scripts.  The most fun can be had when preparing a script to deploy the latest finalized version, version x, from the Development (DEV) to the Acceptance (UAT) environment.

In our DEV database there have already been changes for the next version, version x+1, so these changes need to be skipped.

Now imagine the following situation:

  • a table called TableA exists
  • a new field called NewFieldA was added to TableA
  • a view called ViewA exists and was modified – it uses TableA but not NewFieldA

The changes to ViewA belong with version x, while the changes to TableA are part of version x+1.  So the script that I’m going to generate should contain an ALTER VIEW ViewA … but not an ALTER TABLE TableA ….

The Behavior

Using the dude, I open up the dropdown on the line that states Different definition on TableA and I select Skip.  The value gets changed from Update to Skip.  However, when the focus changes to another line in the grid the value changes back to Update!  Furthermore, it’s now grayed out, impossible to change!

The first time that you see this happening you can’t believe your eyes and try it once more, first closing everything and then re-doing your actions.  Just to see if you didn’t select the wrong option somewhere.  But alas, the dude remains stubborn!

Tip: if you have just executed the compare and you find that some update actions cannot be modified, try to do a refresh (right-click on the background of the comparison window and click Refresh).  I noticed that after the refresh you will get the dropdown for these lines and thus will be able to change the Update Action.

The Explanation

It took me a while to realize, but here’s what seems to be happening.  Like with all things stubborn, it helps if you know what the thing in question is actually being stubborn for.

The dude knows that ViewA depends on TableA and it also knows that both ViewA and TableA where changed.  However, the dude is a little short-sighted, it doesn’t know that NewFieldA is not used in ViewA.  It doesn’t know that the change to ViewA was not related to the change on TableA.  So it automatically assumes that you’re making a mistake and prevents you from messing up.  It thinks, “You stupid user, I know it better, I am smart software!  And that’s not all, I AM THE DATA DUDE! And if there’s one thing you don’t want, that’s to mess with the Data Dude!”.

Here’s a screenshot of the dude’s stubbornness:

Grayed-out Update Action: the result of a stubborn data dude

In this particular case it refused to skip certain new tables, even though there were no dependent objects that needed those tables!  The first two lines show the grayed-out items.  These tables were located under a new schema and I had to put the schema’s Update Action to Skip to get the tables be skipped as well.  The Update Action for the tables was changed automatically when I modified the action for the schema.

But, if you ask me, I don’t see why I shouldn’t be able to Skip a new table even when I am going to create the new schema to which it belongs?  It’s the table that depends on the schema, not the other way around.

The Feature Request

It would have been great if I could manually override the dude’s decision.  As that was not the case, I have to manually change the generated script so that it only contains the changes that I want.  As long as it’s technically feasible, I think manual changes to the Update Action should be allowed.

One More Tip

Take a good look at the following screenshot.

Beware the miniature exclamation mark in the icon's corner!

By default, the node that reads Different dependencies was collapsed.  So you think, good, the line says Skip so nothing will happen.  Except, did you notice that minuscule exclamation mark in the bottom-left corner of the icon?  That means that other actions may be happening further down the dependency tree, as illustrated in the screenshot.

So don’t let the dude mislead you into thinking nothing will happen when it says Skip.  Sometimes the actual actions are hidden further down and you need to open up every node with the exclamation mark on the icon to verify if that’s what you want.

Have fun training the dude,

Valentino.

Share

Tags: , ,

In this post I’ll be comparing the different places where I write blog articles.

Here’s the location list:

  • home office
  • garden
  • train
  • work

At Home

My Office

The pink animal clock of Windows 7My home office is the place where I’ve written most of my posts so far.  It’s a quiet place, even when the kids are at home, because there’s a hallway in between that room and the living room.  So it’s a good place to write, easy to concentrate.

Our youngest daughter though, three years old, may occasionally take a peek through the glass door.  In fact, sometimes she even comes and sits on my lap to watch what I’m doing.  Her favorite clock type in Windows 7 is the pink animal, not sure what type of animal it is.

But that’s okay, as I’m using my free time to write these articles I occasionally need life to interfere and remind me about that fact.  That’s usually the time to stop writing and spend some time with the family.

Pros

  • Fast internet connection
  • Quiet, good for concentration

Cons

  • No fresh air

My Garden

Weather forecast for today, June 4th 2010: 23°C!Now that the weather has started to prepare Belgium for Summer, I prefer to spend my spare time outside sitting in our garden.  And sometimes, usually during the nice warm weekends, I want to use some of that time to produce an article.  With my wireless network, parasol, table and chair, I’m all set up for it.  While sitting outside I can even keep an eye on the kids, who are also outside playing with their slide, trampoline, swing, …

Pros

  • Can get something done while not missing out on the nice weather
  • Internet access (provided you’ve got a wireless network)

Cons

  • Laptop seems to run a bit warmer, probably because it’s warmer outside than in my office
  • Need an electricity plug for longer sessions

On The Road

In The Train

My new Swiss Gear Ibex 17-Inch Notebook Backpack Recently I’ve bought myself this backpack and it was the best thing I could buy.  My current client are the Belgian Railways.  Their offices are of course located close to a train station, so most of the time I’m commuting to/from work by train.  A one-way trip takes about 45 minutes and my main occupation so far has been reading or sitting with my eyes closed listening to music.

Until now.  With this new backpack I can easily take my laptop with me and use it to write, or to play with SQL Server, prepare a demo, ….  In fact, right now while I’m typing this I’m sitting on the train.  Without the backpack it was just too painful.  I need to walk about 8 minutes to get from the train station to the offices of my client and with the standard Dell bag the weight of the laptop is just too much on my shoulder.

In the mornings I still close my eyes though, that’s just too early to get something useful done anyway.

Pros

  • No time wasted doing nothing
  • More free time at home to spend with wife and daughters

Cons

  • No internet – no quick lookups
  • No internet – publishing will be delayed
  • No internet – links will need to be added later from another location in my list above
  • May get noisy on the crowded trains – difficult to concentrate
  • No mouse connected due to not enough space – I so miss it!
  • No electricity (but my battery lasts long enough so not an issue)

At Work

Well, actually, I’m just kidding here.  You thought it was for real?  Did you jump down straight from the top to get to this section?  I think I may have scared a couple of people, such as my employer.  Which is actually the intention of this paragraph, ha! (If you’re one of those, post a comment!) :-)

I’m not one of those guys who has the possibility to blog and get paid for it.

Conclusion

Even though I’ve listed quite some cons on writing while sitting in a train, I still like it.  Just because it’s much more productive than anything else that can be done at that time, and the time spent on the train is not something that’s optional or depends on a decision.  At home I can decide to make some time free for writing, whether if it’s in my office or in the garden.  But the train is a place where I will certainly spend some time each working day.  I can’t decide to not spend time on the train.  Except when I commute by car – which is something I do when I’ve got a meeting or info session in the evening – but that’s another story.

Real conclusion: sometimes buying a backpack can save you time.

Time for me to start the weekend and get off the train!

Have fun!

Valentino.

Share

Tags:

In this article I will show you a couple of different T-SQL queries to fetch aggregated data.  The main purpose is to illustrate how the OVER clause can be used to aggregate data.

For the examples I will use data from the AdventureWorks2008R2 database, available at CodePlex.

The Data

The AdventureWorks 2008 R2 database contains a view called Sales.vSalesPerson.  This is the data with which I’ll be working in the examples below.  Here’s what it looks like:

My Working Data

I’ve hidden some fields so that all the relevant ones are in view.

The Scenario

Your manager has asked you to create one query, to be executed on the Sales.vSalesPerson table, that returns a list of:

  • all employees (FirstName, LastName, JobTitle, CountryRegionName, StateProvinceName, City),
  • their sales of last year (SalesLastYear),
  • the sum of the sales of last year for their country,
  • the average of the sales of last year compared to all employees with the same type of phone (PhoneNumberType)
  • the overall average and sum of the sales of last year.

Using Derived Tables

No problem you say, coming right up.  So you start building your query, retrieving all fields as requested.

After quite some typing, here’s what your query looks like:

select S.FirstName, S.LastName, S.JobTitle, S.PhoneNumberType, S.CountryRegionName,
    S.StateProvinceName, S.City, S.SalesLastYear,
    GeographicSales.SalesLastYearGeographic_SUM,
    SalesByPhoneType.SalesLastYearByPhoneNumberType_AVG,
    SalesSUM.SalesLastYear_AVG, SalesSUM.SalesLastYear_SUM
from Sales.vSalesPerson S
--Derived Table 1: the overall aggregates
cross join (
    select SUM(SalesLastYear) SalesLastYear_SUM, AVG(SalesLastYear) SalesLastYear_AVG
    from Sales.vSalesPerson
) SalesSUM
--Derived Table 2: the aggregate on Country level
inner join (
    select CountryRegionName, SUM(SalesLastYear) SalesLastYearGeographic_SUM
    from Sales.vSalesPerson
    group by CountryRegionName
) GeographicSales on GeographicSales.CountryRegionName = S.CountryRegionName
--Derived Table 3: the aggregate on phone type
inner join (
    select PhoneNumberType, AVG(SalesLastYear) SalesLastYearByPhoneNumberType_AVG
    from Sales.vSalesPerson
    group by PhoneNumberType
) SalesByPhoneType on SalesByPhoneType.PhoneNumberType= S.PhoneNumberType;

The main query is retrieving all fields as requested.  Further down there are three derived table queries, each one retrieving aggregates on a different level.

The first derived table is retrieving the overall aggregates.  These are cross-joined with every record in our main query so for each record the totals will be the same, which is what we want.

The second derived table retrieves the aggregates on Country level, including the CountryRegionName.  This is done using the conventional GROUP BY method.  The CountryRegionName is the key on which the derived table is joined to the main table.

The third derived table uses a similar system, this time for the aggregate on phone type.

And here’s the query’s output:

Output of the query using subqueries

Happy with this result, you go up to the cafeteria to finally have lunch with your colleagues (who left 15 minutes earlier but you wanted to get your query finished first).

Using The OVER Clause

During lunch you explain to your peers what kind of funny request you got from management and told them how you solved it.

Then one of them speaks up and says: “Want to know how you can avoid all that typing?  Use the OVER clause!  I’ll show you when we are back at our desks.”

After lunch, here’s what your colleague helps to produce:

select S.FirstName, S.LastName, S.JobTitle, S.PhoneNumberType, S.CountryRegionName,
    S.StateProvinceName, S.City, S.SalesLastYear,
    SUM(SalesLastYear) OVER (PARTITION BY CountryRegionName)
        SalesLastYearGeographic_SUM,
    AVG(SalesLastYear) OVER (PARTITION BY PhoneNumberType)
        SalesLastYearByPhoneNumberType_AVG,
    SalesSUM.SalesLastYear_AVG, SalesSUM.SalesLastYear_SUM
from Sales.vSalesPerson S
--Derived Table 1: the overall aggregates
cross join (
    select SUM(SalesLastYear) SalesLastYear_SUM, AVG(SalesLastYear) SalesLastYear_AVG
    from Sales.vSalesPerson
) SalesSUM;

As you can see, derived tables 2 and 3 are gone.  They have been replaced with the OVER clause, in combination with PARTITION BY.  What you say with the OVER clause is: “partition the dataset by the fields specified in the PARTITION BY and apply the aggregation on those partitions”.  Another word for this is aggregate window function.

As you like the approach, you ask your co-worker how you can get rid of that cross join.  He doesn’t really know but then another colleague who overheard your conversation says: “On this blog the other day I read that you can use the OVER clause and partition by anything you want.  As long as it’s a constant, it will work!”.

So you give that a try and you end up with the following final query:

select FirstName, LastName, JobTitle, PhoneNumberType, CountryRegionName,
    StateProvinceName, City, SalesLastYear,
    SUM(SalesLastYear) OVER (PARTITION BY CountryRegionName)
        SalesLastYearGeographic_SUM,
    AVG(SalesLastYear) OVER (PARTITION BY PhoneNumberType)
        SalesLastYearByPhoneNumberType_AVG,
    AVG(SalesLastYear) OVER (PARTITION BY 'duh') SalesLastYear_AVG,
    SUM(SalesLastYear) OVER (PARTITION BY 1) SalesLastYear_SUM
from Sales.vSalesPerson;

As illustrated in the example, you can use any constant value to calculate overall aggregates over the whole dataset using the OVER clause.

You happily thank your colleagues and tell them that next time you’ll be able to join them for lunch on time.

About a week later you’re explaining to one of your friends how you’ve gotten to know the OVER clause. After hearing how you use it to aggregate over the whole dataset, he smiles and says: “I know how you can simplify it even more! Don’t partition at all!”.

Taking a closer look it turns out that the PARTITION BY is actually optional:

Ranking Window Functions
< OVER_CLAUSE > :: =     OVER ( [ PARTITION BY value_expression, ... [ n ] ]            <ORDER BY_Clause> )Aggregate Window Functions
< OVER_CLAUSE > :: =     OVER ( [ PARTITION BY value_expression, ... [ n ] ] )

See those square brackets? Means it’s optional.

So here is the real final query:

select FirstName, LastName, JobTitle, PhoneNumberType, CountryRegionName,
    StateProvinceName, City, SalesLastYear,
    SUM(SalesLastYear) OVER (PARTITION BY CountryRegionName)
        SalesLastYearGeographic_SUM,
    AVG(SalesLastYear) OVER (PARTITION BY PhoneNumberType)
        SalesLastYearByPhoneNumberType_AVG,
    AVG(SalesLastYear) OVER () SalesLastYear_AVG,
    SUM(SalesLastYear) OVER () SalesLastYear_SUM
from Sales.vSalesPerson;

Conclusion

When you compare the final query with the first one, tell me, which one would you prefer to maintain?  Do you prefer to have lunch with your peers or to arrive late and miss all the fun?

Have fun!

Valentino.

References

OVER Clause (Transact-SQL)

Join Fundamentals

SELECT (Transact-SQL)

Share

Tags: , , ,

© 2008-2017 BI: Beer Intelligence? All Rights Reserved