In this article I will show you a couple of different T-SQL queries to fetch aggregated data. The main purpose is to illustrate how the OVER clause can be used to aggregate data.
For the examples I will use data from the AdventureWorks2008R2 database, available at CodePlex.
The Data
The AdventureWorks 2008 R2 database contains a view called Sales.vSalesPerson. This is the data with which I’ll be working in the examples below. Here’s what it looks like:
I’ve hidden some fields so that all the relevant ones are in view.
The Scenario
Your manager has asked you to create one query, to be executed on the Sales.vSalesPerson table, that returns a list of:
- all employees (FirstName, LastName, JobTitle, CountryRegionName, StateProvinceName, City),
- their sales of last year (SalesLastYear),
- the sum of the sales of last year for their country,
- the average of the sales of last year compared to all employees with the same type of phone (PhoneNumberType)
- the overall average and sum of the sales of last year.
Using Derived Tables
No problem you say, coming right up. So you start building your query, retrieving all fields as requested.
After quite some typing, here’s what your query looks like:
select S.FirstName, S.LastName, S.JobTitle, S.PhoneNumberType, S.CountryRegionName, S.StateProvinceName, S.City, S.SalesLastYear, GeographicSales.SalesLastYearGeographic_SUM, SalesByPhoneType.SalesLastYearByPhoneNumberType_AVG, SalesSUM.SalesLastYear_AVG, SalesSUM.SalesLastYear_SUM from Sales.vSalesPerson S --Derived Table 1: the overall aggregates cross join ( select SUM(SalesLastYear) SalesLastYear_SUM, AVG(SalesLastYear) SalesLastYear_AVG from Sales.vSalesPerson ) SalesSUM --Derived Table 2: the aggregate on Country level inner join ( select CountryRegionName, SUM(SalesLastYear) SalesLastYearGeographic_SUM from Sales.vSalesPerson group by CountryRegionName ) GeographicSales on GeographicSales.CountryRegionName = S.CountryRegionName --Derived Table 3: the aggregate on phone type inner join ( select PhoneNumberType, AVG(SalesLastYear) SalesLastYearByPhoneNumberType_AVG from Sales.vSalesPerson group by PhoneNumberType ) SalesByPhoneType on SalesByPhoneType.PhoneNumberType= S.PhoneNumberType;
The main query is retrieving all fields as requested. Further down there are three derived table queries, each one retrieving aggregates on a different level.
The first derived table is retrieving the overall aggregates. These are cross-joined with every record in our main query so for each record the totals will be the same, which is what we want.
The second derived table retrieves the aggregates on Country level, including the CountryRegionName. This is done using the conventional GROUP BY method. The CountryRegionName is the key on which the derived table is joined to the main table.
The third derived table uses a similar system, this time for the aggregate on phone type.
And here’s the query’s output:
Happy with this result, you go up to the cafeteria to finally have lunch with your colleagues (who left 15 minutes earlier but you wanted to get your query finished first).
Using The OVER Clause
During lunch you explain to your peers what kind of funny request you got from management and told them how you solved it.
Then one of them speaks up and says: “Want to know how you can avoid all that typing? Use the OVER clause! I’ll show you when we are back at our desks.”
After lunch, here’s what your colleague helps to produce:
select S.FirstName, S.LastName, S.JobTitle, S.PhoneNumberType, S.CountryRegionName, S.StateProvinceName, S.City, S.SalesLastYear, SUM(SalesLastYear) OVER (PARTITION BY CountryRegionName) SalesLastYearGeographic_SUM, AVG(SalesLastYear) OVER (PARTITION BY PhoneNumberType) SalesLastYearByPhoneNumberType_AVG, SalesSUM.SalesLastYear_AVG, SalesSUM.SalesLastYear_SUM from Sales.vSalesPerson S --Derived Table 1: the overall aggregates cross join ( select SUM(SalesLastYear) SalesLastYear_SUM, AVG(SalesLastYear) SalesLastYear_AVG from Sales.vSalesPerson ) SalesSUM;
As you can see, derived tables 2 and 3 are gone. They have been replaced with the OVER clause, in combination with PARTITION BY. What you say with the OVER clause is: “partition the dataset by the fields specified in the PARTITION BY and apply the aggregation on those partitions”. Another word for this is aggregate window function.
As you like the approach, you ask your co-worker how you can get rid of that cross join. He doesn’t really know but then another colleague who overheard your conversation says: “On this blog the other day I read that you can use the OVER clause and partition by anything you want. As long as it’s a constant, it will work!”.
So you give that a try and you end up with the following final query:
select FirstName, LastName, JobTitle, PhoneNumberType, CountryRegionName, StateProvinceName, City, SalesLastYear, SUM(SalesLastYear) OVER (PARTITION BY CountryRegionName) SalesLastYearGeographic_SUM, AVG(SalesLastYear) OVER (PARTITION BY PhoneNumberType) SalesLastYearByPhoneNumberType_AVG, AVG(SalesLastYear) OVER (PARTITION BY 'duh') SalesLastYear_AVG, SUM(SalesLastYear) OVER (PARTITION BY 1) SalesLastYear_SUM from Sales.vSalesPerson;
As illustrated in the example, you can use any constant value to calculate overall aggregates over the whole dataset using the OVER clause.
You happily thank your colleagues and tell them that next time you’ll be able to join them for lunch on time.
About a week later you’re explaining to one of your friends how you’ve gotten to know the OVER clause. After hearing how you use it to aggregate over the whole dataset, he smiles and says: “I know how you can simplify it even more! Don’t partition at all!”.
Taking a closer look it turns out that the PARTITION BY is actually optional:
Ranking Window Functions < OVER_CLAUSE > :: = OVER ( [ PARTITION BY value_expression, ... [ n ] ] <ORDER BY_Clause> )Aggregate Window Functions < OVER_CLAUSE > :: = OVER ( [ PARTITION BY value_expression, ... [ n ] ] )
See those square brackets? Means it’s optional.
So here is the real final query:
select FirstName, LastName, JobTitle, PhoneNumberType, CountryRegionName, StateProvinceName, City, SalesLastYear, SUM(SalesLastYear) OVER (PARTITION BY CountryRegionName) SalesLastYearGeographic_SUM, AVG(SalesLastYear) OVER (PARTITION BY PhoneNumberType) SalesLastYearByPhoneNumberType_AVG, AVG(SalesLastYear) OVER () SalesLastYear_AVG, SUM(SalesLastYear) OVER () SalesLastYear_SUM from Sales.vSalesPerson;
Conclusion
When you compare the final query with the first one, tell me, which one would you prefer to maintain? Do you prefer to have lunch with your peers or to arrive late and miss all the fun?
Have fun!
Valentino.
References
Tags: data, SQL Server, T-SQL, Tutorial
-
Great! Thank you for posting.
One question – can you just omit the over / partition by when using a constant to partition? In your last example, would
AVG(SalesLasYear) SalesLastYear_AVG,
SUM(SalesLastYear) SalesLastYear_SUMwork just as well?






5 comments
Comments feed for this article
Trackback link: http://blog.hoegaerden.be/2010/06/01/aggregating-data-with-the-over-clause/trackback/