You are currently browsing articles tagged MDS.

Microsoft SQL Server 2008 R2 Master Data ServicesI’ll admit it, I’ve been involved in the process of book writing!  I wasn’t the writer, no, I left that task to Jeremy Kashel, Martyn Bullerwell and Tim Kent, three guys working for a UK-based company called Adatis.

I cooperated as one of the Technical Reviewers, which means I’ve read the whole book before it was published.  Obviously, just reading was not all.  I tried out all exercises to ensure that the code did not contain any mistakes and commented on possible errors, or difficult-to-understand paragraphs due to lack of details or sentences that were phrased in a weird way.

And I have to say, I was surprised by the quality of the text before it was even read by a Technical Reviewer.  It’s not easy to write several chapters, building further on the result of the previous one, without making any mistakes.  But the guys managed to do that just fine!

So, if you’re looking for a good book on Master Data Services, have a look at the Microsoft SQL Server 2008 R2 Master Data Services book, published by Packt Publishing.

The book is aimed at everyone new to MDS, no prior knowledge is required.  You will need to know SQL Server though, but as you’re reading my blog I’m sure that shouldn’t be a problem.

This doesn’t mean that it won’t go into detail though, don’t misunderstand me here.  If you’ve read the book from start to finish, you will have thorough knowledge of Master Data Services!

Happy reading,



Tags: , ,

This post is inspired by a presentation that’s available on the Microsoft TechEd Online website.  It’s called Master Data Management – Merging from Multiple Sources, and is presented by Dejan Sarka, one of the Solid Quality Mentors and writer of several SQL Server-related books.

Even if you’re not interested in Master Data Services (MDS), the following will be good to know if you need to compare strings with each other for similarity and find the string that’s the closest match to your input string.

Compare Strings Why?

You may be wondering in what scenarios you’d be required to compare strings for similarity.  To clarify, I’ll give you an example.  Imagine you’re building a data warehouse (DWH).  This DWH receives data from several different source systems.  In two of those systems, you’ve got a list of customers.  To be able to populate your DimCustomer table and avoid duplicate customers, you need to implement some logic to detect that customer Smith in System A is the same customer Smith in System B.

That’s when string similarity or fuzzy-logic functions come in handy.

Compare Strings How?

When strings are 100% equal, it’s obviously not difficult to find matching strings.  Just use the equals (=) operator in your query and you’ve matched them.  However, when strings are not 100% equal, due to typing errors or whatever cause, things get a little more complicated.  Perhaps customer Smith in System A is called Smyth, Smiht or even Smiths in System B while they are actually one and the same person.  That’s when we need to use additional logic, which we – not being rock star mathematicians ourselves – can hopefully find in built-in system functions.

The Built-in Soundex() And Difference() Functions

The standard functionality available in SQL Server to compare strings is fairly limited.  You’ve probably heard of the SOUNDEX() function, maybe even used in somehow already.  This function receives a string as parameter and calculates a four-digit code out of it.  When used on two similar strings, the two strings will produce the same code.  When they are not similar, you get two different codes.  And that’s it.

Here’s an example:

select SOUNDEX('Smith'), SOUNDEX('Smiht'), SOUNDEX('Washington')

The result of that query is:

Result of SOUNDEX query

Is it perfect?  No, it’s not.  If you’d give it a value of ‘Smiths’, it would return S532, which is different from S530 even though there’s only one letter of difference between the two strings.

Next to SOUNDEX() we’ve got the DIFFERENCE() function.  This function accepts two parameters and returns an integer between 0 and 4.  What this function does is it calculates the soundex value for both strings and returns the number of characters of the code that are matching.  In the case of a comparison of ‘Smith’ with ‘Smiths’, it would return 3 because three characters are matching (‘S53’).

Let’s move on to an alternative solution.

The Similarity Function In Master Data Services

The MDS installation procedure goes through several steps to get all the required functionality installed.  One of those steps is the creation of a database.  What’s interesting about this database, even if you’re not interested in MDS or Master Data Management, are the custom functions that it contains.

One of those functions is called Similarity, located in the mdq schema.  This function allows you to compare two strings with each other through a specified match algorithm.  What’s interesting here is that you can choose between those four different algorithms depending on your data.  In some cases a certain algorithm will be more interesting while in other cases the best algorithm will be another one.

The value returned by the Similarity function is a float between zero and one, which makes it more precise than the soundex option.

So how do you use the function?  Let’s have a look at its definition:

ALTER FUNCTION [mdq].[Similarity](@input1 [nvarchar](4000), @input2 [nvarchar](4000),

  @method [tinyint], @containmentBias [float], @minScoreHint [float])
EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].


As you can see, this is not a standard T-SQL function but it’s been implemented in .NET through CLR Integration.  The function expects five parameters.  The first two parameters, @input1 and @input2, are the two strings that need to get compared.  The third parameter specifies the match algorithm that should be used.

Here are the four algorithms as supported by the Similarity function:

Value for @method Algorithm
0 The Levenshtein edit distance algorithm
1 The Jaccard similarity coefficient algorithm
2 A form of the Jaro-Winkler distance algorithm
3 Longest common subsequence algorithm

The fourth parameter, @containmentBias, specifies how exact the fuzzy index should be when comparing strings of different lengths.  Values go from 0.0 to 1.0 with the lower number being the more precise one.  This only applies to the Jaccard and longest common subsequence algorithms.  The default is 0.85.

The fifth parameter, @minScoreHint, influences the calculated scores returned by the Similarity function.  Valid values go from 0.0 to 1.0.  When a value greater than 0 is passed, any calculated score under that value will result in zero.

Note: according to the Books Online, this fifth parameter is optional.  But it’s not.

Note: also according to the Books Online, a value of 4 for @method would also be accepted.   In that case the function will use a date comparison algorithm, thus the two first parameters should be either DateTime values or valid dates that are strings specified in the format yyyy-mm-dd.  However, when testing this out I noticed that this is not working.  Further exploration of the MDS database led me to another function called SimilarityDate, also located in the mdq schema:

ALTER FUNCTION [mdq].[SimilarityDate](@date1 [datetime], @date2 [datetime])
EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].


From the looks of it, this function implements the functionality as explained in the BOL for @method = 4.  And this one is actually working!

So how do you find out which of the four algorithms is the most interesting one in your situation?  You’ll have to try it out.  Take a sample data set and run the four algorithms on the data.  As I don’t have any real-world data to use here (it wouldn’t be legal anyway), I’ll demonstrate this using some data from the ContosoDW database.

The following query uses the Customer dimension and combines that with a very small set of “sample” data that imitate some real-world problems like typos.

with DistinctLastname as
    select distinct LastName
    from DimCustomer
NewData as
    select 'Wahsington' as LastName2 --typo
    union select 'Wqshington' --QWERTY/AZERTY mixup
    union select 'Zqtson' --QWERTY/AZERTY mixup x2
select DistinctLastname.LastName, NewData.LastName2
into #SampleData
from DistinctLastname
cross join NewData
where LastName is not null;

I’m only using the LastName column here.  In a real situation you’d probably want to combine that with FirstName and also some address-related data such as city and street.

Up next is letting the algorithms loose on the data set:

select LastName, LastName2,
    MDS.mdq.Similarity(LastName, LastName2, 0, 0.85, 0) as Levenshtein,
    MDS.mdq.Similarity(LastName, LastName2, 1, 0.85, 0) as Jaccard,
    MDS.mdq.Similarity(LastName, LastName2, 2, 0.85, 0) as JaroWinkler,
    MDS.mdq.Similarity(LastName, LastName2, 3, 0.85, 0) as LongestCommonSubsequence
from #SampleData;

Copy the results of that query to Excel for further, and easier, analysis.  You can easily sort your data in Excel, so that the highest calculated scores of a certain algorithm are located on top.

Let’s first start by sorting on the Levenshtein results:

Results ordered on Levenshtein value

As you can see, the three values that we were looking for are located on top.  That’s a good sign!  Furthermore, the Washington values are quite high.  So based on my sample data this is possibly a good algorithm.

How about the Jaccard results?

Results for the Jaccard algorithm

You can clearly see that the maximum values for the Jaccard algorithm are significantly lower than those of the other algorithms.  Furthermore, the correct value for Watson is scoring lower than Son.  Assuming our logic would select the best-scoring values when searching for “Watson”, it would select the incorrect value of Son.

All this indicates that the Jaccard algorithm is not the best-suited one for our situation.

So, what about Jaro-Winkler?

Results for the Jaro-Winkler algorithm

In this case we’ve got even higher maximum values compared to Levenshtein and the two values for Washington are located on top.  So far so good.  The correct value for Watson is located at position 7.  But as you can see, this is the first match for Zqtson, which means that the correct value would get selected by our matching logic.  Based on these numbers I would say that so far this is the best algorithm for the situation.

One more to go: Longest Common Subsequence.

Results for the Longest Common Subsequence algorithm

Again the three correct values are located on top, just like the Levenshtein algorithm.  In fact, the calculated scores are very similar to the Levenshtein algorithm.  Quite logical: both calculations are using similar algorithms.


Based on the results above and using this very limited sample data set, I would select the Jaro-Winkler algorithm as being the most suitable for our situation.  But I do have to mention that you should really use larger data sets to be sure.

Also, even though we can rely on a fuzzy-logic algorithm to find the correct match, the selected matches should be verified and approved manually.  Of course, all that can be part of a Master Data Management process.

Note that for this post I only looked at fuzzy matching possibilities using just T-SQL.  In Integration Services there are a couple of components, such as the Fuzzy Lookup data flow component, that offer similar functionality.  If you’re dealing with ETL flows in SSIS, be sure to check that one out as well!

Have fun!



Microsoft Contoso BI Demo Dataset for Retail Industry

Fuzzy String Searching

Soundex algorithm

Levenshtein Distance

Jaccard Index

Jaro-Winkler Distance


Tags: , ,

Owkay, I’ve got a good one for you this time.

And all it takes is checking a checkbox, just one checkbox!

That’s the solution to a problem with which I’ve been struggling yesterday.  What problem?  Let me tell you.

I’ve been installing Master Data Services (MDS) on my SQL Server 2008 R2.  This feature will not only create a database but also a website which you can use to manage MDS.  And it’s the website part that I had a problem with.  Everything installed just fine but when I tried to load the site in Internet Explorer 8 on my Windows 7 64-bit machine, I got the following error:

HTTP Error 401.2 – Unauthorized

You are not authorized to view this page due to invalid authentication headers.

Great, an authentication problem – aren’t those our favorite ones?  And like any decent developer, I didn’t waste my time reading all the text on the error page and started investigating the issue.  I had a look at the settings of the Application Pool and those of the Default Web Site.  I also tried changing the security settings in IE.  At the end I think I have tried every possible setting in IIS 7.5 (which is the version that ships with Windows 7), but I kept getting that same error.

I let some time pass (not on purpose but because we were going to visit my parents-in-law) and in the evening I decided to have another look.  This time I took the effort of reading everything mentioned in the error:

HTTP Error 401.2 - Unauthorized

Do you see that blue link down at the bottom, which I’ve marked with a red rectangle?  That’s where I got the answer from!  Clicking it opened up the following Microsoft Support page: Error message when you try to visit a Web page that is hosted on IIS 7.0: "HTTP Error 401.2 – Unauthorized".

Sounds familiar doesn’t it?  Okay, it’s meant for IIS 7.0 but works for 7.5 as well.

I solved my problem by applying Resolution 1.  In short: IIS was running without the Windows Authentication module installed!  Apparently that is not installed by default when you activate Internet Information Services through the “Turn Windows features on or off” window.  Here’s a screenshot showing what needs to be added:

Activating Windows Authentication for IIS 7.5 in Windows 7

Once that was set up I was able to load the MDS application:

Master Data Services Home Page

If you’re looking for instructions on how to install and configure Master Data Services, have a look at this article at the Master Data Services Team blog.

(Do I need to mention that this was the first-ever web application that I tried running on Windows 7? :-) )

Have fun!



Tags: , , , , ,

© 2008-2019 BI: Beer Intelligence? All Rights Reserved