Finding Similar Strings With Fuzzy Logic Functions Built Into MDS
This post is inspired by a presentation that’s available on the Microsoft TechEd Online website. It’s called Master Data Management – Merging from Multiple Sources, and is presented by Dejan Sarka, one of the Solid Quality Mentors and writer of several SQL Server-related books.
Even if you’re not interested in Master Data Services (MDS), the following will be good to know if you need to compare strings with each other for similarity and find the string that’s the closest match to your input string.
Compare Strings Why?
You may be wondering in what scenarios you’d be required to compare strings for similarity. To clarify, I’ll give you an example. Imagine you’re building a data warehouse (DWH). This DWH receives data from several different source systems. In two of those systems, you’ve got a list of customers. To be able to populate your DimCustomer table and avoid duplicate customers, you need to implement some logic to detect that customer Smith in System A is the same customer Smith in System B.
That’s when string similarity or fuzzy-logic functions come in handy.
Compare Strings How?
When strings are 100% equal, it’s obviously not difficult to find matching strings. Just use the equals (=) operator in your query and you’ve matched them. However, when strings are not 100% equal, due to typing errors or whatever cause, things get a little more complicated. Perhaps customer Smith in System A is called Smyth, Smiht or even Smiths in System B while they are actually one and the same person. That’s when we need to use additional logic, which we – not being rock star mathematicians ourselves – can hopefully find in built-in system functions.
The Built-in Soundex() And Difference() Functions
The standard functionality available in SQL Server to compare strings is fairly limited. You’ve probably heard of the SOUNDEX() function, maybe even used in somehow already. This function receives a string as parameter and calculates a four-digit code out of it. When used on two similar strings, the two strings will produce the same code. When they are not similar, you get two different codes. And that’s it.
Here’s an example:
select SOUNDEX('Smith'), SOUNDEX('Smiht'), SOUNDEX('Washington')
Is it perfect? No, it’s not. If you’d give it a value of ‘Smiths’, it would return S532, which is different from S530 even though there’s only one letter of difference between the two strings.
Next to SOUNDEX() we’ve got the DIFFERENCE() function. This function accepts two parameters and returns an integer between 0 and 4. What this function does is it calculates the soundex value for both strings and returns the number of characters of the code that are matching. In the case of a comparison of ‘Smith’ with ‘Smiths’, it would return 3 because three characters are matching (‘S53’).
Let’s move on to an alternative solution.
The Similarity Function In Master Data Services
The MDS installation procedure goes through several steps to get all the required functionality installed. One of those steps is the creation of a database. What’s interesting about this database, even if you’re not interested in MDS or Master Data Management, are the custom functions that it contains.
One of those functions is called Similarity, located in the mdq schema. This function allows you to compare two strings with each other through a specified match algorithm. What’s interesting here is that you can choose between those four different algorithms depending on your data. In some cases a certain algorithm will be more interesting while in other cases the best algorithm will be another one.
The value returned by the Similarity function is a float between zero and one, which makes it more precise than the soundex option.
So how do you use the function? Let’s have a look at its definition:
ALTER FUNCTION [mdq].[Similarity](@input1 [nvarchar](4000), @input2 [nvarchar](4000), @method [tinyint], @containmentBias [float], @minScoreHint [float]) RETURNS [float] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT AS EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality]. [Microsoft.MasterDataServices.DataQuality.SqlClr].[Similarity]
As you can see, this is not a standard T-SQL function but it’s been implemented in .NET through CLR Integration. The function expects five parameters. The first two parameters, @input1 and @input2, are the two strings that need to get compared. The third parameter specifies the match algorithm that should be used.
Here are the four algorithms as supported by the Similarity function:
|Value for @method||Algorithm|
|0||The Levenshtein edit distance algorithm|
|1||The Jaccard similarity coefficient algorithm|
|2||A form of the Jaro-Winkler distance algorithm|
|3||Longest common subsequence algorithm|
The fourth parameter, @containmentBias, specifies how exact the fuzzy index should be when comparing strings of different lengths. Values go from 0.0 to 1.0 with the lower number being the more precise one. This only applies to the Jaccard and longest common subsequence algorithms. The default is 0.85.
The fifth parameter, @minScoreHint, influences the calculated scores returned by the Similarity function. Valid values go from 0.0 to 1.0. When a value greater than 0 is passed, any calculated score under that value will result in zero.
Note: according to the Books Online, this fifth parameter is optional. But it’s not.
Note: also according to the Books Online, a value of 4 for @method would also be accepted. In that case the function will use a date comparison algorithm, thus the two first parameters should be either DateTime values or valid dates that are strings specified in the format yyyy-mm-dd. However, when testing this out I noticed that this is not working. Further exploration of the MDS database led me to another function called SimilarityDate, also located in the mdq schema:
ALTER FUNCTION [mdq].[SimilarityDate](@date1 [datetime], @date2 [datetime]) RETURNS [float] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT AS EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality]. [Microsoft.MasterDataServices.DataQuality.SqlClr].[SimilarityDate]
From the looks of it, this function implements the functionality as explained in the BOL for @method = 4. And this one is actually working!
So how do you find out which of the four algorithms is the most interesting one in your situation? You’ll have to try it out. Take a sample data set and run the four algorithms on the data. As I don’t have any real-world data to use here (it wouldn’t be legal anyway), I’ll demonstrate this using some data from the ContosoDW database.
The following query uses the Customer dimension and combines that with a very small set of “sample” data that imitate some real-world problems like typos.
with DistinctLastname as ( select distinct LastName from DimCustomer ), NewData as ( select 'Wahsington' as LastName2 --typo union select 'Wqshington' --QWERTY/AZERTY mixup union select 'Zqtson' --QWERTY/AZERTY mixup x2 ) select DistinctLastname.LastName, NewData.LastName2 into #SampleData from DistinctLastname cross join NewData where LastName is not null;
I’m only using the LastName column here. In a real situation you’d probably want to combine that with FirstName and also some address-related data such as city and street.
Up next is letting the algorithms loose on the data set:
select LastName, LastName2, MDS.mdq.Similarity(LastName, LastName2, 0, 0.85, 0) as Levenshtein, MDS.mdq.Similarity(LastName, LastName2, 1, 0.85, 0) as Jaccard, MDS.mdq.Similarity(LastName, LastName2, 2, 0.85, 0) as JaroWinkler, MDS.mdq.Similarity(LastName, LastName2, 3, 0.85, 0) as LongestCommonSubsequence from #SampleData;
Copy the results of that query to Excel for further, and easier, analysis. You can easily sort your data in Excel, so that the highest calculated scores of a certain algorithm are located on top.
Let’s first start by sorting on the Levenshtein results:
As you can see, the three values that we were looking for are located on top. That’s a good sign! Furthermore, the Washington values are quite high. So based on my sample data this is possibly a good algorithm.
How about the Jaccard results?
You can clearly see that the maximum values for the Jaccard algorithm are significantly lower than those of the other algorithms. Furthermore, the correct value for Watson is scoring lower than Son. Assuming our logic would select the best-scoring values when searching for “Watson”, it would select the incorrect value of Son.
All this indicates that the Jaccard algorithm is not the best-suited one for our situation.
So, what about Jaro-Winkler?
In this case we’ve got even higher maximum values compared to Levenshtein and the two values for Washington are located on top. So far so good. The correct value for Watson is located at position 7. But as you can see, this is the first match for Zqtson, which means that the correct value would get selected by our matching logic. Based on these numbers I would say that so far this is the best algorithm for the situation.
One more to go: Longest Common Subsequence.
Again the three correct values are located on top, just like the Levenshtein algorithm. In fact, the calculated scores are very similar to the Levenshtein algorithm. Quite logical: both calculations are using similar algorithms.
Based on the results above and using this very limited sample data set, I would select the Jaro-Winkler algorithm as being the most suitable for our situation. But I do have to mention that you should really use larger data sets to be sure.
Also, even though we can rely on a fuzzy-logic algorithm to find the correct match, the selected matches should be verified and approved manually. Of course, all that can be part of a Master Data Management process.
Note that for this post I only looked at fuzzy matching possibilities using just T-SQL. In Integration Services there are a couple of components, such as the Fuzzy Lookup data flow component, that offer similar functionality. If you’re dealing with ETL flows in SSIS, be sure to check that one out as well!
- Book: SQL Server 2012 Reporting Services Blueprints
- SQL Server Days 2013 – Data Cleansing: Download
- SQL Server Days 2013
- Exploring the System.Object Package Variable [SSIS]
- T-SQL Tuesday 46: Contraptions!
- Creating Multi-Column Reports: The Top-Down Version [SSRS]
- Formatting Dates [SSRS]
- Community Day – Data Visualization Tips & Tricks: Download
- Formatting Numbers [SSRS]
- Community Day 2013: Data Visualization Tips & Tricks
- November 2013 (2)
- October 2013 (1)
- September 2013 (2)
- July 2013 (2)
- June 2013 (2)
- May 2013 (3)
- March 2013 (3)
- February 2013 (2)
- January 2013 (2)
- December 2012 (2)
- November 2012 (3)
- October 2012 (2)
- August 2012 (2)
- July 2012 (2)
- June 2012 (2)
- May 2012 (2)
- April 2012 (3)
- March 2012 (4)
- February 2012 (4)
- January 2012 (2)
- December 2011 (2)
- November 2011 (2)
- October 2011 (1)
- September 2011 (3)
- August 2011 (2)
- June 2011 (2)
- May 2011 (3)
- April 2011 (3)
- March 2011 (3)
- February 2011 (2)
- January 2011 (5)
- December 2010 (1)
- November 2010 (3)
- October 2010 (3)
- September 2010 (2)
- August 2010 (4)
- July 2010 (2)
- June 2010 (4)
- May 2010 (6)
- April 2010 (3)
- March 2010 (3)
- February 2010 (11)
- January 2010 (9)
- December 2009 (2)
- November 2009 (3)
- October 2009 (3)
- September 2009 (4)
- August 2009 (6)
- July 2009 (2)
- June 2009 (3)
- May 2009 (7)
- April 2009 (3)
- March 2009 (3)
- February 2009 (5)
- January 2009 (4)
- December 2008 (2)
- November 2008 (3)
- October 2008 (1)
- September 2008 (1)
- August 2008 (4)
- July 2008 (3)