Finding Similar Strings With Fuzzy Logic Functions Built Into MDS

This post is inspired by a presentation that’s available on the Microsoft TechEd Online website.  It’s called Master Data Management – Merging from Multiple Sources, and is presented by Dejan Sarka, one of the Solid Quality Mentors and writer of several SQL Server-related books.

Even if you’re not interested in Master Data Services (MDS), the following will be good to know if you need to compare strings with each other for similarity and find the string that’s the closest match to your input string.

Compare Strings Why?

You may be wondering in what scenarios you’d be required to compare strings for similarity.  To clarify, I’ll give you an example.  Imagine you’re building a data warehouse (DWH).  This DWH receives data from several different source systems.  In two of those systems, you’ve got a list of customers.  To be able to populate your DimCustomer table and avoid duplicate customers, you need to implement some logic to detect that customer Smith in System A is the same customer Smith in System B.

That’s when string similarity or fuzzy-logic functions come in handy.

Compare Strings How?

When strings are 100% equal, it’s obviously not difficult to find matching strings.  Just use the equals (=) operator in your query and you’ve matched them.  However, when strings are not 100% equal, due to typing errors or whatever cause, things get a little more complicated.  Perhaps customer Smith in System A is called Smyth, Smiht or even Smiths in System B while they are actually one and the same person.  That’s when we need to use additional logic, which we – not being rock star mathematicians ourselves – can hopefully find in built-in system functions.

The Built-in Soundex() And Difference() Functions

The standard functionality available in SQL Server to compare strings is fairly limited.  You’ve probably heard of the SOUNDEX() function, maybe even used in somehow already.  This function receives a string as parameter and calculates a four-digit code out of it.  When used on two similar strings, the two strings will produce the same code.  When they are not similar, you get two different codes.  And that’s it.

Here’s an example:

select SOUNDEX('Smith'), SOUNDEX('Smiht'), SOUNDEX('Washington')

The result of that query is:

Result of SOUNDEX query

Is it perfect?  No, it’s not.  If you’d give it a value of ‘Smiths’, it would return S532, which is different from S530 even though there’s only one letter of difference between the two strings.

Next to SOUNDEX() we’ve got the DIFFERENCE() function.  This function accepts two parameters and returns an integer between 0 and 4.  What this function does is it calculates the soundex value for both strings and returns the number of characters of the code that are matching.  In the case of a comparison of ‘Smith’ with ‘Smiths’, it would return 3 because three characters are matching (‘S53’).

Let’s move on to an alternative solution.

The Similarity Function In Master Data Services

The MDS installation procedure goes through several steps to get all the required functionality installed.  One of those steps is the creation of a database.  What’s interesting about this database, even if you’re not interested in MDS or Master Data Management, are the custom functions that it contains.

One of those functions is called Similarity, located in the mdq schema.  This function allows you to compare two strings with each other through a specified match algorithm.  What’s interesting here is that you can choose between those four different algorithms depending on your data.  In some cases a certain algorithm will be more interesting while in other cases the best algorithm will be another one.

The value returned by the Similarity function is a float between zero and one, which makes it more precise than the soundex option.

So how do you use the function?  Let’s have a look at its definition:

ALTER FUNCTION [mdq].[Similarity](@input1 [nvarchar](4000), @input2 [nvarchar](4000),

  @method [tinyint], @containmentBias [float], @minScoreHint [float])
RETURNS [float] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT
AS
EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].

   [Microsoft.MasterDataServices.DataQuality.SqlClr].[Similarity]

As you can see, this is not a standard T-SQL function but it’s been implemented in .NET through CLR Integration.  The function expects five parameters.  The first two parameters, @input1 and @input2, are the two strings that need to get compared.  The third parameter specifies the match algorithm that should be used.

Here are the four algorithms as supported by the Similarity function:

Value for @method Algorithm
0 The Levenshtein edit distance algorithm
1 The Jaccard similarity coefficient algorithm
2 A form of the Jaro-Winkler distance algorithm
3 Longest common subsequence algorithm

The fourth parameter, @containmentBias, specifies how exact the fuzzy index should be when comparing strings of different lengths.  Values go from 0.0 to 1.0 with the lower number being the more precise one.  This only applies to the Jaccard and longest common subsequence algorithms.  The default is 0.85.

The fifth parameter, @minScoreHint, influences the calculated scores returned by the Similarity function.  Valid values go from 0.0 to 1.0.  When a value greater than 0 is passed, any calculated score under that value will result in zero.

Note: according to the Books Online, this fifth parameter is optional.  But it’s not.

Note: also according to the Books Online, a value of 4 for @method would also be accepted.   In that case the function will use a date comparison algorithm, thus the two first parameters should be either DateTime values or valid dates that are strings specified in the format yyyy-mm-dd.  However, when testing this out I noticed that this is not working.  Further exploration of the MDS database led me to another function called SimilarityDate, also located in the mdq schema:

ALTER FUNCTION [mdq].[SimilarityDate](@date1 [datetime], @date2 [datetime])
RETURNS [float] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT
AS
EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].

  [Microsoft.MasterDataServices.DataQuality.SqlClr].[SimilarityDate]

From the looks of it, this function implements the functionality as explained in the BOL for @method = 4.  And this one is actually working!

So how do you find out which of the four algorithms is the most interesting one in your situation?  You’ll have to try it out.  Take a sample data set and run the four algorithms on the data.  As I don’t have any real-world data to use here (it wouldn’t be legal anyway), I’ll demonstrate this using some data from the ContosoDW database.

The following query uses the Customer dimension and combines that with a very small set of “sample” data that imitate some real-world problems like typos.

with DistinctLastname as
(
    select distinct LastName
    from DimCustomer
),
NewData as
(
    select 'Wahsington' as LastName2 --typo
    union select 'Wqshington' --QWERTY/AZERTY mixup
    union select 'Zqtson' --QWERTY/AZERTY mixup x2
)
select DistinctLastname.LastName, NewData.LastName2
into #SampleData
from DistinctLastname
cross join NewData
where LastName is not null;

I’m only using the LastName column here.  In a real situation you’d probably want to combine that with FirstName and also some address-related data such as city and street.

Up next is letting the algorithms loose on the data set:

select LastName, LastName2,
    MDS.mdq.Similarity(LastName, LastName2, 0, 0.85, 0) as Levenshtein,
    MDS.mdq.Similarity(LastName, LastName2, 1, 0.85, 0) as Jaccard,
    MDS.mdq.Similarity(LastName, LastName2, 2, 0.85, 0) as JaroWinkler,
    MDS.mdq.Similarity(LastName, LastName2, 3, 0.85, 0) as LongestCommonSubsequence
from #SampleData;

Copy the results of that query to Excel for further, and easier, analysis.  You can easily sort your data in Excel, so that the highest calculated scores of a certain algorithm are located on top.

Let’s first start by sorting on the Levenshtein results:

Results ordered on Levenshtein value

As you can see, the three values that we were looking for are located on top.  That’s a good sign!  Furthermore, the Washington values are quite high.  So based on my sample data this is possibly a good algorithm.

How about the Jaccard results?

Results for the Jaccard algorithm

You can clearly see that the maximum values for the Jaccard algorithm are significantly lower than those of the other algorithms.  Furthermore, the correct value for Watson is scoring lower than Son.  Assuming our logic would select the best-scoring values when searching for “Watson”, it would select the incorrect value of Son.

All this indicates that the Jaccard algorithm is not the best-suited one for our situation.

So, what about Jaro-Winkler?

Results for the Jaro-Winkler algorithm

In this case we’ve got even higher maximum values compared to Levenshtein and the two values for Washington are located on top.  So far so good.  The correct value for Watson is located at position 7.  But as you can see, this is the first match for Zqtson, which means that the correct value would get selected by our matching logic.  Based on these numbers I would say that so far this is the best algorithm for the situation.

One more to go: Longest Common Subsequence.

Results for the Longest Common Subsequence algorithm

Again the three correct values are located on top, just like the Levenshtein algorithm.  In fact, the calculated scores are very similar to the Levenshtein algorithm.  Quite logical: both calculations are using similar algorithms.

Conclusion

Based on the results above and using this very limited sample data set, I would select the Jaro-Winkler algorithm as being the most suitable for our situation.  But I do have to mention that you should really use larger data sets to be sure.

Also, even though we can rely on a fuzzy-logic algorithm to find the correct match, the selected matches should be verified and approved manually.  Of course, all that can be part of a Master Data Management process.

Note that for this post I only looked at fuzzy matching possibilities using just T-SQL.  In Integration Services there are a couple of components, such as the Fuzzy Lookup data flow component, that offer similar functionality.  If you’re dealing with ETL flows in SSIS, be sure to check that one out as well!

Have fun!

Valentino.

References

Microsoft Contoso BI Demo Dataset for Retail Industry

Fuzzy String Searching

Soundex algorithm

Levenshtein Distance

Jaccard Index

Jaro-Winkler Distance

Share

Tags: , ,

  1. Steve’s avatar

    Very useful – I was looking to do something along these lines and this has pointed me in the right direction.

    Reply

  2. Valentino Vranken’s avatar

    Thanks, glad I could help :-)

    Reply

  3. Ralph’s avatar

    Nice comparison. Another candidate for fuzzy string matching is the Dice Coefficient.

    Reply

  4. Val’s avatar

    Thx u very match!

    Reply

  5. Andreas’s avatar

    Hi Valentino, thanks for a great post :) .

    I am currently learning about this MDS function and actually I have a question regarding multiple fields data matching.
    What’s the practice for matching record based on multiple fields? (e.g.: score each field separately and average the score, or we concat all those field and make score, etc), thanks before :D .

    Reply

© 2008-2014 BI: Beer Intelligence? All Rights Reserved