cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
jhamil560
New Member

Fuzzy Match calculation explained

So i thought I knew how Power Query is using the Jaccard similarity to generate a fuzzy match. However I am not getting expeced matches in some instances unless I set the threshold lower to .5

 

Can anyone explain the exact calculation steps power BI uses to come up with the similarity value? I'm doing a fuzzy merge on colums 1 and 2. See two sample tables below (2 columns separated by ;). I would expect these to match using the default threshold as they only have 2 differences, but they don't match  unless i set to .5.

 

PPID01234 ; john-smith 2A 3H

 

PPID01234; john-smith 2B 4H

1 ACCEPTED SOLUTION
v-yingjl
Community Support
Community Support

Hi @jhamil560 ,

About similarity threshold in fuzzy match, here is the explaination in Power Query editor in Power BI Desktop:

Similarity Threshold:

Indicates how similar two values need to be in order to match. The minimum value of 0.00 causes all values to match each other. The maximum value of 1.00 only allows exact matches. The default value is 0.80.

 

To check the calculation of this threshold, you can create a table in Power Query online(dataflow) to use cluster values feature, please refer this document which introduces it in details:

How fuzzy matching works in Power Query? 

 

Note:

Currently only the Cluster values feature in Power Query Online will provide a new column with the similarity score.

 

Best Regards,
Community Support Team _ Yingjie Li
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

7 REPLIES 7
lasersharks
Super User
Super User

The documentation quoted here states: The minimum value of 0.00 causes all values to match each other.

 

This doesn't actually happen though. Can someone clarify what is meant by this statement or correct the documentation?

 

v-yingjl
Community Support
Community Support

Hi @jhamil560 ,

About similarity threshold in fuzzy match, here is the explaination in Power Query editor in Power BI Desktop:

Similarity Threshold:

Indicates how similar two values need to be in order to match. The minimum value of 0.00 causes all values to match each other. The maximum value of 1.00 only allows exact matches. The default value is 0.80.

 

To check the calculation of this threshold, you can create a table in Power Query online(dataflow) to use cluster values feature, please refer this document which introduces it in details:

How fuzzy matching works in Power Query? 

 

Note:

Currently only the Cluster values feature in Power Query Online will provide a new column with the similarity score.

 

Best Regards,
Community Support Team _ Yingjie Li
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

watkinnc
Super User
Super User

I don't know, maybe it looks for length (different), characters in same position (none), who can tell what the computer sees--  but I bet that if you simply add a column that removes spaces to both queries--maybe only to just one of them, and they'll probably fuzzy match. Also, make sure you have the Match by combining text parts checked.

 

--Nate


I’m usually answering from my phone, which means the results are visualized only in my mind. You’ll need to use my answer to know that it works—but it will work!!
Greg_Deckler
Super User
Super User

@jhamil560 Have you read through this article? How fuzzy matching works in Power Query? - Power Query | Microsoft Docs

 

If you want to "roll your own", you can use this: https://community.powerbi.com/t5/Quick-Measures-Gallery/Fuzzy/m-p/1352914#M608

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

I did read that yes, but it doesn't explain how they are coming up with the similarity values. That is what I am curious about. I would like to understand the logic being used. When i researched it was shown to be the Jaccard similarity algorithm, which is the length of intersection divided by length of union. But is power query looking at unique words as '1' element in a string? Is it just looking at overall unique list of letters/symbols? I would like to see a step by step example of what Power BI is doing, just to understand it better.

@jhamil560 Try creating a concatenated value column in each table and then doing a Fuzzy match based on those single columns. Perhaps using multiple columns is causing slightly different behavior and it is matching only when one column is equal to another and hence the .5. Half of the columns match?


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

Thanks, I'll look into it and see what that does. What is interesting is I was mostly getting good results. So if there was an issue using two columns I would have expected it to be an issue on the entire data set. But it's only an issue in a small number of rows not matching. I've found other rows that did get a match that had the exact same difference as my example, but maybe the names were longer/shorter. That's why I was trying to understand the calculation being done by the software.

Helpful resources

Announcements
Power BI Show Episode 10 Recap

The Power BI Community Show

Watch the playback when Amit Chandak, a Power BI Super User, demos how to use Field Parameters to make reports more dynamic.

Power BI Dev Camp Session 26

New Date - Check it Out!

Mark your calendars and join us on Thursday, October 6 at 11a PDT for a great session with Ted Pattison!

Health and Life Sciences Power BI User Group

Health and Life Sciences Power BI User Group

Power BI specialists at Microsoft have created a community user group where customers in the provider, payor, pharma, health solutions, and life science industries can collaborate.

Ignite 2022

What's Next at Microsoft Ignite 2022

Explore the latest innovations, learn from product experts and partners, level up your skillset, and create connections from around the world.

Top Kudoed Authors