Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
BrandyBurk
Frequent Visitor

Fuzzy Match Score Wrong

I'm seeing something wrong with the fuzzy match scoring on some of the matches and hoping someone can help me understand why this is occuring.  I have two tables.  My first table called "AS Employee ID" has around 500 rows with two columns called "Request ID" and "AssetSteward Leader Name".  The asset steward name column is text and is a free form field that holds a person's name.  Unfortunately, the user can enter the name any number of ways...first name last name or last name comma first name, etc...

 

My 2nd table named "Employee" has over 35,000 records and each row is an employee record.  There is a text column called "EmpName".  All names are entered Last Name then a comma with a space followed by the first name.  Sometimes there may be another space and a middle initial or middle name on the end.  Not always though.

 

I've merged the tables using a simple outer fuzzy join which works great on most matches.  However, there are some that just don't make any sense.  An example is the match for an asset steward named "Sayers, John".  The merge details and results are shown below.  I don't understand why it is scoring what should be the closest match (row 7) with a score of 0.83.  Row 7 is almost an identical match minus the space and an "O" MI on the end.  What is going on with this?  Ultimately, I want to limit the matches to one but that gives me "Waters, John" instead of "Sayers, John O" which obviously isn't correct.  The score just doesn't seem to make any sense based on what I've read.  It seems to be only scoring the first name instead of the entire string and disregarding the last name completely.  Thank you for the help!  This is going to drive me to drink.

 

= Table.FuzzyNestedJoin(#"Expanded Employee", {"AssetSteward Leader Name"}, Employee, {"EmpName"}, "Employee", JoinKind.LeftOuter, [IgnoreCase=false, IgnoreSpace=false, NumberOfMatches=20, Threshold=.65, SimilarityColumnName="Score"])

BrandyBurk_1-1651864251917.png

 

 

BrandyBurk_0-1651864103379.png

 

2 ACCEPTED SOLUTIONS
Vijay_A_Verma
Super User
Super User

Actually documentation is clear that "The best scenario for applying the fuzzy match algorithm is when all text strings in a column contain only the strings that need to be compared and no extra components. For example, comparing Apples against 4ppl3s yields higher similarity scores than comparing Apples to My favorite fruit, by far, is Apples. I simply love them!. Because the word Apples in the second string is only a small part of the whole text string, that comparison yields a lower similarity score."

https://docs.microsoft.com/en-us/power-query/fuzzy-matching

Hence, to get the right match, you need to have a single word. Hence, you need to match SayersJohn not Sayers, John.

Just have one intermediate step. In first table i.e. where you want match insert one more column which should have following formula (please replace Name with your column name)

= Text.Replace([Name],", ","")

 Now, you should perform fuzzy matching on this column to get right result. (You need not make this change in lookup table).

After fuzzy matching, you can delete this column from your result table.

View solution in original post

Problem is that you applied the formula on both the columns. I specifically wrote this that you need to apply this only in source column. You are supposed to apply this formula only in AssetSteward Leader Name column. Don't apply this formula on Emp Name column.

Then you would get right result.

I have demonstrated this in this file where I have shown merge as you were originally doing, then merge where I applied the formula on both source and target columns and merge where I applied the formula on source column only - https://1drv.ms/u/s!Akd5y6ruJhvhuW0f_3h5TxwnKmX7?e=JzLdWS 

View solution in original post

4 REPLIES 4
Vijay_A_Verma
Super User
Super User

Actually documentation is clear that "The best scenario for applying the fuzzy match algorithm is when all text strings in a column contain only the strings that need to be compared and no extra components. For example, comparing Apples against 4ppl3s yields higher similarity scores than comparing Apples to My favorite fruit, by far, is Apples. I simply love them!. Because the word Apples in the second string is only a small part of the whole text string, that comparison yields a lower similarity score."

https://docs.microsoft.com/en-us/power-query/fuzzy-matching

Hence, to get the right match, you need to have a single word. Hence, you need to match SayersJohn not Sayers, John.

Just have one intermediate step. In first table i.e. where you want match insert one more column which should have following formula (please replace Name with your column name)

= Text.Replace([Name],", ","")

 Now, you should perform fuzzy matching on this column to get right result. (You need not make this change in lookup table).

After fuzzy matching, you can delete this column from your result table.

Thanks for the guidance.  I actually added the space in one of my attempts to get more accurate matching.  I did as you instructed, removed it and it still gives me the same results.  It really looks like it is only matching the string after the comma and giving no consideration to the string before the comma.  Forgive me but I don't see ANY scenario where "Hayes, John" should be scored higher than "Sayers, John" when comparing both to "Sayers, John O".  With or without a space.  "Sayers, John" is an EXACT match minus two characters (a space and an "O") on the end.  Why in the world would "Hayes, John" ever be considered a closer match? 

 

Any other ideas to try?

 

Here is what I get when removing the space after the comma.  

BrandyBurk_0-1651953187590.png

 

 

Problem is that you applied the formula on both the columns. I specifically wrote this that you need to apply this only in source column. You are supposed to apply this formula only in AssetSteward Leader Name column. Don't apply this formula on Emp Name column.

Then you would get right result.

I have demonstrated this in this file where I have shown merge as you were originally doing, then merge where I applied the formula on both source and target columns and merge where I applied the formula on source column only - https://1drv.ms/u/s!Akd5y6ruJhvhuW0f_3h5TxwnKmX7?e=JzLdWS 

OK.  I went back and removed the comma from the Asset Steward Name coulmn only and performed the fuzzy merge and it worked! I don't understand why but I am just going to take it and move on.   Thanks for the help!  

 

 

 

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.