cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
GreggPowell
Frequent Visitor

Fuzzy name matching to standardize names in a single column

Have data sets where there are names, in the first column, client names in the second, and Client start date in the third. 

There are thousands of these records with thousands of names/clients/client start dates. The name is entered each time the person begins with a new client such that each person has many entries in the name column. Often the names were not entered in a consistent way. With and without middle initial, middle name, or various abbreviations such as ",RN" at the end of the name.

Is there a way to do fuzzy matching so that the names in name column get replaced with a "standardized" format - where some type of machine learning can pick the most common spelling of each repeat name and replace the different variations with the common spelling?

I included an example below. First table includes the names with the various spellings. Second table depicts what I hope to achieve.

Again - this is on a large scale - there are something like 10,000 records with names that need to be standardized.

NameClientClient Start Date
John GoodClient 11/1/2020
Joe JacksonClient 26/1/2020
Bob A. BarkerClient 38/1/2020
John B. GoodClient 410/1/2020
Joe J. JacksonClient 512/1/2020
Bob Allen BarkerCleint 61/1/2021
John GoodClient 75/1/2021
Joe Jack JacksonClient 88/1/2021
Bob BarkerClient 912/1/2021
   
NameClientClient Start Date
John GoodClient 11/1/2020
Joe J. JacksonClient 26/1/2020
Bob A. BarkerClient 38/1/2020
John GoodClient 410/1/2020
Joe J. JacksonClient 512/1/2020
Bob A. BarkerCleint 61/1/2021
John GoodClient 75/1/2021
Joe J. JacksonClient 88/1/2021
Bob A. BarkerClient 912/1/2021




THANKS!

Gregg Powell

Arizona, USA

 

1 ACCEPTED SOLUTION
v-kkf-msft
Community Support
Community Support

Hi @GreggPowell ,

 

You can use the fuzzy merge or Cluster in Power Query.

 

1. Fuzzy merge: https://docs.microsoft.com/power-query/merge-queries-fuzzy-match 

Power Query uses the Jaccard similarity algorithm to measure the similarity between pairs of instances. Setting the similarity threshold to 0.4 will work well with your example data. Here are the steps of my test.

 

  • Open a new blank query and paste the code.
let
    tab = Table.Distinct(Table.SelectColumns(Table, "Name")),
    Source = Table.FuzzyNestedJoin(tab, {"Name"}, tab, {"Name"}, "JoinName", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=0.4]),
    #"Expanded JoinName" = Table.ExpandTableColumn(Source, "JoinName", {"Name"}, {"JoinName.Name"}),
    #"Grouped Rows" = Table.Group(#"Expanded JoinName", {"Name"}, {{"RankTab", each _, type table [Name=nullable text, JoinName.Name=nullable text]}}),
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each Table.AddRankColumn([RankTab],"NameRank",
    {"JoinName.Name", Order.Ascending})),
    #"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"Custom"}),
    #"Expanded Custom" = Table.ExpandTableColumn(#"Removed Other Columns", "Custom", {"Name", "JoinName.Name", "NameRank"}, {"Name", "JoinName.Name", "NameRank"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Custom", each ([NameRank] = 1))
in
    #"Filtered Rows"

vkkfmsft_0-1655790691747.png

 

  • Then merge Table and Query1.

vkkfmsft_1-1655790742869.png

 

  • Get the result.

vkkfmsft_2-1655790783721.png

 

2. Cluster: https://docs.microsoft.com/power-query/cluster-values 

Cluster values automatically create groups with similar values using a fuzzy matching algorithm, and then maps each column's value to the best-matched group. But the Cluster values feature is available only for Power Query Online.

In the Cluster values dialog box, confirm the column that you want to use to create the clusters from, and enter the new name of the column. The result of that operation yields the result shown in the image.

 

vkkfmsft_3-1655791113985.png

 

If the problem is still not resolved, please provide detailed error information or the expected result you expect. Let me know immediately, looking forward to your reply.
Best Regards,
Winniz
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

9 REPLIES 9
v-kkf-msft
Community Support
Community Support

Hi @GreggPowell ,

 

You can use the fuzzy merge or Cluster in Power Query.

 

1. Fuzzy merge: https://docs.microsoft.com/power-query/merge-queries-fuzzy-match 

Power Query uses the Jaccard similarity algorithm to measure the similarity between pairs of instances. Setting the similarity threshold to 0.4 will work well with your example data. Here are the steps of my test.

 

  • Open a new blank query and paste the code.
let
    tab = Table.Distinct(Table.SelectColumns(Table, "Name")),
    Source = Table.FuzzyNestedJoin(tab, {"Name"}, tab, {"Name"}, "JoinName", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=0.4]),
    #"Expanded JoinName" = Table.ExpandTableColumn(Source, "JoinName", {"Name"}, {"JoinName.Name"}),
    #"Grouped Rows" = Table.Group(#"Expanded JoinName", {"Name"}, {{"RankTab", each _, type table [Name=nullable text, JoinName.Name=nullable text]}}),
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each Table.AddRankColumn([RankTab],"NameRank",
    {"JoinName.Name", Order.Ascending})),
    #"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"Custom"}),
    #"Expanded Custom" = Table.ExpandTableColumn(#"Removed Other Columns", "Custom", {"Name", "JoinName.Name", "NameRank"}, {"Name", "JoinName.Name", "NameRank"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Custom", each ([NameRank] = 1))
in
    #"Filtered Rows"

vkkfmsft_0-1655790691747.png

 

  • Then merge Table and Query1.

vkkfmsft_1-1655790742869.png

 

  • Get the result.

vkkfmsft_2-1655790783721.png

 

2. Cluster: https://docs.microsoft.com/power-query/cluster-values 

Cluster values automatically create groups with similar values using a fuzzy matching algorithm, and then maps each column's value to the best-matched group. But the Cluster values feature is available only for Power Query Online.

In the Cluster values dialog box, confirm the column that you want to use to create the clusters from, and enter the new name of the column. The result of that operation yields the result shown in the image.

 

vkkfmsft_3-1655791113985.png

 

If the problem is still not resolved, please provide detailed error information or the expected result you expect. Let me know immediately, looking forward to your reply.
Best Regards,
Winniz
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

This looks hopeful as a solution...

A Few questions:

You stated "Open a new blank query and paste the code."  So, I'm not 100% following this - when I am in the power query editor, and I have already added a web element as a data source that pulls a table from a csv file in sharepoint -

GreggPowell_0-1656693893917.png

 

at this point, I try to "Open a new blank query" - 

GreggPowell_1-1656694024144.png

and I past in the code provided above....

GreggPowell_2-1656694104489.png

But - My query that I just created is in a different format (assuming a text format)

GreggPowell_3-1656694171007.png

 

Compared to what you showed in your example above... Yours is in a table format...

GreggPowell_4-1656694222314.png


So - where did I go wrong?




Hi @GreggPowell ,

 

After opening a blank query, we need to paste the code in the Advanced Editor. 

 

vkkfmsft_0-1657094590088.png

vkkfmsft_1-1657094620342.png

 

If we paste directly in the formula bar, we are actually pasting the code into the double quotes in the image below, i.e., the output is text.

 

vkkfmsft_2-1657094804599.png

 

Best Regards,
Winniz

 

smpa01
Super User
Super User

@GreggPowell  do you use datamart?





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!






New Animated Dashboard: Sales Calendar


I don't know what datamart is. Never heard of it.

r/

Gregg

@GreggPowell  just so you know, if you wan to achieve scale and have a performant query over a large data source and if you have premium and datamart, you can easily write a SQL query to achieve what you need (in fraction of the speed the same task performed by PQ). 

 

The query may be optimized and I was just trying to prove the point. If you have premium

->datamart, ingest the data, get the server string and transform the data on the AZURE SQL server and query the server to get the data. With the task in hand

 

declare @t1 as table(name varchar(max),id int)
insert into @t1
select* from
(values('John Good',1), ('Joe Jackson',2),('Bob A. Barker',3),('John B. Good',4),('Joe J. Jackson',5),('Bob Allen Barker',6),('John Good',7),('Joe Jack Jackson',8),('Bob Barker',9)) t(a,b)

;with base as (select a.name,a.id, value from @t1 a
CROSS APPLY 
string_split(a.name,' '))
, t1 as (select * 
,row_number() over ( partition by id order by name) as row#
, count(value) over ( partition by id ) as [mx]
, 1 as [mn] from base )
,t2 as (select * from t1 where row#=mx or row#=mn)
,t3 as(select id, STRING_AGG(value,' ') SID  from t2 group by id)
,t4 as (select t1.*,t3.SID from t1 
left outer join t3 on t1.id=t3.id)
,t05 as (select  ID,SID,STRING_AGG(value,' ') as PAT  from t4 group by ID,SID)
,t06 as(
select 
SID,PAT,COUNT(PAT)as PATCOUNT
from t05 
group by SID,PAT)
,t07 as (
select a.SID,b.PAT
from
(select SID from t06 
group by SID
HAVING max(PATCOUNT)=1) a
inner join 
( select * from t06 b where PAT like '%.%' ) b
on a.SID=b.SID
union all
select SID,PAT from t06 where PATCOUNT>1)
select a.id, a.name, b.PAT as MLName
from t4 a 
inner join t07 b on a.SID=b.SID
group by a.id, a.name, a.SID,b.PAT 

 

 

 

smpa01_0-1655313685959.png

 

 





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!






New Animated Dashboard: Sales Calendar


So, we do not use datamart. The data actually exists in a 3rd party tool (Monday.com) We pull the data daily using APIs - and put it into simple csv file tables in sharepoint. From the csv file tables in sharepoint, we build power BI reports. Attempting to use datamart as a respository is not something we are considering. The names need to be standardized actually so we can then go back and more easily assign ID numbers to them. 

 

@GreggPowell then you need to translate back this SQL to PQ and it will spin out the output but much slower. Additionally, I don't know if a ML exists for such a task and if yes, and if you have AZURE subscription, you need to utilize (probably AZURE Data Factory in conjunction with Azure ML), and AZURE ML and Power BI (assuming that ML exits in Azure ML). If you don't have any of the above, you are limited to PQ. It will do the job as long as you ask it how to peel the layers of this onion.





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!






New Animated Dashboard: Sales Calendar


@GreggPowell  if you happen to operate in a premium workspace, you can utilize the newly introduced 

datamart as it enables you to write fully qualified SQL to transform data to achieve scale.





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!






New Animated Dashboard: Sales Calendar


Helpful resources

Announcements
August 1 episode 9_no_dates 768x460.jpg

The Power BI Community Show

Watch the playback when Priya Sathy and Charles Webb discuss Datamarts! Kelly also shares Power BI Community updates.

Power BI Dev Camp Session 24 without aka link and time 768x460.jpg

Ted's Dev Camp - July 28, 2022

Watch Session 24 of Ted's Dev Camp along with past sessions!

Power Platform Conf 2022 768x460.jpg

Join us for Microsoft Power Platform Conference

The first Microsoft-sponsored Power Platform Conference is coming in September. 100+ speakers, 150+ sessions, and what's new and next for Power Platform.