Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!
Have data sets where there are names, in the first column, client names in the second, and Client start date in the third.
There are thousands of these records with thousands of names/clients/client start dates. The name is entered each time the person begins with a new client such that each person has many entries in the name column. Often the names were not entered in a consistent way. With and without middle initial, middle name, or various abbreviations such as ",RN" at the end of the name.
Is there a way to do fuzzy matching so that the names in name column get replaced with a "standardized" format - where some type of machine learning can pick the most common spelling of each repeat name and replace the different variations with the common spelling?
I included an example below. First table includes the names with the various spellings. Second table depicts what I hope to achieve.
Again - this is on a large scale - there are something like 10,000 records with names that need to be standardized.
Name | Client | Client Start Date |
John Good | Client 1 | 1/1/2020 |
Joe Jackson | Client 2 | 6/1/2020 |
Bob A. Barker | Client 3 | 8/1/2020 |
John B. Good | Client 4 | 10/1/2020 |
Joe J. Jackson | Client 5 | 12/1/2020 |
Bob Allen Barker | Cleint 6 | 1/1/2021 |
John Good | Client 7 | 5/1/2021 |
Joe Jack Jackson | Client 8 | 8/1/2021 |
Bob Barker | Client 9 | 12/1/2021 |
Name | Client | Client Start Date |
John Good | Client 1 | 1/1/2020 |
Joe J. Jackson | Client 2 | 6/1/2020 |
Bob A. Barker | Client 3 | 8/1/2020 |
John Good | Client 4 | 10/1/2020 |
Joe J. Jackson | Client 5 | 12/1/2020 |
Bob A. Barker | Cleint 6 | 1/1/2021 |
John Good | Client 7 | 5/1/2021 |
Joe J. Jackson | Client 8 | 8/1/2021 |
Bob A. Barker | Client 9 | 12/1/2021 |
THANKS!
Gregg Powell
Arizona, USA
Solved! Go to Solution.
Hi @GreggPowell ,
You can use the fuzzy merge or Cluster in Power Query.
1. Fuzzy merge: https://docs.microsoft.com/power-query/merge-queries-fuzzy-match
Power Query uses the Jaccard similarity algorithm to measure the similarity between pairs of instances. Setting the similarity threshold to 0.4 will work well with your example data. Here are the steps of my test.
let
tab = Table.Distinct(Table.SelectColumns(Table, "Name")),
Source = Table.FuzzyNestedJoin(tab, {"Name"}, tab, {"Name"}, "JoinName", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=0.4]),
#"Expanded JoinName" = Table.ExpandTableColumn(Source, "JoinName", {"Name"}, {"JoinName.Name"}),
#"Grouped Rows" = Table.Group(#"Expanded JoinName", {"Name"}, {{"RankTab", each _, type table [Name=nullable text, JoinName.Name=nullable text]}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each Table.AddRankColumn([RankTab],"NameRank",
{"JoinName.Name", Order.Ascending})),
#"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"Custom"}),
#"Expanded Custom" = Table.ExpandTableColumn(#"Removed Other Columns", "Custom", {"Name", "JoinName.Name", "NameRank"}, {"Name", "JoinName.Name", "NameRank"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Custom", each ([NameRank] = 1))
in
#"Filtered Rows"
2. Cluster: https://docs.microsoft.com/power-query/cluster-values
Cluster values automatically create groups with similar values using a fuzzy matching algorithm, and then maps each column's value to the best-matched group. But the Cluster values feature is available only for Power Query Online.
In the Cluster values dialog box, confirm the column that you want to use to create the clusters from, and enter the new name of the column. The result of that operation yields the result shown in the image.
If the problem is still not resolved, please provide detailed error information or the expected result you expect. Let me know immediately, looking forward to your reply.
Best Regards,
Winniz
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hi @GreggPowell ,
You can use the fuzzy merge or Cluster in Power Query.
1. Fuzzy merge: https://docs.microsoft.com/power-query/merge-queries-fuzzy-match
Power Query uses the Jaccard similarity algorithm to measure the similarity between pairs of instances. Setting the similarity threshold to 0.4 will work well with your example data. Here are the steps of my test.
let
tab = Table.Distinct(Table.SelectColumns(Table, "Name")),
Source = Table.FuzzyNestedJoin(tab, {"Name"}, tab, {"Name"}, "JoinName", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=0.4]),
#"Expanded JoinName" = Table.ExpandTableColumn(Source, "JoinName", {"Name"}, {"JoinName.Name"}),
#"Grouped Rows" = Table.Group(#"Expanded JoinName", {"Name"}, {{"RankTab", each _, type table [Name=nullable text, JoinName.Name=nullable text]}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each Table.AddRankColumn([RankTab],"NameRank",
{"JoinName.Name", Order.Ascending})),
#"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"Custom"}),
#"Expanded Custom" = Table.ExpandTableColumn(#"Removed Other Columns", "Custom", {"Name", "JoinName.Name", "NameRank"}, {"Name", "JoinName.Name", "NameRank"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Custom", each ([NameRank] = 1))
in
#"Filtered Rows"
2. Cluster: https://docs.microsoft.com/power-query/cluster-values
Cluster values automatically create groups with similar values using a fuzzy matching algorithm, and then maps each column's value to the best-matched group. But the Cluster values feature is available only for Power Query Online.
In the Cluster values dialog box, confirm the column that you want to use to create the clusters from, and enter the new name of the column. The result of that operation yields the result shown in the image.
If the problem is still not resolved, please provide detailed error information or the expected result you expect. Let me know immediately, looking forward to your reply.
Best Regards,
Winniz
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
This looks hopeful as a solution...
A Few questions:
You stated "Open a new blank query and paste the code." So, I'm not 100% following this - when I am in the power query editor, and I have already added a web element as a data source that pulls a table from a csv file in sharepoint -
at this point, I try to "Open a new blank query" -
and I past in the code provided above....
But - My query that I just created is in a different format (assuming a text format)
Compared to what you showed in your example above... Yours is in a table format...
So - where did I go wrong?
Hi @GreggPowell ,
After opening a blank query, we need to paste the code in the Advanced Editor.
If we paste directly in the formula bar, we are actually pasting the code into the double quotes in the image below, i.e., the output is text.
Best Regards,
Winniz
@GreggPowell do you use datamart?
I don't know what datamart is. Never heard of it.
r/
Gregg
@GreggPowell just so you know, if you wan to achieve scale and have a performant query over a large data source and if you have premium and datamart, you can easily write a SQL query to achieve what you need (in fraction of the speed the same task performed by PQ).
The query may be optimized and I was just trying to prove the point. If you have premium
->datamart, ingest the data, get the server string and transform the data on the AZURE SQL server and query the server to get the data. With the task in hand
declare @t1 as table(name varchar(max),id int)
insert into @t1
select* from
(values('John Good',1), ('Joe Jackson',2),('Bob A. Barker',3),('John B. Good',4),('Joe J. Jackson',5),('Bob Allen Barker',6),('John Good',7),('Joe Jack Jackson',8),('Bob Barker',9)) t(a,b)
;with base as (select a.name,a.id, value from @t1 a
CROSS APPLY
string_split(a.name,' '))
, t1 as (select *
,row_number() over ( partition by id order by name) as row#
, count(value) over ( partition by id ) as [mx]
, 1 as [mn] from base )
,t2 as (select * from t1 where row#=mx or row#=mn)
,t3 as(select id, STRING_AGG(value,' ') SID from t2 group by id)
,t4 as (select t1.*,t3.SID from t1
left outer join t3 on t1.id=t3.id)
,t05 as (select ID,SID,STRING_AGG(value,' ') as PAT from t4 group by ID,SID)
,t06 as(
select
SID,PAT,COUNT(PAT)as PATCOUNT
from t05
group by SID,PAT)
,t07 as (
select a.SID,b.PAT
from
(select SID from t06
group by SID
HAVING max(PATCOUNT)=1) a
inner join
( select * from t06 b where PAT like '%.%' ) b
on a.SID=b.SID
union all
select SID,PAT from t06 where PATCOUNT>1)
select a.id, a.name, b.PAT as MLName
from t4 a
inner join t07 b on a.SID=b.SID
group by a.id, a.name, a.SID,b.PAT
So, we do not use datamart. The data actually exists in a 3rd party tool (Monday.com) We pull the data daily using APIs - and put it into simple csv file tables in sharepoint. From the csv file tables in sharepoint, we build power BI reports. Attempting to use datamart as a respository is not something we are considering. The names need to be standardized actually so we can then go back and more easily assign ID numbers to them.
@GreggPowell then you need to translate back this SQL to PQ and it will spin out the output but much slower. Additionally, I don't know if a ML exists for such a task and if yes, and if you have AZURE subscription, you need to utilize (probably AZURE Data Factory in conjunction with Azure ML), and AZURE ML and Power BI (assuming that ML exits in Azure ML). If you don't have any of the above, you are limited to PQ. It will do the job as long as you ask it how to peel the layers of this onion.
@GreggPowell if you happen to operate in a premium workspace, you can utilize the newly introduced
datamart as it enables you to write fully qualified SQL to transform data to achieve scale.