Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
zahlenschubser
Helper III
Helper III

LIKE operator & customer name analysis

I have a rather big table with customer first and last names and need to check if they are similar, so I can then check if they are actually different people or just entered with typos.

 

If I do left() and right() with low numbers on the first and last name columns - as there is no LIKE operator in PBI and so I can hopefully catch most of the entries I want to filter for - the query takes way too long to complete because it has to go through so many entries and variations to evaluate.

 

Any ideas?

14 REPLIES 14
zahlenschubser
Helper III
Helper III

Ok, I think I found a way to work around this, in a way, with the levenshtein likeness calculation.

I managed to implement it as a custom function in the query editor as
---
= (s1 as nullable text, s2 as nullable text) =>
if s1 = null or s2 = null then "" else
let
l1 = Text.Length(s1),
l2 = Text.Length(s2),
d = List.Generate(
() => [i=0, j=0, cost=0],
each [i] <= l1,
each if [j] = l2 then [i]+1 else null,
each
if [i] = 0 then
[i=1, j=1, cost=if s1{0} = s2{0} then 0 else 1]
else if s1{[i]-1} = s2{[j]-1} then
[i=[i]+1, j=[j]+1, cost=[cost]]
else
let
ins = [i=[i], j=[j]+1, cost=[cost]+1],
del = [i=[i]+1, j=[j], cost=[cost]+1],
sub = [i=[i]+1, j=[j]+1, cost=[cost]+1]
in
List.Min({ins, del, sub}, each [i]<=l1 and [j]<=l2)
)
in
List.Last(d)[cost]
---

but I am unsure how to actually use it to add a new calculated column, as I need to compare the value in [name] to the earlier entry(entries) in the same column, filtered by a shared value in a different column.

I tried sorting the whole thing by the filtering column and then use
= Table.AddColumn(#"Hinzugefügter Index", "Name_likeness", each LSTEIN([name], List.Range([name],[Index]-1,1)))
but that got me an error message.

halfglassdarkly
Resolver IV
Resolver IV

You could also try out using Fuzzy Merge or Cluster values in Power Query, see https://learn.microsoft.com/en-us/power-query/fuzzy-matching 

 

I've had mixed success for that myself, but if using Excel isn't too far out of scope for you I always get reliable fuzzy matching results using the free Fuzzy Lookup add in for Excel https://www.microsoft.com/en-nz/download/details.aspx?id=15011 

I don't see the cluster option in my menu.

zahlenschubser_0-1679988594047.png

 

Oh sorry, I haven't used cluster values myself, just saw the cluster values feature is available only for Power Query Online.

 

The fuzzy merge option is available in vanilla Power Query when  merging queries though. I haven't found it very reliable compared to the excel add in I mentioned, but your mileage may vary.

 

The Power Query fuzzy merge is probably ok for names matching, but isn't very good at identifying similarity for e.g where one string is a keyword and the other string is a longer text to search. The excel add in seems to perform well for both scenarios.

https://numidiabi.wordpress.com/2020/08/24/power-query-table-addfuzzyclustercolumn/

gave me some ideas, but without a "correct" column to match the names against this doesn't work I think.

You can use the standard merge data interface and select the same table for both sides of the join using the fuzzy match option.

Good idea, but the resulting merge query (I have 717k lines in the table I need to check) wasn't done even after letting it run for ten minutes.

Ouch, yep, that will take a while.

One suggestion, if your customer names are in a fact table, try duplicating the query, remove columns other than the customer name (and any other customer ID if you have one e.g email) and deduplicate. I don't know how many times a customer is repeated throughout your data, but that could considerably reduce the amount of records power query has to compare.

The de-duplicate did reduce the lines a lot, but still far from a manageable amount sadly.

I'm trying to manually add in the column - as that seems to work according to another post - according to https://learn.microsoft.com/en-us/powerquery-m/table-addfuzzyclustercolumn , but I didn't have much success yet.

 

My data source is mysql that gets directly imported into Power BI, so Excel isn't really an option sadly.

v-shex-msft
Community Support
Community Support

Hi @zahlenschubser,

You can take a look at the following blog about string comparison from SQL to Dax if help with you scenarios:

From SQL to DAX: String Comparison 

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

I tried to use the ideas from that one, but the necessary granularity makes the resulting query take way too long sadly.

rajulshah
Super User
Super User

@zahlenschubser Can you please provide any sample data?

Like this for example. The first and last two are pretty sure the same person, but last and first name(s) are written differently. It's also often a combination of typos, missing letters, or swapped letters, like v <> w.

zahlenschubser_1-1679565553371.png

 

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.