LIKE operator & customer name analysis

zahlenschubser · ‎03-23-2023

I have a rather big table with customer first and last names and need to check if they are similar, so I can then check if they are actually different people or just entered with typos.

If I do left() and right() with low numbers on the first and last name columns - as there is no LIKE operator in PBI and so I can hopefully catch most of the entries I want to filter for - the query takes way too long to complete because it has to go through so many entries and variations to evaluate.

Any ideas?

zahlenschubser · ‎03-31-2023

Ok, I think I found a way to work around this, in a way, with the levenshtein likeness calculation.

I managed to implement it as a custom function in the query editor as
---
= (s1 as nullable text, s2 as nullable text) =>
if s1 = null or s2 = null then "" else
let
l1 = Text.Length(s1),
l2 = Text.Length(s2),
d = List.Generate(
() => [i=0, j=0, cost=0],
each [i] <= l1,
each if [j] = l2 then [i]+1 else null,
each
if [i] = 0 then
[i=1, j=1, cost=if s1{0} = s2{0} then 0 else 1]
else if s1{[i]-1} = s2{[j]-1} then
[i=[i]+1, j=[j]+1, cost=[cost]]
else
let
ins = [i=[i], j=[j]+1, cost=[cost]+1],
del = [i=[i]+1, j=[j], cost=[cost]+1],
sub = [i=[i]+1, j=[j]+1, cost=[cost]+1]
in
List.Min({ins, del, sub}, each [i]<=l1 and [j]<=l2)
)
in
List.Last(d)[cost]
---

but I am unsure how to actually use it to add a new calculated column, as I need to compare the value in [name] to the earlier entry(entries) in the same column, filtered by a shared value in a different column.

I tried sorting the whole thing by the filtering column and then use
= Table.AddColumn(#"Hinzugefügter Index", "Name_likeness", each LSTEIN([name], List.Range([name],[Index]-1,1)))
but that got me an error message.

halfglassdarkly · ‎03-27-2023

You could also try out using Fuzzy Merge or Cluster values in Power Query, see https://learn.microsoft.com/en-us/power-query/fuzzy-matching

I've had mixed success for that myself, but if using Excel isn't too far out of scope for you I always get reliable fuzzy matching results using the free Fuzzy Lookup add in for Excel https://www.microsoft.com/en-nz/download/details.aspx?id=15011

zahlenschubser · ‎03-28-2023

I don't see the cluster option in my menu.

halfglassdarkly · ‎03-28-2023

Oh sorry, I haven't used cluster values myself, just saw the cluster values feature is available only for Power Query Online.

The fuzzy merge option is available in vanilla Power Query when merging queries though. I haven't found it very reliable compared to the excel add in I mentioned, but your mileage may vary.

The Power Query fuzzy merge is probably ok for names matching, but isn't very good at identifying similarity for e.g where one string is a keyword and the other string is a longer text to search. The excel add in seems to perform well for both scenarios.

zahlenschubser · ‎03-28-2023

https://numidiabi.wordpress.com/2020/08/24/power-query-table-addfuzzyclustercolumn/

gave me some ideas, but without a "correct" column to match the names against this doesn't work I think.

halfglassdarkly · ‎03-28-2023

You can use the standard merge data interface and select the same table for both sides of the join using the fuzzy match option.

zahlenschubser · ‎03-28-2023

Good idea, but the resulting merge query (I have 717k lines in the table I need to check) wasn't done even after letting it run for ten minutes.

halfglassdarkly · ‎03-28-2023

Ouch, yep, that will take a while.

One suggestion, if your customer names are in a fact table, try duplicating the query, remove columns other than the customer name (and any other customer ID if you have one e.g email) and deduplicate. I don't know how many times a customer is repeated throughout your data, but that could considerably reduce the amount of records power query has to compare.

zahlenschubser · ‎03-29-2023

The de-duplicate did reduce the lines a lot, but still far from a manageable amount sadly.

zahlenschubser · ‎03-28-2023

I'm trying to manually add in the column - as that seems to work according to another post - according to https://learn.microsoft.com/en-us/powerquery-m/table-addfuzzyclustercolumn , but I didn't have much success yet.

My data source is mysql that gets directly imported into Power BI, so Excel isn't really an option sadly.

v-shex-msft · ‎03-27-2023

Hi @zahlenschubser,

You can take a look at the following blog about string comparison from SQL to Dax if help with you scenarios:

From SQL to DAX: String Comparison

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

zahlenschubser · ‎03-28-2023

I tried to use the ideas from that one, but the necessary granularity makes the resulting query take way too long sadly.

rajulshah · ‎03-23-2023

@zahlenschubser Can you please provide any sample data?

zahlenschubser · ‎03-23-2023

Like this for example. The first and last two are pretty sure the same person, but last and first name(s) are written differently. It's also often a combination of typos, missing letters, or swapped letters, like v <> w.

LIKE operator & customer name analysis

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly