09-07-2020 17:40 PM - last edited 07-28-2021 09:53 AM
Well, at the start of this past Labor Day weekend, I can safely say that the LAST thing that I thought that I would be doing is coding fuzzy text matching logic in DAX. Well, here we are. This came about because of this thread here posted by @Anonymous. It's quite the long thread with a number of flawed attempts, lying (I don't need fuzzy matching) and a lot of me complaining that it was pretty much impossible. Turns out, it isn't and "fuzzy" matching was required. Anyway, I should know better by now that there really isn't too much that is impossible if you twist your brain just right. Anyway the goal was to match the client name up with the closest matching project name only the data is pretty much trash.
This is a slightly improved version from the thread. Included a matching threshold that can help speed it up as well as serve as a threshold for matching (so you don't match just the letter "M" for example).
Fuzzy = VAR __MatchWord = MAX([Client Name]) VAR __CleanMatchThreshold = 4 VAR __KillThreshold = 3 VAR __FuzzyThreshold1 = .4 VAR __FuzzyThreshold2 = .8 VAR __WordSearchTable = GENERATE( 'Ongoing Projects', VAR __Word = MAXX(FILTER('Ongoing Projects',[Index]=EARLIER('Ongoing Projects'[Index])),[Ongoing Projects]) RETURN ADDCOLUMNS(GENERATESERIES(3,LEN(__Word),1),"Search",LEFT(__Word,[Value]),"Original",__Word) ) VAR __Table = FILTER( ADDCOLUMNS( __WordSearchTable, "Match",SEARCH([Search],__MatchWord,,BLANK()) ), NOT(ISBLANK([Match])) ) VAR __Max = MAXX(__Table,[Value]) VAR __Match = MAXX(FILTER(__Table,[Value]=__Max),[Search]) VAR __Proposed = IF( LEN(__Match)<=__CleanMatchThreshold, SWITCH(TRUE(), //__Clean2 = "ABB",__Clean2, COUNTROWS(FILTER(__Table,[Value]=__Max))>1,BLANK(), LEN(__Match) <= __KillThreshold,BLANK(), LEN(__Match) = LEN(__MatchWord),__Match, LEN(__Match)/LEN(__MatchWord)>__FuzzyThreshold1 && SEARCH(__Match,__MatchWord,,0)=1,__Match, LEN(__Match)/LEN(__MatchWord)>__FuzzyThreshold2,__Match, BLANK() ), SWITCH(TRUE(), __Match = "Blue Cross" || __Match = "Blue Cross ",__Match, LEN(__Match)/LEN(__MatchWord)<__FuzzyThreshold2 && SEARCH(__Match,__MatchWord,,0)<>1,BLANK(), __Match ) ) VAR __Clean1 = IF(RIGHT(__Proposed,1)="(",LEFT(__Proposed,LEN(__Proposed)-1),__Proposed) VAR __Clean2 = IF(RIGHT(__Clean1,1)=" ",LEFT(__Clean1,LEN(__Clean1)-1),__Clean1) RETURN __Clean2
Couple of usage notes. __CleanMatchThreshold is basically the number of characters by which if it is less than or equal to this number, more intense scrutiny is done. Above this number, we do a percentage check of matching characters versus the total Client Name just to verify a match. When falling below __CleanMatchThreshold, we check to see if there were multiple matches of the same length and if so, we throw them out. We also check if it is below or equal to the __KillThreshold and if so, throw it out. More checking ensues. Note that in both sets of checking (above and below __CleanMatchThreshold) we can put in as many exceptions as we want to ensure that specific matches fall through to the end. Something like this is almost inevitable with fuzzy matching. It is "fuzzy" after all!!
Is it true fuzzy matching logic. Eh. A fuzzy matching data scientist might object. So too a mathematician; you know, because they're a mathematician and all. But, in the end is it more complex and accurate than just using SEARCH. Yes. Is it scalable and performant? Probably not! Did you miss the part about using DAX to do fuzzy text matching? Hey Greg you idiot, why didn't you just use Power Query's fuzzy merge logic? We tried that; didn't work, go away. Enjoy! 🙂
The attached PBIX has some earlier attempts and such in it. You can see how it evolved.