cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Helper IV
Helper IV

How to extract continuous string set from a column with multiple criteria

Hi,

I have a Column A with client names.
Table 1                                                                      Table 2
Client Name                                                              Ongoing Projects
Microsoft Corporation Ltd                                        Microsoft
BARCALAYS (LLP)                                                      Barcalays
Caseys General                                                          Casey's
LLP Ernst and Young                                                 Ernst and Young
ASTRAZENCALimited                                                Astrazanca

I want to search and extract the client names from Table 1 as below:
Result Column
Microsoft
Barcalays
Casey
Ernst and Young
Astrazenca

Basically the logic I am looking for is as below. The function/formula should check for continuous sequence of characters from Table B in Table A and return that entire sequence. Also this search should be case insensitive (For ex: BARCALAYS and Barcalays) and also the search text can be in between the within text column anywhere and not necessarily at the beginning (Ex: LLP Ernst and Youngand Ernst and Young).
One important criteria is it should return matches for continuous string greater than 2 (i.e; 3 and above).

I tried various formulae like Search, Find, Minx, Left but I am not getting the desired result with the above set of criteria met. Please help me with the logical formula for this.

Happy to give any additional info pertaining to this.

52 REPLIES 52

@Greg_DecklerThe latest file still have some flaws. Fuzzy Column: 'Tri', "Ste', 'Unit' these three have no ongoing projects, They need to be blank.

If we implement the logic that I proposed, then these would return blank. Since no projects are there with the string 'Tri P' , Search for characters in Project Name 'Sterling' will return only Sterling and 'Ste' will not get mapped to Steelwave, 'Unit' will not get mapped to 'United' -> similar to Sterling.
Explanation on Sterling:
While searching for the characters in cell 'Sterling' , up until 'Ste' the retun values will be Sterling and Steelwave. Now add the next character and search again. 'Ster' will return only 'Sterling' and eliminate Steelwave leaving it as a blank.
From what I understand it is a teeny bit of reverse mapping from that what we are following now.
Instead of searching for the whole string of Project Name, and returning all matches with the matching strings, why dont we set a criteria to search only until first character after first space in the Project Name, return all matching client names to the string set increment. 

Basically as the increment of the character for the cell by cell Project Name happens, the number of matching clients to this string will reduce leaving just 1 or 2 matching client names. 

Basically consider this, A is Project name.

A = "Barclays L"  --> taking only the characters until the first one after first space. Increment (i++) should stop after the first character after first space if found or end of string without any spaces.

for (A[0]='B', search for matches in client name,i++)

B=All client names matching A[0].
Now for (A[0][1]='Ba', search for matches in client name, i++)

B=All client names matching A[0][1]
.

.

.

Now for (A[0][1]...[8],search for matches,i++)

B=Barclays Corporation -> Client name matching to the searched project Name (retuning Client Name instead of matching strings)

The loop ends at [8] because A0]...[9] = 'Barclays L' is not present in the list of client names. Hence matching Barclays is mapped until A[8] and the mapped client name is Barclays Corporation is returned.

I hope I was able to convey the logic. My apologies for the above way of mentioning the logic.

And Dont worry about Wolters. It is the client name error. There has to be a space after Wolter.

@deepakramamurth Except it doesn't work that way. To implement the original algorithm in DAX I had to do this:

Let's take just the first 2 client names

1 Aaron's
2 ABB (CA VDA)

In order to emulate looping, which is impossible in DAX, this becomes:

1 1 A
1 2 Aa
1 3 Aar
1 4 Aaro
1 5 Aaron
1 6 Aaron'
1 7 Aaron's
2 1 A
2 2 AB
2 3 ABB

and so on, the entire list of projects gets blown out into a big table like this. Then, I go about adding a column to this table that says whether it matches the client or not, basically a 1 or a 0. Then, I filter out the zeros. Then I grab the MAX of the second column, that is my longest match.

This is looping in DAX.

So, explain again how your logic fits into this?

Steelwave is a Client Name and it matches Sterling up until 3 characters. The only match returned is Sterling, not Sterling and Steelwave because Steelwave is not an Ongoing Project name. So, you are left with Steelwave matching Sterling up until Ste. You can lower the thresholds but then you start running into issues with Dell, etc.


---------------------------------------

@ me in replies or I'll lose your thread!!!

I have a NEW book! 
DAX Cookbook from Packt
Over 120 DAX Recipes!




Did I answer your question? Mark my post as a solution!

Proud to be a Super User!




@deepakramamurth - OK, this might actually be the one. I added a KillThreshold that was possible because I moved the cleaning to the end to maximum the length of the match. I got rid of the "Unit" match also. So, basically once I adjusted the cleaning of the special characters and spaces at the end, I was able to increase Threshold1 so that ABB is included but Unit is excluded. The only special exception is for Blue Cross and I'm not sure how else to handle that one quite honestly. Let me know if you find any obvious issues. Updated PBIX attached.

 

Fuzzy Column = 
    VAR __MatchWord = [Client Name]
    VAR __CleanMatchThreshold = 4
    VAR __KillThreshold = 3
    VAR __FuzzyThreshold1 = .4
    VAR __FuzzyThreshold2 = .8
    VAR __WordSearchTable = 
        GENERATE(
            'Ongoing Projects',
                VAR __Word = MAXX(FILTER('Ongoing Projects',[Index]=EARLIER('Ongoing Projects'[Index])),[Ongoing Projects])
            RETURN ADDCOLUMNS(GENERATESERIES(3,LEN(__Word),1),"Search",LEFT(__Word,[Value]),"Original",__Word)
        )
    VAR __Table =
        FILTER(
            ADDCOLUMNS(
                __WordSearchTable,
                "Match",SEARCH([Search],__MatchWord,,BLANK())
            ),
            NOT(ISBLANK([Match]))
        )
    VAR __Max = MAXX(__Table,[Value])
    VAR __Match = MAXX(FILTER(__Table,[Value]=__Max),[Search])
    VAR __Proposed =
        IF(
            LEN(__Match)<=__CleanMatchThreshold,
            SWITCH(TRUE(),
                //__Clean2 = "ABB",__Clean2,
                COUNTROWS(FILTER(__Table,[Value]=__Max))>1,BLANK(),
                LEN(__Match) <= __KillThreshold,BLANK(),
                LEN(__Match) = LEN(__MatchWord),__Match,
                LEN(__Match)/LEN(__MatchWord)>__FuzzyThreshold1 && SEARCH(__Match,__MatchWord,,0)=1,__Match,
                LEN(__Match)/LEN(__MatchWord)>__FuzzyThreshold2,__Match,
                BLANK()
            ),
            SWITCH(TRUE(),
                __Match = "Blue Cross" || __Match = "Blue Cross ",__Match,
                LEN(__Match)/LEN(__MatchWord)<__FuzzyThreshold2 && SEARCH(__Match,__MatchWord,,0)<>1,BLANK(),
                __Match
            )
        )
        VAR __Clean1 = IF(RIGHT(__Proposed,1)="(",LEFT(__Proposed,LEN(__Proposed)-1),__Proposed)
        VAR __Clean2 = IF(RIGHT(__Clean1,1)=" ",LEFT(__Clean1,LEN(__Clean1)-1),__Clean1)
RETURN
    __Clean2

 

 


---------------------------------------

@ me in replies or I'll lose your thread!!!

I have a NEW book! 
DAX Cookbook from Packt
Over 120 DAX Recipes!




Did I answer your question? Mark my post as a solution!

Proud to be a Super User!




@Greg_DecklerYou are right. My logic has some flaws in case where the Project name does not appear in the client list

@Greg_DecklerThank you Greg. You are truly a genius. I will go ahead and use this for my presentation today. This is really helpful. Thank you for all your time.
However I do not understand just one thing. Why was the exception given to Blue Cross?

Even National Grid has two matches with client names similar to Blue cross. While this works for National Grid, why is it not working for Blue Cross?

(See -Ernst and YoungNATIONAL Grid     and Ernst and YoungBLUE Cross   under client names for which blue cross exception was added). While National grip is getting mapped to YoungNATIONAL Grid, Blue Cross should get mapped to YoungBLUE Cross since the client names are of similar nature or structure. Any thoughts on why this is happening? Then may be we can figure out to eliminate giving this exception as well and have a perfectly working formula without any exceptions 🙂
Let me know you thoughts. I will go ahead and accept this as a solution post your confirmation on this piece.

@Greg_Deckler  Presented today with the exception but obviously I did not mentioned about the exception in the code. However I was thinking, instead of retunring only matched characters and then clean it, can we have the formula return the entire cell value from the Project list?

For example, instead of 'Casey' which has been matched, since 'Casey' is matched, the return value will be the entire cell value of 'Casey' i.e; Casey's General Store (CARES 2019) --> From the project list.

 

This should be similar for other matches.

 

Now I am aware of multiple names in project list (For example: Barclays has 3 variations in the project list). In such cases let the retun value be the first occurance od this match term. i.e; Barclays(VDA).
Sorry for trying to improvise on the initial requirement but could you suggest me a formula for the above?

We can mark the response as a solution since I am satisfied with the results. (Except for my initial concern to avoid giving exception to 'Blue Cross'). You could also clarify why Blue Cross is behaving differently than 'National Grid'. 

 

Thanl you so much for all the time @Greg_Deckler  I really appreciate. You are the saviour 🙂 

@deepakramamurth - So, yes, we can return the entire matching project, that we can do. I will see what I can do with that. The problem with Blue Cross is that the client names are so long that it ends up not meeting the threshold requirements for a match. I may be able to adjust those. So, for example 

 

Blue Cross and Blue Shield of Arizona I

 

Matches 11 out of like 40 characters or .275% This makes it fall below thresholds set to filter out other stuff. The Ernst & Young entry for Blue Cross is even worse.


---------------------------------------

@ me in replies or I'll lose your thread!!!

I have a NEW book! 
DAX Cookbook from Packt
Over 120 DAX Recipes!




Did I answer your question? Mark my post as a solution!

Proud to be a Super User!




@Greg_Deckler  For some wierd reason, I added data to my datadump today and after refresh I dont see Blue Cross and Blue Shield of Arizona I as well as The Ernst & Young entry for Blue Cross . I only see BLUE CROSS OF IDAHO CARE PLUS, INC in the data dump (excel). However when I filter in the data view of the BI desktop under list of projects, there are two entries for Blue cross. Idaho care plus and the Ernst and Young Entry. I am not sure what could be a reason for this. I dont see Blue Shield entry in my entire data set nor in the data view. Have you come across such odd things?

On the return value, we could either return the Project Name for the match or the Client Name for the Match. (either works - Better to know when to change what for convenience).

I am still trying to understand some part of the code. May be I can play around with the threshold values and get a match for Blue cross also for all kind of entries currently present. That is if you dont come up with one earlier than me 🙂

Please share the latest improvements you could make on the existing one to have better results. Hopefully the next post from you could be our very much needed near to perfect solution or even better - A perfect Solution 😉

@Greg_DecklerI have attached the current list of client names for your reference. You could use this list in your working file if it is convenient.

 

Drive Link 

@deepakramamurth - I will take a look.


---------------------------------------

@ me in replies or I'll lose your thread!!!

I have a NEW book! 
DAX Cookbook from Packt
Over 120 DAX Recipes!




Did I answer your question? Mark my post as a solution!

Proud to be a Super User!




Helpful resources

Announcements
November Update

Check it Out!

Click here to read more about the November 2020 Updates!

Community Conference

Power Platform Community Conference

Check out the on demand sessions that are available now!

secondImage

Power Platform October Community Highlights

Check out the top community contributors across all of the communities

secondImage

Create an end-to-end data and analytics solution

Learn how Power BI works with the latest Azure data and analytics innovations at the digital event with Microsoft CEO Satya Nadella.

Top Solution Authors
Top Kudoed Authors