Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
Anonymous
Not applicable

How to extract continuous string set from a column with multiple criteria

Hi,

I have a Column A with client names.
Table 1                                                                      Table 2
Client Name                                                              Ongoing Projects
Microsoft Corporation Ltd                                        Microsoft
BARCALAYS (LLP)                                                      Barcalays
Caseys General                                                          Casey's
LLP Ernst and Young                                                 Ernst and Young
ASTRAZENCALimited                                                Astrazanca

I want to search and extract the client names from Table 1 as below:
Result Column
Microsoft
Barcalays
Casey
Ernst and Young
Astrazenca

Basically the logic I am looking for is as below. The function/formula should check for continuous sequence of characters from Table B in Table A and return that entire sequence. Also this search should be case insensitive (For ex: BARCALAYS and Barcalays) and also the search text can be in between the within text column anywhere and not necessarily at the beginning (Ex: LLP Ernst and Youngand Ernst and Young).
One important criteria is it should return matches for continuous string greater than 2 (i.e; 3 and above).

I tried various formulae like Search, Find, Minx, Left but I am not getting the desired result with the above set of criteria met. Please help me with the logical formula for this.

Happy to give any additional info pertaining to this.

1 ACCEPTED SOLUTION

@Anonymous - OK, I updated this with a Client Name output version and Project output version in 2 additional columns. I checked the matches, don't see anything obviously amiss. Solved?

 

PBIX is attached. 

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

View solution in original post

53 REPLIES 53
Greg_Deckler
Super User
Super User

@Anonymous - All kinds of confused on this one. You reference Table 1, Table 2, Table A, Table B. Is it supposed to do fuzzy matching with Casey's matching Caseys? Where are you creating this new column? Is it a new column? Is it a table? For case insensitive searches, use SEARCH. You can probably get there using Text to Table but I would need some clarification from you to be more specific than that.

https://community.powerbi.com/t5/Quick-Measures-Gallery/Text-to-Table/m-p/1312929#M594

Might be a solution using just SEARCH as well where you get all of the rows from Table 2 (Table B??) and you ADDCOLUMNS with a SEARCH like this:

 

Column in Table 1 (Table A?) =
  VAR __ClientName = [Client Name]
  VAR __Table =
    ADDCOLUMNS(
      ADDCOLUMNS(
        'Table 2/B?',
        "Search",SEARCH([Ongoing Projects],__ClientName,,0)
      ),
      "Result",IF([Search]>0,MID(__ClientName,[Search],LEN([Ongoing Projects])),BLANK())
    )
RETURN
  MAXX(__Table,[Result])

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

@Greg_Deckler  Thanks for your reply. Lets say table 1 and 2 are Table A and B respectively. I do not need to do a fuzzy search. Casey's shourd return just Casey since the first 5 characters are matching with the search text. 

I need to create a new column (Result) and not a table which retrievs the searched value. Lets say this new column is to be created in Table A beside the 'Client Name' column. So it picks up the search terms from Table B, searches for a continuous string match with the 'Client Name' column and fills in the 'Result' column.

Not a fuzzy search but a continuous string search (Exactly just like how a filter search on a column in excel works).

@Anonymous Check out my second reply, I proved the technique will work, there is an attached PBIX file.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

@Greg_Deckler  Thank you so much for the solution. It almost fizes my issue. But I just need a small extension to this. I apologise for not mentioning it earlier.

 

Table A                                                        Table B

BARCALAYS Corporation Ltd                       Barcalays (LLP)

CaSey's                                                         CaseyLLP

 

I need the result to be retrieved in the result columns as "Barcalays" -> case insensitive and just like how the search field works in an excel filter. For Caseys the result should be "Casey".

 

@Anonymous - I believe, see updated PBIX attached.

 

Column = 
  VAR __ClientName = [Client Name]
  VAR __Table =
    ADDCOLUMNS(
      ADDCOLUMNS(
        'Table (24a)',
        "Search",SEARCH([Ongoing Projects],__ClientName,,0)
      ),
      "Result",IF([Search]>0,[Ongoing Projects],BLANK())
    )
RETURN
  MAXX(__Table,[Result])

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

@Greg_Deckler  Hi Greg, I dont see any changes in the latest PBIX. It is same as the earlier one. The parent tables are also the same as the earlier one with no changes for Barcalays (LLP) and CaSey. Probably you missed on attaching the updated PBIX.

@Anonymous - Not sure, I thought I attached the one attached here. Here is the screen shot:

Greg_Deckler_0-1599419133947.png

I just saved and attached again.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

 Ge

@Greg_DecklerI dont see any changes in this screenshot as well (even checked the file).

Here is the exact problem statement.

Replace Table 24(a) Barclays with Barclays (ABC)

Replace Table 24[Client Name] Casey's General with CaSey General

Replace Table24[Client Name] ASTRAZENCALimited with 2020 ASTRAZENCALimited AMT

 

Now I want the result to show as Barclays, Casey, Astrazenca

The result should be the complete set of matching characters including space and special characters if any and not the value from the Table 24(a). (For ex: Casey)


I am not really sure if we need to implement fuzzy search. I have not explored much of it.

@Anonymous - OK, I guess I was confused because you have the PBIX and can make those changes on your own?

Greg_Deckler_0-1599420562711.png

The only change that I could not make was to have Barclays (ABC) match BARCLAYS (LLP) because those two things do not match. It would have to match on Barclays<=>BARCLAYS which it does and returns Barclays. There is no fuzzy matching with FIND. It is only case-insensitive.

 

Not the formula. It is returning in the column whatever search word (Project) matches the current row. It is not doing anything with regard to manipulating the text in Client Name, it simply returns the matching Ongoing Project value. So, you can put whatever you want in Table24 as far as extra characters, etc. it doesn't matter one tiny bit, as long as a match is found in Table (24a) then that match is returned as the value. I made no changes to the code, just the changes to the data tables, which you have the PBIX, you can do that.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

@Greg_DecklerYou see the problem? I have the project name in 24(a) as Casey's (note the apostrophe) which is same as Barclays (ABC).
If I replace 24(a) with Casey's then the result column will not return Casey because the apostrophe is missing in CaSeY General (Table 24). Same goes with Barclays since (ABC) is missing from Table 24.

My data is such that the project list is not the exact match with the Client Name. Some projects are of the above nature. Hence I need to match the continuous set of strings and then return that set as the result.

So result for Casey's and Barclays (ABC) would be Casey and Barclay respectively considering these set of continuous strings are present in Table 24.

 

I hope I was able to explain my problem. I am not sure if there is a code but I wonder what logic does the filter search (ctrl+shift+L) follow to be able to achieve this.

@Anonymous - What you are asking for is not likely very possible or you are not explaining it very well. What is the "threshold" for a match? I asked originally if you wanted fuzzy matching and you said "no". Well, what you are asking for is called fuzzy matching. Fuzzy matching is when you match based upon a threshold value. If you want fuzzy matching, you are going to need to use Power Query's built in fuzzy matching in Merge query. I could probably come up with some convoluted DAX way of doing it. It would probabl involve Text to Table (https://community.powerbi.com/t5/Quick-Measures-Gallery/Text-to-Table/m-p/1312929#M594). I feel like you intentionally wasted my time on this and on a holiday weekend no less because I asked if you wanted fuzzy matching and you said no and then what you want turns out to be fuzzy matching. But I would have NEVER gone down this path had you been clear that you want fuzzy matching. It honestly makes no sense to write a fuzzy matching DAX solution and believe me I am the king of writing ridiculously impractical DAX but that is probably even a bridge too far for me.

 

In conclusion. Use what you want is called fuzzy matching. Use the built-in fuzzy matching technology in Power Query Merge (Table.Join). The end.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

@Greg_DecklerThank you for your time Greg. I really appreciate it. My apologies if you felt we wasted our time. Let me try the fuzzy matching feature and see if that is of any help in solving my requirement. 

@Anonymous - You understand the problems with trying to reinvent this wheel in DAX, right? You can't just match up until a certain number of letters from the start because you have things like LLP Ernst and Young and Ernst and Young. You can't match a certain number of characters from the right because of things like BARCLAYS (LLP) versus Barclays (ABC). You can't use CONTAINSTRING or FIND or SEARCH because we are talking about partial (fuzzy) matching. You would literally have to tear apart each string into it's own table using Text to Table (word version) where every character gets a row in a table. No problem for the Client Name but then you are trying to compare it against every single other word in the other table. Except DAX has no looping. So, you could brute force it if you have a set amount of words in your second table, but something tells me the real world implementation of this is not going to have 5 rows in the second table but more like a few hundred or thousand. So, you would have to do text to table on a hard coded table several hundred or thousand times and it would be hard coded. You *might* attempt to do something tricky by using an Index and trying to use GENERATE against the Index and somehow transforming each of search words into a table of one column and somehow tie it all together with a UNION but I haven't really ever figured that how to do that dynamically without hard coding everything. Then, after you have thus far figured out how to do 2 or 3 already impossible things, then and only then you are ready to start trying to match sequences of letters. Here you would have the enviable task of iterating through your text to table of all your search words and checking each row individually whether it matched the a character in the text to table of your Client Name. OK, get rid of the characters (rows) that didn't match. Now, iterate over this table again looking for consequetive rows. Iterate again to find more than some threshold of consequetive rows. Somehow magically extract those rows from the table. Return them.

 

Other than somehow overcoming probably 2 or 3 impossible things, and I don't use the word impossible lightly when it comes to DAX, sure, the method above will work perfectly. Oh, except it would grind the CPU of your computer into an absolute dust pile of silicon for more than maybe a dozen words and search words. Other than that, first rate solution. The above thought process is what went through my head when I first saw this problem which was the reason for the question about fuzzy matching. Because I already knew that fuzzy matching was not in the cards. Definition of fuzzy matching: Fuzzy matching is a technique used in computer-assisted translation as a special case of record linkage. It works with matches that may be less than 100% perfect when finding correspondences between segments text.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

@Greg_DecklerThanks for your explanation. The Fuzzy match aint working as expected. For example: Client_name is "BARCLAYS CAPITAL". This client name is getting matched with Project_name - "TriState Capital" instead of Barclays (ABC) with 0.24 threshold (the least I could set for a maximum output).

With threshold of 1 , BARCLAYS CAPITAL has no matches and returns NULL. 

Due to the above, I really doubt if the results with Fuzzy matching are accurate for my datasets. I do not want a wrong project to be mapped to a client name.

 

Let me figure out something, if nothing works out then I should probably write to the management to tweak the data source to match with the Client Name without the (ABC) or apostrophe  🙂

 

Thanks again for your time.

@Anonymous - OK, so after all my complaining about how impossible this was in DAX, I couldn't let it go because I am just a glutton for punishment when it comes to these things. So, here is the first ever (that I am aware of) fuzzy text matching formula in DAX. It's fairly rubbish but it does prove that the concept is actually possible. I have some thoughts for improving its accuracy but my brain is kind of fried after this. Updated PBIX is attached. As you can see, it is not 100% accurate and does some weird things but that's just sort of what DAX does (long explanation over the whole casing of letters).

Column 3 = 
    VAR __ClientTable = ADDCOLUMNS(GENERATESERIES(1,LEN([Client Name]),1),"Letter",MID([Client Name],[Value],1))
    VAR __WordSearchTable = 
        GENERATE(
            DISTINCT('Table (24b)'[Index]),
                VAR __Word = MAXX(FILTER('Table (24b)',[Index]=EARLIER([Index])),[Ongoing Projects])
            RETURN ADDCOLUMNS(GENERATESERIES(1,LEN(__Word),1),"Letter",MID(__Word,[Value],1))
        )
    VAR __FilterTable = 
        INTERSECT(
            SELECTCOLUMNS(__WordSearchTable,"Letter",[Letter]),
            SELECTCOLUMNS(__ClientTable,"Letter",[Letter])
        )
    VAR __Table1 = FILTER(__WordSearchTable,[Letter] IN __FilterTable)
    VAR __Table2 = 
        ADDCOLUMNS(
            FILTER(__WordSearchTable,[Letter] IN __FilterTable),
            "Consq",[Value] - MAXX(FILTER(__Table1,[Index]=EARLIER([Index])&&[Value]<EARLIER([Value])),[Value])
        )
    VAR __Table3 = FILTER(__Table2,[Consq]=1)
    VAR __Table = GROUPBY(__Table3,[Index],"Rows",COUNTX(CURRENTGROUP(),[Letter]))
    VAR __Max = MAXX(__Table,[Rows])
    VAR __WordIndex = MAXX(FILTER(__Table,[Rows]=__Max),[Index])
RETURN
    CONCATENATEX(FILTER(__WordSearchTable,[Index]=__WordIndex),[Letter],,[Value])

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

@Greg_DecklerThanks for the file. I went throught the implementation and I somehow feel the logic is taking a different approach. Let me jot down the logic that I feel is a right approach. I am not sure if this is implementable with the current formula and functions in PBI.

 

Logic:

1. Pick value of first cell in Project table(Value to be found reference column). Start with searching for the first 3 characters of this cell in the Client Name (Find Within column). Return all the cells having these 3 characters in the Client Name appearing continuously and save it in a variable. Now add the 4th character and search for the continuous appearances in Client Name (Return values will decrease because of adding the 4th continuous character). Repeat adding characters until no matches are found.

Final value to be returned is the last variable value with the maximum continuous matches.

2. Now repeat the entire thing for value in the second cell of the Project Table.

 

This logic seems flawless and satisfies all criteria required. It also returns "Barcalays" or "Barcalays ".

Also doesnt match Tristate capital and Barcalays capital together because the search for character is always beginning from the first charater of the string which is Barcalays and not capital.

 

And yes the entire thing should be case insensitive.

 

Hope I was able to explain in an understandable way. The logic SEEMS easily implementable but the actual work is finding the right set of combinations of DAX to do so. 

@Anonymous The logic seems implementable because PHP and other programming languages have these things called loops. DAX does not have loops so breezing away the "oh just repeat this process" because extremely problematic. So, I will have to sit down with this and figure out once again how to emulate/simulate loops in DAX. It's not that I haven't done it but it takes a fair amount of effort and focus to get it right.

 

Also, the logic you describe would return "Barcalays ("


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

@Anonymous - OK, in order to implement this I kind of had to twist your logic in order to make it compatible with DAX but I *believe* it accomplishes the result you want. And, as I mentioned, it returns "Barclays (" for that one entry. Not entirely sure how to fix that unless you have an idea. I can't just wipe out spaces and there is no "trim" function in DAX to trim out special characters. I could potentially create a table of characters to remove at the end of words and SUBSTITUTE them out somehow if they are at  the end of a word but not sure of the ramifications of that because you might have a client whose name you want to have a special character at the end of their name. Could also potentially wipe out an ending space " " but, again, not sure if perhaps a situation might arise where you would want to wipe out two trailing spaces.

Anyway, I never ask for Kudos but I expect Kudos, it's Labor Day after all and I'm here coding fuzzy text matching logic in DAX afterall.

Updated PBIX is attached, Table (24). Here is the DAX:

Column 4 = 
    VAR __WordSearchTable = 
        GENERATE(
            'Table (24b)',
                VAR __Word = MAXX(FILTER('Table (24b)',[Index]=EARLIER([Index])),[Ongoing Projects])
            RETURN ADDCOLUMNS(GENERATESERIES(1,LEN(__Word),1),"Search",LEFT(__Word,[Value]))
        )
    VAR __Table =
        FILTER(
            ADDCOLUMNS(
                __WordSearchTable,
                "Match",SEARCH([Search],[Client Name],,BLANK())
            ),
            NOT(ISBLANK([Match]))
        )
    VAR __Max = MAXX(__Table,[Value])
    VAR __Match = MAXX(FILTER(__Table,[Value]=__Max),[Search])
RETURN
    __Match

@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

@Greg_DecklerThanks for this. The logic seems to be perfectly working but however the search is returning all values from Client Name containing the first matching set of character series from the Projects list. (I have attached the files in the drive link provided below and also attached the entire list of client names and the Ongoing Project Names for your reference). For now I can have a work around soultion by extending the search to a new column and returning the search results from the project column based on your formulae return values. Let me know if it is possible to eliminate this workaround and get the solution by tweaing your formula. We can mark it as a solved solution if we can get around with this missing piece 🙂 

 

Drive Link 

@Anonymous - Sorry, what? I'm not understand the issue, can you explain again with examples?


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.