Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
Zelbinian
Frequent Visitor

Create new column based on text from a messy text blob

So, I have a column with data that looks like this:

#Research; #ourResearch-Cat1-Subcat1; #Benchmark; 2018; Benchmark
#Research; #OurResearch-Cat1-Subcat2; #Benchmark; Benchmark
#OURRESEARCH-NewCat; #Research
#Research; #ourresearch-Category3-Subcat 3;
#Research; #ourResearch-NewCat2

 

It's a text blob from a database and it's a little messy, as user-entered data often is. I've definitely not captured every variation, but you get the gist. Here's the result I want:

Category SubCategory

Cat1Subcat1
Cat1Subcat2
NewCat 
Category3Subcat 3
NewCat2 

 

I've been Googling around for solutions for hours and the issue is everything I've seen really depends on knowing the exact positioning of the text you're trying to grab. I do know, to an extent. Like, if I could write a regex this problem would have been solved in about 5 minutes. But I've yet to figure out a combination of replacer and splitter functions that can do the same job.

Edit: I've added some complexity to my example. The things in the sample data are just stand-ins for the real thing, categories and subcategories can be any alphanumeric string of any length.

2 ACCEPTED SOLUTIONS

Hi @Zelbinian

 

Please try these two calculated columns as an alternative.  These are in DAX and again, I have attached a PBIX file that you can download to see/tweak etc.

 

Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointOR = SEARCH("#OurResearch",'Table1'[Column1],1,0)+1
VAR RestOfText1 =  MID('Table1'[Column1],StartingPointOR+12,LengthOfText-StartingPointOR+1) & ";"
VAR StartingPointSC = SEARCH(";",RestOfText1,1,0)-1
VAR RestOfText2 = LEFT(RestOfText1,StartingPointSC)
VAR StartingPointDash = SEARCH("-",RestOfText2,1,0)

RETURN 
    IF(
        StartingPointDash>0,
        LEFT(RestOfText2,StartingPointDash-1),
        RestOfText2
        )

and

 

Sub Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointOR = SEARCH("#OurResearch",'Table1'[Column1],1,0)+1
VAR RestOfText1 =  MID('Table1'[Column1],StartingPointOR+12,LengthOfText-StartingPointOR+1) & ";"
VAR StartingPointSC = SEARCH(";",RestOfText1,1,0)-1
VAR RestOfText2 = LEFT(RestOfText1,StartingPointSC)
VAR StartingPointDash = SEARCH("-",RestOfText2,1,0)

RETURN 
    IF(
        StartingPointDash>0,
        MID(RestOfText2,StartingPointDash+1,99)
        )

To learn more about DAX visit : aka.ms/practicalDAX

Proud to be a Datanaut!

View solution in original post

Hi @Zelbinian,

 

The @Phil_Seamark's solution is wonderful. Here is a solution of M based on your sample data. The file is attached. 

Text.BeforeDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-")
Text.AfterDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-")

Create-new-column-based-on-text-from-a-messy-text-blob

 

Best Regards,
Dale

Community Support Team _ Dale
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

10 REPLIES 10
Phil_Seamark
Employee
Employee

Hi @Zelbinian

 

Here is one approach.  Not sure if it works with your wider dataset, but hopefully you can see how to tweak it.  I have attached a PBIX file

 

Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointCat = FIND("-Cat",'Table1'[Column1],1,0)+1
VAR RestOfText =  MID('Table1'[Column1],StartingPointCat,LengthOfText-StartingPointCat+1)
VAR FinishPointCAT = {
                        {FIND("-",RestOfText,1,LengthOfText)},
                        {FIND(";",RestOfText,1,LengthOfText)}
                        }
RETURN 
    MID(
        'Table1'[Column1],
        StartingPointCat,MINX(FinishPointCAT,[Value])-1
        )

and 

Sub Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointCat = FIND("-Subcat",'Table1'[Column1],1,0)+1
VAR RestOfText =  MID('Table1'[Column1],StartingPointCat,LengthOfText-StartingPointCat+1)
VAR FinishPointCAT = {
                        {FIND("-",RestOfText,1,LengthOfText)},
                        {FIND(";",RestOfText,1,LengthOfText)}
                        }
RETURN 
    IF(StartingPointCat>1,
        MID(
        'Table1'[Column1],
        StartingPointCat,MINX(FinishPointCAT,[Value])-1
        ))

image.png


To learn more about DAX visit : aka.ms/practicalDAX

Proud to be a Datanaut!

The problem with this solution - and other solutions I've seen for similar problems - is that it assumes I can know what the data is going to be in advance. I use "Cat" and "Subcat" as standins for Category and Subcategory data but I can't actually know what those will be in advance. They can be of arbitrary length and composition. I can know that they will follow "#ourresearch-" (in various forms of capitalization), but that's it.

(Also, it would have been helpful for you to mention that this was DAX code. I assumed it was Power Query for about an hour lol)

HI @Zelbinian

 

My solution is based on the small dataset you provided, hence the approach.  If you have some definite "rules" then these can probably be used to help tweak the approach.  


To learn more about DAX visit : aka.ms/practicalDAX

Proud to be a Datanaut!

I updated the question, if that helps.

Hi @Zelbinian

 

Please try these two calculated columns as an alternative.  These are in DAX and again, I have attached a PBIX file that you can download to see/tweak etc.

 

Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointOR = SEARCH("#OurResearch",'Table1'[Column1],1,0)+1
VAR RestOfText1 =  MID('Table1'[Column1],StartingPointOR+12,LengthOfText-StartingPointOR+1) & ";"
VAR StartingPointSC = SEARCH(";",RestOfText1,1,0)-1
VAR RestOfText2 = LEFT(RestOfText1,StartingPointSC)
VAR StartingPointDash = SEARCH("-",RestOfText2,1,0)

RETURN 
    IF(
        StartingPointDash>0,
        LEFT(RestOfText2,StartingPointDash-1),
        RestOfText2
        )

and

 

Sub Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointOR = SEARCH("#OurResearch",'Table1'[Column1],1,0)+1
VAR RestOfText1 =  MID('Table1'[Column1],StartingPointOR+12,LengthOfText-StartingPointOR+1) & ";"
VAR StartingPointSC = SEARCH(";",RestOfText1,1,0)-1
VAR RestOfText2 = LEFT(RestOfText1,StartingPointSC)
VAR StartingPointDash = SEARCH("-",RestOfText2,1,0)

RETURN 
    IF(
        StartingPointDash>0,
        MID(RestOfText2,StartingPointDash+1,99)
        )

To learn more about DAX visit : aka.ms/practicalDAX

Proud to be a Datanaut!

I appreciate you giving me the fish, but, unfortunately, I do not understand your code (I've never written in DAX) nor do I have an explanation about the thinking that produced it. Can you start by explaining why you chose a calculated column / DAX vs doing Power Query / M in the query step and creating a new real column?

Hi @Zelbinian,

 

The @Phil_Seamark's solution is wonderful. Here is a solution of M based on your sample data. The file is attached. 

Text.BeforeDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-")
Text.AfterDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-")

Create-new-column-based-on-text-from-a-messy-text-blob

 

Best Regards,
Dale

Community Support Team _ Dale
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

In case anybody wants to see the full text of @v-jiascu-msft's code without having to download the file, here it is:

 #"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each Text.BeforeDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-")),
 #"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each Text.AfterDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-"))


The 'each' in each statement seems like a particularly important detail there.

Also, that last example does not actually have a semicolon for the Text.BetweenDelimiters to key off of. Yet it seems to work. The the documentation of the function does not describe how the function behaves if the second delimeter doesn't exist, how did you figure out that solution?

Hi @Zelbinian,

 

I'm glad it can work. Firstly, we need to find the pattern that the computer can recognize. Secondly, find a function here that can do the job. 

The functions will iterate every character of the parameter till the second delimiter or the end. So it won't be a problem if the second delimiter doesn't exist.

 

Best Regards,
Dale

Community Support Team _ Dale
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Hi @Zelbinian

 

For something like this, you can use either M or DAX.  I chose DAX because I'm more familiar with it, but it probably doesn't matter and there won't be a big performance difference.  Personally, I find DAX code easier to read than M, but I love anything you can do in M just from the UI.

 

If I was to do this in M, the principals would be the same.  Look for the guaranteed patters (#ourresearch) etc and then start breaking up the text using text functions.


To learn more about DAX visit : aka.ms/practicalDAX

Proud to be a Datanaut!

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.