Solved: Re: Create new column based on text from a messy t...

Zelbinian · ‎11-05-2018

So, I have a column with data that looks like this:

#Research; #ourResearch-Cat1-Subcat1; #Benchmark; 2018; Benchmark

#Research; #OurResearch-Cat1-Subcat2; #Benchmark; Benchmark

#OURRESEARCH-NewCat; #Research

#Research; #ourresearch-Category3-Subcat 3;

#Research; #ourResearch-NewCat2

It's a text blob from a database and it's a little messy, as user-entered data often is. I've definitely not captured every variation, but you get the gist. Here's the result I want:

Category SubCategory

Cat1	Subcat1
Cat1	Subcat2
NewCat
Category3	Subcat 3
NewCat2

I've been Googling around for solutions for hours and the issue is everything I've seen really depends on knowing the exact positioning of the text you're trying to grab. I do know, to an extent. Like, if I could write a regex this problem would have been solved in about 5 minutes. But I've yet to figure out a combination of replacer and splitter functions that can do the same job.

Edit: I've added some complexity to my example. The things in the sample data are just stand-ins for the real thing, categories and subcategories can be any alphanumeric string of any length.

Phil_Seamark · ‎11-06-2018

Hi @Zelbinian

Please try these two calculated columns as an alternative. These are in DAX and again, I have attached a PBIX file that you can download to see/tweak etc.

Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointOR = SEARCH("#OurResearch",'Table1'[Column1],1,0)+1
VAR RestOfText1 =  MID('Table1'[Column1],StartingPointOR+12,LengthOfText-StartingPointOR+1) & ";"
VAR StartingPointSC = SEARCH(";",RestOfText1,1,0)-1
VAR RestOfText2 = LEFT(RestOfText1,StartingPointSC)
VAR StartingPointDash = SEARCH("-",RestOfText2,1,0)

RETURN 
    IF(
        StartingPointDash>0,
        LEFT(RestOfText2,StartingPointDash-1),
        RestOfText2
        )

and

Sub Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointOR = SEARCH("#OurResearch",'Table1'[Column1],1,0)+1
VAR RestOfText1 =  MID('Table1'[Column1],StartingPointOR+12,LengthOfText-StartingPointOR+1) & ";"
VAR StartingPointSC = SEARCH(";",RestOfText1,1,0)-1
VAR RestOfText2 = LEFT(RestOfText1,StartingPointSC)
VAR StartingPointDash = SEARCH("-",RestOfText2,1,0)

RETURN 
    IF(
        StartingPointDash>0,
        MID(RestOfText2,StartingPointDash+1,99)
        )

To learn more about DAX visit : aka.ms/practicalDAX

Proud to be a Datanaut!

View solution in original post

v-jiascu-msft · ‎11-06-2018

Hi @Zelbinian,

The @Phil_Seamark's solution is wonderful. Here is a solution of M based on your sample data. The file is attached.

Text.BeforeDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-")

Text.AfterDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-")

Best Regards,
Dale

Community Support Team _ Dale
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

Phil_Seamark · ‎11-05-2018

Hi @Zelbinian

Here is one approach. Not sure if it works with your wider dataset, but hopefully you can see how to tweak it. I have attached a PBIX file

Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointCat = FIND("-Cat",'Table1'[Column1],1,0)+1
VAR RestOfText =  MID('Table1'[Column1],StartingPointCat,LengthOfText-StartingPointCat+1)
VAR FinishPointCAT = {
                        {FIND("-",RestOfText,1,LengthOfText)},
                        {FIND(";",RestOfText,1,LengthOfText)}
                        }
RETURN 
    MID(
        'Table1'[Column1],
        StartingPointCat,MINX(FinishPointCAT,[Value])-1
        )

and

Sub Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointCat = FIND("-Subcat",'Table1'[Column1],1,0)+1
VAR RestOfText =  MID('Table1'[Column1],StartingPointCat,LengthOfText-StartingPointCat+1)
VAR FinishPointCAT = {
                        {FIND("-",RestOfText,1,LengthOfText)},
                        {FIND(";",RestOfText,1,LengthOfText)}
                        }
RETURN 
    IF(StartingPointCat>1,
        MID(
        'Table1'[Column1],
        StartingPointCat,MINX(FinishPointCAT,[Value])-1
        ))

To learn more about DAX visit : aka.ms/practicalDAX

Proud to be a Datanaut!

Zelbinian · ‎11-06-2018

The problem with this solution - and other solutions I've seen for similar problems - is that it assumes I can know what the data is going to be in advance. I use "Cat" and "Subcat" as standins for Category and Subcategory data but I can't actually know what those will be in advance. They can be of arbitrary length and composition. I can know that they will follow "#ourresearch-" (in various forms of capitalization), but that's it.

(Also, it would have been helpful for you to mention that this was DAX code. I assumed it was Power Query for about an hour lol)

Phil_Seamark · ‎11-06-2018

HI @Zelbinian

My solution is based on the small dataset you provided, hence the approach. If you have some definite "rules" then these can probably be used to help tweak the approach.

To learn more about DAX visit : aka.ms/practicalDAX

Proud to be a Datanaut!

Zelbinian · ‎11-06-2018

I updated the question, if that helps.

Phil_Seamark · ‎11-06-2018

Hi @Zelbinian

Please try these two calculated columns as an alternative. These are in DAX and again, I have attached a PBIX file that you can download to see/tweak etc.

Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointOR = SEARCH("#OurResearch",'Table1'[Column1],1,0)+1
VAR RestOfText1 =  MID('Table1'[Column1],StartingPointOR+12,LengthOfText-StartingPointOR+1) & ";"
VAR StartingPointSC = SEARCH(";",RestOfText1,1,0)-1
VAR RestOfText2 = LEFT(RestOfText1,StartingPointSC)
VAR StartingPointDash = SEARCH("-",RestOfText2,1,0)

RETURN 
    IF(
        StartingPointDash>0,
        LEFT(RestOfText2,StartingPointDash-1),
        RestOfText2
        )

and

Sub Category = 
VAR LengthOfText = LEN('Table1'[Column1])
VAR StartingPointOR = SEARCH("#OurResearch",'Table1'[Column1],1,0)+1
VAR RestOfText1 =  MID('Table1'[Column1],StartingPointOR+12,LengthOfText-StartingPointOR+1) & ";"
VAR StartingPointSC = SEARCH(";",RestOfText1,1,0)-1
VAR RestOfText2 = LEFT(RestOfText1,StartingPointSC)
VAR StartingPointDash = SEARCH("-",RestOfText2,1,0)

RETURN 
    IF(
        StartingPointDash>0,
        MID(RestOfText2,StartingPointDash+1,99)
        )

To learn more about DAX visit : aka.ms/practicalDAX

Proud to be a Datanaut!

Zelbinian · ‎11-06-2018

I appreciate you giving me the fish, but, unfortunately, I do not understand your code (I've never written in DAX) nor do I have an explanation about the thinking that produced it. Can you start by explaining why you chose a calculated column / DAX vs doing Power Query / M in the query step and creating a new real column?

v-jiascu-msft · ‎11-06-2018

Hi @Zelbinian,

The @Phil_Seamark's solution is wonderful. Here is a solution of M based on your sample data. The file is attached.

Text.BeforeDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-")

Text.AfterDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-")

Best Regards,
Dale

Community Support Team _ Dale
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Zelbinian · ‎11-07-2018

In case anybody wants to see the full text of @v-jiascu-msft's code without having to download the file, here it is:

 #"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each Text.BeforeDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-")),
 #"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each Text.AfterDelimiter(Text.BetweenDelimiters([Column1], "-", ";"), "-"))

The 'each' in each statement seems like a particularly important detail there.

Also, that last example does not actually have a semicolon for the Text.BetweenDelimiters to key off of. Yet it seems to work. The the documentation of the function does not describe how the function behaves if the second delimeter doesn't exist, how did you figure out that solution?

v-jiascu-msft · ‎11-07-2018

Hi @Zelbinian,

I'm glad it can work. Firstly, we need to find the pattern that the computer can recognize. Secondly, find a function here that can do the job.

The functions will iterate every character of the parameter till the second delimiter or the end. So it won't be a problem if the second delimiter doesn't exist.

Best Regards,
Dale

Community Support Team _ Dale
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Phil_Seamark · ‎11-06-2018

Hi @Zelbinian

For something like this, you can use either M or DAX. I chose DAX because I'm more familiar with it, but it probably doesn't matter and there won't be a big performance difference. Personally, I find DAX code easier to read than M, but I love anything you can do in M just from the UI.

If I was to do this in M, the principals would be the same. Look for the guaranteed patters (#ourresearch) etc and then start breaking up the text using text functions.

To learn more about DAX visit : aka.ms/practicalDAX

Proud to be a Datanaut!

Create new column based on text from a messy text blob

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly