Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
syed_sn
Regular Visitor

Looking for solution to extract text of a specific format to a new column

I've been working on a text extraction exercise from some manually entered data to automate some reports.

 

Requirement is to extract a location identifier from the string text into a new column. Conditions are listed below:

1. Text is always going to be 6 characters

2. First two characters are going to be Alphabets in Capitals

3. Last 4 characters are always going to be integers

4. Required output can lie anywhere in the string, start, end, middle, between brackets etc.

5. If the required output is not available, text value null/NA/0 to be mentioned to avoid error

 

Would be really grateful as I've tried delimeters but they didnt work. Also, I tried, Text.PositionOfAny but that doesn't accept a list of prefixes for the location identifier in text format. 

 

Looking foward.

 

StringRequired Output
Power restart  on NE0624NE0624
Power issue on location (WU2860)SU2360
Access problem between KA4429-45KR4529
Power problem in node (PR2661)QA5261
SU0139 down due to power issueSU2360
Power issue at 7 sitesNull
2 Sites down due to bad weather.Null
2 ACCEPTED SOLUTIONS
edhans
Super User
Super User

Your best bet I think will be to work with a list. For example, try this:

  • Remove the opening and closing parens first as a delimiter you don't care about.
  • Add a new column called and use this function: Text.Split([String], " ")

It will return this:

edhans_0-1661861574387.png

Now, add another new column called List 2 or something, and use this formula:

List.Transform([Split Data], each Text.BeforeDelimiter(_, "-"))

Now the KA4429-45 will just be KA4429

 

Now add another column with this formula:

List.Select([List 2], each Text.Length(_) = 6)

 

This will now only keep things 6 chars long. But words like Powers would be kept as they are 6 chars long too.

The last formula is the longest.

try
List.Select([List 3], each List.Contains({"A".."Z"},Text.Start(_,1)) and List.Contains({"0".."9"},Text.End(_,1))){0}
otherwise null

That will only keep things that start with upper case A-Z and end with 0-9. If there isn't one, then it returns null. If there are two, this will only return the first one.

edhans_1-1661862239405.png

It may not be perfect, expecially if you have to get rid of other delimiters that pop into the dataset, but this will get you started. Full code to play with:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("Vc/BCoJAFAXQX3m4KrDQcZxy6aKVEKZIC3Ex6oMEm5GZEX+nb+nLUsnS3eVxuNyX51YsB1SgUBuuzPsFUsD14jBCLdv6hsJeWKN1jxNpZcVNM4bdPSNn5uxHnWbEY86sw6pCraFTsmzxCSWaAVFAFFJKggP1Rx0l1CfBqnvBjQAha4RdnBDG3Kn5FvqEubNNM8f1AqjlIKAetxgJ3X/adsV6MzdwAt0Y1NNffdvOgkA6nTZ1Ja9hQG4eqI4/W3wA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [String = _t, #"Required Output" = _t]),
    #"Replaced Value" = Table.ReplaceValue(Source,"(","",Replacer.ReplaceText,{"String"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value",")","",Replacer.ReplaceText,{"String"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Split Data", each Text.Split([String], " ")),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "List 2", each List.Transform([Split Data], each Text.BeforeDelimiter(_, "-"))),
    #"Added Custom2" = Table.AddColumn(#"Added Custom1", "List 3", each List.Select([List 2], each Text.Length(_) = 6)),
    #"Added Custom3" = Table.AddColumn(#"Added Custom2", "List 4", each try
List.Select([List 3], each List.Contains({"A".."Z"},Text.Start(_,1)) and List.Contains({"0".."9"},Text.End(_,1))){0}
otherwise null)
in
    #"Added Custom3"

How to use M code provided in a blank query:
1) In Power Query, select New Source, then Blank Query
2) On the Home ribbon, select "Advanced Editor" button
3) Remove everything you see, then paste the M code I've given you in that box.
4) Press Done
5) See this article if you need help using this M code in your model.

 



Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling


Proud to be a Super User!

MCSA: BI Reporting

View solution in original post

@syed_sn You can also combine filter conditions into a single step like this:

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("ZY7NCoJAFIVf5eLKoEKnyXLpolUQkkgLcTE6FxqwGZm5Ma/Ts/RkqSEV7Q5856+qgtx4tGDRkbD0fIDRcDpECeNBvZypcu6OI+lMK0gNIryUbJ9Ei8mUtS06B701TYc3aJA8ooZjxjlLV3z71TR7lAZtJEKYn1mSxO+eooziTQrSeA1yGCQD/Wf/748g2IFThG5CDIpR/8QbIcGjoCvadVDXLw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [String = _t]),
    #"Added Custom" = Table.AddColumn(Source, "Required Output", each
        List.First(
            List.Select(
                Text.SplitAny([String], " -()"),
                each Text.Length(_) = 6
                    /*First two characters are capital letters*/
                    and Text.Length(Text.Select(Text.Start(_, 2), {"A".."Z"})) = 2
                    /*Last four characters are digits*/
                    and Text.Length(Text.Select(Text.End(_, 4), {"0".."9"})) = 4
            )
        ),
    type text)
in
    #"Added Custom"

 

View solution in original post

8 REPLIES 8
wdx223_Daniel
Super User
Super User

= let fx=(x,y)=>List.RemoveItems(Splitter.SplitTextByAnyDelimiter(x)(y),{""}) in Table.AddColumn(PreviousStepName,"Custom",each List.Select(fx(fx({"A".."Z","0".."9"},[String]),[String]),each Text.Remove(Text.Start(_,2),{"A".."Z"})="" and Text.Remove(Text.End(_,4),{"0".."9"})=""){0}?)

edhans
Super User
Super User

Your best bet I think will be to work with a list. For example, try this:

  • Remove the opening and closing parens first as a delimiter you don't care about.
  • Add a new column called and use this function: Text.Split([String], " ")

It will return this:

edhans_0-1661861574387.png

Now, add another new column called List 2 or something, and use this formula:

List.Transform([Split Data], each Text.BeforeDelimiter(_, "-"))

Now the KA4429-45 will just be KA4429

 

Now add another column with this formula:

List.Select([List 2], each Text.Length(_) = 6)

 

This will now only keep things 6 chars long. But words like Powers would be kept as they are 6 chars long too.

The last formula is the longest.

try
List.Select([List 3], each List.Contains({"A".."Z"},Text.Start(_,1)) and List.Contains({"0".."9"},Text.End(_,1))){0}
otherwise null

That will only keep things that start with upper case A-Z and end with 0-9. If there isn't one, then it returns null. If there are two, this will only return the first one.

edhans_1-1661862239405.png

It may not be perfect, expecially if you have to get rid of other delimiters that pop into the dataset, but this will get you started. Full code to play with:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("Vc/BCoJAFAXQX3m4KrDQcZxy6aKVEKZIC3Ex6oMEm5GZEX+nb+nLUsnS3eVxuNyX51YsB1SgUBuuzPsFUsD14jBCLdv6hsJeWKN1jxNpZcVNM4bdPSNn5uxHnWbEY86sw6pCraFTsmzxCSWaAVFAFFJKggP1Rx0l1CfBqnvBjQAha4RdnBDG3Kn5FvqEubNNM8f1AqjlIKAetxgJ3X/adsV6MzdwAt0Y1NNffdvOgkA6nTZ1Ja9hQG4eqI4/W3wA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [String = _t, #"Required Output" = _t]),
    #"Replaced Value" = Table.ReplaceValue(Source,"(","",Replacer.ReplaceText,{"String"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value",")","",Replacer.ReplaceText,{"String"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Split Data", each Text.Split([String], " ")),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "List 2", each List.Transform([Split Data], each Text.BeforeDelimiter(_, "-"))),
    #"Added Custom2" = Table.AddColumn(#"Added Custom1", "List 3", each List.Select([List 2], each Text.Length(_) = 6)),
    #"Added Custom3" = Table.AddColumn(#"Added Custom2", "List 4", each try
List.Select([List 3], each List.Contains({"A".."Z"},Text.Start(_,1)) and List.Contains({"0".."9"},Text.End(_,1))){0}
otherwise null)
in
    #"Added Custom3"

How to use M code provided in a blank query:
1) In Power Query, select New Source, then Blank Query
2) On the Home ribbon, select "Advanced Editor" button
3) Remove everything you see, then paste the M code I've given you in that box.
4) Press Done
5) See this article if you need help using this M code in your model.

 



Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling


Proud to be a Super User!

MCSA: BI Reporting

Dear Edhans,

 

Thanks for the detailed solution and step by step guide. I ran the code on current data set with over 22K+ records and the results were 97% correct. Most of the errors occured for the listing delimter function at the start as the data set specifications vary. However, the 92% is more than enough for me. I've added few more steps as part of the power query and made it work like a charm on top of your steps. 

Additionally, I think, the Text.PositionOfAny really lacks the possibility for providing a range of substring to make it easy for finding out the position of the prefix and then exclude 6 characters from there. From the overall inventory set, I might be able to get the first two characters and can do validation of the last 4 integrers to be part of the substring.

 

Overall, your query has solved my problem and highlighted what needs to be fixed if more different complicated scenarios occur. I have this data set changing on every days and for now, the logic is fixed by me and covers 99.8% of the data set.

 

Kudos.

Fantastic @syed_sn

 

glad I was able to help provide a usable framework for you. 



Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling


Proud to be a Super User!

MCSA: BI Reporting

@syed_sn You can also combine filter conditions into a single step like this:

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("ZY7NCoJAFIVf5eLKoEKnyXLpolUQkkgLcTE6FxqwGZm5Ma/Ts/RkqSEV7Q5856+qgtx4tGDRkbD0fIDRcDpECeNBvZypcu6OI+lMK0gNIryUbJ9Ei8mUtS06B701TYc3aJA8ooZjxjlLV3z71TR7lAZtJEKYn1mSxO+eooziTQrSeA1yGCQD/Wf/748g2IFThG5CDIpR/8QbIcGjoCvadVDXLw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [String = _t]),
    #"Added Custom" = Table.AddColumn(Source, "Required Output", each
        List.First(
            List.Select(
                Text.SplitAny([String], " -()"),
                each Text.Length(_) = 6
                    /*First two characters are capital letters*/
                    and Text.Length(Text.Select(Text.Start(_, 2), {"A".."Z"})) = 2
                    /*Last four characters are digits*/
                    and Text.Length(Text.Select(Text.End(_, 4), {"0".."9"})) = 4
            )
        ),
    type text)
in
    #"Added Custom"

 

Dear Alexis,

 

Thanks for the provided solution. The point in place is perfectly summed up. I haven't tried this solution but I believe it will give the same result as the one shared by Edhans as the logic is pretty much the same; identifying the first two leading character and then 4 integers. 

 

Kudos.

Yes, I could have combined mine into one function, but decided to walk through the steps and logic, and give the person the ability to see what it was doing was more instructive on the logic of how I arrived at the answer. The other piece is @syed_sn only gave us a very very limited set of data, and other combinations would be much easier to insert into the individual steps. Your single formula will only work with the limited 6 record dataset. Any other anomalies will break. 

@syed_sn if that is your exact and total dataset then what Alexis did will work in one step. If there are other oddities in the data, you can use my step by step approach until you have the full range handled, and then begin the process of combining it into a single formula. Kind of like Excel. Generally I create a massively complex formula one column/step at a time. When I get it right, I start assembling the final single cell formula, knowing full well I will likely have trouble safely editing it in the future. 😂

 



Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling


Proud to be a Super User!

MCSA: BI Reporting

@edhans The robustness between the two methods is similar as far as I can tell, though I've clearly assumed stronger conditions on the parts to be extracted (which may or may not be correct for additional records but attempts to use the narrowest simple conditions that fit all of the given records). I'm curious if you have any anomalies in mind that will break mine but not yours (other than the obvious distinction that I'm checking the first 2 and last 4 characters rather than the first 1 and last 1 characters).

 

There is certainly pedagogical and developmental utility in splitting it into more steps. In this case, the steps (1) splitting the text into a list, (2) filtering that list according to specific rules, and then (3) grabbing the first remaining element seem like reasonable logical separations to me. I combined the three since (1) and (3) were particularly short expressions. Keeping these separate looks like this:

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("ZY7NCoJAFIVf5eLKoEKnyXLpolUQkkgLcTE6FxqwGZm5Ma/Ts/RkqSEV7Q5856+qgtx4tGDRkbD0fIDRcDpECeNBvZypcu6OI+lMK0gNIryUbJ9Ei8mUtS06B701TYc3aJA8ooZjxjlLV3z71TR7lAZtJEKYn1mSxO+eooziTQrSeA1yGCQD/Wf/748g2IFThG5CDIpR/8QbIcGjoCvadVDXLw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [String = _t]),
    #"Added Text2List" = Table.AddColumn(Source, "Text2List", each Text.SplitAny([String], " -()"), type list),
    #"Added FilteredList" = Table.AddColumn(#"Added Text2List", "FilteredList", each
        List.Select(
            [Text2List],
            each Text.Length(_) = 6
                /*First two characters are capital letters*/
                and Text.Length(Text.Select(Text.Start(_, 2), {"A".."Z"})) = 2
                /*Last four characters are digits*/
                and Text.Length(Text.Select(Text.End(_, 4), {"0".."9"})) = 4
        ),
        type list
    ),
    #"Added DesiredOutput" = Table.AddColumn(#"Added FilteredList", "Desired Output", each List.First([FilteredList]), type text)
in
    #"Added DesiredOutput"

 

Personally, in this specific case, it takes me longer to read and understand the logic split into three steps compared to a single step since steps (1) and (3) are sufficiently simple, but I do concede that the split logic version is easier to follow and contains more explanatory detail embedded in my naming choices.

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors