Solved: Looking for solution to extract text of a specific...

syed_sn · ‎08-30-2022

I've been working on a text extraction exercise from some manually entered data to automate some reports.

Requirement is to extract a location identifier from the string text into a new column. Conditions are listed below:

1. Text is always going to be 6 characters

2. First two characters are going to be Alphabets in Capitals

3. Last 4 characters are always going to be integers

4. Required output can lie anywhere in the string, start, end, middle, between brackets etc.

5. If the required output is not available, text value null/NA/0 to be mentioned to avoid error

Would be really grateful as I've tried delimeters but they didnt work. Also, I tried, Text.PositionOfAny but that doesn't accept a list of prefixes for the location identifier in text format.

Looking foward.

String	Required Output
Power restart on NE0624	NE0624
Power issue on location (WU2860)	SU2360
Access problem between KA4429-45	KR4529
Power problem in node (PR2661)	QA5261
SU0139 down due to power issue	SU2360
Power issue at 7 sites	Null
2 Sites down due to bad weather.	Null

edhans · ‎08-30-2022

Your best bet I think will be to work with a list. For example, try this:

Remove the opening and closing parens first as a delimiter you don't care about.
Add a new column called and use this function: Text.Split([String], " ")

It will return this:

Now, add another new column called List 2 or something, and use this formula:

List.Transform([Split Data], each Text.BeforeDelimiter(_, "-"))

Now the KA4429-45 will just be KA4429

Now add another column with this formula:

List.Select([List 2], each Text.Length(_) = 6)

This will now only keep things 6 chars long. But words like Powers would be kept as they are 6 chars long too.

The last formula is the longest.

try
List.Select([List 3], each List.Contains({"A".."Z"},Text.Start(_,1)) and List.Contains({"0".."9"},Text.End(_,1))){0}
otherwise null

That will only keep things that start with upper case A-Z and end with 0-9. If there isn't one, then it returns null. If there are two, this will only return the first one.

It may not be perfect, expecially if you have to get rid of other delimiters that pop into the dataset, but this will get you started. Full code to play with:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("Vc/BCoJAFAXQX3m4KrDQcZxy6aKVEKZIC3Ex6oMEm5GZEX+nb+nLUsnS3eVxuNyX51YsB1SgUBuuzPsFUsD14jBCLdv6hsJeWKN1jxNpZcVNM4bdPSNn5uxHnWbEY86sw6pCraFTsmzxCSWaAVFAFFJKggP1Rx0l1CfBqnvBjQAha4RdnBDG3Kn5FvqEubNNM8f1AqjlIKAetxgJ3X/adsV6MzdwAt0Y1NNffdvOgkA6nTZ1Ja9hQG4eqI4/W3wA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [String = _t, #"Required Output" = _t]),
    #"Replaced Value" = Table.ReplaceValue(Source,"(","",Replacer.ReplaceText,{"String"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value",")","",Replacer.ReplaceText,{"String"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Split Data", each Text.Split([String], " ")),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "List 2", each List.Transform([Split Data], each Text.BeforeDelimiter(_, "-"))),
    #"Added Custom2" = Table.AddColumn(#"Added Custom1", "List 3", each List.Select([List 2], each Text.Length(_) = 6)),
    #"Added Custom3" = Table.AddColumn(#"Added Custom2", "List 4", each try
List.Select([List 3], each List.Contains({"A".."Z"},Text.Start(_,1)) and List.Contains({"0".."9"},Text.End(_,1))){0}
otherwise null)
in
    #"Added Custom3"

How to use M code provided in a blank query:
1) In Power Query, select New Source, then Blank Query
2) On the Home ribbon, select "Advanced Editor" button
3) Remove everything you see, then paste the M code I've given you in that box.
4) Press Done
5) See this article if you need help using this M code in your model.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

View solution in original post

AlexisOlson · ‎08-30-2022

@syed_sn You can also combine filter conditions into a single step like this:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("ZY7NCoJAFIVf5eLKoEKnyXLpolUQkkgLcTE6FxqwGZm5Ma/Ts/RkqSEV7Q5856+qgtx4tGDRkbD0fIDRcDpECeNBvZypcu6OI+lMK0gNIryUbJ9Ei8mUtS06B701TYc3aJA8ooZjxjlLV3z71TR7lAZtJEKYn1mSxO+eooziTQrSeA1yGCQD/Wf/748g2IFThG5CDIpR/8QbIcGjoCvadVDXLw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [String = _t]),
    #"Added Custom" = Table.AddColumn(Source, "Required Output", each
        List.First(
            List.Select(
                Text.SplitAny([String], " -()"),
                each Text.Length(_) = 6
                    /*First two characters are capital letters*/
                    and Text.Length(Text.Select(Text.Start(_, 2), {"A".."Z"})) = 2
                    /*Last four characters are digits*/
                    and Text.Length(Text.Select(Text.End(_, 4), {"0".."9"})) = 4
            )
        ),
    type text)
in
    #"Added Custom"

View solution in original post

wdx223_Daniel · ‎08-30-2022

= let fx=(x,y)=>List.RemoveItems(Splitter.SplitTextByAnyDelimiter(x)(y),{""}) in Table.AddColumn(PreviousStepName,"Custom",each List.Select(fx(fx({"A".."Z","0".."9"},[String]),[String]),each Text.Remove(Text.Start(_,2),{"A".."Z"})="" and Text.Remove(Text.End(_,4),{"0".."9"})=""){0}?)

edhans · ‎08-30-2022

Your best bet I think will be to work with a list. For example, try this:

Remove the opening and closing parens first as a delimiter you don't care about.
Add a new column called and use this function: Text.Split([String], " ")

It will return this:

Now, add another new column called List 2 or something, and use this formula:

List.Transform([Split Data], each Text.BeforeDelimiter(_, "-"))

Now the KA4429-45 will just be KA4429

Now add another column with this formula:

List.Select([List 2], each Text.Length(_) = 6)

This will now only keep things 6 chars long. But words like Powers would be kept as they are 6 chars long too.

The last formula is the longest.

try
List.Select([List 3], each List.Contains({"A".."Z"},Text.Start(_,1)) and List.Contains({"0".."9"},Text.End(_,1))){0}
otherwise null

That will only keep things that start with upper case A-Z and end with 0-9. If there isn't one, then it returns null. If there are two, this will only return the first one.

It may not be perfect, expecially if you have to get rid of other delimiters that pop into the dataset, but this will get you started. Full code to play with:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("Vc/BCoJAFAXQX3m4KrDQcZxy6aKVEKZIC3Ex6oMEm5GZEX+nb+nLUsnS3eVxuNyX51YsB1SgUBuuzPsFUsD14jBCLdv6hsJeWKN1jxNpZcVNM4bdPSNn5uxHnWbEY86sw6pCraFTsmzxCSWaAVFAFFJKggP1Rx0l1CfBqnvBjQAha4RdnBDG3Kn5FvqEubNNM8f1AqjlIKAetxgJ3X/adsV6MzdwAt0Y1NNffdvOgkA6nTZ1Ja9hQG4eqI4/W3wA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [String = _t, #"Required Output" = _t]),
    #"Replaced Value" = Table.ReplaceValue(Source,"(","",Replacer.ReplaceText,{"String"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value",")","",Replacer.ReplaceText,{"String"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value1", "Split Data", each Text.Split([String], " ")),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "List 2", each List.Transform([Split Data], each Text.BeforeDelimiter(_, "-"))),
    #"Added Custom2" = Table.AddColumn(#"Added Custom1", "List 3", each List.Select([List 2], each Text.Length(_) = 6)),
    #"Added Custom3" = Table.AddColumn(#"Added Custom2", "List 4", each try
List.Select([List 3], each List.Contains({"A".."Z"},Text.Start(_,1)) and List.Contains({"0".."9"},Text.End(_,1))){0}
otherwise null)
in
    #"Added Custom3"

How to use M code provided in a blank query:
1) In Power Query, select New Source, then Blank Query
2) On the Home ribbon, select "Advanced Editor" button
3) Remove everything you see, then paste the M code I've given you in that box.
4) Press Done
5) See this article if you need help using this M code in your model.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

syed_sn · ‎08-30-2022

Dear Edhans,

Thanks for the detailed solution and step by step guide. I ran the code on current data set with over 22K+ records and the results were 97% correct. Most of the errors occured for the listing delimter function at the start as the data set specifications vary. However, the 92% is more than enough for me. I've added few more steps as part of the power query and made it work like a charm on top of your steps.

Additionally, I think, the Text.PositionOfAny really lacks the possibility for providing a range of substring to make it easy for finding out the position of the prefix and then exclude 6 characters from there. From the overall inventory set, I might be able to get the first two characters and can do validation of the last 4 integrers to be part of the substring.

Overall, your query has solved my problem and highlighted what needs to be fixed if more different complicated scenarios occur. I have this data set changing on every days and for now, the logic is fixed by me and covers 99.8% of the data set.

Kudos.

edhans · ‎08-31-2022

Fantastic @syed_sn

glad I was able to help provide a usable framework for you.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

AlexisOlson · ‎08-30-2022

@syed_sn You can also combine filter conditions into a single step like this:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("ZY7NCoJAFIVf5eLKoEKnyXLpolUQkkgLcTE6FxqwGZm5Ma/Ts/RkqSEV7Q5856+qgtx4tGDRkbD0fIDRcDpECeNBvZypcu6OI+lMK0gNIryUbJ9Ei8mUtS06B701TYc3aJA8ooZjxjlLV3z71TR7lAZtJEKYn1mSxO+eooziTQrSeA1yGCQD/Wf/748g2IFThG5CDIpR/8QbIcGjoCvadVDXLw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [String = _t]),
    #"Added Custom" = Table.AddColumn(Source, "Required Output", each
        List.First(
            List.Select(
                Text.SplitAny([String], " -()"),
                each Text.Length(_) = 6
                    /*First two characters are capital letters*/
                    and Text.Length(Text.Select(Text.Start(_, 2), {"A".."Z"})) = 2
                    /*Last four characters are digits*/
                    and Text.Length(Text.Select(Text.End(_, 4), {"0".."9"})) = 4
            )
        ),
    type text)
in
    #"Added Custom"

syed_sn · ‎08-31-2022

Dear Alexis,

Thanks for the provided solution. The point in place is perfectly summed up. I haven't tried this solution but I believe it will give the same result as the one shared by Edhans as the logic is pretty much the same; identifying the first two leading character and then 4 integers.

Kudos.

edhans · ‎08-30-2022

Yes, I could have combined mine into one function, but decided to walk through the steps and logic, and give the person the ability to see what it was doing was more instructive on the logic of how I arrived at the answer. The other piece is @syed_sn only gave us a very very limited set of data, and other combinations would be much easier to insert into the individual steps. Your single formula will only work with the limited 6 record dataset. Any other anomalies will break.

@syed_sn if that is your exact and total dataset then what Alexis did will work in one step. If there are other oddities in the data, you can use my step by step approach until you have the full range handled, and then begin the process of combining it into a single formula. Kind of like Excel. Generally I create a massively complex formula one column/step at a time. When I get it right, I start assembling the final single cell formula, knowing full well I will likely have trouble safely editing it in the future. 😂

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

AlexisOlson · ‎08-30-2022

@edhans The robustness between the two methods is similar as far as I can tell, though I've clearly assumed stronger conditions on the parts to be extracted (which may or may not be correct for additional records but attempts to use the narrowest simple conditions that fit all of the given records). I'm curious if you have any anomalies in mind that will break mine but not yours (other than the obvious distinction that I'm checking the first 2 and last 4 characters rather than the first 1 and last 1 characters).

There is certainly pedagogical and developmental utility in splitting it into more steps. In this case, the steps (1) splitting the text into a list, (2) filtering that list according to specific rules, and then (3) grabbing the first remaining element seem like reasonable logical separations to me. I combined the three since (1) and (3) were particularly short expressions. Keeping these separate looks like this:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("ZY7NCoJAFIVf5eLKoEKnyXLpolUQkkgLcTE6FxqwGZm5Ma/Ts/RkqSEV7Q5856+qgtx4tGDRkbD0fIDRcDpECeNBvZypcu6OI+lMK0gNIryUbJ9Ei8mUtS06B701TYc3aJA8ooZjxjlLV3z71TR7lAZtJEKYn1mSxO+eooziTQrSeA1yGCQD/Wf/748g2IFThG5CDIpR/8QbIcGjoCvadVDXLw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [String = _t]),
    #"Added Text2List" = Table.AddColumn(Source, "Text2List", each Text.SplitAny([String], " -()"), type list),
    #"Added FilteredList" = Table.AddColumn(#"Added Text2List", "FilteredList", each
        List.Select(
            [Text2List],
            each Text.Length(_) = 6
                /*First two characters are capital letters*/
                and Text.Length(Text.Select(Text.Start(_, 2), {"A".."Z"})) = 2
                /*Last four characters are digits*/
                and Text.Length(Text.Select(Text.End(_, 4), {"0".."9"})) = 4
        ),
        type list
    ),
    #"Added DesiredOutput" = Table.AddColumn(#"Added FilteredList", "Desired Output", each List.First([FilteredList]), type text)
in
    #"Added DesiredOutput"

Personally, in this specific case, it takes me longer to read and understand the logic split into three steps compared to a single step since steps (1) and (3) are sufficiently simple, but I do concede that the split logic version is easier to follow and contains more explanatory detail embedded in my naming choices.

Looking for solution to extract text of a specific format to a new column

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024