Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
monojchakrab
Resolver III
Resolver III

Extract text (sub) strings from a text string using a pre-specified list

Hiya good people of PQ,

 

I have a text column as containing various texts,e.g. xxyyzz, xxxabc, acdfgh.

 

I want to extract the the texts "x", "c" and "h" for example, using a pre-specified list {"x","c","h"}, if there is a match.

 

I am currently using if text.contains code, but its starting to get very complicated with every new data refresh adding more complexity to the text column.

 

If there was a simpler solution with List.accumulate, e..g, then I could just update the list.

 

Thanks and best regds.,

4 ACCEPTED SOLUTIONS

Table.AddColumn(Custom2, "Ingredient", each
            List.Skip(
                      {{{"50%","sugarlo"},"SugarLo"},
                       {{"Stevia","Green"},"Stevia"},
                       {{"classic","cook","zero calorie"},"Sucralose"},
                       {{"original","Low Calories"},"Aspartame"},
                       {{"xylitol"},"Xylitol"}
                      },
                      (x)=>not List.Contains(x{0},[Title],(x,y)=>Text.Contains(y,x,Comparer.OrdinalIgnoreCase))
                      ){0}?{1}?
               )

please  change List.ContainsAny to List.Contains

View solution in original post

1 if use flat list, you can not get the replacement when more than one substrings be assigned a replacement. the list can be re-write as this

{"50%","SugarLo"},
{"sugarlo","SugarLo"},
{"Stevia","Stevia"},
{"Green","Stevia"},
{"classic","Sucralose"},
{"cook","Sucralose"},
{"zero calorie""Sucralose"},
{"original","Aspartame"},
{"Low Calories","Aspartame"},
{"xylitol","Xylitol"}

 and re-write the second arguement as 

(x)=>Text.Contains([Title],x,Comparer.OrdinalIgnoreCase)

I just intergerate the same replacement into one item.

2 (ItemFromSupportingList)=>not List.Contains(ItemFromSupportingList{0},[Title],(1stArguement,2ndArgument)=>Text.Contains(2ndArgument,1stArguement,Comparer.OrdinalIgnoreCase))
){0}?{1}?

{0} is select the first value of the list after skiping all the items not match the condition. this value will be in the structure as {list,text}, then {1} to get the "text" as the replacement.

View solution in original post

wdx223_Daniel_0-1698967838387.png

so you get a blank list {}, so {}{0} is error, then {}{0}{1} is error.

can add a ? to tolerate the error, when there is no item on the index you want, it will give a null

 

View solution in original post

26 REPLIES 26
wdx223_Daniel
Super User
Super User

each item should like {{"50%","sugarlo"},"SugarLo"}

when all the value in first item's list can not be found in the Title column, then this item {{"50%","sugarlo"},"SugarLo"} will be skipped.

monojchakrab
Resolver III
Resolver III

Hey @Daryl-Lynch-Bzy ,

 

Below is a snapshot of the text column :

 

monojchakrab_0-1698379984092.png

And here is the code I am using for extracting the key words like Stevia or Aspartame and I use similar code to extract the pack size of type (like sachets or tablets) :

 

Table.AddColumn(Custom2, "Ingredient", each if 
Text.Contains([Title],"50%",Comparer.OrdinalIgnoreCase) or Text.Contains([Title],"sugarlo",Comparer.OrdinalIgnoreCase)then "SugarLo" else 
if Text.Contains([Title],"Stevia",Comparer.OrdinalIgnoreCase) or Text.Contains([Title],"Green",Comparer.OrdinalIgnoreCase)
then "Stevia" else 
        if Text.Contains([Title],"classic",Comparer.OrdinalIgnoreCase) or Text.Contains([Title],"cook",Comparer.OrdinalIgnoreCase) or Text.Contains([Title],"zero calorie",Comparer.OrdinalIgnoreCase) then "Sucralose" else 
        if Text.Contains([Title],"original" ,Comparer.OrdinalIgnoreCase) or Text.Contains([Title],"Low Calories", Comparer.OrdinalIgnoreCase) 
        then "Aspartame" else 
        if Text.Contains([Title],"xylitol",Comparer.OrdinalIgnoreCase) then "Xylitol" else null)

 

This code keeps getting longer and more complex with every refresh as the texts start getting new additions everytime so I  have to mody this code post every refresh. 

But what remains invariant is the list of actives or the list of pack sizes or the type etc - hence if I can work off a list, it will be a simple task just to update the relevant list

Thanks and appreciate the leg-up

Table.AddColumn(Custom2, "Ingredient", each
            List.Skip(
                      {{{"50%","sugarlo"},"SugarLo"},
                       {{"Stevia","Green"},"Stevia"},
                       {{"classic","cook","zero calorie"},"Sucralose"},
                       {{"original","Low Calories"},"Aspartame"},
                       {{"xylitol"},"Xylitol"}
                      },
                      (x)=>not List.ContainsAny(x{0},[Title],(x,y)=>Text.Contains(y,x,Comparer.OrdinalIgnoreCase))
                      ){0}?{1}?
               )

Hey @wdx223_Daniel - this is the error I am getting :

 

monojchakrab_0-1698404876045.png

 

I typed in your code verbatim.

Table.AddColumn(Custom2, "Ingredient", each
            List.Skip(
                      {{{"50%","sugarlo"},"SugarLo"},
                       {{"Stevia","Green"},"Stevia"},
                       {{"classic","cook","zero calorie"},"Sucralose"},
                       {{"original","Low Calories"},"Aspartame"},
                       {{"xylitol"},"Xylitol"}
                      },
                      (x)=>not List.Contains(x{0},[Title],(x,y)=>Text.Contains(y,x,Comparer.OrdinalIgnoreCase))
                      ){0}?{1}?
               )

please  change List.ContainsAny to List.Contains

@wdx223_Daniel ,

 

Sorry to be a pest here but I am havig trouble working thru' the logic of the code. Appreciate your being patient with my dumb questions :

1. Why are we using a nested list - why cannot we use a flat list, e.g. {"50%", "sugarlo", "low calories","Aspartame"} and so on? The reaso I am asking is this is that the [Title] column also has some other information like forms, pack size etc for which we may not need a nested list as in this code. 

2. I am also not clear as to how do you skip a nested list

3. In the last part of the code, 

(x)=>not List.Contains(x{0},[Title],(x,y)=>Text.Contains(y,x,Comparer.OrdinalIgnoreCase))
                      ){0}?{1}?

I can see you are using the logic for skipping the list, but I did not understand the following :

a. where is variable (x) getting its value from? and why are using the 1st positing by using x{0}?

b. I am not able to understand the 3rd parameter within the list.contains

c. where is variable (y) getting its value from

d. Then in the last part why are we using {0} and then {1} positions and of which list?

Possibly all dumb questions but if I get this logis straight, I can reconstruct the code for extracting the other text strings :

a. Form - "tablets","jars","pouch","sachets"

b. Pack size - "100","300","500","50","100","150" etc

c. Multi pack - "1","2","3","4","5","6"

Appreciate the leg-up and apologies for the long-winded question!

 

1 if use flat list, you can not get the replacement when more than one substrings be assigned a replacement. the list can be re-write as this

{"50%","SugarLo"},
{"sugarlo","SugarLo"},
{"Stevia","Stevia"},
{"Green","Stevia"},
{"classic","Sucralose"},
{"cook","Sucralose"},
{"zero calorie""Sucralose"},
{"original","Aspartame"},
{"Low Calories","Aspartame"},
{"xylitol","Xylitol"}

 and re-write the second arguement as 

(x)=>Text.Contains([Title],x,Comparer.OrdinalIgnoreCase)

I just intergerate the same replacement into one item.

2 (ItemFromSupportingList)=>not List.Contains(ItemFromSupportingList{0},[Title],(1stArguement,2ndArgument)=>Text.Contains(2ndArgument,1stArguement,Comparer.OrdinalIgnoreCase))
){0}?{1}?

{0} is select the first value of the list after skiping all the items not match the condition. this value will be in the structure as {list,text}, then {1} to get the "text" as the replacement.

Thanks a lot @wdx223_Daniel  for the clear explanation. 

Suppose the text string also contains sub-strings like 50,500 and 5 - is there a way we can extrat the "5" from the single "5", but not from "50 and "500"? That along with a remodification of this code can help extract the pack sizes and selling units from the [Title] string?

could you provide some sample data along with the desire output?

The source data remains same...

The code I am using to extract the form is as below :

Table.AddColumn(Ingredients, "Form", each if Text.Contains([Title],"tablets",Comparer.OrdinalIgnoreCase) or Text.Contains([Title],"tablet",Comparer.OrdinalIgnoreCase) or Text.Contains([Title],"500 pcs",Comparer.OrdinalIgnoreCase) 
        or Text.Contains([Title],"330",Comparer.OrdinalIgnoreCase) or Text.Contains([Title],"440",Comparer.OrdinalIgnoreCase) or Text.Contains([Title],"550",Comparer.OrdinalIgnoreCase)
        then "Tablets" else 
        if Text.Contains([Title],"sachet",Comparer.OrdinalIgnoreCase) or Text.Contains([Title],"sachets",Comparer.OrdinalIgnoreCase) then "Sachets" else 
        if Text.Contains([Title],"500 g",Comparer.OrdinalIgnoreCase) or  Text.Contains([Title],"500g",Comparer.OrdinalIgnoreCase)
            or Text.Contains([Title],"xylitol",Comparer.OrdinalIgnoreCase) then "Pouch" else 
        if Text.Contains([Title],"80",Comparer.OrdinalIgnoreCase)  or 
        Text.Contains([Title],"150g",Comparer.OrdinalIgnoreCase) or 
        Text.Contains([Title],"150 g",Comparer.OrdinalIgnoreCase) or 
        Text.Contains([Title],"150gm",Comparer.OrdinalIgnoreCase) or 
        Text.Contains([Title],"150 GRAM",Comparer.OrdinalIgnoreCase)
        
        then "Jar" else "Tablets"
                
        )

The code I am using to extract the sizes is as below :

a. First I remove all text from the [Title] string :

Table.AddColumn(MPU, "Remove Text", each Text.Remove([Title],{"A".."Z","a".."z"}))

b. This returns a column as below :

monojchakrab_0-1698824935381.png

c. from this column now, I feed a pre-prepared list :

monojchakrab_1-1698825030161.png

into the following code to extract the matching strings :

 Table.AddColumn(RemoveTextFromTitle, "Pack", each List.Select(PackList,
            (x)=>
            Text.Contains([Remove Text],x,Comparer.OrdinalIgnoreCase)
            ){0}?)

d. This returns a column as below :

monojchakrab_2-1698825292242.png

e. I now clean up the above the column, by feeding another list with corrected packlist :

monojchakrab_3-1698825429803.png

by using the following code :

List.Accumulate({0..Table.RowCount(PackSize)-1},
                                PackPosition,
                                        (x,y)=>
                                            Table.ReplaceValue(x, 
                                                PackSize[OldValue]{y}, PackSize[NewValue]{y},
                                                Replacer.ReplaceValue, {"Pack"})
                                
    )

Little long-winded, but I could not find a simpler way to get to this in fewer steps.

Any help appreciated.

it looks like more complicated.

i can not get the whole story depending on what you showed

@wdx223_Daniel ,

 

Sorry I confused you unintentionally. let me try and simplify this for you :

 

1. We are starting with the same column as above :

 

monojchakrab_0-1698830479484.png

2. This text string, contains terms like sachets, tablets and jars - which are the forms. I want to extract these sub-strings from this column and the desired output should be to add a column headed "Form" with the strings as below :

 

monojchakrab_1-1698830626655.png

3. Next if you notice, the text string column also includes number strings as - 100,300, 500, 80,50 etc. These are pack sizes. The desired output would be to extract these strings into a separate column headed "pack Size" :

monojchakrab_2-1698830865514.png

Not sure if that has simplified things a bit and helps you better with the solution I am seeking.

 

Table.SplitColumn(Custom2, "Title", each
            let 
                  fx=(x,y)=>List.RemoveItems(Splitter.SplitTextByAnyDelimiter(x)(y),{"",null}),
                  Ingredient=List.Skip(
                      {{{"50%","sugarlo"},"SugarLo"},
                       {{"Stevia","Green"},"Stevia"},
                       {{"classic","cook","zero calorie"},"Sucralose"},
                       {{"original","Low Calories"},"Aspartame"},
                       {{"xylitol"},"Xylitol"}
                      },
                      (x)=>not List.Contains(x{0},_,(x,y)=>Text.Contains(y,x,Comparer.OrdinalIgnoreCase))
                      ){0}?{1}?,
                  Form=fx(fx(YourFormList,_),_){0}?,
                  PackSize=fx(fx(YourPackSizeList,_),_){0}?,
                  MultiPack=fx(fx(YourMulitPackList,_),){0}?
                in {Ingredient,Form,PackSize,MultiPack}
               {"Ingredient","Form","PackSize","MultiPack"})

@wdx223_Daniel , I tried the code above only for ingredients :

Table.SplitColumn(ActiveNewWay.1, "Title", each
            let 
                  fx=(x,y)=>List.RemoveItems(Splitter.SplitTextByAnyDelimiter(x)(y),{"",null}),
                  Ingredient=List.Skip(
                      {{{"50%","sugarlo"},"SugarLo"},
                       {{"Stevia","Green"},"Stevia"},
                       {{"classic","cook","zero calorie"},"Sucralose"},
                       {{"original","Low Calories"},"Aspartame"},
                       {{"xylitol"},"Xylitol"}
                      },
                      (x)=>not List.Contains(x{0},_,(x,y)=>Text.Contains(y,x,Comparer.OrdinalIgnoreCase))
                      ){0}?{1}?
                //   Form=fx(fx(YourFormList,_),_){0}?,
                //   PackSize=fx(fx(YourPackSizeList,_),_){0}?,
                //   MultiPack=fx(fx(YourMulitPackList,_),){0}?
                in {Ingredient}
               {"Ingredient"})

And this is returning the following error :

monojchakrab_0-1698911243822.png

Am I missing something here?

regds.,

wdx223_Daniel_1-1698913925924.png

 

Thanks @wdx223_Daniel - the missing underscore probably is not causing the error since that is commented out. Will try the comma after {Ingredient} and check the result. 

One question though :

when you are feeding the variable (x){0} into the list.contains function, will it not always check against the 1st item in the list (which in this case is "SugarLo"? I am slightly confused here. Should not the list.contains check each item in the list within the [Title] column? but (x){0} will always check for the 1st item and not list.contains will always return false, right? I am not able to figure out that if its not a recursive function, then how for each row in column [Title], each item of the list will be checked for true or false?

this function will be ran in each row of your table.

it firstly get the value of [Title] in this row

then, test if it contains any value in the first list-item of each row in your pre-prepared list

it will skip all the row of which the first list-item did not be found in Title,

then get the first row of the remain pre-prepared list, and get the 2nd item of the row as the replacement.   

@wdx223_Daniel - I was trying out the code in a smaller bit as follows :

1. I have prepared a list called [strings} as follows :

let
    Source = {
{{"SugarLo","50%"},"SugarLo"},
{{"Original", "Low Calories"},"Aspartame"},
{{"Classic","cook","zero calorie"},"Sucralose"},
{{"Stevia","green"},"Stevia"},
{{"Xylitol"},"Xylitol"}
}
in
    Source

2. Then I am trying out your code on this list as follows :

 List.Skip(Strings, (x)=>not List.Contains(x{0},"SugarLo"))

Since the condition is false it returns all the 5 lists and if we use {0}{1} on this list it correctly returns "SugarLo"

3. the problem starts when I change "SugarLo" to "Aspartame" e.g. - 

List.Skip(Strings, (x)=>not List.Contains(x{0},"Aspartame"))

which returns an empty list and hence the {0}{1} does not work and returns an error.

Am I missing something in the logic here? I am really sorry but I am not able to work out how will (x){0} work on the list recursively?

can you help?

thanks and really appreciate and apologies for the bother!

 

wdx223_Daniel_0-1698967838387.png

so you get a blank list {}, so {}{0} is error, then {}{0}{1} is error.

can add a ? to tolerate the error, when there is no item on the index you want, it will give a null

 

Thanks @wdx223_Daniel - its a bit clearer now.

Helpful resources

Announcements
RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

Top Solution Authors
Top Kudoed Authors