Solved: Re: getting words that contains diacritics from te...

prorokrak · ‎03-15-2022

Hi, I have a column with names. Names sometimes consist of several words and I need to extract only the words that have diacritics. For example, from the 1st row "Bühnen/Buehnen/Buhnen", I need to get the word "Bühnen" only. The diacritics characters are specified in an separate list.

I wanted to use List.Intersect function with custom comparer. I found several articles that made me think this should be doable:

https://blog.crossjoin.co.uk/2017/01/22/the-list-m-functions-and-the-equationcriteria-argument/

https://gorilla.bi/power-query/complete-guide-to-lists/#remove-values-from-lists

I also tried to adapt a solution provided in attached pbix file from @ImkeF in this post, but I was unsuccessful.

When I try to use custom comparer I get an error: "A custom comparer cannot be used in this context."

I am not sure whether this is a limitation of this List.Intersect function or whether I got the syntax wrong. I thought that by using (x,y) as function signature, x would stand for an item from "Name Parts" list, and y would stand for an item from DiacriticsList, and that List.Intersect will make the comparison recursively for all items in the list.

Here is my code:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("RU/LCsIwEPyVpSelFr/BClasFlG8WHowcduEpinkcdAPyym3/Jg1ip6G2Zndma3rJA+eSZTL3OIX35A0izo5UxacJJaySK/pBmbXUQhUkMJGBbcVnLJ5FFfKWAWFCI72kyHO1ukRDsERFPqhDQ64PIx/Ei1F8NWTWNVNN88GeWeQMsl7KAbySd0Ff+pQQmG1gdJKbbQZ2zaLYjnlrXsZl29MZHCR9yhUwe0FwlGNLYdcWc1+H/nT1K55AQ==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Name = _t]),
    FullNames = Table.TransformColumnTypes(Source,{{"Name", type text}}),
    DiacriticsList = List.Buffer({"ö", "ü"}),
    SeparatorsList = List.Buffer({",","-"," ","&",".","/","(",")"}),

    DuplicatedFullNames = Table.DuplicateColumn(FullNames, "Name", "Name Copy"),
    ConvertedNamePartsToList = Table.SplitColumn(DuplicatedFullNames, "Name Copy", Splitter.SplitTextByAnyDelimiter(SeparatorsList, QuoteStyle.None), {"Name Parts"}, null, ExtraValues.List),
    NamePartsWDiacriticsOnly = Table.AddColumn(ConvertedNamePartsToList, "Name Parts with Diacritics", each List.Intersect({[Name Parts], DiacriticsList}, (x, y) => Text.Contains(x, y, Comparer.OrdinalIgnoreCase)), List.Type)
in
    NamePartsWDiacriticsOnly

Any help would be really appreciated.

Vijay_A_Verma · ‎03-15-2022

Just use Text.Combine additionally.

NamePartsWDiacriticsOnly = Table.AddColumn(ConvertedNamePartsToList, "Name Parts with Diacritics", each try Text.Combine(List.Select([Name Parts], (x)=>List.ContainsAny(Text.ToList(x),DiacriticsList, Comparer.OrdinalIgnoreCase)),", ") otherwise null)

Regarding words part, I will come back later as I am trying to have a one line solution for words matching rather than character matching.

View solution in original post

Vijay_A_Verma · ‎03-15-2022

Replace the statement before in with this

NamePartsWDiacriticsOnly = Table.AddColumn(ConvertedNamePartsToList, "Name Parts with Diacritics", each try List.Select([Name Parts], (x)=>List.ContainsAny(Text.ToList(x),DiacriticsList, Comparer.OrdinalIgnoreCase)){0} otherwise null)

prorokrak · ‎03-15-2022

Hi @Vijay_A_Verma, there is one thing I realized as I used this solution with my full data set which was not clear from the example data set. Sometimes there is more than one word with diacritics in the text. However, this solution keeps just the first word. I would actually need the list of all words with diacritics.

For example, if you have "Z+F (Züller + FröHlich)" instead of "Z+F (Zoller + FröHlich)", only word "Züller" would be kept. Second word will be omitted even though I need it as well.

Any idea how could I get this?

Vijay_A_Verma · ‎03-15-2022

Just use Text.Combine additionally.

NamePartsWDiacriticsOnly = Table.AddColumn(ConvertedNamePartsToList, "Name Parts with Diacritics", each try Text.Combine(List.Select([Name Parts], (x)=>List.ContainsAny(Text.ToList(x),DiacriticsList, Comparer.OrdinalIgnoreCase)),", ") otherwise null)

Regarding words part, I will come back later as I am trying to have a one line solution for words matching rather than character matching.

prorokrak · ‎03-15-2022

Actually, I realized that I can just remove {0} part. And also, I do not even need the error checking because for items that are not found it returns empty list. And in the next step I can expand the list (which returns null if there is no match and, of course, creates duplicate rows in case there are several words with diacritics but this is OK with me).

So this is what I have done:

AddedNamePartsWDiacriticsOnly = Table.AddColumn(#"Split Column by Character Transition", "Name Parts with Diacritics", each List.Select([Name Parts], (x)=>List.ContainsAny(Text.ToList(x),List.Buffer(DiacriticsList), Comparer.OrdinalIgnoreCase)), List.Type),
ExpandedNamePartswithDiacritics = Table.ExpandListColumn(AddedNamePartsWDiacriticsOnly, "Name Parts with Diacritics"),
FilteredOutNonDiacritics = Table.SelectRows(ExpandedNamePartswithDiacritics, each ([Name Parts with Diacritics] <> null))

It should be obvious to me from the beginning. Thank you for your patience and persistence. And also thank you for looking into the substring-search solution.

Vijay_A_Verma · ‎03-15-2022

Below is a solution based on function (Edit - Couldn't make oneliner. Will try sometime in future to make one liner). Replace Source and DiacriticsList appropriately. In the end, you can expand tables.

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    FullNames = Table.TransformColumnTypes(Source,{{"Name", type text}}),
    DiacriticsList = List.Buffer({"ö", "ü", "ó", "ã", "í","č","è", "Sa", "vad", "r" }),
    SeparatorsList = List.Buffer({",","-"," ","&",".","/","(",")"}),

    DuplicatedFullNames = Table.DuplicateColumn(FullNames, "Name", "Name Copy"),
    ConvertedNamePartsToList = Table.SplitColumn(DuplicatedFullNames, "Name Copy", Splitter.SplitTextByAnyDelimiter(SeparatorsList, QuoteStyle.None), {"Name Parts"}, null, ExtraValues.List),
    //Start of function
    fxConvertfunc=(InputList,DiacriticsList)=>
let
    #"Converted to Table" = Table.FromList(InputList, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Added Custom" = Table.AddColumn(#"Converted to Table", "Custom", (x)=> List.AnyTrue(List.Transform(DiacriticsList, each Text.Contains(x[Column1],_)))),
    #"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] = true)),
    #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Custom"}),
    #"Transposed Table" = Table.Transpose(#"Removed Columns")
in
    #"Transposed Table",
    //End of function
    #"Added Custom" = Table.AddColumn(ConvertedNamePartsToList, "Custom", each fxConvertfunc([Name Parts],DiacriticsList))
in
    #"Added Custom"

prorokrak · ‎03-16-2022

Hi @Vijay_A_Verma ,

thank you for looking further to another possible solution. It was really an interesting take on how to tackle the problem I needed to solve. I will add it to my toolbox of useful ideas.

In the end I have simplified your function like this:

    fxConvertfunc=(InputList,DiacriticsList)=>
let
    #"Converted to Table" = Table.FromList(InputList, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Added Custom" = Table.AddColumn(#"Converted to Table", "Custom", (x)=> List.AnyTrue(List.Transform(DiacriticsList, each Text.Contains(x[Column1],_)))),
    #"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] = true)),
    #"Removed Columns" = #"Filtered Rows"[Column1]
in
    #"Removed Columns"

Having that transposed table as a result was actually a bit tricky. I had a lot of rows containing tables with different number of columns. PowerQuery was unable to determine correctly how many columns I actually have in those tables. If I wanted to expand them, it would require me to know ahead what is the maximum number of columns in the tables. Since I wanted to have a list of results in the individual rows, my adjustment was one of the ways how I could achieve this.

I really appreaciate all the effort you put into helping me.

prorokrak · ‎03-15-2022

Thank you very much @Vijay_A_Verma,

it really solved the problem at hand. Interesting workaround - split the text into individual characters and then check whether any of the diacritics characters are there!

Still, I am curious how could this be handled if I needed to look for substrings, not just for single characters.

Let's say I would like to search also for words where the diacritics were transcribed into multiple ANSI characters. For example, if the DiacriticsList would look like this:

DiacriticsList = List.Buffer({"ö", "ü", "oe", "ue"})

Any ideas how this could be handled?

getting words that contains diacritics from text

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024