Solved: Robust function to remove HTML tags

unicorn · ‎05-07-2017

Hi, I have a column of HTML data that I need to display as text. I need to get rid of all HTML tags and substitute reserved HTML characters. I was trying to recursively remove all HTML tags first. Then I plan to create a table of the most common reserved characters, the values to replace them with, and generate a list to call replace for each row in that table.

I am still new to M but I came across this to replace HTML tags. Except it only removes a single tag, which is not helpful.

https://social.technet.microsoft.com/Forums/en-US/7ec64d6d-c3fc-4110-94c7-2e0087171475/how-to-remove...)

But when I try to expand that to a recursive function I'm getting a stack overflow error. I plan to start again to try to learn M, but hoping somebody can help me in the meantime with a quick fix to figure out what I did wrong. Below is my function. Thanks.

let
    removeOne = (input) =>
        let
            text = Text.From(input),
            length = Text.Length(text),
            position = Text.PositionOf(text, "<"),
            positionEnd = Text.PositionOf(text, ">"),
            range = positionEnd-position+1,
            result = if position >= 0 then Text.ReplaceRange(text, position, range, "") else input
        in
            result,

    removeAll = (input) =>
        let
            rmvOne = removeOne(input),
            rmvAll = if Text.PositionOf(rmvOne, "l") >= 0 then @removeAll(rmvOne) else rmvOne
        in
            rmvAll,

// Source = get from database,
Source = "<p>hello I am text to replace, lol</p>",

// return = removeAll([columnToPassToFunction])
    return = removeAll(Source)
        in
            return

MarcelBeug · ‎05-07-2017

Just change the "|" into "<"

rmvAll = if Text.PositionOf(rmvOne, "<") >= 0 then @removeAll(rmvOne) else rmvOne

Otherwise you may get unexpected results if the string contains < or > that are not part of a HTML tag.

If you are interested, I can share my code that only removes tag pairs, i.e. </...> strings preceded by the same without /: <...>

Specializing in Power Query Formula Language (M)

View solution in original post

pstaggers · ‎05-11-2022

Hi, I am using this custom function to remove all HTML tags:

(TableToSearch as table, ColumnToSearch as text) =>
Table.TransformColumns(TableToSearch, {ColumnToSearch, each Table.FirstValue(Html.Table(_ ?? "", {{"text",":root"}}))})

The initial idea came from Chris Webb's blog:

https://blog.crossjoin.co.uk/2019/06/09/removing-html-tags-from-text-in-power-query-power-bi/

Coalescing Operator (??) to prevent errors on nulls was an addition courtesy of Ben Gribaudo.

stejin · ‎07-21-2020

Here is a different approach that uses the built-in Web.Page function.

Below custom function replace special characters such as   or $amp; and additionally parses and combines text in HTML tags.

(Text as any) => let
        Source = Text,
        Html = Web.Page(Source),
        resolve = (t as table) =>
            let
                Result = List.Accumulate(Table.ToRecords(t), {}, (state, current) => List.Combine({state, if current[Text] <> null then {current[Text]} else if current[Children] <> null then @resolve(current[Children]) else {null} }))
            in
                Result,
        Tables = Html[Data]{0},
        Text1 = resolve(Tables),
        Text2 = List.Select(Text1, each Text.Length(Text.Trim(_)) > 0),
        Text3 = Text.Combine(Text2, Character.FromNumber(10))
    in
        Text3

vsslasd1 · ‎06-05-2020

I've read through this, and still a bit confused.

I just want to add a new text column based upon an HTML-based column and remove the HTML tags as described. I'm not certain how to add a custom function.

Can this all be done in one step, within the creation of the new column ?

My new column name is: NoteText which is based upon the column NoteHTML

Thank you

Anonymous · ‎06-05-2020

Add a function just like you would a standard query.

1. Create a new blank query. Name it "HTML Cleaner".

2. In the Advanced Editor, paste the following:

(HTML as text) =>
let
Source = Text.From(HTML),
SplitAny = Text.SplitAny(Source,"<>"),
ListAlternate = List.Alternate(SplitAny,1,1,1),
ListSelect = List.Select(ListAlternate, each _<>" "),
TextCombine = Text.Combine(ListSelect, " ")
in
TextCombine

3. Close the Editor.

4. In your primary query, open the Advanced Editor.

5. Add the following step:

#"HTML Cleanup" = Table.AddColumn(#"Previous Step", "NoteText", each
if [NoteHTML] = null
then null
else #"HTML Cleaner"([NoteHTML])),

Good luck.

vsslasd1 · ‎06-08-2020

Thank you, but unfortunately, no luck. I get over 4000 errors.

This is the syntax I have:

let
Source = OData.Feed("https://xyzasdf1111.api.crm.dynamics.com/api/data/v9.1", null, [Implementation="2.0"]),
new_arcollectioncalls_table = Source{[Name="new_arcollectioncalls",Signature="table"]}[Data],
#"Expanded new_currentowner" = Table.ExpandRecordColumn(new_arcollectioncalls_table, "new_currentowner", {"fullname"}, {"new_currentowner.fullname"}),
#"Expanded new_BillingMgr" = Table.ExpandRecordColumn(#"Expanded new_currentowner", "new_BillingMgr", {"fullname"}, {"new_BillingMgr.fullname"}),
#"Expanded new_arcollectioncall_Annotations" = Table.ExpandTableColumn(#"Expanded new_BillingMgr", "new_arcollectioncall_Annotations", {"createdon", "_ownerid_value", "notetext"}, {"new_arcollectioncall_Annotations.createdon", "new_arcollectioncall_Annotations._ownerid_value", "new_arcollectioncall_Annotations.notetext"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded new_arcollectioncall_Annotations",{"new_arcollectioncall_Annotations._ownerid_value"}),
#"Split Column by Delimiter" = Table.SplitColumn(Table.TransformColumnTypes(#"Removed Columns", {{"new_arcollectioncall_Annotations.createdon", type text}}, "en-US"), "new_arcollectioncall_Annotations.createdon", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"new_arcollectioncall_Annotations.createdon.1", "new_arcollectioncall_Annotations.createdon.2"}),
#"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"new_arcollectioncall_Annotations.createdon.1", type date}, {"new_arcollectioncall_Annotations.createdon.2", type time}}),
#"Removed Columns1" = Table.RemoveColumns(#"Changed Type",{"new_arcollectioncall_Annotations.createdon.2"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Removed Columns1",{{"new_arcollectioncall_Annotations.createdon.1", type date}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type1", each ([new_status] = true)),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"new_legal", "_new_currentowner_value", "_ownerid_value", "new_collectionnotes", "new_collections", "_createdonbehalfby_value", "new_customer_number", "createdon", "owninguser", "owningteam", "ownerid", "new_arcollectioncall_Annotations.createdon.1", "new_arcollectioncall_Annotations.notetext", "new_BillingMgr.fullname", "new_currentowner.fullname"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Other Columns",{{"new_arcollectioncall_Annotations.notetext", "NoteHTML"}}),

#"HTML Cleanup" = Table.AddColumn(#"Renamed Columns", "PlainNoteText", each
if [NoteHTML] = null
then null
else #"HTML Cleaner"([NoteHTML]))
in
#"HTML Cleanup"

And the "Blank Query" Definition:

"(HTML as text) =>#(cr)#(lf)let#(cr)#(lf)Source = Text.From(HTML),#(cr)#(lf)SplitAny = Text.SplitAny(Source,""<>""),#(cr)#(lf)ListAlternate = List.Alternate(SplitAny,1,1,1),#(cr)#(lf)ListSelect = List.Select(ListAlternate, each _<>"" ""),#(cr)#(lf)TextCombine = Text.Combine(ListSelect, "" "")#(cr)#(lf)in#(cr)#(lf)TextCombine"

Anonymous · ‎06-18-2020

Unfortunately, without seeing the errors, and the data that generates them, I won't be able to help much further. Perhaps you could create a 'dummy' table in the query, to include data rows that generate errors, and rows that don't. Use the following code:

let

Table = #table( type table
[
#"NoteHTML" = text
],
{
{"Data that throws an error"},
{"Data that throws an error"},
{null},
{"Data that doesn't"}
}
),

#"HTML Cleanup" = Table.AddColumn(Table, "PlainNoteText", each
if [NoteHTML] = null
then null
else #"HTML Cleaner"([NoteHTML]), type text)

in
#"HTML Cleanup"

TerrificPoet · ‎05-15-2018

Hello,

I have the same problem as you but i'm not sure where should i copy and paste that function. I have a table with multiple columns and i just need to clean the data of HTML TAGS in one of them could you please explain in more details how you did it

Kind regards,

Nicolas

MarcelBeug · ‎05-07-2017

Just change the "|" into "<"

rmvAll = if Text.PositionOf(rmvOne, "<") >= 0 then @removeAll(rmvOne) else rmvOne

Otherwise you may get unexpected results if the string contains < or > that are not part of a HTML tag.

If you are interested, I can share my code that only removes tag pairs, i.e. </...> strings preceded by the same without /: <...>

Specializing in Power Query Formula Language (M)

unicorn · ‎05-08-2017

Shoot, sorry thanks. I was doing a lot of testing and started with a recursive function for removing 'L'. Will give the full code a shot now. Also, good point about tag pairs. If you don't mind, would love to look through your code to improve this.

MarcelBeug · ‎05-08-2017

Here is my code. It should be called with StartPosition 0. On each iteration, the code examines the string from that startposition for code tag pairs, removes them if found and call the next iteration with a new StartPosition, so the code is also able to handle the situation that the string includes an end tag without corresponding start tag, or </ without corresponding >.

Notice that I also included a mechanism to stop after a maximum number of iterations, which is especiallly useful during development, to prevent endless iterations (until stack overflow).

        fnRHTMLT = (String as text, StartPosition as number, optional Iteration as number, optional MaxIterations as number) as text =>
        let
            StringFromStartposition = Text.RemoveRange(String, 0, StartPosition),
            StartPositionEndTag = Text.PositionOf(StringFromStartposition, "</"),
            PositionsEndTag = if StartPositionEndTag = -1 
                                then -1 
                                else Text.PositionOf(Text.RemoveRange(StringFromStartposition, 0, StartPositionEndTag),">"),
            StartTag = if PositionsEndTag = -1 
                       then null 
                       else "<" & Text.Range(StringFromStartposition, StartPositionEndTag + 2, PositionsEndTag - 1),
            StartPositionStartTag = if PositionsEndTag = -1 
                                    then -1 
                                    else Text.PositionOf(Text.Start(String,StartPosition + StartPositionEndTag),StartTag,Occurrence.Last),
            NewString = if StartPositionStartTag = -1 
                        then String 
                        else Text.RemoveRange(Text.RemoveRange(String,StartPosition + StartPositionEndTag, PositionsEndTag + 1),StartPositionStartTag, PositionsEndTag),
            NextStartPosition = if PositionsEndTag = -1
                                then -1
                                else if StartPositionStartTag = -1
                                     then StartPosition + StartPositionEndTag + 1
                                     else StartPosition + StartPositionEndTag - PositionsEndTag,
            Result = if NextStartPosition = -1
                     then NewString
                     else if Iteration = null 
                          then @fnRHTMLT(NewString, NextStartPosition)
                          else if Iteration = MaxIterations
                               then NewString
                               else @fnRHTMLT(NewString, NextStartPosition, Iteration + 1, MaxIterations)

        in
            Result

Specializing in Power Query Formula Language (M)

mmayer · ‎04-17-2019

I tried incorporating your code to clean up the HTML in only one of my columns. But I get the following error: "Expresion.Error: 1 arguments were passed to function which expects between 2 and 4.

Details:

Pattern=

Arguments=List"

This is the HTML I am trying to remove from the [Notes] column:

Before:

<div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">Request includes: Status indicator for project costs and job progress, Email notification to supervisors of outstanding/incomplete items.</div><div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">TSR has been submitted.  Waiting on TEC.</div><div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">SWS clean-up and data update in progress in preparation of system modifications.</div><div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">System cleaned up and ready to work with TEC on revamp.<br></div>

After:

Request includes: Status indicator for project costs and job progress, Email notification to supervisors of outstanding/incomplete items.TSR has been submitted.  Waiting on TEC. SWS clean-up and data update in progress in preparation of system modifications. System cleaned up and ready to work with TEC on revamp.

Here is the code:

let
  Source = SharePoint.Tables("https://orgname.sharepoint.com/sites/department/", [ApiVersion = 15]),
    #"2be78719-1e12-4827-8f88-d9edd1a7781f" = Source{[Id="2be78719-1e12-4827-8f88-d9edd1a7781f"]}[Items],
    #"Renamed Columns" = Table.RenameColumns(#"2be78719-1e12-4827-8f88-d9edd1a7781f",{{"ID", "ID.1"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"End Date", type date}, {"Start Date", type date}, {"Today", type date}, {"End Date (Actu", type date}, {"Created", type date}, {"Modified", type date}, {"EndDateT", type text}, {"Completion", Int64.Type}, {"Id", Int64.Type}, {"ID.1", Int64.Type}, {"EditorId", Int64.Type}}),
      fnRHTMLT = (String as text, StartPosition as number, optional Iteration as number, optional MaxIterations as number) as text =>
        let
            StringFromStartposition = Text.RemoveRange(String, 0, StartPosition),
            StartPositionEndTag = Text.PositionOf(StringFromStartposition, "</"),
            PositionsEndTag = if StartPositionEndTag = -1 
                                then -1 
                                else Text.PositionOf(Text.RemoveRange(StringFromStartposition, 0, StartPositionEndTag),">"),
            StartTag = if PositionsEndTag = -1 
                       then null 
                       else "<" & Text.Range(StringFromStartposition, StartPositionEndTag + 2, PositionsEndTag - 1),
            StartPositionStartTag = if PositionsEndTag = -1 
                                    then -1 
                                    else Text.PositionOf(Text.Start(String,StartPosition + StartPositionEndTag),StartTag,Occurrence.Last),
            NewString = if StartPositionStartTag = -1 
                        then String 
                        else Text.RemoveRange(Text.RemoveRange(String,StartPosition + StartPositionEndTag, PositionsEndTag + 1),StartPositionStartTag, PositionsEndTag),
            NextStartPosition = if PositionsEndTag = -1
                                then -1
                                else if StartPositionStartTag = -1
                                     then StartPosition + StartPositionEndTag + 1
                                     else StartPosition + StartPositionEndTag - PositionsEndTag,
            Result = if NextStartPosition = -1
                     then NewString
                     else if Iteration = null 
                          then @fnRHTMLT(NewString, NextStartPosition)
                          else if Iteration = MaxIterations
                               then NewString
                               else @fnRHTMLT(NewString, NextStartPosition, Iteration + 1, MaxIterations)

        in
            Result,      
    #"RemoveHTML" = Table.TransformColumns(#"Changed Type",{{"Notes", fnRHTMLT, type text}}),

in
    RemoveHTML

Grumelo · ‎07-24-2018

This solution is capable of removing tags like

<p> </p>

but not something like

<div class="..."> </div>

An example of function call would be welcome. I don't understand what the meaning of "Iteration".

Is it the amount of tags you want to remove?

Grumelo · ‎07-24-2018

My solution to clean HTML tags

let
   TextFromHtml = (HTML as any) =>
let
    Source = if HTML = null then
        ""
    else
        Text.From(HTML),
    SplitAny = Text.SplitAny(Source,"<>"),
    ListAlternate = List.Alternate(SplitAny,1,1,1),
    ListSelect = List.Select(ListAlternate, each _<>""),
    TextCombine = Text.Combine(ListSelect, "")
in
    TextCombine
in
    TextFromHtml

For beginners: create a blank query an copy paste the code above; rename the query as "TextFromHtml"

Example of calling this function on a column called Comment:
= Table.TransformColumns(#"Previous Step",{{"Comment", TextFromHtml, type text}})

Reference: https://social.technet.microsoft.com/Forums/office/en-US/b080d122-14f0-43bc-8a86-f50cdf1c32d8/removi...

sizi · ‎03-25-2024

Hello. I tried your query, but its giving me error as below:

Expression.Error: We cannot convert the value "<div>The business ca..." to type Table.
Details:
Value=<div>The business case was delayed, as the proposals were not submitted by the vendors ontime</div>
Type=[Type].

This is the custom function : NoteText= Table.AddColumn(#"Changed Type4", "NoteText", each Table.TransformColumns([#"Notes/Comments"],{{[#"Notes/Comments - Copy"], TextFromHtml, type text}})) . Kindly help as my data is as below which varies with every comment :

<div class="ExternalClassAEC9E21CD9CB4D5099FAB2EBE2BC4A87"><div style="font-family:Calibri, Arial, Helvetica, sans-serif;font-size:11pt;color:rgb(0, 0, 0);">Final sign off confirmation received today from BRM. Attached the confirmation email.</div></div>

<div class="ExternalClass82C655AC166D45B68A0237698583C179"><div style="font-family:Calibri, Arial, Helvetica, sans-serif;font-size:11pt;color:rgb(0, 0, 0);">PQD provided the focal point list yesterday. Meeting to be scheduled as per the Business availability to revalidate the BRD.  </div></div>

Comments are as above for reference. I need the actual text from these comments.

Thanks,

ScottTynan · ‎09-18-2022

Thanks Grumelo, your response solved my issue!

Anonymous · ‎10-24-2017

Just tried your code, but I got the error "The name 'Iteration' wasn't recognized. Make sure it's spelled correctly." Any suggestions?

MarcelBeug · ‎10-24-2017

Probably you didn't copy my code correctly.

Otherwise: the code must be preceded by an additional "let", and "in fnRHTMLT" must be added at the end:

let
    

    <code block: see above>


in
    fnRHTMLT

Specializing in Power Query Formula Language (M)

Anonymous · ‎10-25-2017

OK, thanks, that worked. But it doesn't strip all HTML. It cleaned <b>, </b>, <ol>, </ol>, and a few other one- and two-character strings, but it overlooked <br>, </a>, </span>, and long <span...........>, <table...........> and <img........> strings.

I don't see anything in the code that would filter some one- or two-character strings but not others. How can I adapt the code to find strings of any length?

MarcelBeug · ‎10-25-2017

The code doesn't look at the lengths, in other words, lengths don't matter.

The code searches for pairs of tags like <p> and </p>

In general: a start tag without slash and an end tag with slash /.

Any tags that don't come in such pairs are not removed.

You might consider a solution where anything between "<" and ">" is removed (including these delimiters), but the risk is that such characters are removed that are not part of a HTML tag.

You can find an example on Technet.

Specializing in Power Query Formula Language (M)

Max01 · ‎11-02-2017

Hello,

Very new to Power BI and trying to use the solution posted above but I'm not sure which bits to replace and exactly what to replace them with? I have pasted the query into the advanced editor and have tried replacing 'string as text' with my column name but it's not working.

Can you please help by explaining?

Thank you,

Max

Robust function to remove HTML tags

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly