cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Regular Visitor

Robust function to remove HTML tags

Hi, I have a column of HTML data that I need to display as text. I need to get rid of all HTML tags and substitute reserved HTML characters. I was trying to recursively remove all HTML tags first. Then I plan to create a table of the most common reserved characters, the values to replace them with, and generate a list to call replace for each row in that table.

 

I am still new to M but I came across this to replace HTML tags. Except it only removes a single tag, which is not helpful.

https://social.technet.microsoft.com/Forums/en-US/7ec64d6d-c3fc-4110-94c7-2e0087171475/how-to-remove...)

 

But when I try to expand that to a recursive function I'm getting a stack overflow error. I plan to start again to try to learn M, but hoping somebody can help me in the meantime with a quick fix to figure out what I did wrong. Below is my function. Thanks.

 

let
    removeOne = (input) =>
        let
            text = Text.From(input),
            length = Text.Length(text),
            position = Text.PositionOf(text, "<"),
            positionEnd = Text.PositionOf(text, ">"),
            range = positionEnd-position+1,
            result = if position >= 0 then Text.ReplaceRange(text, position, range, "") else input
        in
            result,

    removeAll = (input) =>
        let
            rmvOne = removeOne(input),
            rmvAll = if Text.PositionOf(rmvOne, "l") >= 0 then @removeAll(rmvOne) else rmvOne
        in
            rmvAll,

 

// Source = get from database,
    Source = "<p>hello I am text to replace, lol</p>",

// return = removeAll([columnToPassToFunction])
    return = removeAll(Source)
        in
            return

1 ACCEPTED SOLUTION
Community Champion
Community Champion

Just change the "|" into "<"

 

rmvAll = if Text.PositionOf(rmvOne, "<") >= 0 then @removeAll(rmvOne) else rmvOne

 

Otherwise you may get unexpected results if the string contains < or > that are not part of a HTML tag.

 

If you are interested, I can share my code that only removes tag pairs, i.e. </...> strings preceded by the same without /: <...>

Specializing in Power Query Formula Language (M)

View solution in original post

18 REPLIES 18
New Member

Here is a different approach that uses the built-in Web.Page function.

 

Below custom function replace special characters such as &nbsp; or $amp; and additionally parses and combines text in HTML tags.

 

(Text as any) => let
        Source = Text,
        Html = Web.Page(Source),
        resolve = (t as table) =>
            let
                Result = List.Accumulate(Table.ToRecords(t), {}, (state, current) => List.Combine({state, if current[Text] <> null then {current[Text]} else if current[Children] <> null then @resolve(current[Children]) else {null} }))
            in
                Result,
        Tables = Html[Data]{0},
        Text1 = resolve(Tables),
        Text2 = List.Select(Text1, each Text.Length(Text.Trim(_)) > 0),
        Text3 = Text.Combine(Text2, Character.FromNumber(10))
    in
        Text3

 

Helper I
Helper I

I've read through this, and still a bit confused. 

I just want to add a new text column based upon an HTML-based column and remove the HTML tags as described. I'm not certain how to add a custom function. 

Can this all be done in one step, within the creation of the new column ? 

My new column name is: NoteText which is based upon the column NoteHTML

 

Thank you

Add a function just like you would a standard query.

 

1.  Create a new blank query.  Name it "HTML Cleaner".

2.  In the Advanced Editor, paste the following:

 

(HTML as text) =>
let
Source = Text.From(HTML),
SplitAny = Text.SplitAny(Source,"<>"),
ListAlternate = List.Alternate(SplitAny,1,1,1),
ListSelect = List.Select(ListAlternate, each _<>" "),
TextCombine = Text.Combine(ListSelect, " ")
in
TextCombine

 

3.  Close the Editor.

4.  In your primary query, open the Advanced Editor.

5.  Add the following step:

 

#"HTML Cleanup" = Table.AddColumn(#"Previous Step", "NoteText", each
if [NoteHTML] = null
then null
else #"HTML Cleaner"([NoteHTML])),

 

Good luck.

Thank you, but unfortunately, no luck. I get over 4000 errors. 

This is the syntax I have:

 

let
Source = OData.Feed("https://xyzasdf1111.api.crm.dynamics.com/api/data/v9.1", null, [Implementation="2.0"]),
new_arcollectioncalls_table = Source{[Name="new_arcollectioncalls",Signature="table"]}[Data],
#"Expanded new_currentowner" = Table.ExpandRecordColumn(new_arcollectioncalls_table, "new_currentowner", {"fullname"}, {"new_currentowner.fullname"}),
#"Expanded new_BillingMgr" = Table.ExpandRecordColumn(#"Expanded new_currentowner", "new_BillingMgr", {"fullname"}, {"new_BillingMgr.fullname"}),
#"Expanded new_arcollectioncall_Annotations" = Table.ExpandTableColumn(#"Expanded new_BillingMgr", "new_arcollectioncall_Annotations", {"createdon", "_ownerid_value", "notetext"}, {"new_arcollectioncall_Annotations.createdon", "new_arcollectioncall_Annotations._ownerid_value", "new_arcollectioncall_Annotations.notetext"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded new_arcollectioncall_Annotations",{"new_arcollectioncall_Annotations._ownerid_value"}),
#"Split Column by Delimiter" = Table.SplitColumn(Table.TransformColumnTypes(#"Removed Columns", {{"new_arcollectioncall_Annotations.createdon", type text}}, "en-US"), "new_arcollectioncall_Annotations.createdon", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"new_arcollectioncall_Annotations.createdon.1", "new_arcollectioncall_Annotations.createdon.2"}),
#"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"new_arcollectioncall_Annotations.createdon.1", type date}, {"new_arcollectioncall_Annotations.createdon.2", type time}}),
#"Removed Columns1" = Table.RemoveColumns(#"Changed Type",{"new_arcollectioncall_Annotations.createdon.2"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Removed Columns1",{{"new_arcollectioncall_Annotations.createdon.1", type date}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type1", each ([new_status] = true)),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"new_legal", "_new_currentowner_value", "_ownerid_value", "new_collectionnotes", "new_collections", "_createdonbehalfby_value", "new_customer_number", "createdon", "owninguser", "owningteam", "ownerid", "new_arcollectioncall_Annotations.createdon.1", "new_arcollectioncall_Annotations.notetext", "new_BillingMgr.fullname", "new_currentowner.fullname"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Other Columns",{{"new_arcollectioncall_Annotations.notetext", "NoteHTML"}}),

#"HTML Cleanup" = Table.AddColumn(#"Renamed Columns", "PlainNoteText", each
if [NoteHTML] = null
then null
else #"HTML Cleaner"([NoteHTML]))
in
#"HTML Cleanup"

 

 

And the "Blank Query" Definition: 

"(HTML as text) =>#(cr)#(lf)let#(cr)#(lf)Source = Text.From(HTML),#(cr)#(lf)SplitAny = Text.SplitAny(Source,""<>""),#(cr)#(lf)ListAlternate = List.Alternate(SplitAny,1,1,1),#(cr)#(lf)ListSelect = List.Select(ListAlternate, each _<>"" ""),#(cr)#(lf)TextCombine = Text.Combine(ListSelect, "" "")#(cr)#(lf)in#(cr)#(lf)TextCombine"

 

 

 

 

pbi.png

Unfortunately, without seeing the errors, and the data that generates them, I won't be able to help much further.  Perhaps you could create a 'dummy' table in the query, to include data rows that generate errors, and rows that don't.  Use the following code:

 

let

Table = #table( type table
[
#"NoteHTML" = text
],
{
{"Data that throws an error"},
{"Data that throws an error"},
{null},
{"Data that doesn't"}
}
),

#"HTML Cleanup" = Table.AddColumn(Table, "PlainNoteText", each
if [NoteHTML] = null
then null
else #"HTML Cleaner"([NoteHTML]), type text)

in
#"HTML Cleanup"

Helper I
Helper I

Hello,

I have the same problem as you but i'm not sure where should i copy and paste that function. I have a table with multiple columns and i just need to clean the data of HTML TAGS in one of them could you please explain in more details how you did it 

Kind regards,

 

Nicolas

Community Champion
Community Champion

Just change the "|" into "<"

 

rmvAll = if Text.PositionOf(rmvOne, "<") >= 0 then @removeAll(rmvOne) else rmvOne

 

Otherwise you may get unexpected results if the string contains < or > that are not part of a HTML tag.

 

If you are interested, I can share my code that only removes tag pairs, i.e. </...> strings preceded by the same without /: <...>

Specializing in Power Query Formula Language (M)

View solution in original post

Shoot, sorry thanks. I was doing a lot of testing and started with a recursive function for removing 'L'. Will give the full code a shot now. Also, good point about tag pairs. If you don't mind, would love to look through your code to improve this.

Here is my code. It should be called with StartPosition 0. On each iteration, the code examines the string from that startposition for code tag pairs, removes them if found and call the next iteration with a new StartPosition, so the code is also able to handle the situation that the string includes an end tag without corresponding start tag, or </ without corresponding >.

 

Notice that I also included a mechanism to stop after a maximum number of iterations, which is especiallly useful during development, to prevent endless iterations (until stack overflow).

 

 

        fnRHTMLT = (String as text, StartPosition as number, optional Iteration as number, optional MaxIterations as number) as text =>
        let
            StringFromStartposition = Text.RemoveRange(String, 0, StartPosition),
            StartPositionEndTag = Text.PositionOf(StringFromStartposition, "</"),
            PositionsEndTag = if StartPositionEndTag = -1 
                                then -1 
                                else Text.PositionOf(Text.RemoveRange(StringFromStartposition, 0, StartPositionEndTag),">"),
            StartTag = if PositionsEndTag = -1 
                       then null 
                       else "<" & Text.Range(StringFromStartposition, StartPositionEndTag + 2, PositionsEndTag - 1),
            StartPositionStartTag = if PositionsEndTag = -1 
                                    then -1 
                                    else Text.PositionOf(Text.Start(String,StartPosition + StartPositionEndTag),StartTag,Occurrence.Last),
            NewString = if StartPositionStartTag = -1 
                        then String 
                        else Text.RemoveRange(Text.RemoveRange(String,StartPosition + StartPositionEndTag, PositionsEndTag + 1),StartPositionStartTag, PositionsEndTag),
            NextStartPosition = if PositionsEndTag = -1
                                then -1
                                else if StartPositionStartTag = -1
                                     then StartPosition + StartPositionEndTag + 1
                                     else StartPosition + StartPositionEndTag - PositionsEndTag,
            Result = if NextStartPosition = -1
                     then NewString
                     else if Iteration = null 
                          then @fnRHTMLT(NewString, NextStartPosition)
                          else if Iteration = MaxIterations
                               then NewString
                               else @fnRHTMLT(NewString, NextStartPosition, Iteration + 1, MaxIterations)

        in
            Result

 

Specializing in Power Query Formula Language (M)

I tried incorporating your code to clean up the HTML in only one of my columns. But I get the following error: "Expresion.Error: 1 arguments were passed to function which expects between 2 and 4.

Details:

Pattern=

Arguments=List"

 

This is the HTML I am trying to remove from the [Notes] column:

 

Before:

<div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">Request includes&#58; Status indicator for project costs and job progress, Email notification to supervisors of outstanding/incomplete items​.</div><div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">TSR has been submitted.&#160; Waiting on TEC.</div><div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">SWS clean-up and data update in progress in preparation of system modifications.</div><div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">System cleaned up and ready to work with TEC on revamp.<br></div>

 

After:

Request includes&#58; Status indicator for project costs and job progress, Email notification to supervisors of outstanding/incomplete items​.TSR has been submitted.&#160; Waiting on TEC. SWS clean-up and data update in progress in preparation of system modifications. System cleaned up and ready to work with TEC on revamp.

 

 

Here is the code:

 

let
  Source = SharePoint.Tables("https://orgname.sharepoint.com/sites/department/", [ApiVersion = 15]),
    #"2be78719-1e12-4827-8f88-d9edd1a7781f" = Source{[Id="2be78719-1e12-4827-8f88-d9edd1a7781f"]}[Items],
    #"Renamed Columns" = Table.RenameColumns(#"2be78719-1e12-4827-8f88-d9edd1a7781f",{{"ID", "ID.1"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"End Date", type date}, {"Start Date", type date}, {"Today", type date}, {"End Date (Actu", type date}, {"Created", type date}, {"Modified", type date}, {"EndDateT", type text}, {"Completion", Int64.Type}, {"Id", Int64.Type}, {"ID.1", Int64.Type}, {"EditorId", Int64.Type}}),
      fnRHTMLT = (String as text, StartPosition as number, optional Iteration as number, optional MaxIterations as number) as text =>
        let
            StringFromStartposition = Text.RemoveRange(String, 0, StartPosition),
            StartPositionEndTag = Text.PositionOf(StringFromStartposition, "</"),
            PositionsEndTag = if StartPositionEndTag = -1 
                                then -1 
                                else Text.PositionOf(Text.RemoveRange(StringFromStartposition, 0, StartPositionEndTag),">"),
            StartTag = if PositionsEndTag = -1 
                       then null 
                       else "<" & Text.Range(StringFromStartposition, StartPositionEndTag + 2, PositionsEndTag - 1),
            StartPositionStartTag = if PositionsEndTag = -1 
                                    then -1 
                                    else Text.PositionOf(Text.Start(String,StartPosition + StartPositionEndTag),StartTag,Occurrence.Last),
            NewString = if StartPositionStartTag = -1 
                        then String 
                        else Text.RemoveRange(Text.RemoveRange(String,StartPosition + StartPositionEndTag, PositionsEndTag + 1),StartPositionStartTag, PositionsEndTag),
            NextStartPosition = if PositionsEndTag = -1
                                then -1
                                else if StartPositionStartTag = -1
                                     then StartPosition + StartPositionEndTag + 1
                                     else StartPosition + StartPositionEndTag - PositionsEndTag,
            Result = if NextStartPosition = -1
                     then NewString
                     else if Iteration = null 
                          then @fnRHTMLT(NewString, NextStartPosition)
                          else if Iteration = MaxIterations
                               then NewString
                               else @fnRHTMLT(NewString, NextStartPosition, Iteration + 1, MaxIterations)

        in
            Result,      
    #"RemoveHTML" = Table.TransformColumns(#"Changed Type",{{"Notes", fnRHTMLT, type text}}),

in
    RemoveHTML

 

 

This solution is capable of removing tags like

<p> </p>

but not something like

<div class="..."> </div>

 

 

An example of function call would be welcome. I don't understand what the meaning of "Iteration".

Is it the amount of tags you want to remove?

My solution to clean HTML tags

 

let
   TextFromHtml = (HTML as any) =>
let
    Source = if HTML = null then
        ""
    else
        Text.From(HTML),
    SplitAny = Text.SplitAny(Source,"<>"),
    ListAlternate = List.Alternate(SplitAny,1,1,1),
    ListSelect = List.Select(ListAlternate, each _<>""),
    TextCombine = Text.Combine(ListSelect, "")
in
    TextCombine
in
    TextFromHtml

For beginners: create a blank query an copy paste the code above; rename the query as "TextFromHtml"

 

Example of calling this function on a column called Comment:
= Table.TransformColumns(#"Previous Step",{{"Comment", TextFromHtml, type text}})

 

Reference: https://social.technet.microsoft.com/Forums/office/en-US/b080d122-14f0-43bc-8a86-f50cdf1c32d8/removi...

Just tried your code, but I got the error "The name 'Iteration' wasn't recognized. Make sure it's spelled correctly." Any suggestions?

Probably you didn't copy my code correctly.

 

Otherwise: the code must be preceded by an additional "let", and "in fnRHTMLT" must be added at the end:

 

let
    

    <code block: see above>


in
    fnRHTMLT
Specializing in Power Query Formula Language (M)

OK, thanks, that worked.  But it doesn't strip all HTML.  It cleaned <b>, </b>, <ol>, </ol>, and a few other one- and two-character strings, but it overlooked <br>, </a>, </span>, and long <span...........>, <table...........> and <img........> strings. 

 

I don't see anything in the code that would filter some one- or two-character strings but not others.  How can I adapt the code to find strings of any length?

The code doesn't look at the lengths, in other words, lengths don't matter.

 

The code searches for pairs of tags like <p> and </p>

In general: a start tag without slash and an end tag with slash /.

 

Any tags that don't come in such pairs are not removed.

 

You might consider a solution where anything between "<" and ">" is removed (including these delimiters), but the risk is that such characters are removed that are not part of a HTML tag.

 

You can find an example on Technet.

Specializing in Power Query Formula Language (M)

Hello,

 

Very new to Power BI and trying to use the solution posted above but I'm not sure which bits to replace and exactly what to replace them with? I have pasted the query into the advanced editor and have tried replacing 'string as text' with my column name but it's not working.

 

Can you please help by explaining?

 

Thank you,

 

Max

Just tried your code, but I got the error "The name 'Iteration' wasn't recognized.  Make sure it's spelled correctly."  Any suggestions?

Helpful resources

Announcements
secondImage

Experience what’s next for Power BI

Join us for an in-depth look at the new Power BI features and capabilities at the free Microsoft Business Applications Launch Event.

secondImage

Power BI Women

Join our monthly meetings and learning sessions.

secondImage

Congratulations!

We are excited to announce the Power BI Super Users!

secondImage

The largest Power BI virtual conference

100+ sessions, 100+ speakers, Product managers, MVPs, and experts. All about Power BI. Attend online or watch the recordings.

Top Solution Authors
Top Kudoed Authors