cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
unicorn Frequent Visitor
Frequent Visitor

Robust function to remove HTML tags

Hi, I have a column of HTML data that I need to display as text. I need to get rid of all HTML tags and substitute reserved HTML characters. I was trying to recursively remove all HTML tags first. Then I plan to create a table of the most common reserved characters, the values to replace them with, and generate a list to call replace for each row in that table.

 

I am still new to M but I came across this to replace HTML tags. Except it only removes a single tag, which is not helpful.

https://social.technet.microsoft.com/Forums/en-US/7ec64d6d-c3fc-4110-94c7-2e0087171475/how-to-remove...)

 

But when I try to expand that to a recursive function I'm getting a stack overflow error. I plan to start again to try to learn M, but hoping somebody can help me in the meantime with a quick fix to figure out what I did wrong. Below is my function. Thanks.

 

let
    removeOne = (input) =>
        let
            text = Text.From(input),
            length = Text.Length(text),
            position = Text.PositionOf(text, "<"),
            positionEnd = Text.PositionOf(text, ">"),
            range = positionEnd-position+1,
            result = if position >= 0 then Text.ReplaceRange(text, position, range, "") else input
        in
            result,

    removeAll = (input) =>
        let
            rmvOne = removeOne(input),
            rmvAll = if Text.PositionOf(rmvOne, "l") >= 0 then @removeAll(rmvOne) else rmvOne
        in
            rmvAll,

 

// Source = get from database,
    Source = "<p>hello I am text to replace, lol</p>",

// return = removeAll([columnToPassToFunction])
    return = removeAll(Source)
        in
            return

1 ACCEPTED SOLUTION

Accepted Solutions
Super User
Super User

Re: Robust function to remove HTML tags

Just change the "|" into "<"

 

rmvAll = if Text.PositionOf(rmvOne, "<") >= 0 then @removeAll(rmvOne) else rmvOne

 

Otherwise you may get unexpected results if the string contains < or > that are not part of a HTML tag.

 

If you are interested, I can share my code that only removes tag pairs, i.e. </...> strings preceded by the same without /: <...>

Specializing in Power Query Formula Language (M)
13 REPLIES 13
Super User
Super User

Re: Robust function to remove HTML tags

Just change the "|" into "<"

 

rmvAll = if Text.PositionOf(rmvOne, "<") >= 0 then @removeAll(rmvOne) else rmvOne

 

Otherwise you may get unexpected results if the string contains < or > that are not part of a HTML tag.

 

If you are interested, I can share my code that only removes tag pairs, i.e. </...> strings preceded by the same without /: <...>

Specializing in Power Query Formula Language (M)
unicorn Frequent Visitor
Frequent Visitor

Re: Robust function to remove HTML tags

Shoot, sorry thanks. I was doing a lot of testing and started with a recursive function for removing 'L'. Will give the full code a shot now. Also, good point about tag pairs. If you don't mind, would love to look through your code to improve this.

Super User
Super User

Re: Robust function to remove HTML tags

Here is my code. It should be called with StartPosition 0. On each iteration, the code examines the string from that startposition for code tag pairs, removes them if found and call the next iteration with a new StartPosition, so the code is also able to handle the situation that the string includes an end tag without corresponding start tag, or </ without corresponding >.

 

Notice that I also included a mechanism to stop after a maximum number of iterations, which is especiallly useful during development, to prevent endless iterations (until stack overflow).

 

 

        fnRHTMLT = (String as text, StartPosition as number, optional Iteration as number, optional MaxIterations as number) as text =>
        let
            StringFromStartposition = Text.RemoveRange(String, 0, StartPosition),
            StartPositionEndTag = Text.PositionOf(StringFromStartposition, "</"),
            PositionsEndTag = if StartPositionEndTag = -1 
                                then -1 
                                else Text.PositionOf(Text.RemoveRange(StringFromStartposition, 0, StartPositionEndTag),">"),
            StartTag = if PositionsEndTag = -1 
                       then null 
                       else "<" & Text.Range(StringFromStartposition, StartPositionEndTag + 2, PositionsEndTag - 1),
            StartPositionStartTag = if PositionsEndTag = -1 
                                    then -1 
                                    else Text.PositionOf(Text.Start(String,StartPosition + StartPositionEndTag),StartTag,Occurrence.Last),
            NewString = if StartPositionStartTag = -1 
                        then String 
                        else Text.RemoveRange(Text.RemoveRange(String,StartPosition + StartPositionEndTag, PositionsEndTag + 1),StartPositionStartTag, PositionsEndTag),
            NextStartPosition = if PositionsEndTag = -1
                                then -1
                                else if StartPositionStartTag = -1
                                     then StartPosition + StartPositionEndTag + 1
                                     else StartPosition + StartPositionEndTag - PositionsEndTag,
            Result = if NextStartPosition = -1
                     then NewString
                     else if Iteration = null 
                          then @fnRHTMLT(NewString, NextStartPosition)
                          else if Iteration = MaxIterations
                               then NewString
                               else @fnRHTMLT(NewString, NextStartPosition, Iteration + 1, MaxIterations)

        in
            Result

 

Specializing in Power Query Formula Language (M)
omtinole Frequent Visitor
Frequent Visitor

Re: Robust function to remove HTML tags

Just tried your code, but I got the error "The name 'Iteration' wasn't recognized.  Make sure it's spelled correctly."  Any suggestions?

omtinole Frequent Visitor
Frequent Visitor

Re: Robust function to remove HTML tags

Just tried your code, but I got the error "The name 'Iteration' wasn't recognized. Make sure it's spelled correctly." Any suggestions?

Super User
Super User

Re: Robust function to remove HTML tags

Probably you didn't copy my code correctly.

 

Otherwise: the code must be preceded by an additional "let", and "in fnRHTMLT" must be added at the end:

 

let
    

    <code block: see above>


in
    fnRHTMLT
Specializing in Power Query Formula Language (M)
omtinole Frequent Visitor
Frequent Visitor

Re: Robust function to remove HTML tags

OK, thanks, that worked.  But it doesn't strip all HTML.  It cleaned <b>, </b>, <ol>, </ol>, and a few other one- and two-character strings, but it overlooked <br>, </a>, </span>, and long <span...........>, <table...........> and <img........> strings. 

 

I don't see anything in the code that would filter some one- or two-character strings but not others.  How can I adapt the code to find strings of any length?

Super User
Super User

Re: Robust function to remove HTML tags

The code doesn't look at the lengths, in other words, lengths don't matter.

 

The code searches for pairs of tags like <p> and </p>

In general: a start tag without slash and an end tag with slash /.

 

Any tags that don't come in such pairs are not removed.

 

You might consider a solution where anything between "<" and ">" is removed (including these delimiters), but the risk is that such characters are removed that are not part of a HTML tag.

 

You can find an example on Technet.

Specializing in Power Query Formula Language (M)
Max01 Regular Visitor
Regular Visitor

Re: Robust function to remove HTML tags

Hello,

 

Very new to Power BI and trying to use the solution posted above but I'm not sure which bits to replace and exactly what to replace them with? I have pasted the query into the advanced editor and have tried replacing 'string as text' with my column name but it's not working.

 

Can you please help by explaining?

 

Thank you,

 

Max