Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
ianbruckner
Frequent Visitor

Remove text between html tags

I'm looking for help to remove html tags from a string.

 

Example input: <div class="ExternalClass742C332E0D0340C598BC9A78413A04DE">Staff going to storage training</div>

Desired output: Staff going to storage training

 

I found @MarcelBeug 's post in  Robust-function-to-remove-HTML-tags very helpful, but can't seem to nail the magic combo of also including other tags in the <> brackets.... like class=somethingIdon'tcaretosee

1 ACCEPTED SOLUTION
jsh121988
Employee
Employee

You could try using PowerQuery Text Between function and use '>' and '</' as the delimiters.

 

You can also do this in DAX using PATHITEM and SUBSTITUTE.

InnerHTML = 
PATHITEM( // Splits the string using delimiter "|", and takes the 2nd item that is a type of Text
SUBSTITUTE( // Output <div class="Whatever"|Staff going to storage training|div>
SUBSTITUTE([Html], ">", "|"), // Output <div class="Whatever"|Staff going to storage training</div>
"</","|"
),
2,
TEXT
)

View solution in original post

7 REPLIES 7
JoelDoesPBI
New Member

I know I am a little late to the game but after trying extract between '>' and '</' and having it not work for me I broke it down into 2 steps.
1. Extract before '</'

2. Extract after '<' (using the Scan for the delimiter from the end of the input' in advanced options)

This worked for me as my source text has an inconsistant number of HTML Tags but always more than 3. This meant Extract between was giving weird results.  

    #"Extracted Text Before Delimiter" = Table.TransformColumns(#"Changed Type", {{"Subject Request", each Text.BeforeDelimiter(_, "</"), type text}}),
    #"Extracted Text After Delimiter" = Table.TransformColumns(#"Extracted Text Before Delimiter", {{"Subject Request", each Text.AfterDelimiter(_, ">", {0, RelativePosition.FromEnd}), type text}})

 

Pascal_KTeam
Resolver I
Resolver I

Removing HTML tasks can be a daunting task, I had to do this as well and it always needed some maintenance as new cases were coming in. As an alternative, why don't you just use a custom visual that can actually deal with HTML?

Take a look at the HTML Text Styler visual which you can get from the AppSource: https://appsource.microsoft.com/en-us/product/power-bi-visuals/wa200002071?tab=overview

v-frfei-msft
Community Support
Community Support

Hi @ianbruckner ,

 

We can create a calculated column using DAX as well.

Column = 
VAR len =
    SEARCH ( ">", Table1[Column1],, BLANK () )
VAR len2 =
    SEARCH ( "</div>", Table1[Column1],, BLANK () )
RETURN
    MID ( Table1[Column1], len + 1, len2 - len - 1 )

Capture.PNG

 

Regards,

Frank

 

Community Support Team _ Frank
If this post helps, then please consider Accept it as the solution to help the others find it more quickly.

Hello. I tried your query it worked partially and i still have below html tags in the content which needs to be removed. 

 

<div style="font-family&#58;Calibri, Arial, Helvetica, sans-serif;font-size&#58;11pt;color&#58;rgb(0, 0, 0);">communication started with recruitment team section head in order to assign contact person,<br>i asked the BRM to follow up on that as&#160;<span style="font-family&#58;Calibri, Arial, Helvetica, sans-serif;background-color&#58;rgb(255, 255, 255);display&#58;inline !important;">recruitment team section head didn't reply to the email yet</span>

 

The highlighted in green should be the outcome. Kindly help,

jsh121988
Employee
Employee

You could try using PowerQuery Text Between function and use '>' and '</' as the delimiters.

 

You can also do this in DAX using PATHITEM and SUBSTITUTE.

InnerHTML = 
PATHITEM( // Splits the string using delimiter "|", and takes the 2nd item that is a type of Text
SUBSTITUTE( // Output <div class="Whatever"|Staff going to storage training|div>
SUBSTITUTE([Html], ">", "|"), // Output <div class="Whatever"|Staff going to storage training</div>
"</","|"
),
2,
TEXT
)

I ended up using Text.BetweenDelimiters, and it certainly stripped out the first tag. Then I found enough other tags hidden inside the input that 1, makes it impractical to state them all, and then 2, scared enough I'd lose data if I went to the lowest common denominator of > < as the delimiters alone... so I gave up. Instead, I'll probably use the HTML visualization that's filtered by a user selecting the row that contains the rest of the data. I really wish I could embed that in the table - oh well for now.

 

Thanks for the pointers!

Can you give a sample input and output where multiple tags are present? Maybe use 3 layers of nested tags if you have that example?

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.