cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Helper V
Helper V

Webscraping with powerbi get img URL without using img src and using srcset or data-original instead

Hello everyone,

 

(once again I try to write this post).

I want to extract an Image from an website. However the img src parameter does not contain a readable img URL. Hence instead I would have to use something else. I was thinking of maybe retriving the value of data-original but I have no clou how to do it. Maybe data-scrset is another option or perhaps there is a way to extract the entire html content of the img tag. Do you have a solution?

 

<div class="article-image-container">
<div class="content">
<img
src="data&colon;image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
data-original="https://cdn2.chrono24.com/images/uhren/15832324-tbzie1krd5li6y9gt3q55f6r-Square210.jpg"
data-srcset="https://cdn2.chrono24.com/images/uhren/15832324-tbzie1krd5li6y9gt3q55f6r-Square420.jpg 2x"
alt="" class="js-lazy">
</div>
</div>

 Thanks in advance

3 REPLIES 3
Resident Rockstar
Resident Rockstar

Hi @raymond 

If you open the web page as text, you will see all the HTML, 1 line per row, and can then filter for what you want.

In this example I've filtered for the line containing data-original.

You can then extract the image URL using Transform -> Extract -> Text Between Delimiters.

You can use this query to test on a file I've placed on Amazon S3 that contains the HTML you posted.

 

let
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://d2cgdza3nuf1jv.cloudfront.net/img.htm"))}),
    #"Filtered Rows" = Table.SelectRows(Source, each ([Column1] = "   data-original=""https://cdn2.chrono24.com/images/uhren/15832324-tbzie1krd5li6y9gt3q55f6r-Square210.jpg"" ")),
    #"Extracted Text Between Delimiters" = Table.TransformColumns(#"Filtered Rows", {{"Column1", each Text.BetweenDelimiters(_, """", """"), type text}})
in
    #"Extracted Text Between Delimiters"

 

 Phil


If I answered your question please mark my post as the solution.
If my answer helped solve your problem, give it a kudos by clicking on the Thumbs Up.

Helper V
Helper V

@Admin: I cant read my post, it got translated to spanish?

I will go nuts here, everything is in spanish 😄

@PhilipTreacy : thanks for your support. But with that I cannot match an Image to another piece of information. I can download all the images now but since I dont just only want the images but also names, prices and so forth I would need a reference.

The code in actuality looks more like this, and to be frank even more complicated. From here I would like to extract the name, link, image, price.

<div class="article-item-container">
      <a href="/rolex/rolex-datejust-turn-o-graph--id16906358.htm">

<div class="article-image-container">
<img src="data&colon;image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-original="https://cdn2.chrono24.com/images/uhren/15429755-jawuuoydt9e7qbg2x62c3m8u-Square210.jpg" data-srcset="https://cdn2.chrono24.com/images/uhren/15429755-jawuuoydt9e7qbg2x62c3m8u-Square420.jpg 2x">
</div>

<div class="article-title">
        Rolex Datejust Turn-O-Graph
</div>

<div class="article-price-container">
         9.500
</div>

</div>

Helpful resources

Announcements
Community Conference

Power Platform Community Conference

Check out the on demand sessions that are available now!

Community Conference

Microsoft Power Platform Communities

Check out the Winners!

secondImage

Create an end-to-end data and analytics solution

Learn how Power BI works with the latest Azure data and analytics innovations at the digital event with Microsoft CEO Satya Nadella.

Top Solution Authors
Top Kudoed Authors