Re: Webscraping with powerbi get img URL without u...

raymond · ‎10-29-2020

Hello everyone,

(once again I try to write this post).

I want to extract an Image from an website. However the img src parameter does not contain a readable img URL. Hence instead I would have to use something else. I was thinking of maybe retriving the value of data-original but I have no clou how to do it. Maybe data-scrset is another option or perhaps there is a way to extract the entire html content of the img tag. Do you have a solution?

<div class="article-image-container">
<div class="content">
<img 
   src="data&colon;image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" 
   data-original="https://cdn2.chrono24.com/images/uhren/15832324-tbzie1krd5li6y9gt3q55f6r-Square210.jpg" 
   data-srcset="https://cdn2.chrono24.com/images/uhren/15832324-tbzie1krd5li6y9gt3q55f6r-Square420.jpg 2x" 
   alt="" class="js-lazy">
</div>
</div>

Thanks in advance

raymond · ‎10-30-2020

@Admin: I cant read my post, it got translated to spanish?

raymond · ‎10-30-2020

I will go nuts here, everything is in spanish 😄

@PhilipTreacy : thanks for your support. But with that I cannot match an Image to another piece of information. I can download all the images now but since I dont just only want the images but also names, prices and so forth I would need a reference.

The code in actuality looks more like this, and to be frank even more complicated. From here I would like to extract the name, link, image, price.

<div class="article-item-container">
      <a href="/rolex/rolex-datejust-turn-o-graph--id16906358.htm">

<div class="article-image-container">
<img src="data&colon;image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-original="https://cdn2.chrono24.com/images/uhren/15429755-jawuuoydt9e7qbg2x62c3m8u-Square210.jpg" data-srcset="https://cdn2.chrono24.com/images/uhren/15429755-jawuuoydt9e7qbg2x62c3m8u-Square420.jpg 2x">
</div>

<div class="article-title">
        Rolex Datejust Turn-O-Graph
</div>

<div class="article-price-container">
         9.500
</div>

</div>

PhilipTreacy · ‎10-29-2020

Hi @raymond

If you open the web page as text, you will see all the HTML, 1 line per row, and can then filter for what you want.

In this example I've filtered for the line containing data-original.

You can then extract the image URL using Transform -> Extract -> Text Between Delimiters.

You can use this query to test on a file I've placed on Amazon S3 that contains the HTML you posted.

let
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://d2cgdza3nuf1jv.cloudfront.net/img.htm"))}),
    #"Filtered Rows" = Table.SelectRows(Source, each ([Column1] = "   data-original=""https://cdn2.chrono24.com/images/uhren/15832324-tbzie1krd5li6y9gt3q55f6r-Square210.jpg"" ")),
    #"Extracted Text Between Delimiters" = Table.TransformColumns(#"Filtered Rows", {{"Column1", each Text.BetweenDelimiters(_, """", """"), type text}})
in
    #"Extracted Text Between Delimiters"

Phil

If I answered your question please mark my post as the solution.
If my answer helped solve your problem, give it a kudos by clicking on the Thumbs Up.

Did I answer your question? Then please mark my post as the solution.
If I helped you, click on the Thumbs Up to give Kudos.

Blog :: YouTube Channel :: Connect on Linkedin

Proud to be a Super User!

Webscraping with powerbi get img URL without using img src and using srcset or data-original instead

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly