Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
emilmaican
Frequent Visitor

Scrap a web page - trying to get all the data

Hi,

 

I'm trying to get 13 years of financial data for Apple from this page https://www.macrotrends.net/stocks/charts/AAPL/apple/income-statement .

When I'm accessing the web page in browser I'm able to see almost all data:

emilmaican_0-1653747508525.png

 

But when I'm trying to scrap de data using PowerBI I only get the first 6 years of data.

Although in browser I see details from 2021 back to 2009, in PowerBI I see only 2021 to 2016

emilmaican_1-1653747674810.png

 

 

 

Is there a solution to be able to take at least 10 years of data?

1 ACCEPTED SOLUTION
AlexisOlson
Super User
Super User

If you look in the HTML code, there's a section that looks like

var originalData = [{"field_name": ...bunch of data... }];

 

This contains all of the data in the table and is in a JSON format, so we can write the full query along these lines:

let
    #"HTML Code" = Web.BrowserContents("https://www.macrotrends.net/stocks/charts/AAPL/apple/income-statement"),
    #"Split Text" = "[" & Text.BetweenDelimiters(List.Single(List.Select(Text.Split(#"HTML Code", "var "), each Text.StartsWith(_, "originalData = "))),"[","]") & "]",
    #"Parsed JSON" = Json.Document(#"Split Text"),
    #"Converted to Table" = Table.FromList(#"Parsed JSON", Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"field_name", "2021-09-30", "2020-09-30", "2019-09-30", "2018-09-30", "2017-09-30", "2016-09-30", "2015-09-30", "2014-09-30", "2013-09-30", "2012-09-30", "2011-09-30", "2010-09-30", "2009-09-30"}, {"field_name", "2021-09-30", "2020-09-30", "2019-09-30", "2018-09-30", "2017-09-30", "2016-09-30", "2015-09-30", "2014-09-30", "2013-09-30", "2012-09-30", "2011-09-30", "2010-09-30", "2009-09-30"}),
    #"Extracted Text Between Delimiters" = Table.TransformColumns(#"Expanded Column1", {{"field_name", each Text.BetweenDelimiters(_, ">", "<"), type text}}),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Extracted Text Between Delimiters", {"field_name"}, "Date", "Value"),
    #"Changed Type" = Table.TransformColumnTypes(#"Unpivoted Other Columns",{{"Date", type date}, {"Value", Currency.Type}})
in
    #"Changed Type"

This results in a nice clean unpivoted tabular format.

 

AlexisOlson_0-1653770651379.png

View solution in original post

2 REPLIES 2
AlexisOlson
Super User
Super User

If you look in the HTML code, there's a section that looks like

var originalData = [{"field_name": ...bunch of data... }];

 

This contains all of the data in the table and is in a JSON format, so we can write the full query along these lines:

let
    #"HTML Code" = Web.BrowserContents("https://www.macrotrends.net/stocks/charts/AAPL/apple/income-statement"),
    #"Split Text" = "[" & Text.BetweenDelimiters(List.Single(List.Select(Text.Split(#"HTML Code", "var "), each Text.StartsWith(_, "originalData = "))),"[","]") & "]",
    #"Parsed JSON" = Json.Document(#"Split Text"),
    #"Converted to Table" = Table.FromList(#"Parsed JSON", Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"field_name", "2021-09-30", "2020-09-30", "2019-09-30", "2018-09-30", "2017-09-30", "2016-09-30", "2015-09-30", "2014-09-30", "2013-09-30", "2012-09-30", "2011-09-30", "2010-09-30", "2009-09-30"}, {"field_name", "2021-09-30", "2020-09-30", "2019-09-30", "2018-09-30", "2017-09-30", "2016-09-30", "2015-09-30", "2014-09-30", "2013-09-30", "2012-09-30", "2011-09-30", "2010-09-30", "2009-09-30"}),
    #"Extracted Text Between Delimiters" = Table.TransformColumns(#"Expanded Column1", {{"field_name", each Text.BetweenDelimiters(_, ">", "<"), type text}}),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Extracted Text Between Delimiters", {"field_name"}, "Date", "Value"),
    #"Changed Type" = Table.TransformColumnTypes(#"Unpivoted Other Columns",{{"Date", type date}, {"Value", Currency.Type}})
in
    #"Changed Type"

This results in a nice clean unpivoted tabular format.

 

AlexisOlson_0-1653770651379.png

Vijay_A_Verma
Super User
Super User

When you need to scrap data from tables which are constructed from html pages using <div> tags, then situation becomes challenging. Html.Table function is used by Power BI to extract data from children and it can go to a depth of 7 children only. That is the limit. Hence, only 7 columns are shown and you will not get columns beyond first 7 columns. 

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors