Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
Anonymous
Not applicable

extracting data from a websoure

Hi,

 

I was wondering if people could have other ideas on how to improve extracting data from this website (public) https://www.ukri.org/opportunity/ 

I have used the PQ option "Add Table Using Examples" for Page 1 or specific links to calls, but after creating several functions and made this working on Power BI Service (for refreshing every day) I found out that this "Add Table Using Examples" didn't work as expected as I get values mixed between columns. I don't know Python, but maybe there are other tools I could use better than PQ.

 

Here the link to the file https://drive.google.com/file/d/1DtXManvMeNLvB1s19JxGIKSgV_AYO2xT/view?usp=sharing

In the properties of each query you have some instructions.

 

Best.

 

1 ACCEPTED SOLUTION
lbendlin
Super User
Super User

Now with proper pagination.

 

let
    Source = Web.BrowserContents("https://www.ukri.org/opportunity"),
    Results = Html.Table(Source, {{"Result", ".opportunities-results"}}),
    #"Added Custom" = Table.AddColumn(Results, "Records", each Int16.From(Text.Start([Result],Text.PositionOf([Result]," ")))),
    #"Changed Type" = Table.TransformColumnTypes(#"Added Custom",{{"Records", Int64.Type}}),
    URLs = {1..Int16.From(#"Changed Type"[Records]{0}/10+0.5)},
    #"Converted to Table" = Table.FromList(URLs, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Changed Type1" = Table.TransformColumnTypes(#"Converted to Table",{{"Column1", type text}}),
    #"Invoked Custom Function" = Table.AddColumn(#"Changed Type1", "Contents", each FetchPage([Column1])),
    #"Expanded Contents" = Table.ExpandTableColumn(#"Invoked Custom Function", "Contents", {"Opportunity Title", "Opportunity Link", "Details", "Opportunity Status", "Funders", "Funders Link", "Funding type", "Maximum award", "Publication date", "Opening date", "Closing date", "Co-funders", "Award range", "Total fund"}, {"Opportunity Title", "Opportunity Link", "Details", "Opportunity Status", "Funders", "Funders Link", "Funding type", "Maximum award", "Publication date", "Opening date", "Closing date", "Co-funders", "Award range", "Total fund"}),
    #"Renamed Columns" = Table.RenameColumns(#"Expanded Contents",{{"Column1", "Page"}})
in
    #"Renamed Columns"

 

You also need the (now outsourced) parsing code as a function called FetchPage

 

(page) => let
        Source = Web.BrowserContents("https://www.ukri.org/opportunity/page/" & page),
       
        GetPage = Html.Table(Source, {{"Opportunity Title", ".ukri-funding-opp__link"}
    ,{"Opportunity Link", ".ukri-funding-opp__link",each [Attributes][href]}
    , {"Details", ".entry-content"}
    , {"Opportunity Status", ".govuk-table__row:nth-child(1) .govuk-table__cell"}
    , {"Funders", ".govuk-table__row:nth-child(2) .govuk-table__cell"}
    , {"Funders Link", ".govuk-table__row:nth-child(2) .govuk-table__cell a",each [Attributes][href]}
    , {"A3", ".govuk-table__row:nth-child(3) .govuk-table__header"}
    , {"V3", ".govuk-table__row:nth-child(3) .govuk-table__cell"}
    , {"A4", ".govuk-table__row:nth-child(4) .govuk-table__header"}
    , {"V4", ".govuk-table__row:nth-child(4) .govuk-table__cell"}
    , {"A5", ".govuk-table__row:nth-child(5) .govuk-table__header"}
    , {"V5", ".govuk-table__row:nth-child(5) .govuk-table__cell"}
    , {"A6", ".govuk-table__row:nth-child(6) .govuk-table__header"}
    , {"V6", ".govuk-table__row:nth-child(6) .govuk-table__cell"}
    , {"A7", ".govuk-table__row:nth-child(7) .govuk-table__header"}
    , {"V7", ".govuk-table__row:nth-child(7) .govuk-table__cell"}
    , {"A8", ".govuk-table__row:nth-child(8) .govuk-table__header"}
    , {"V8", ".govuk-table__row:nth-child(8) .govuk-table__cell"}
    , {"A9", ".govuk-table__row:nth-child(9) .govuk-table__header"}
    , {"V9", ".govuk-table__row:nth-child(9) .govuk-table__cell"}
    }, [RowSelector=".opportunity"]),
        #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(GetPage, {"Funders Link", "Funders", "Opportunity Status", "Details", "Opportunity Link", "Opportunity Title"}, "Attribute", "Value"),
        #"Added Index" = Table.AddIndexColumn(#"Unpivoted Other Columns", "Index", 0, 1, Int64.Type),
        #"Replaced Value" = Table.ReplaceValue(#"Added Index",each [Attribute],each if Text.Start([Attribute],1) = "A" then "" else Text.Remove(#"Added Index"{[Index]-1}[Value],":"),Replacer.ReplaceText,{"Attribute"}),
        #"Filtered Rows" = Table.SelectRows(#"Replaced Value", each ([Attribute] <> "")),
        #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Opportunity Title", "Opportunity Link", "Details", "Opportunity Status", "Funders", "Funders Link", "Attribute", "Value"}),
        #"Pivoted Column" = Table.Pivot(#"Removed Other Columns", List.Distinct(#"Removed Other Columns"[Attribute]), "Attribute", "Value")
    in
        #"Pivoted Column"

 

Note:  This code will fetch page 1 twice but it doesn't really matter as that data is cached in the web browser engine.

 

View solution in original post

6 REPLIES 6
lbendlin
Super User
Super User

Now with proper pagination.

 

let
    Source = Web.BrowserContents("https://www.ukri.org/opportunity"),
    Results = Html.Table(Source, {{"Result", ".opportunities-results"}}),
    #"Added Custom" = Table.AddColumn(Results, "Records", each Int16.From(Text.Start([Result],Text.PositionOf([Result]," ")))),
    #"Changed Type" = Table.TransformColumnTypes(#"Added Custom",{{"Records", Int64.Type}}),
    URLs = {1..Int16.From(#"Changed Type"[Records]{0}/10+0.5)},
    #"Converted to Table" = Table.FromList(URLs, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Changed Type1" = Table.TransformColumnTypes(#"Converted to Table",{{"Column1", type text}}),
    #"Invoked Custom Function" = Table.AddColumn(#"Changed Type1", "Contents", each FetchPage([Column1])),
    #"Expanded Contents" = Table.ExpandTableColumn(#"Invoked Custom Function", "Contents", {"Opportunity Title", "Opportunity Link", "Details", "Opportunity Status", "Funders", "Funders Link", "Funding type", "Maximum award", "Publication date", "Opening date", "Closing date", "Co-funders", "Award range", "Total fund"}, {"Opportunity Title", "Opportunity Link", "Details", "Opportunity Status", "Funders", "Funders Link", "Funding type", "Maximum award", "Publication date", "Opening date", "Closing date", "Co-funders", "Award range", "Total fund"}),
    #"Renamed Columns" = Table.RenameColumns(#"Expanded Contents",{{"Column1", "Page"}})
in
    #"Renamed Columns"

 

You also need the (now outsourced) parsing code as a function called FetchPage

 

(page) => let
        Source = Web.BrowserContents("https://www.ukri.org/opportunity/page/" & page),
       
        GetPage = Html.Table(Source, {{"Opportunity Title", ".ukri-funding-opp__link"}
    ,{"Opportunity Link", ".ukri-funding-opp__link",each [Attributes][href]}
    , {"Details", ".entry-content"}
    , {"Opportunity Status", ".govuk-table__row:nth-child(1) .govuk-table__cell"}
    , {"Funders", ".govuk-table__row:nth-child(2) .govuk-table__cell"}
    , {"Funders Link", ".govuk-table__row:nth-child(2) .govuk-table__cell a",each [Attributes][href]}
    , {"A3", ".govuk-table__row:nth-child(3) .govuk-table__header"}
    , {"V3", ".govuk-table__row:nth-child(3) .govuk-table__cell"}
    , {"A4", ".govuk-table__row:nth-child(4) .govuk-table__header"}
    , {"V4", ".govuk-table__row:nth-child(4) .govuk-table__cell"}
    , {"A5", ".govuk-table__row:nth-child(5) .govuk-table__header"}
    , {"V5", ".govuk-table__row:nth-child(5) .govuk-table__cell"}
    , {"A6", ".govuk-table__row:nth-child(6) .govuk-table__header"}
    , {"V6", ".govuk-table__row:nth-child(6) .govuk-table__cell"}
    , {"A7", ".govuk-table__row:nth-child(7) .govuk-table__header"}
    , {"V7", ".govuk-table__row:nth-child(7) .govuk-table__cell"}
    , {"A8", ".govuk-table__row:nth-child(8) .govuk-table__header"}
    , {"V8", ".govuk-table__row:nth-child(8) .govuk-table__cell"}
    , {"A9", ".govuk-table__row:nth-child(9) .govuk-table__header"}
    , {"V9", ".govuk-table__row:nth-child(9) .govuk-table__cell"}
    }, [RowSelector=".opportunity"]),
        #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(GetPage, {"Funders Link", "Funders", "Opportunity Status", "Details", "Opportunity Link", "Opportunity Title"}, "Attribute", "Value"),
        #"Added Index" = Table.AddIndexColumn(#"Unpivoted Other Columns", "Index", 0, 1, Int64.Type),
        #"Replaced Value" = Table.ReplaceValue(#"Added Index",each [Attribute],each if Text.Start([Attribute],1) = "A" then "" else Text.Remove(#"Added Index"{[Index]-1}[Value],":"),Replacer.ReplaceText,{"Attribute"}),
        #"Filtered Rows" = Table.SelectRows(#"Replaced Value", each ([Attribute] <> "")),
        #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Opportunity Title", "Opportunity Link", "Details", "Opportunity Status", "Funders", "Funders Link", "Attribute", "Value"}),
        #"Pivoted Column" = Table.Pivot(#"Removed Other Columns", List.Distinct(#"Removed Other Columns"[Attribute]), "Attribute", "Value")
    in
        #"Pivoted Column"

 

Note:  This code will fetch page 1 twice but it doesn't really matter as that data is cached in the web browser engine.

 

Anonymous
Not applicable

Thanks so much!

Works perfectly! I have to dig into it yet to understand how you did it! Thanks again.

One last question. I need to refresh this daily on PBI Service. In order to refresh I had to use Web.Contents and RelativePath. 

I did change the function to 

(page as number) => let
Source = Web.Contents("https://www.ukri.org/opportunity/page/", [RelativePath=Number.ToText(page)]),

 

And, I change the type of the column Page of the main query to Page, so the function can work.

 

This refreshes fine on PBI desktop, but when on PBI Service asks for a Gateway to be installed. I can't see any code in your two queries pointing to anything local. I've used this Web.Contents + RelativePath with other webpages and it didn't ask for a Gateway.

 

Thanks!

This page holds the magic nuggets CSS Selectors Reference (w3schools.com) I recommend you bookmark it.

Anonymous
Not applicable

Oh! Just found the error. I forgot to change the main query with "Web.Contents".

 

Thanks again!!!!!!

lbendlin
Super User
Super User

@Anonymous this is a really tough web source to scrape, the individual records are of different length and have varying field names . Sometimes "Co-funders" is present, sometimes not. Some have "Maximum award", some have "Award range" etc.

 

What are you eventually expecting to get out of this data?

 

This is what I have so far but it is incorrect.

let
    Source = Web.BrowserContents("https://www.ukri.org/opportunity/"),
    #"Extracted Table From Html" = Html.Table(Source, {{"Opportunity Title", ".ukri-funding-opp__link"}
, {"comment", ".entry-content"}
, {"Opportunity Status", ".govuk-table__row:nth-child(1) .govuk-table__cell"}
, {"Funders", ".govuk-table__row:nth-child(2) .govuk-table__cell"}
, {"Funding Type", ".govuk-table__row:nth-child(3) .govuk-table__cell"}
, {"Maximum ward", ".govuk-table__row:nth-child(4) .govuk-table__cell"}
, {"Publication date", ".govuk-table__row:nth-child(5) .govuk-table__cell"}
, {"Opening date", ".govuk-table__row:nth-child(6) .govuk-table__cell"}
, {"Closing date", ".govuk-table__row:nth-child(7) .govuk-table__cell"}}, [RowSelector=".opportunity"])
in
    #"Extracted Table From Html"

How to use this code: Create a new Blank Query. Click on "Advanced Editor". Replace the code in the window with the code provided here. Click "Done".

 

  

Here's a slightly better version. It still only reads the first page, but does so reliably. It also fetches the total number of results to prepare for the pagination.

 

let
    Source = Web.BrowserContents("https://www.ukri.org/opportunity/page/1"),
    Results = Html.Table(Source, {{"Result", ".opportunities-results"}}),
    #"Added Custom" = Table.AddColumn(Results, "Records", each Int16.From(Text.Start([Result],Text.PositionOf([Result]," ")))),
    #"Changed Type" = Table.TransformColumnTypes(#"Added Custom",{{"Records", Int64.Type}}),
    #"First Page" = Html.Table(Source, {{"Opportunity Title", ".ukri-funding-opp__link"}
,{"Opportunity Link", ".ukri-funding-opp__link",each [Attributes][href]}
, {"Details", ".entry-content"}
, {"Opportunity Status", ".govuk-table__row:nth-child(1) .govuk-table__cell"}
, {"Funders", ".govuk-table__row:nth-child(2) .govuk-table__cell"}
, {"Funders Link", ".govuk-table__row:nth-child(2) .govuk-table__cell a",each [Attributes][href]}
, {"A3", ".govuk-table__row:nth-child(3) .govuk-table__header"}
, {"V3", ".govuk-table__row:nth-child(3) .govuk-table__cell"}
, {"A4", ".govuk-table__row:nth-child(4) .govuk-table__header"}
, {"V4", ".govuk-table__row:nth-child(4) .govuk-table__cell"}
, {"A5", ".govuk-table__row:nth-child(5) .govuk-table__header"}
, {"V5", ".govuk-table__row:nth-child(5) .govuk-table__cell"}
, {"A6", ".govuk-table__row:nth-child(6) .govuk-table__header"}
, {"V6", ".govuk-table__row:nth-child(6) .govuk-table__cell"}
, {"A7", ".govuk-table__row:nth-child(7) .govuk-table__header"}
, {"V7", ".govuk-table__row:nth-child(7) .govuk-table__cell"}
, {"A8", ".govuk-table__row:nth-child(8) .govuk-table__header"}
, {"V8", ".govuk-table__row:nth-child(8) .govuk-table__cell"}
, {"A9", ".govuk-table__row:nth-child(9) .govuk-table__header"}
, {"V9", ".govuk-table__row:nth-child(9) .govuk-table__cell"}
}, [RowSelector=".opportunity"]),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"First Page", {"Funders Link", "Funders", "Opportunity Status", "Details", "Opportunity Link", "Opportunity Title"}, "Attribute", "Value"),
    #"Added Index" = Table.AddIndexColumn(#"Unpivoted Other Columns", "Index", 0, 1, Int64.Type),
    #"Replaced Value" = Table.ReplaceValue(#"Added Index",each [Attribute],each if Text.Start([Attribute],1) = "A" then "" else Text.Remove(#"Added Index"{[Index]-1}[Value],":"),Replacer.ReplaceText,{"Attribute"}),
    #"Filtered Rows" = Table.SelectRows(#"Replaced Value", each ([Attribute] <> "")),
    #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Opportunity Title", "Opportunity Link", "Details", "Opportunity Status", "Funders", "Funders Link", "Attribute", "Value"}),
    #"Pivoted Column" = Table.Pivot(#"Removed Other Columns", List.Distinct(#"Removed Other Columns"[Attribute]), "Attribute", "Value")
in
    #"Pivoted Column"

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors