Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
KM007
Helper II
Helper II

Web scraping - website changed

Hi PBi Users,

 

I, on a daily basis, scrape a web site for fund prices but the website has been updated over the weekend and my query (below) and table below, no longer works. Please may you help me out.

 

Much appreciate any help you can provide.

Thank you!

 

the new website is ... https://www.hl.co.uk/funds/fund-discounts,-prices--and--factsheets/search-results?sectorid=121&start...

 

let

// Function for each Sector-Information
WebCall = (SectorID as text) =>
let
Source = Web.Page(Web.Contents("https://www.hl.co.uk/funds/fund-discounts,-prices--and--factsheets/search-results?sectorid="&SectorI..., 0, 0, 60)])),
Data = Source{[ClassName="sortableTable"]}[Data],
ListOfAllPages = if List.Count(List.Skip(Text.Split(Data{0}[Name], "#(lf)")))=0 then {"1"} else List.Skip(Text.Split(Data{0}[Name], "#(lf)")),
#"Converted to Table" = Table.FromList(ListOfAllPages, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
RearragnedURL = Table.AddColumn(#"Converted to Table", "Custom", each Web.Page(Web.Contents("https://www.hl.co.uk/funds/fund-discounts,-prices--and--factsheets/search-results?tab=prices&filters...]),
#"Expanded Custom" = Table.ExpandTableColumn(RearragnedURL, "Custom", {"Name", "Current price Sell (p)", "Current price Change (p)", "More information"}, {"Name", "Sell (p)", "Change (p)", "More information"}),
#"Split Column by Delimiter" = Table.SplitColumn(#"Expanded Custom", "More information", Splitter.SplitTextByDelimiter("#(lf)", QuoteStyle.Csv), {"More information.1", "More information.2", "More information.3", "More information.4", "More information.5", "More information.6"}),
#"Renamed Columns" = Table.RenameColumns(#"Split Column by Delimiter",{{"More information.2", "Unit type"}, {"More information.4", "Sector"}, {"More information.6", "Valuation date"}}),
#"Removed Columns" = Table.RemoveColumns(#"Renamed Columns",{"More information.1", "More information.3", "More information.5", "Column1"}),
#"Added Custom" = Table.AddColumn(#"Removed Columns", "Name + type", each [Name]&" - "&[Unit type]),
#"Reordered Columns" = Table.ReorderColumns(#"Added Custom",{"Name", "Name + type", "Sell (p)", "Change (p)", "Unit type", "Sector", "Valuation date"}),
#"Filtered Rows" = Table.SelectRows(#"Reordered Columns", each not Text.StartsWith([Name], "Page:")),
#"Removed Columns1" = Table.RemoveColumns(#"Filtered Rows",{"Name", "Unit type"}),
#"Changed Type" = Table.TransformColumnTypes(#"Removed Columns1",{{"Name + type", type text}, {"Sell (p)", type number}, {"Change (p)", type number}, {"Sector", type text}, {"Valuation date", type date}})
in
#"Changed Type",


// Start of main query: Dig into Html to retrieve a table with all Sectors
Source = #"SectorIDs - 1",
// Call function and cleanup
CallEachFunction = Table.AddColumn(Source, "WebCall", each WebCall([SectorID])),
FilterOutEmpties = Table.SelectRows(CallEachFunction, each [WebCall] <> null),
#"Expanded WebCall" = Table.ExpandTableColumn(FilterOutEmpties, "WebCall", {"Name + type", "Sell (p)", "Change (p)", "Sector", "Valuation date"}, {"Name + type", "Sell (p)", "Change (p)", "Sector", "Valuation date"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded WebCall",{"SectorID", "SectorName"}),
#"Changed Type" = Table.TransformColumnTypes(#"Removed Columns",{{"Name + type", type text}, {"Sell (p)", type number}, {"Change (p)", type number}, {"Sector", type text}, {"Valuation date", type text}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type", each (not Number.IsNaN([#"Sell (p)"]))),
#"Changed Type1" = Table.TransformColumnTypes(#"Filtered Rows",{{"Change (p)", Currency.Type}})
in
#"Changed Type1"

 

Table "SectorIDs - 1" is below

 

 

121Asia Pacific Ex Japan
114Asia Pacific Inc Japan
155China/ Greater China
111Europe Excluding UK
102Europe Including UK
117European Smaller Companies
115Flexible Investment
150GBP Corporate Bond
151GBP High Yield
152GBP Strategic Bond
109Global
106Global Bonds
112Global Emerging Markets
156Global Emerging Markets Bond
139Global Equity Income
113Japan
132Japanese Smaller Companies
133Latin America
137Mixed Investment 0-35% Shares
131Mixed Investment 20-60% Shares
107Mixed Investment 40-85% Shares
123Money Market
999N/A
110North America
130North American Smaller Cos
71Offshore
400Pension Trusts
108Property
138Short Term Money Market
129Specialist
136Targeted Absolute Return
134Technology & Telecoms
101UK All Companies
104UK Equity & Bond Income
103UK Equity Income
127UK Gilt
135UK Index Linked Gilt
118UK Smaller Companies
48Unclassified
157Volatility Managed
3 REPLIES 3
HotChilli
Super User
Super User

I think it might be an anti-spam thing on this forum but they shorten any URLs in posted stuff so we don't know what the 2 lines with web links are. 

I had a brief try with the Query.  The table that gets returned with fund names is results__table not sortableTable.

The second web call (RearragnedURL ) presumably constructs each url with positions in the list e.g. You can see that the &start= parameter controls which fund is displayed on the page

https://www.hl.co.uk/funds/fund-discounts,-prices--and--factsheets/search-results?sectorid=121&start=0&rpp=20&lo=0%2C1&sort=fd.full_description&sort_dir=asc

https://www.hl.co.uk/funds/fund-discounts,-prices--and--factsheets/search-results?sectorid=121&start=20&rpp=20&lo=0%2C1&sort=fd.full_description&sort_dir=asc

 

Overall, I think the work involved is too complex for a forum issue.  It needs time and effort.

I appreciate you having a go @HotChilli 

 

Sadly, yes it is quite complex. I have spent most of the day manually scraping the website .... eeeek! 

 

Yes, the table name has changed, apologies for not stating that in my original post. There are 39 "sectorid" and each sectorid, could possibly, have more than 1 page and with the change in the website you can't put page numbers in the URL. That's just another headache I have come across.

 

Anyway I appreciate you taking the time to look. I'll keep scratching my head and hopefully reach a solution

Thanks again

Another headache I faced was not able to navigate to the tab "Prices & Yields". This doesn't appear in the URL so I don't know how to access this tab via the query.

 

Many thanks

Helpful resources

Announcements
RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

MayPowerBICarousel

Fabric Monthly Update - May 2024

Check out the May 2024 Fabric update to learn about new features.

LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.