Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
Anonymous
Not applicable

Query freezes while web scraping

Hi Guys,

 

I am trying to retrive a hierarchy beween P/N from a Website by scraping data each individual P/N html page.

 

I have a list of all the P/N that are active with their name and their unique #id on an Excel spreadsheet like the table below :

P/N - Nameid
Cylinder head12
Rod (bearing)35
liner type B27

 

The html pages of each P/N are build like this : https://<my_domain>/<random_name>/PN=id 

and on each page I can easily retrive the P/N that are directly linked to this P/N (see function Get_hierarchy)

On the table below, I would have scrapped the P/N Actuator (in green) which would then have given me the 3 sub components (in blue).

levelP/N - Name
0Actuator
1Cylinder
1Rod
1Sealing system

 

I would then need to retrive the ID of each sub component to issue the same request for each of those P/N and get the list of P/N that are directly linked to each each sub component. The end table will look a bit like this :

levelP/N - Nameid
0Actuator3
1Cylinder46
2Cylinder body78
2Cylinder head12
1Rod33
2Rod (bar)35
2Rod (bearing)40
1Sealing system112

 

Everything is working fine with the code below except when I have more that 8 row to scrap on the same level where the Query freezes and I can't get a decent result. Issue being, I have at least 100.000 P/N to scrap on 15 hierarchy levels so I can't do it manualy.

 

Code on the Hierachy_full table (for the 1st 2 levels but the remaining lignes are similar) : 

 

let
Source = <Top P/N with its ID>
#"Fonction personnalisée appelée" = Table.AddColumn(Source, "N-1", each Get_hierarchy([N.ID])),
#"N-1 développé" = Table.ExpandTableColumn(#"Fonction personnalisée appelée", "N-1", {"P/N"}, {"N-1.P/N"}),
#"Requêtes fusionnées" = Table.NestedJoin(#"N-1 développé", {"N-1.P/N"}, #"P/N actifs", {"P/N"}, "P/N actifs", JoinKind.LeftOuter),
#"P/N actifs développé" = Table.ExpandTableColumn(#"Requêtes fusionnées", "P/N actifs", {"P/N"}, {"N-1.ID"}),
#"Fonction personnalisée appelée1" = Table.AddColumn(#"P/N actifs développé", "N-2", each #"Get_hierarchy"([#"N-1.ID"])),
#"N-2 développé" = Table.ExpandTableColumn(#"Fonction personnalisée appelée1", "N-2", {"P/N"}, {"N-2.P/N"}),
in
#"N-2 développé"

 

And the code of the function Get_hierarchy :

 

let
Source = (#"ID" as any) => let
Source = Web.Page(Web.Contents("https://<my_domain>/<random_name>/PN= " & #"id")),
Data = Source{2}[Data],
#"Type modifié" = Table.TransformColumnTypes(Data,{{"Column1", type text}, {"Column2", type text}}),
#"Diviser la colonne selon les transitions de caractères" = Table.SplitColumn(#"Type modifié", "Column2", Splitter.SplitTextByCharacterTransition({"a".."z"}, {"A".."Z"}), {"Column2.1", "Column2.2"}),
#"Colonnes renommées" = Table.RenameColumns(#"Diviser la colonne selon les transitions de caractères",{{"Column2.1", "P/N"}}),
#"Autres colonnes supprimées" = Table.SelectColumns(#"Colonnes renommées",{"P/N"})
in
#"Autres colonnes supprimées"
in
Source

 

Does any of you have an idea of what I did wrong or where does the problem could came from ?

And how to fix it of course.

 

Many thanks,

Romuald.

1 REPLY 1
lbendlin
Super User
Super User

Is it possible that the web server is blocking your scraping/slowing you down because you cause too much traffic for them? Maybe you should ask them if they would be willing to give you a data extract.

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors