Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!
Hi Guys,
I am trying to retrive a hierarchy beween P/N from a Website by scraping data each individual P/N html page.
I have a list of all the P/N that are active with their name and their unique #id on an Excel spreadsheet like the table below :
P/N - Name | id |
Cylinder head | 12 |
Rod (bearing) | 35 |
liner type B | 27 |
The html pages of each P/N are build like this : https://<my_domain>/<random_name>/PN=id
and on each page I can easily retrive the P/N that are directly linked to this P/N (see function Get_hierarchy)
On the table below, I would have scrapped the P/N Actuator (in green) which would then have given me the 3 sub components (in blue).
level | P/N - Name |
0 | Actuator |
1 | Cylinder |
1 | Rod |
1 | Sealing system |
I would then need to retrive the ID of each sub component to issue the same request for each of those P/N and get the list of P/N that are directly linked to each each sub component. The end table will look a bit like this :
level | P/N - Name | id |
0 | Actuator | 3 |
1 | Cylinder | 46 |
2 | Cylinder body | 78 |
2 | Cylinder head | 12 |
1 | Rod | 33 |
2 | Rod (bar) | 35 |
2 | Rod (bearing) | 40 |
1 | Sealing system | 112 |
Everything is working fine with the code below except when I have more that 8 row to scrap on the same level where the Query freezes and I can't get a decent result. Issue being, I have at least 100.000 P/N to scrap on 15 hierarchy levels so I can't do it manualy.
Code on the Hierachy_full table (for the 1st 2 levels but the remaining lignes are similar) :
let
Source = <Top P/N with its ID>
#"Fonction personnalisée appelée" = Table.AddColumn(Source, "N-1", each Get_hierarchy([N.ID])),
#"N-1 développé" = Table.ExpandTableColumn(#"Fonction personnalisée appelée", "N-1", {"P/N"}, {"N-1.P/N"}),
#"Requêtes fusionnées" = Table.NestedJoin(#"N-1 développé", {"N-1.P/N"}, #"P/N actifs", {"P/N"}, "P/N actifs", JoinKind.LeftOuter),
#"P/N actifs développé" = Table.ExpandTableColumn(#"Requêtes fusionnées", "P/N actifs", {"P/N"}, {"N-1.ID"}),
#"Fonction personnalisée appelée1" = Table.AddColumn(#"P/N actifs développé", "N-2", each #"Get_hierarchy"([#"N-1.ID"])),
#"N-2 développé" = Table.ExpandTableColumn(#"Fonction personnalisée appelée1", "N-2", {"P/N"}, {"N-2.P/N"}),
in
#"N-2 développé"
And the code of the function Get_hierarchy :
let
Source = (#"ID" as any) => let
Source = Web.Page(Web.Contents("https://<my_domain>/<random_name>/PN= " & #"id")),
Data = Source{2}[Data],
#"Type modifié" = Table.TransformColumnTypes(Data,{{"Column1", type text}, {"Column2", type text}}),
#"Diviser la colonne selon les transitions de caractères" = Table.SplitColumn(#"Type modifié", "Column2", Splitter.SplitTextByCharacterTransition({"a".."z"}, {"A".."Z"}), {"Column2.1", "Column2.2"}),
#"Colonnes renommées" = Table.RenameColumns(#"Diviser la colonne selon les transitions de caractères",{{"Column2.1", "P/N"}}),
#"Autres colonnes supprimées" = Table.SelectColumns(#"Colonnes renommées",{"P/N"})
in
#"Autres colonnes supprimées"
in
Source
Does any of you have an idea of what I did wrong or where does the problem could came from ?
And how to fix it of course.
Many thanks,
Romuald.
Is it possible that the web server is blocking your scraping/slowing you down because you cause too much traffic for them? Maybe you should ask them if they would be willing to give you a data extract.