Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
I am trying to scarpe some data using MS PowerBI from https://activecaptain.com/quickLists/marinaIndexUSState.php?st=TN&city=Knoxville
The problem is i cannot see tables when i'm trying to connect to webpage.
Can anyone help please.
Solved! Go to Solution.
Well, I'm quite familiar with Power Query, but not so much with getting data from the web.
Anyhow, from this topic I learned that you can import a web page as text, and that helped me to come up with a solution.
Please bear with me, it may well be that the selections, filterings and extractions are not full proof, but in this case it all seems to work.
let // // Import webpage as text Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://activecaptain.com/quickLists/marinaIndexUSState.php?st=TN&city=Knoxville"))}), // Get the row(s) containing "<table": these include the href's to the linked pages #"FilteredOn<table" = Table.SelectRows(Source, each Text.Contains([Column1], "<table")), // In the next lines, the web addresses are extracted SplittedOnhref = Table.TransformColumns(#"FilteredOn<table", {{"Column1", each Text.Split(_, "href"), type text}}), Expanded = Table.ExpandListColumn(SplittedOnhref, "Column1"), #"FilteredOnNotStart<table" = Table.SelectRows(Expanded, each not Text.StartsWith([Column1], "<table")), ExtractedTextAfterFirstQuotes = Table.TransformColumns(#"FilteredOnNotStart<table", {{"Column1", each Text.Middle(_, 1+Text.PositionOf(_,"""")), type text}}), ExtractedTextUntilFirstQuotes = Table.TransformColumns(ExtractedTextAfterFirstQuotes, {{"Column1", each Text.Start(_, Text.PositionOf(_,"""")), type text}}), // Now get the table from each WebPage GetDetailsFromWebPageTable = Table.AddColumn(ExtractedTextUntilFirstQuotes, "Details", each Web.Page(Web.Contents("https://activecaptain.com/quickLists/" &[Column1]))[Data]{0}), // Get the friendly marina name AddedMarina = Table.AddColumn(GetDetailsFromWebPageTable, "Marina", each [Details][Column2]{0}, type text), // Finishing touches SelectedColumns = Table.SelectColumns(AddedMarina,{"Marina", "Details"}), ExpandedDetails = Table.ExpandTableColumn(SelectedColumns, "Details", {"Column1", "Column2"}, {"Attribute", "Value"}), // Add index to sort the data in the data model AddedIndex = Table.AddIndexColumn(ExpandedDetails, "Index", 1, 1) in AddedIndex
Websites are very specific, so you can't expect a solution for 1 website to work for any website.
Even worse, you can't expect a solution working today will still be working tomorrow.
My solution was based on the required end result (the URL's for the various marinas), so starting with the end result, you can work your way backwards to the beginning. In the original solution I found that the required URL's could be found at the row containing the string "<table", and I proceeded from there.
So for any website you need to sort how (and if) you can get the required data.
You can find an example in this post for a multi-page website.
Hi @funlpro,
After a few try, I find the data table we need must be gotten from another URL( e.g. https://activecaptain.com/quickLists/marina.php?name=Choto_Marina_Knoxville_TN&i=728090921) which is stored in href attributes on the current URL(https://activecaptain.com/quickLists/marinaIndexUS
Regards
Well, I'm quite familiar with Power Query, but not so much with getting data from the web.
Anyhow, from this topic I learned that you can import a web page as text, and that helped me to come up with a solution.
Please bear with me, it may well be that the selections, filterings and extractions are not full proof, but in this case it all seems to work.
let // // Import webpage as text Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://activecaptain.com/quickLists/marinaIndexUSState.php?st=TN&city=Knoxville"))}), // Get the row(s) containing "<table": these include the href's to the linked pages #"FilteredOn<table" = Table.SelectRows(Source, each Text.Contains([Column1], "<table")), // In the next lines, the web addresses are extracted SplittedOnhref = Table.TransformColumns(#"FilteredOn<table", {{"Column1", each Text.Split(_, "href"), type text}}), Expanded = Table.ExpandListColumn(SplittedOnhref, "Column1"), #"FilteredOnNotStart<table" = Table.SelectRows(Expanded, each not Text.StartsWith([Column1], "<table")), ExtractedTextAfterFirstQuotes = Table.TransformColumns(#"FilteredOnNotStart<table", {{"Column1", each Text.Middle(_, 1+Text.PositionOf(_,"""")), type text}}), ExtractedTextUntilFirstQuotes = Table.TransformColumns(ExtractedTextAfterFirstQuotes, {{"Column1", each Text.Start(_, Text.PositionOf(_,"""")), type text}}), // Now get the table from each WebPage GetDetailsFromWebPageTable = Table.AddColumn(ExtractedTextUntilFirstQuotes, "Details", each Web.Page(Web.Contents("https://activecaptain.com/quickLists/" &[Column1]))[Data]{0}), // Get the friendly marina name AddedMarina = Table.AddColumn(GetDetailsFromWebPageTable, "Marina", each [Details][Column2]{0}, type text), // Finishing touches SelectedColumns = Table.SelectColumns(AddedMarina,{"Marina", "Details"}), ExpandedDetails = Table.ExpandTableColumn(SelectedColumns, "Details", {"Column1", "Column2"}, {"Attribute", "Value"}), // Add index to sort the data in the data model AddedIndex = Table.AddIndexColumn(ExpandedDetails, "Index", 1, 1) in AddedIndex
@MarcelBeugI'm trying to follow the same steps to scrape data from another webpage (http://www.cspdailynews.com/industry-news-analysis/top-convenience-stores-2017?page=1) "to make sure that I understood all the steps" but the second step doesn’t give correct results. Would you please clarify why it doesn’t?
let // // Import webpage as text Source = Table.FromColumns({Lines.FromBinary(Web.Contents("http://www.cspdailynews.com/industry-news-analysis/top-convenience-stores-2017?page=1"))}), // Get the row(s) containing "<table": these include the href's to the linked pages #"FilteredOn<table" = Table.SelectRows(Source, each Text.Contains([Column1], "<table")),
Websites are very specific, so you can't expect a solution for 1 website to work for any website.
Even worse, you can't expect a solution working today will still be working tomorrow.
My solution was based on the required end result (the URL's for the various marinas), so starting with the end result, you can work your way backwards to the beginning. In the original solution I found that the required URL's could be found at the row containing the string "<table", and I proceeded from there.
So for any website you need to sort how (and if) you can get the required data.
You can find an example in this post for a multi-page website.
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.
User | Count |
---|---|
96 | |
93 | |
83 | |
70 | |
65 |
User | Count |
---|---|
118 | |
106 | |
93 | |
79 | |
72 |