Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
funlpro
Frequent Visitor

Power BI - Import data from web

I am trying to scarpe some data using MS PowerBI from https://activecaptain.com/quickLists/marinaIndexUSState.php?st=TN&city=Knoxville 

The problem is i cannot see tables when i'm trying to connect to webpage.
Can anyone help please.

2 ACCEPTED SOLUTIONS

Well, I'm quite familiar with Power Query, but not so much with getting data from the web.

 

Anyhow, from this topic I learned that you can import a web page as text, and that helped me to come up with a solution.

Please bear with me, it may well be that the selections, filterings and extractions are not full proof, but in this case it all seems to work.

 

let
    // // Import webpage as text
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://activecaptain.com/quickLists/marinaIndexUSState.php?st=TN&city=Knoxville"))}),

    // Get the row(s) containing "<table": these include the href's to the linked pages
    #"FilteredOn<table" = Table.SelectRows(Source, each Text.Contains([Column1], "<table")),

    // In the next lines, the web addresses are extracted
    SplittedOnhref = Table.TransformColumns(#"FilteredOn<table", {{"Column1", each Text.Split(_, "href"), type text}}),
    Expanded = Table.ExpandListColumn(SplittedOnhref, "Column1"),
    #"FilteredOnNotStart<table" = Table.SelectRows(Expanded, each not Text.StartsWith([Column1], "<table")),
    ExtractedTextAfterFirstQuotes = Table.TransformColumns(#"FilteredOnNotStart<table", {{"Column1", each Text.Middle(_, 1+Text.PositionOf(_,"""")), type text}}),
    ExtractedTextUntilFirstQuotes = Table.TransformColumns(ExtractedTextAfterFirstQuotes, {{"Column1", each Text.Start(_, Text.PositionOf(_,"""")), type text}}),

    // Now get the table from each WebPage
    GetDetailsFromWebPageTable = Table.AddColumn(ExtractedTextUntilFirstQuotes, "Details", each Web.Page(Web.Contents("https://activecaptain.com/quickLists/" &[Column1]))[Data]{0}),

    // Get the friendly marina name
    AddedMarina = Table.AddColumn(GetDetailsFromWebPageTable, "Marina", each [Details][Column2]{0}, type text),

    // Finishing touches
    SelectedColumns = Table.SelectColumns(AddedMarina,{"Marina", "Details"}),
    ExpandedDetails = Table.ExpandTableColumn(SelectedColumns, "Details", {"Column1", "Column2"}, {"Attribute", "Value"}),

    // Add index to sort the data in the data model
    AddedIndex = Table.AddIndexColumn(ExpandedDetails, "Index", 1, 1)
in
    AddedIndex

 

Specializing in Power Query Formula Language (M)

View solution in original post

Websites are very specific, so you can't expect a solution for 1 website to work for any website.

Even worse, you can't expect a solution working today will still be working tomorrow.

 

My solution was based on the required end result (the URL's for the various marinas), so starting with the end result, you can work your way backwards to the beginning. In the original solution I found that the required URL's could be found at the row containing the string "<table", and I proceeded from there.

 

So for any website you need to sort how (and if) you can get the required data.

 

You can find an example in this post for a multi-page website.

Specializing in Power Query Formula Language (M)

View solution in original post

5 REPLIES 5
v-ljerr-msft
Employee
Employee

Hi @funlpro,

 

After a few try, I find the data table we need must be gotten from another URL( e.g. https://activecaptain.com/quickLists/marina.php?name=Choto_Marina_Knoxville_TN&i=728090921) which is stored in href attributes on the current URL(https://activecaptain.com/quickLists/marinaIndexUSState.php?st=TN&city=Knoxville ) page. And getting attributes out of a tag is quite a bit more difficult. Hope someone who is more familiar with Power Query( @ImkeF, @MarcelBeug) can help on this issue.Smiley Happy

 

Regards

Well, I'm quite familiar with Power Query, but not so much with getting data from the web.

 

Anyhow, from this topic I learned that you can import a web page as text, and that helped me to come up with a solution.

Please bear with me, it may well be that the selections, filterings and extractions are not full proof, but in this case it all seems to work.

 

let
    // // Import webpage as text
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://activecaptain.com/quickLists/marinaIndexUSState.php?st=TN&city=Knoxville"))}),

    // Get the row(s) containing "<table": these include the href's to the linked pages
    #"FilteredOn<table" = Table.SelectRows(Source, each Text.Contains([Column1], "<table")),

    // In the next lines, the web addresses are extracted
    SplittedOnhref = Table.TransformColumns(#"FilteredOn<table", {{"Column1", each Text.Split(_, "href"), type text}}),
    Expanded = Table.ExpandListColumn(SplittedOnhref, "Column1"),
    #"FilteredOnNotStart<table" = Table.SelectRows(Expanded, each not Text.StartsWith([Column1], "<table")),
    ExtractedTextAfterFirstQuotes = Table.TransformColumns(#"FilteredOnNotStart<table", {{"Column1", each Text.Middle(_, 1+Text.PositionOf(_,"""")), type text}}),
    ExtractedTextUntilFirstQuotes = Table.TransformColumns(ExtractedTextAfterFirstQuotes, {{"Column1", each Text.Start(_, Text.PositionOf(_,"""")), type text}}),

    // Now get the table from each WebPage
    GetDetailsFromWebPageTable = Table.AddColumn(ExtractedTextUntilFirstQuotes, "Details", each Web.Page(Web.Contents("https://activecaptain.com/quickLists/" &[Column1]))[Data]{0}),

    // Get the friendly marina name
    AddedMarina = Table.AddColumn(GetDetailsFromWebPageTable, "Marina", each [Details][Column2]{0}, type text),

    // Finishing touches
    SelectedColumns = Table.SelectColumns(AddedMarina,{"Marina", "Details"}),
    ExpandedDetails = Table.ExpandTableColumn(SelectedColumns, "Details", {"Column1", "Column2"}, {"Attribute", "Value"}),

    // Add index to sort the data in the data model
    AddedIndex = Table.AddIndexColumn(ExpandedDetails, "Index", 1, 1)
in
    AddedIndex

 

Specializing in Power Query Formula Language (M)

Thank you @MarcelBeug, your solution worked well.

@MarcelBeugI'm trying to follow the same steps to scrape data from another webpage (http://www.cspdailynews.com/industry-news-analysis/top-convenience-stores-2017?page=1) "to make sure that I understood all the steps" but the second step doesn’t give correct results. Would you please clarify why it doesn’t?

 

let
    // // Import webpage as text
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("http://www.cspdailynews.com/industry-news-analysis/top-convenience-stores-2017?page=1"))}),
 
    // Get the row(s) containing "<table": these include the href's to the linked pages
    #"FilteredOn<table" = Table.SelectRows(Source, each Text.Contains([Column1], "<table")),
 

 

Websites are very specific, so you can't expect a solution for 1 website to work for any website.

Even worse, you can't expect a solution working today will still be working tomorrow.

 

My solution was based on the required end result (the URL's for the various marinas), so starting with the end result, you can work your way backwards to the beginning. In the original solution I found that the required URL's could be found at the row containing the string "<table", and I proceeded from there.

 

So for any website you need to sort how (and if) you can get the required data.

 

You can find an example in this post for a multi-page website.

Specializing in Power Query Formula Language (M)

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors