cancel
Showing results for 
Search instead for 
Did you mean: 

How-To: Union Paginated Web Data Source

Hello, #pbination!

What does a usual web data source look like?

You are lucky if you have one structured table on one page, like the sports team stats sheet from the http://stats.nba.com resource below.

1.png

But, much more often, for a better user experience, webmasters create an index page with links to other equally structured pages with the same data nature.

I’ll give you an example. Let’s try to get a full list of English Premier League players (it is football, or soccer, if you still didn’t figure it out). Do not try to find any sense in this task. I just want to discover the average age, or height/weight, or nationalities diversity, or anything. Share your ideas in comments, please.

Of course, you can find one big list of all players (I bet it will be a difficult task) or get access to any SQL DB with the list. Or if you even have enough time to enter data manually (great Power BI option).

But, more often, you will find a resource with about 20 equally structured pages for each team. And, after that, you can create 20 data sources for each team, then append it, and then create a beautiful dashboard.

Stop, just imagine. You’ve just created all 20 duplicated data sources. And then you found out that you need to edit one Power Query transformation step in each of them. And later one more... and so on.

What if one team will be replaced by another? You will need another data source. In other words, you will need to continuously manage changes in the resources list.

Good sense and beginner-programmer’s logic are telling me (and you) that there should be some construction-like loop or repeated function or anything.

Let’s try to get it. There are going to be a few stages in my technique:

Find the source. Choose template page

  1. Develop M-code from the template page
  2. Find the source. Find an index page and parse it to get all sources list
  3. Union all useful pages

 

Find the source

OK, I’ll be honest. I’m cheating a little bit. I already have a quality source. I will use http://www.footballsquads.co.uk/eng/2019-2020/engprem.htm

 

Choose the template page

Then we will need some practice. Take one example page, get a set of data that we will use for the report, produce all transformation routine which will be evaluated to each data set. As it is my case, I will choose my favorite Manchester United page as a template http://www.footballsquads.co.uk/eng/2019-2020/engprem/manutd.htm (sorry, Liverpool fans).

 

Develop M-code from template page

Let’s do the whole typical routine for getting data from a webpage. Go to Get data -> Web, copy-paste the URL of the template page into the URL field and load the table with the native Navigator dialog box.

 

2.png

 

After making a typical routine to clean the data like remove empty rows, autodetect data types, format values to my locale, etc. I’ve got a new template data source. Copy and save this code to future operations

 

let
Source = Web.BrowserContents("http://www.footballsquads.co.uk/eng/2019-2020/engprem/manutd.htm"),
#"Extracted Table From Html" = Html.Table(Source, {{"Column1", "TABLE > * > TR >  ....  TR"]),
#"Changed Type" = Table.TransformColumnTypes(#"Extracted Table From Html",{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}, {"Column4", type text}, {"Column5", type text}, {"Column6", type text}, {"Column7", type text}, {"Column8", type text}, {"Column9", type text}}),
#"Promoted Headers" = Table.PromoteHeaders(#"Changed Type", [PromoteAllScalars=true]),
#"Changed Type1" = Table.TransformColumnTypes(#"Promoted Headers",{{"Number", type text}, {"Name", type text}, {"Nat", type text}, {"Pos", type text}, {"Height", type text}, {"Weight", type text}, {"Date of Birth", type text}, {"Birth Place", type text}, {"Previous Club", type text}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type1", each ([Name] <> "" and [Name] <> "Name" and [Name] <> "Players no longer at this club") and ([Height] <> "")),
#"Replaced Value" = Table.ReplaceValue(#"Filtered Rows",".",",",Replacer.ReplaceText,{"Height"}),
#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","-",".",Replacer.ReplaceText,{"Date of Birth"}),
#"Changed Type2" = Table.TransformColumnTypes(#"Replaced Value1",{{"Date of Birth", type date}, {"Height", type number}, {"Weight", Int64.Type}})
in
#"Changed Type2"

 

 

Find an index page and parse it to get all sources list

Not a big problem for us, there is a full list of links to each EPL roster on the same resource – http://www.footballsquads.co.uk/eng/2019-2020/engprem.htm

Our task here is to get the list and we will have to do some magic with this web data source. After some manipulation with the Navigator option “Add table using examples”, we will have a new index-page data source:

3.png

 

 

let
Source = Web.BrowserContents("http://www.footballsquads.co.uk/eng/2019-2020/engprem.htm"),
#"Extracted Table From Html" = Html.Table(Source, {{"Team", "H5"}, {"Link", "H5 > A:nth-child(1):nth-last-child(1)", each [Attributes][href]?}}, [RowSelector="H5"]),
#"Changed Type" = Table.TransformColumnTypes(#"Extracted Table From Html",{{"Team", type text}, {"Link", type text}})
in
#"Changed Type"

 

Copy and save this code.

 

Union all useful pages

Finally, our preparation is finished and now we are ready to union all team roster pages from index links into the single data source. At this stage, we will have to do a few hard-coding manipulations

Go to New source (or Get data if you are in Report mode occasionally) -> Blank query, press Advanced Editor and copy-paste here our code from previous step – Index page source code.

Then, just before the second row, which looks like this

 

Source = Web.BrowserContents("http://www.footballsquads.co.uk/eng/2019-2020/engprem.htm"),

 

insert template page code and do not forget to delimit it via comma symbol.

Ok, stop, take a breath and have a glance at our current code

 

let
let
Source = Web.BrowserContents("http://www.footballsquads.co.uk/eng/2019-2020/engprem/manutd.htm"),
#"Extracted Table From Html" = Html.Table(Source, {{"Column1", "TABLE > * > TR > ... TR"]),
#"Changed Type" = Table.TransformColumnTypes(#"Extracted Table From Html",{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}, {"Column4", type text}, {"Column5", type text}, {"Column6", type text}, {"Column7", type text}, {"Column8", type text}, {"Column9", type text}}),
#"Promoted Headers" = Table.PromoteHeaders(#"Changed Type", [PromoteAllScalars=true]),
#"Changed Type1" = Table.TransformColumnTypes(#"Promoted Headers",{{"Number", type text}, {"Name", type text}, {"Nat", type text}, {"Pos", type text}, {"Height", type text}, {"Weight", type text}, {"Date of Birth", type text}, {"Birth Place", type text}, {"Previous Club", type text}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type1", each ([Name] <> "" and [Name] <> "Name" and [Name] <> "Players no longer at this club") and ([Height] <> "")),
#"Replaced Value" = Table.ReplaceValue(#"Filtered Rows",".",",",Replacer.ReplaceText,{"Height"}),
#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","-",".",Replacer.ReplaceText,{"Date of Birth"}),
#"Changed Type2" = Table.TransformColumnTypes(#"Replaced Value1",{{"Date of Birth", type date}, {"Height", type number}, {"Weight", Int64.Type}})
in
#"Changed Type2",
Source = Web.BrowserContents("http://www.footballsquads.co.uk/eng/2019-2020/engprem.htm"), #"Extracted Table From Html" = Html.Table(Source, {{"Team", "H5"}, {"Link", "H5 > A:nth-child(1):nth-last-child(1)", each [Attributes][href]?}}, [RowSelector="H5"]), #"Changed Type" = Table.TransformColumnTypes(#"Extracted Table From Html",{{"Team", type text}, {"Link", type text}})
in
#"Changed Type"

 

Pay attention to 2 important things:

  1. We have got not unique variable names: Source and “Changed Type”. Let’s rename Source from template page to Team (and do not forget to replace “Source” to “Team” in the next row as well) and last “Changed Type” rename to “Changed Type3” (and again do not forget to replace it in the next row as well).
  2. We see two “let” sentences in row. Replace second to 

 

TeamRoster = (Link) => let

 

where (link) will be our variable which defines data source for each team. It is the one of the most important steps here.

So, we have almost achieved the goal. Let’s add a step after “Changed Type3”:

 

, InsertedCustom = Table.AddColumn(#"Changed Type3" , "Custom", each Team.Roster([Link]))

 

Again, do not forget to replace “Changed Type3” to “InsertedCustom” in the last row and press OK to exit from the Advanced editor. Now you should see something like this:

4.png

 Expand the Custom field and voila – our data is ready for report building!

5.png

 

Conclusion

In this technique, to get one complete data source from list of similar data sources placed on Index page, your code should have the next structure:

 

let
TemplateResourceVariable = (link) => let
...
template page code here
...
in
lastStepTemplateCode,
...
index page code here
...
,
InsertedCustomStep = Table.AddColumn(#"Changed Type3" , "Custom", each TemplateResourceVariable([Link])),
#"Expanded Custom" = Table.ExpandTableColumn(... expandTable code here ...)
in
#"Expanded Custom"

 

 

P.S. Result

At the end of the day, I’ve got a report like below (see my attached pbix file).

6.png