In this article, we are going to go through an example of performing ETL (Extraction, Transformation, Load) function by looping through multiple pages on a website.
For this, we are going to use a website called planecrashinfo.com, which has historical data of all the plane crashes from 1920 onwards.
Now if you click on Database Page, you will find that there are links to many HTML pages, starting from 1920 till 2020, and we would like to extract data for each of the HTML pages here.
To start, we are going to open one of the pages randomly and extract the data of that web page in Power BI. In this example, I select 1922. Below is how 1922 looks like.
To extract this data in Power BI, we would simply open Power BI desktop and select the Get Data option -> Web. To follow the steps easily, I have embedded a GIF file here.
Get Data in Power BI Desktop
Once you have selected the table, hit Transform Data and go to Power Query Editor.
[Edit] - One thing I noticed in the latest #powerbi version is that when I import a web page with Table, it calls Web[.]BrowserContents(URL) method. While this method is doing the intended, this eliminates any line breaks characters your text might have. (A must for text transformation). One quick hack I found is replacing it with Web[.]Page(Web[.]Contents(URL)). You can also refer to my YouTube Video for a step-by-step approach.
In this step, we are going to transform this data a bit,
Split the Location/Operator by line breaks [#(cr)] to make two columns, Location, and Operator.
Split the Aircraft Type/Registration by line breaks [#(cr)] to make two columns, Aircraft Type, and Registration.
Split the fatalities column by / and ( into three columns i.e. Air Fatalities, Aboard, Ground Fatalities.
The next step is to define a Parameter, We can define a parameter using the Manage Parameters option in the Query editor, however, here we would use Advanced Editor and use M Code to define our parameter.
Using Advanced Editor to pass on a Parameter
To test this, we can trigger the AircraftEvents function and enter some years manually, and this will create a table in the Query Editor with the data from that specific year.
Now, since we want data for all the Years from this website, we would import the list of years from this website, remove the unwanted column and unpivot multiple columns into one column. Rename it as “year".
Extracting year list from the website
Now, since we have two queries, the first one being AircraftEvents, which is a functional query, and the second one a list of all the years from the website, we need to find a way to pass this list of years to the AircraftEvents function.
To do this, we will add a custom column in the Year_List query and invoke the function AircraftEvents with the argument [Year], where [Year] is the name of the column in the Year_List table.
This will give us a custom column with many tables, one table for each year. Once we expand the table, we get a consolidated data table with Plane crash data of all the years.
Calling AircraftEvents table in Year_List table
Your dataset is ready for you to do cool visualizations.