I love using PowerBI for analysis! Recently PowerBI added support for AutoML (Supervised Automated machine Learning) and I wanted to demonstrate the new ML capabilities with an end to end example. In this article, I use AutoML to create and apply a regression model. Regression is a machine learning technique used for understanding the relationship between input and output variables (input is what is known in your data and output is generally the effect of input data and is sometimes unknown). Regression technique gives us more insight into data by making it easier to see relation between dependent and independent variables. These are the five steps I followed in this article to create the Regression model:
Refer this tutorial for a detailed walk through of each of these steps, and this tutorial for ones specific to regression models.
Creating a dataflow with the input data
I created a dataflow using the House Sales prediction dataset from Kaggle. It is also available, at the following link: house sales prediction for purposes of this article. It contains sale prices, number of bedrooms, square footage, number of floors, latitude, longitude etc. of houses in King County (which includes Seattle) sold between May 2014 and May 2015. The Price attribute indicating the price of the house is the numeric field used as label for the Regression model. The dataset was split into two entities HousePrices2014 and HousePrices2015 based on the year attribute.
Note that though most of the attribute data types in this dataset are numeric, these may be interpreted as strings by Power Query. So, most of these fields, including the label, were converted to numeric.
Training a machine learning model
I chose the HousePrices2014 as the entity to apply ML models. Power BI analyzed the Price field and suggested Regression as the type of machine learning model that can be created to predict that field. It took around an impressive 7 minutes to train 12 algorithms on 14 thousand data points.
Model validation report
After the training refresh completed, I checked out the training report. Key Influencers shows that Latitude, Sqft_living, Grade are the top predictors for the model.
To verify this, I plotted a ft_living (square footage of the home) and price. The plot does indeed show a relationship indicating that that higher square feet living indicates higher prices. As shown in the sqft_living breakdown above, 2.7k to 2.9k sqfeet houses has the highest share of range 1 priced houses (72k to 1M).
Likewise, a scatterplot between grade (overall grade given to the housing unit, based on King County grading system) and price shows houses with higher grades does have higher prices.
The model is 88% performant, which is a pretty great value for this dataset. As you can see in the chart below, most of the test samples lie in the diagonal showing the low error in prediction.
The Training Details page shows the performance of the 25 different models, how features were extracted from the inputs, and the hyperparameters for the final model used. AutoML chose an ensemble, Pre fitted Soft Voting Regressor as the best model for this dataset.
Applying the model
As I was satisfied with the training results, I proceeded to apply the model to the HousePrices2015 entity. A HousePrices2015 enriched RegressionModel was created, which includes the predicted output from the model. Overall, it took only around 2 minutes to score nearly 7 thousand rows!
Applying a Regression model added two columns with the Regression outcome, and the top record-specific influencers for each prediction. The HousePrices2015 enriched Regression entity with explanation, predicted prices is shown below along with the actual prices in PowerBI Desktop. The local explanations contains break down of how much each feature moved the specific instance away from the average price of houses. For categorical features, it shows how much price would have changed for different values of the categories. In the highlighted example, you can see that the house price was predicted to be $554174. The explanation says that, "condition", caused the price to increase by 44k from the average price of houses, and that the "latitude" caused the price to fall down by 26k.
I established a connection to the dataflow from PBI Desktop, to incorporate these predictions from the model in PBI reports so that I can visualize these results. I plotted a heat map of predicted prices using latitude, longitude. Prices were represented by a red (100% gradient stop) through green (50% gradient stop), to yellow(0% gradient stop) gradient.
We can see that, for example, Downtown Seattle and Downtown Bellevue have red regions as expected, owing to high prices.
I tried a scikit-learn Random Forest on this same dataset. It took 1 min using 3-fold cross validation, searching across 30 different combinations, using all available cores to give an R2 score of 0.85.
Even though AutoML took longer, I am impressed to have gotten a better R2 score of 0.88 by trying not just Random Forest but 25 different models, in addition to getting instance-level explanations. Moreover, AutoML automatically does preprocessing like dropping features like ID with no useful information and generating additional features for datetime like month, year etc. On the other hand, in scikit-learn one must encode string or date-time like features or drop it.
Overall, I’m very excited about AutoML, because I did not have to learn the intricacies of the ML models nor worry about the implementation and still get the same results. I have done these in Python and R and it took me considerably longer to code up and setup the hyperparameters, whereas in AutoML I am just hitting a bunch of buttons and voila I am done! Given how much easier it is to get started with lesser effort, I encourage business analysts and even data scientists/analysts to try out AutoML!
Power BI's slicers provide a powerful way to hide rows in a table - but no built-in feature allows columns to be shuffled and sliced in a similar way. That doesn't mean it can't be implemented, however, and in a way that is close to seamless for an end user. Using a layer of measures, we can abstract the columns of raw data from the displayed columns, and allow columns to be dynamically displayed in any position.
Today companies store huge amounts of data related to their various business processes. This data can help discover, monitor and improve your actual business process. The process of extracting process knowledge from data is called Process Mining. Process Mining can help gain better visibility, improve KPIs and eliminate bottlenecks.
One of the popular open source packages to help with process mining is bupaR. It is an open-source, integrated suite of R packages for the handling and analysis of business process data. It was developed by the Business Informatics research group at Hasselt University, Belgium. It currently consists of many packages which can help in calculating descriptives, process monitoring and process visualization.
The bupaR is the core package of the framework. It includes basic functionality for creating event log objects in R. It contains several functions to get information about an event log and provides specific event log versions of generic R functions. Together with the related packages, each of which has its own specific purpose, bupaR aims at supporting each step in the analysis of event data in R, from data import to online process monitoring.
The good news is that now PowerBI service supports bupaR visuals. Let’s explore what we can do! Our attempt here is to just quickly show a few possibilities with bupaR and PowerBI. You can read more about bupaR in some of the links below. For more information on how to create R visuals in the Power BI service, please see Creating R visuals in the Power BI service and Create Power BI visuals using R.
Let’s consider the scenario of patients arriving in an emergency department of a hospital. The event data in this example comes from "patients" dataset from eventdataR package. I made the sample data as a .csv file, then imported the data into PowerBI desktop and next I will show you how to use bupaR to create event logs and plot visuals from PowerBI. The data looks like below picture in PowerBI desktop.
If you are interested to see the process map for the "completed" patients event log, which starts with "Registration" and ends with "Check-out", you can create the R visual in the Power BI Desktop with the following R script:
Once it gets published to Power BI service, we can see it renders as the following image.
If you want to see frequency in the process map, it can be created explicitly using the frequency function. The colors can be modified through the color_scale argument.
Different activity sequences in the event log can be visualized with trace_explorer. It can be used to explore frequent as well as infrequent traces. The coverage argument specifies how much of the log you want to explore. Below example shows the most frequent traces covering 98.5% of the event log.
Automated ML integration with Power BI dataflows allows training and applying Binary Prediction, General Classification and Regression models. The ML models are internally represented as specially marked dataflow entities.I’ll describehow the ML related entities are defined in Mlanguageand how theycan be edited using the Power Query editor.