cancel
Showing results for 
Search instead for 
Did you mean: 

Find articles, guides, information and community news

Most Recent
ancyphilip Occasional Visitor
Occasional Visitor

I love using PowerBI for analysis! Recently PowerBI added support for AutoML (Supervised Automated machine Learning) and I wanted to demonstrate the new ML capabilities with an end to end example. In this article, I use AutoML to create and apply a regression model. Regression is a machine learning technique used for understanding the relationship between input and output variables (input is what is known in your data and output is generally the effect of input data and is sometimes unknown). Regression technique gives us more insight into data by making it easier to see relation between dependent and independent variables. These are the five steps I followed in this article to create the Regression model:

2.png

Refer this tutorial  for a detailed walk through of each of  these steps, and this tutorial for ones specific to regression models.

 

Creating a dataflow with the input data

I created a dataflow using the House Sales prediction dataset from Kaggle. It is also available, at the following link: house sales prediction for purposes of this article. It contains sale prices, number of bedrooms, square footage, number of floors, latitude, longitude etc. of houses in King County (which includes Seattle) sold between May 2014 and May 2015. The Price attribute indicating the price of the house is the numeric field used as label for the Regression model. The dataset was split into two entities HousePrices2014 and HousePrices2015 based on the year attribute.

3.png

Note that though most of the attribute data types in this dataset are numeric, these may be interpreted as strings by Power Query. So, most of these fields, including the label, were converted to numeric.

Training a machine learning model

I chose the HousePrices2014 as the entity to apply ML models. Power BI analyzed the Price field and suggested Regression as the type of machine learning model that can be created to predict that field. It took around an impressive 7 minutes to train 12 algorithms on 14 thousand data points.

 

Model validation report

After the training refresh completed, I checked out the training report. Key Influencers shows that Latitude, Sqft_living, Grade are the top predictors for the model.

 

1.png

 

To verify this, I plotted a ft_living (square footage of the home) and price. The plot does indeed show a relationship indicating that that higher square feet living indicates higher prices. As shown in the sqft_living breakdown above, 2.7k to 2.9k sqfeet houses has the highest share of range 1 priced houses (72k to 1M).

 

4.png

 

Likewise, a scatterplot between grade (overall grade given to the housing unit, based on King County grading system) and price shows houses with higher grades does have higher prices.

 

5.png

 

The model is 88% performant, which is a pretty great value for this dataset. As you can see in the chart below, most of the test samples lie in the diagonal showing the low error in prediction.

6.png

7.png

 

The Training Details page shows the performance of the 25 different models, how features were extracted from the inputs, and the hyperparameters for the final model used. AutoML chose an ensemble, Pre fitted Soft Voting Regressor as the best model for this dataset.

 

8.png

9.png

 

Applying the model

As I was satisfied with the training results, I proceeded to apply the model to the HousePrices2015 entity. A HousePrices2015 enriched RegressionModel was created, which includes the predicted output from the model. Overall, it took only around 2 minutes to score nearly 7 thousand rows!

 

Applying a Regression model added two columns with the Regression outcome, and the top record-specific influencers for each prediction. The HousePrices2015 enriched Regression entity with explanation, predicted prices is shown below along with the actual prices in PowerBI Desktop. The local explanations contains break down of how much each feature moved the specific instance away from the average price of houses. For categorical features, it shows how much price would have changed for different values of the categories. In the highlighted example, you can see that the house price was predicted to be $554174. The explanation says that, "condition", caused the price to increase by 44k from the average price of houses, and that the "latitude" caused the price to fall down by 26k.

 

10.png

 

I established a connection to the dataflow from PBI Desktop, to incorporate these predictions from the model in PBI reports so that I can visualize these results. I plotted a heat map of predicted prices using latitude, longitude. Prices were represented by a red (100% gradient stop) through green (50% gradient stop), to yellow(0% gradient stop) gradient.

 

12.PNG

We can see that, for example, Downtown Seattle and Downtown Bellevue have red regions as expected, owing to high prices.

 

13.png

 

I tried a scikit-learn Random Forest on this same dataset. It took 1 min using 3-fold cross validation, searching across 30 different combinations, using all available cores to give an R2 score of 0.85.

 

14.png

Summary

Even though AutoML took longer, I am impressed to have gotten a better R2 score of 0.88 by trying not just Random Forest but 25 different models, in addition to getting instance-level explanations. Moreover, AutoML automatically does preprocessing like dropping features like ID with no useful information and generating additional features for datetime like month, year etc. On the other hand, in scikit-learn one must encode string or date-time like features or drop it.

Overall, I’m very excited about AutoML, because I did not have to learn the intricacies of the ML models nor worry about the implementation and still get the same results. I have done these in Python and R and it took me considerably longer to code up and setup the hyperparameters, whereas in AutoML I am just hitting a bunch of buttons and voila I am done! Given how much easier it is to get started with lesser effort, I encourage business analysts and even data scientists/analysts to try out AutoML!

ghamers Occasional Visitor
Occasional Visitor

Data preparation can go a long way in improving the results of machine learning models. Before getting started with AutoML in Power BI, take a moment to explore some data cleaning techniques with your data. All of the necessary tools you’ll need already exist in the Power BI ecosystem. 

 

Standardization (Relabeling) 

 

Imagine you have a text column describing college degrees, i.e. “Master’s Degree”, “Bachelor’s Degree”, etc. Depending on how the data entry was done, you might end up with values such as “M.A.”, “Masters”, and “Master’s Degree”, all meaning the same thing. By default, a machine learning model will make no assumptions about these fields being synonyms and end up treating them as unique entries. In most cases, it would be ideal if the model analyzed these varying entries the same way. 

 

Once your data is available as an entity in Power Query Online, you can remedy this discrepancy using the “Replace values” feature. To discover this functionality, simply right click your desired column header and select “Replace values”. Use the “Advanced” mode to match the entire contents of your column’s cells rather than the default partial matching. 

 

 

Standardization.PNG


Discretizing and Binning (Bucketing) 

 

With AutoML in PowerBI, it is possible to create a powerful prediction model if your entity has a True/False column. This is a column containing two distinct values indicating separate states. numeric column is the easiest type of column to convert to a True/False column. In Power Query Online, this conversion can be achieved by using a Conditional ColumnIn this example, imagine you have a numeric column of scores ranging from 0 to 100We can narrow this column to a True/False column by separating values above sixty and those below. After adding the conditional column, remember to set the column type of the new column to True/False by clicking the type icon next to the column header. 

 

 

Bucketing.PNG

 

 

Removing Outliers 

 

There are times when a numeric column may have entries that are largely different from the rest of the values in a column. In most cases, the presence of these outlier values provides little benefit for a machine learning model. Let’s say we define an outlier for a numeric column as a value falling outside two standard deviations above or below the median value of a columnIn this section we build upon the concept of using conditional columns and enhance it with a little extra Power Query magic. For your table entity, open the advanced editor: 
 

 

 

Outliers.PNG

 

 

Next locate the column for which you wish to remove outliers. In this case the column header is “Fare”. We will create two new variables to store the values of the column’s standard deviation as well as the median. 
 

 

#"Two Standard Deviations" = List.StandardDeviation(#"Changed column type"[Fare]) * 2, 
#"Medium Value" = List.Median(#"Changed column type"[Fare]), 

 

 

Now we will use a conditional column to identify an outlier by comparing the Fare value to median value plus or minus two times the standard deviation.  If the value falls outside of that range, then we set the value to the overall median value. 

 

 

#"Outliers Replaced" = Table.AddColumn(#"Changed column type", "New column", each if [Fare] < #"Medium Value" - #"Two Standard Deviations" then #"Medium Value" else if [Fare] > #"Medium Value" + #"Two Standard Deviations" then #"Medium Value" else [Fare]) 

 

 

When using this code snippet, simply replace “Fare” with the header name of your desired column. An example section of Power Query code to perform the outlier replacement follows: 

 

 

let 
    Source = Csv.Document(Web.Contents("<YOUR_CSV_SOURCE"), [Delimiter = ",", Columns = 12, QuoteStyle = QuoteStyle.None]),  
    #"Promoted headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),  
    #"Changed column type" = Table.TransformColumnTypes(#"Promoted headers", {{"Fare", type number}}),  
    #"Two Standard Deviations" = List.StandardDeviation(#"Changed column type"[Fare]) * 2,  
    #"Medium Value" = List.Median(#"Changed column type"[Fare]),  
    #"Outliers Replaced" = Table.AddColumn(#"Changed column type", "New column", each if [Fare] < #"Medium Value" - #"Two Standard Deviations" then #"Medium Value" else if [Fare] > #"Medium Value" + #"Two Standard Deviations" then #"Medium Value" else [Fare]) 
in 
    #"Outliers Replaced" 

 

 

With only these simple techniques at your disposal, you can start to build powerful, narrowed down machine learning models in Power BIKeep a watch out for more advanced techniques that will be covered in future posts. In the meantime, get started building models in Power BI today and get a feel for how a bit of data preparation can have a positive impact on the resulting model reports. 

 

 

Yasin Shtiui | Software Engineer II at Microsoft Power BI (Artificial Intelligence) team

Garrett Hamers | Software Engineer at Microsoft Power BI (Artificial Intelligence) team

Yuchen Visitor
Visitor

Power BI, in the latest release, added support for supervised automated machine learning.This means that Power BI can help predict ‘unknowns’ once it learns from the ‘known’ values.

Read more...

Christian_Berg Visitor
Visitor

Estimating the actual impact of an investment is an essential part of management. An investment, in this context, can be any change in resource allocation but is most often a project or marketing initiative. By quantifying the major benefits realized we obtain an objective measure of the outcome. This is useful when discussing lessons learnt and often invaluable input the next time a similar investment is considered. From a business intelligence perspective too, there is great value in comparing the actual results with the initial projection/forecast to understand limitations of existing models, and to prioritize opportunities for improvement. In addition to the aforementioned, rigorous analytics promotes a learning culture with end to end accountability. It is an essential tool for continuous improvement.

Read more...

Helpful resources

Join Blog
Interested in blogging for the community? Let us know.