Showing results for 
Search instead for 
Did you mean: 

Find articles, guides, information and community news

Most Recent
catdor Frequent Visitor
Frequent Visitor

Recently we announced the capability of creating Machine Learning models directly in PowerBI. This is powered by automated machine learning from Azure. 


You can create an ML model based on your data with a few simple steps using this new capability, so this post aims to provide some light on the next steps, model performance and application, and provide a few tips about the reports. 



ancyphilip Visitor

I love using PowerBI for analysis! Recently PowerBI added support for AutoML (Supervised Automated machine Learning) and I wanted to demonstrate the new ML capabilities with an end to end example. In this article, I use AutoML to create and apply a regression model. Regression is a machine learning technique used for understanding the relationship between input and output variables (input is what is known in your data and output is generally the effect of input data and is sometimes unknown). Regression technique gives us more insight into data by making it easier to see relation between dependent and independent variables. These are the five steps I followed in this article to create the Regression model:


Refer this tutorial  for a detailed walk through of each of  these steps, and this tutorial for ones specific to regression models.


Creating a dataflow with the input data

I created a dataflow using the House Sales prediction dataset from Kaggle. It is also available, at the following link: house sales prediction for purposes of this article. It contains sale prices, number of bedrooms, square footage, number of floors, latitude, longitude etc. of houses in King County (which includes Seattle) sold between May 2014 and May 2015. The Price attribute indicating the price of the house is the numeric field used as label for the Regression model. The dataset was split into two entities HousePrices2014 and HousePrices2015 based on the year attribute.


Note that though most of the attribute data types in this dataset are numeric, these may be interpreted as strings by Power Query. So, most of these fields, including the label, were converted to numeric.

Training a machine learning model

I chose the HousePrices2014 as the entity to apply ML models. Power BI analyzed the Price field and suggested Regression as the type of machine learning model that can be created to predict that field. It took around an impressive 7 minutes to train 12 algorithms on 14 thousand data points.


Model validation report

After the training refresh completed, I checked out the training report. Key Influencers shows that Latitude, Sqft_living, Grade are the top predictors for the model.




To verify this, I plotted a ft_living (square footage of the home) and price. The plot does indeed show a relationship indicating that that higher square feet living indicates higher prices. As shown in the sqft_living breakdown above, 2.7k to 2.9k sqfeet houses has the highest share of range 1 priced houses (72k to 1M).




Likewise, a scatterplot between grade (overall grade given to the housing unit, based on King County grading system) and price shows houses with higher grades does have higher prices.




The model is 88% performant, which is a pretty great value for this dataset. As you can see in the chart below, most of the test samples lie in the diagonal showing the low error in prediction.




The Training Details page shows the performance of the 25 different models, how features were extracted from the inputs, and the hyperparameters for the final model used. AutoML chose an ensemble, Pre fitted Soft Voting Regressor as the best model for this dataset.





Applying the model

As I was satisfied with the training results, I proceeded to apply the model to the HousePrices2015 entity. A HousePrices2015 enriched RegressionModel was created, which includes the predicted output from the model. Overall, it took only around 2 minutes to score nearly 7 thousand rows!


Applying a Regression model added two columns with the Regression outcome, and the top record-specific influencers for each prediction. The HousePrices2015 enriched Regression entity with explanation, predicted prices is shown below along with the actual prices in PowerBI Desktop. The local explanations contains break down of how much each feature moved the specific instance away from the average price of houses. For categorical features, it shows how much price would have changed for different values of the categories. In the highlighted example, you can see that the house price was predicted to be $554174. The explanation says that, "condition", caused the price to increase by 44k from the average price of houses, and that the "latitude" caused the price to fall down by 26k.




I established a connection to the dataflow from PBI Desktop, to incorporate these predictions from the model in PBI reports so that I can visualize these results. I plotted a heat map of predicted prices using latitude, longitude. Prices were represented by a red (100% gradient stop) through green (50% gradient stop), to yellow(0% gradient stop) gradient.



We can see that, for example, Downtown Seattle and Downtown Bellevue have red regions as expected, owing to high prices.




I tried a scikit-learn Random Forest on this same dataset. It took 1 min using 3-fold cross validation, searching across 30 different combinations, using all available cores to give an R2 score of 0.85.




Even though AutoML took longer, I am impressed to have gotten a better R2 score of 0.88 by trying not just Random Forest but 25 different models, in addition to getting instance-level explanations. Moreover, AutoML automatically does preprocessing like dropping features like ID with no useful information and generating additional features for datetime like month, year etc. On the other hand, in scikit-learn one must encode string or date-time like features or drop it.

Overall, I’m very excited about AutoML, because I did not have to learn the intricacies of the ML models nor worry about the implementation and still get the same results. I have done these in Python and R and it took me considerably longer to code up and setup the hyperparameters, whereas in AutoML I am just hitting a bunch of buttons and voila I am done! Given how much easier it is to get started with lesser effort, I encourage business analysts and even data scientists/analysts to try out AutoML!

marekr New Member
New Member

Automated ML integration with Power BI dataflows allows training and applying Binary Prediction, General Classification and Regression models. The ML models are internally represented as specially marked dataflow entities. I’ll describe how the ML related entities are defined in M language and how they can be edited using the Power Query editor. 


ghamers Occasional Visitor
Occasional Visitor

Data preparation can go a long way in improving the results of machine learning models. Before getting started with AutoML in Power BI, take a moment to explore some data cleaning techniques with your data. All of the necessary tools you’ll need already exist in the Power BI ecosystem. 


Standardization (Relabeling) 


Imagine you have a text column describing college degrees, i.e. “Master’s Degree”, “Bachelor’s Degree”, etc. Depending on how the data entry was done, you might end up with values such as “M.A.”, “Masters”, and “Master’s Degree”, all meaning the same thing. By default, a machine learning model will make no assumptions about these fields being synonyms and end up treating them as unique entries. In most cases, it would be ideal if the model analyzed these varying entries the same way. 


Once your data is available as an entity in Power Query Online, you can remedy this discrepancy using the “Replace values” feature. To discover this functionality, simply right click your desired column header and select “Replace values”. Use the “Advanced” mode to match the entire contents of your column’s cells rather than the default partial matching. 




Discretizing and Binning (Bucketing) 


With AutoML in PowerBI, it is possible to create a powerful prediction model if your entity has a True/False column. This is a column containing two distinct values indicating separate states. numeric column is the easiest type of column to convert to a True/False column. In Power Query Online, this conversion can be achieved by using a Conditional ColumnIn this example, imagine you have a numeric column of scores ranging from 0 to 100We can narrow this column to a True/False column by separating values above sixty and those below. After adding the conditional column, remember to set the column type of the new column to True/False by clicking the type icon next to the column header. 






Removing Outliers 


There are times when a numeric column may have entries that are largely different from the rest of the values in a column. In most cases, the presence of these outlier values provides little benefit for a machine learning model. Let’s say we define an outlier for a numeric column as a value falling outside two standard deviations above or below the median value of a columnIn this section we build upon the concept of using conditional columns and enhance it with a little extra Power Query magic. For your table entity, open the advanced editor: 






Next locate the column for which you wish to remove outliers. In this case the column header is “Fare”. We will create two new variables to store the values of the column’s standard deviation as well as the median. 


#"Two Standard Deviations" = List.StandardDeviation(#"Changed column type"[Fare]) * 2, 
#"Medium Value" = List.Median(#"Changed column type"[Fare]), 



Now we will use a conditional column to identify an outlier by comparing the Fare value to median value plus or minus two times the standard deviation.  If the value falls outside of that range, then we set the value to the overall median value. 



#"Outliers Replaced" = Table.AddColumn(#"Changed column type", "New column", each if [Fare] < #"Medium Value" - #"Two Standard Deviations" then #"Medium Value" else if [Fare] > #"Medium Value" + #"Two Standard Deviations" then #"Medium Value" else [Fare]) 



When using this code snippet, simply replace “Fare” with the header name of your desired column. An example section of Power Query code to perform the outlier replacement follows: 



    Source = Csv.Document(Web.Contents("<YOUR_CSV_SOURCE"), [Delimiter = ",", Columns = 12, QuoteStyle = QuoteStyle.None]),  
    #"Promoted headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),  
    #"Changed column type" = Table.TransformColumnTypes(#"Promoted headers", {{"Fare", type number}}),  
    #"Two Standard Deviations" = List.StandardDeviation(#"Changed column type"[Fare]) * 2,  
    #"Medium Value" = List.Median(#"Changed column type"[Fare]),  
    #"Outliers Replaced" = Table.AddColumn(#"Changed column type", "New column", each if [Fare] < #"Medium Value" - #"Two Standard Deviations" then #"Medium Value" else if [Fare] > #"Medium Value" + #"Two Standard Deviations" then #"Medium Value" else [Fare]) 
    #"Outliers Replaced" 



With only these simple techniques at your disposal, you can start to build powerful, narrowed down machine learning models in Power BIKeep a watch out for more advanced techniques that will be covered in future posts. In the meantime, get started building models in Power BI today and get a feel for how a bit of data preparation can have a positive impact on the resulting model reports. 



Yasin Shtiui | Software Engineer II at Microsoft Power BI (Artificial Intelligence) team

Garrett Hamers | Software Engineer at Microsoft Power BI (Artificial Intelligence) team

Yuchen Visitor

Power BI, in the latest release, added support for supervised automated machine learning.This means that Power BI can help predict ‘unknowns’ once it learns from the ‘known’ values.


Helpful resources

Join Blog
Interested in blogging for the community? Let us know.