cancel
Showing results for 
Search instead for 
Did you mean: 

Three ways to prepare data for machine learning in Power BI

Data preparation can go a long way in improving the results of machine learning models. Before getting started with AutoML in Power BI, take a moment to explore some data cleaning techniques with your data. All of the necessary tools you’ll need already exist in the Power BI ecosystem. 

 

Standardization (Relabeling) 

 

Imagine you have a text column describing college degrees, i.e. “Master’s Degree”, “Bachelor’s Degree”, etc. Depending on how the data entry was done, you might end up with values such as “M.A.”, “Masters”, and “Master’s Degree”, all meaning the same thing. By default, a machine learning model will make no assumptions about these fields being synonyms and end up treating them as unique entries. In most cases, it would be ideal if the model analyzed these varying entries the same way. 

 

Once your data is available as an entity in Power Query Online, you can remedy this discrepancy using the “Replace values” feature. To discover this functionality, simply right click your desired column header and select “Replace values”. Use the “Advanced” mode to match the entire contents of your column’s cells rather than the default partial matching. 

 

 

Standardization.PNG


Discretizing and Binning (Bucketing) 

 

With AutoML in PowerBI, it is possible to create a powerful prediction model if your entity has a True/False column. This is a column containing two distinct values indicating separate states. numeric column is the easiest type of column to convert to a True/False column. In Power Query Online, this conversion can be achieved by using a Conditional ColumnIn this example, imagine you have a numeric column of scores ranging from 0 to 100We can narrow this column to a True/False column by separating values above sixty and those below. After adding the conditional column, remember to set the column type of the new column to True/False by clicking the type icon next to the column header. 

 

 

Bucketing.PNG

 

 

Removing Outliers 

 

There are times when a numeric column may have entries that are largely different from the rest of the values in a column. In most cases, the presence of these outlier values provides little benefit for a machine learning model. Let’s say we define an outlier for a numeric column as a value falling outside two standard deviations above or below the median value of a columnIn this section we build upon the concept of using conditional columns and enhance it with a little extra Power Query magic. For your table entity, open the advanced editor: 
 

 

 

Outliers.PNG

 

 

Next locate the column for which you wish to remove outliers. In this case the column header is “Fare”. We will create two new variables to store the values of the column’s standard deviation as well as the median. 
 

 

#"Two Standard Deviations" = List.StandardDeviation(#"Changed column type"[Fare]) * 2, 
#"Medium Value" = List.Median(#"Changed column type"[Fare]), 

 

 

Now we will use a conditional column to identify an outlier by comparing the Fare value to median value plus or minus two times the standard deviation.  If the value falls outside of that range, then we set the value to the overall median value. 

 

 

#"Outliers Replaced" = Table.AddColumn(#"Changed column type", "New column", each if [Fare] < #"Medium Value" - #"Two Standard Deviations" then #"Medium Value" else if [Fare] > #"Medium Value" + #"Two Standard Deviations" then #"Medium Value" else [Fare]) 

 

 

When using this code snippet, simply replace “Fare” with the header name of your desired column. An example section of Power Query code to perform the outlier replacement follows: 

 

 

let 
    Source = Csv.Document(Web.Contents("<YOUR_CSV_SOURCE"), [Delimiter = ",", Columns = 12, QuoteStyle = QuoteStyle.None]),  
    #"Promoted headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),  
    #"Changed column type" = Table.TransformColumnTypes(#"Promoted headers", {{"Fare", type number}}),  
    #"Two Standard Deviations" = List.StandardDeviation(#"Changed column type"[Fare]) * 2,  
    #"Medium Value" = List.Median(#"Changed column type"[Fare]),  
    #"Outliers Replaced" = Table.AddColumn(#"Changed column type", "New column", each if [Fare] < #"Medium Value" - #"Two Standard Deviations" then #"Medium Value" else if [Fare] > #"Medium Value" + #"Two Standard Deviations" then #"Medium Value" else [Fare]) 
in 
    #"Outliers Replaced" 

 

 

With only these simple techniques at your disposal, you can start to build powerful, narrowed down machine learning models in Power BIKeep a watch out for more advanced techniques that will be covered in future posts. In the meantime, get started building models in Power BI today and get a feel for how a bit of data preparation can have a positive impact on the resulting model reports. 

 

 

Yasin Shtiui | Software Engineer II at Microsoft Power BI (Artificial Intelligence) team

Garrett Hamers | Software Engineer at Microsoft Power BI (Artificial Intelligence) team