Three ways to prepare data for machine learning in...

ghamers · ‎03-29-2019

Data preparation can go a long way in improving the results of machine learning models. Before getting started with AutoML in Power BI, take a moment to explore some data cleaning techniques with your data. All of the necessary tools you’ll need already exist in the Power BI ecosystem.

Standardization (Relabeling)

Imagine you have a text column describing college degrees, i.e. “Master’s Degree”, “Bachelor’s Degree”, etc. Depending on how the data entry was done, you might end up with values such as “M.A.”, “Masters”, and “Master’s Degree”, all meaning the same thing. By default, a machine learning model will make no assumptions about these fields being synonyms and end up treating them as unique entries. In most cases, it would be ideal if the model analyzed these varying entries the same way.

Once your data is available as an entity in Power Query Online, you can remedy this discrepancy using the “Replace values” feature. To discover this functionality, simply right click your desired column header and select “Replace values”. Use the “Advanced” mode to match the entire contents of your column’s cells rather than the default partial matching.

Discretizing and Binning (Bucketing)

With AutoML in PowerBI, it is possible to create a powerful prediction model if your entity has a True/False column. This is a column containing two distinct values indicating separate states. A numeric column is the easiest type of column to convert to a True/False column. In Power Query Online, this conversion can be achieved by using a Conditional Column. In this example, imagine you have a numeric column of scores ranging from 0 to 100. We can narrow this column to a True/False column by separating values above sixty and those below. After adding the conditional column, remember to set the column type of the new column to True/False by clicking the type icon next to the column header.

Removing Outliers

There are times when a numeric column may have entries that are largely different from the rest of the values in a column. In most cases, the presence of these outlier values provides little benefit for a machine learning model. Let’s say we define an outlier for a numeric column as a value falling outside two standard deviations above or below the median value of a column. In this section we build upon the concept of using conditional columns and enhance it with a little extra Power Query magic. For your table entity, open the advanced editor:

Next locate the column for which you wish to remove outliers. In this case the column header is “Fare”. We will create two new variables to store the values of the column’s standard deviation as well as the median.

#"Two Standard Deviations" = List.StandardDeviation(#"Changed column type"[Fare]) * 2, 
#"Medium Value" = List.Median(#"Changed column type"[Fare]),

Now we will use a conditional column to identify an outlier by comparing the Fare value to median value plus or minus two times the standard deviation. If the value falls outside of that range, then we set the value to the overall median value.

#"Outliers Replaced" = Table.AddColumn(#"Changed column type", "New column", each if [Fare] < #"Medium Value" - #"Two Standard Deviations" then #"Medium Value" else if [Fare] > #"Medium Value" + #"Two Standard Deviations" then #"Medium Value" else [Fare])

When using this code snippet, simply replace “Fare” with the header name of your desired column. An example section of Power Query code to perform the outlier replacement follows:

let 
    Source = Csv.Document(Web.Contents("<YOUR_CSV_SOURCE"), [Delimiter = ",", Columns = 12, QuoteStyle = QuoteStyle.None]),  
    #"Promoted headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),  
    #"Changed column type" = Table.TransformColumnTypes(#"Promoted headers", {{"Fare", type number}}),  
    #"Two Standard Deviations" = List.StandardDeviation(#"Changed column type"[Fare]) * 2,  
    #"Medium Value" = List.Median(#"Changed column type"[Fare]),  
    #"Outliers Replaced" = Table.AddColumn(#"Changed column type", "New column", each if [Fare] < #"Medium Value" - #"Two Standard Deviations" then #"Medium Value" else if [Fare] > #"Medium Value" + #"Two Standard Deviations" then #"Medium Value" else [Fare]) 
in 
    #"Outliers Replaced"

With only these simple techniques at your disposal, you can start to build powerful, narrowed down machine learning models in Power BI. Keep a watch out for more advanced techniques that will be covered in future posts. In the meantime, get started building models in Power BI today and get a feel for how a bit of data preparation can have a positive impact on the resulting model reports.

Yasin Shtiui | Software Engineer II at Microsoft Power BI (Artificial Intelligence) team

Garrett Hamers | Software Engineer at Microsoft Power BI (Artificial Intelligence) team

Three ways to prepare data for machine learning in Power BI

Unveiling the Power of Lakehouse's SQL Analytics E...

How to provide sample data in the Power BI Forum

Re: Power BI Tip: How To Sort A Column With Duplic...

Important Links every Power BI enthusiastic shoul...

Enhanced Data Profiling in Power Query: GUI and Ta...