Managing Complexity

by Christian_Berg Visitor on ‎12-05-2016 10:28 AM

Missed the previous parts of this series? See Become your organization’s strategic advisor by using Machine Learning and Power BI

 

In part one, we considered how correlation plots can easily be integrated in Power BI reports to show how various factors relate to one another. That has been a useful starting point for me when looking at new data for the first time. It can also be useful for challenging old wisdoms or to perform an initial ranking of potential investments. A correlation plot however only compares two columns at a time, e.g. how correlated is revenue and inventory levels.

 

Unfortunately, the world is often more complicated than that. For instance, revenue and inventory may be positively correlated during the high season but negatively correlated during the other months of the year (if a lack of stock hurts sales). A simple correlation plot would not be able to show that. In this post, we will therefore look at a simple approach to detect simultaneous interactions between multiple variables.

 

Used correctly, this type of analysis can be a great tool for understanding a business and its underlying sensitivities. This can be incredibly valuable when deciding which investments to prioritize. There are multiple possible approaches to this type of multivariate analysis. A simple and effective one is a binary tree. It tries to explain one thing by simultaneously using all other available information. Say for instance that we are trying to explain (or predict) variations in store revenue using the month of the year, marketing investments and product promotions. The tree algorithm would attempt to group the rows into multiple subsets with the goal of minimizing the range of revenue in each subset. In other words, it tries to find subsets that have as similar values for the target column as possible.

 

For this example, the rows with the highest revenue may all have coincided with significant marketing activities, product promotions and be in the month of December. The tree would depict this with three or more splits on these variables. The sequence of the splits is determined by how much of the variations in revenue they explain. If December has 300% higher revenue than any other month, while marketing on average gives a 50% boost to sales, vs. 30% for product promotions then Month = December would be the first split. The second split would not however necessarily be marketing activities since for that the algorithm now only considers the data rows for December. This is the important difference to a correlation plot which always compares all rows and the tree which can discern strong correlations for a subset. In this example management, may for instance realize that product promotions only have a negligible return in December but are a good investment all other months of the year.

 

Below is another simple example of how trees can complement correlation plots built using the below table.

 

cb1.png

 

Here two factors (A and B) are used to try and predict the third column (Outcome). Individually both factors seem to have some positive impact on the outcome. However, when both factors have the value “2” the outcome is much higher. Of course, this is not enough data for statistical significance but it illustrates the point.

 

Using a correlation plot we can see the average 12% correlation between the outcome and Factor A, vs. 32% for Factor B.

cb2.png

If we instead use a tree we see a much richer explanation of the interactions. To get to the highest outcome (bottom right with an average value of 12) we read the tree from the top:

  1. “Factor B < 1.5” should be “no” (i.e. B should be greater than 1.5),
  2. “Factor A < 1.5” should also be “no” (we again take the branch to the right),
  3. “Factor B >= 2.5” should be “no” (again right which is always “no”) and finally,
  4. “Factor A >= 2.5” should be “no”.

cb3.png

In other words, what this tree tells us is that the highest value for the column “Outcome”, our target value, is achieved when Factor B and A are both: not less than 1.5 and not greater than 2.5. Since we know from our data that both factors are discrete numbers we can simplify this further to say we observe the highest values for “Outcome” when the two factors both have value 2.

 

In addition to being very valuable tools for analysis; trees are a great introduction to more complex Machine Learning scenarios which we will cover in subsequent posts.

 

cb4.png

 

 Here we see the same data as in the two previous posts. The bar chart to the left shows the all-up revenue by Product ID and is mainly used to select one product at a time (here product 6 is selected). The tree plot in the top-right shows the same data as the correlation plot below it. Its first input column is revenue and that is the number that it is trying to forecast by splitting the dataset at different thresholds for the other columns. For product 6 the best split is on seasonality (see post 2), followed by promotion. The highest revenue bucket for product 6, with an average of $35k per month, is expected when seasonality is greater than or equal to 0.95 and there is an active promotion for that product:

 

cc1.png

 

 The second highest revenue bucket with an average of $33k per month is expected when seasonality is greater than or equal to 1.2, regardless of whether there is an active promotion:

 

Based on that, one conclusion could be that the return on promotions is less in the top season.

 

The importance of seasonality to explain revenue could in this case have been guessed by looking at the correlation plot but an insight about diminishing returns on promotions, as the annual cycle picked up, would have been missed:

 

cc2.PNG

 The tree and correlation plots were both made using R-visuals with the same columns as input.  

 

Using the report

 

For this report to work in Power BI Desktop you need the packages rpart, rpart.plot, RColorBrewer, RGtk2 and rattle installed. If you have not previously used these packages on your computer, install them by going to your R console in R Studio or other R GUI and copy/paste install.packages(c("rattle", "rpart", "rpart.plot", "RColorBrewer", "RGtk2")).

 

By clicking on the different products, we can examine which combination of factors are most important in determining a given product’s revenue. Using Power BI’s standard filters we can also drill-in to specific subsets of the data. If we for instance use product 6 again we see that the first split that the model does is on the column “Seasonality”.

cb8.png

 

 To validate the accuracy of this we can add Seasonality to the Page level filters:

 

cb9.png

 

 This leaves all but the leftmost branch and reduces the all-up correlation between Product 6’s revenue and Seasonality.

 

c10.png

 

 

cb11.png

 

 In a second step, we can then add Active promotion to the Page level filters pane and set that to “is greater than 0.5”. That will show NA values in the correlation plot for Active promotion since that column now only has one value, namely “1”. It might therefore be helpful to convert that visualization to a scatter plot, e.g. SandDance to explore the remaining rows of data. For small datasets, a table might be the best option. Here I converted the correlation plot to a standard table and added Revenue again, this time aggregating it as Standard deviation (by using the quick calculation option, under “Values” in the right-hand side menu - visible after clicking the down-arrow button). I also changed the aggregation form for some of the other columns, below to help me understand this subset of data.

 

cb12.png

 

 As you can see the average of revenue is $35k, identical to the leaf in the tree that we filtered down to. From the values in that column and from the standard deviation of $16.8k we can however see that just filtering on Seasonality and Active promotion still leaves us with a lot of variance in revenue and with 14 data points we need to be cautious reading in too much in these results.

 

Recreating the report

 

R visuals will automatically work in Power BI when the report is published to the service, where the most common packages have already been installed, but to get them to work in Power BI Desktop you need a local R installation (please see my previous post for installation instructions).

 

To create the binary tree R visual:

 

      1. Open a pbix file with the relevant data, e.g. the example file attached at the bottom of this blog post and click on the + sign in the bottom row to create a new page.
      2. Insert a simple standard visual, like a table and add the columns that you want to analyze (because an R visual will not display anything until you have added a script this is a great starting point, it also helps you see if you have a problem with your data that will cause the R script to fail, e.g. empty rows or text values mixed with numbers).

        cc3.png

      3. Define granularity. The tree’s splits are calculated to minimize the variance in values by row, for the first column in dataset. It is therefore important to get the granularity right. In this case I want to understand if an investment in a promotion impacts the revenue for that month, using price and seasonality as additional explanatory factors. To make this level of detail explicit I added the column [Month_ID] to the table and verified that it had the aggregation option “Don’t summarize”.

        cc4.png

      4. Scroll through a representative amount of the dataset to ensure that you don’t have any blank or NA values that are likely to cause a runtime error when you run the script. Once I’m happy with the data input (this will be the raw data that R receives) I convert the visual to R by selecting the “R” visual in the canvas and then clicking on the R visualization icon.

        cc5.png

      5. Add the script using the R script* editor (visible when you select the R visual directly in the canvas). Note, you may have to expand it by clicking on the far-right arrow in the R script editor menu below the canvas.

        Once expanded you should see the below, with space for the R script.

        cc7.png

        Copy/paste the script below into the script editor (any line that starts with # or ## is a comment only and is not executed):

        ## Load the required libraries
        ## If you have not installed these packages you will get an error message saying that the libraries were not found. In that case install them by copy/pasting: install.packages(c("rattle", "rpart", "rpart.plot", "RColorBrewer", "RGtk2")) directly in your R Console. The function c() combines values into lists in R.
        library(rpart)
        library(rattle)
        ## rpart, like our previous examples is sensitive to "NAs" and missing values. na.omit() deletes those rows.
        ## Alternatively, you can use the Power BI filter functions to filter out rows with missing values
        treenona <- na.omit(dataset)
        ## In this example I used month id to ensure I aggregated by month. However, I don't want to include it in the calculation so I delete the last column. 
        treenona <- treenona[,1Smiley Sadncol(dataset)-1)]
        ## I'm trying to predict Revenue but here I'll rename it to Sales which I will later reference
        names(treenona)[1]<-"Sales"
        
        ## Calculate the splits to explain/predict column "Sales"
        ## with a minimum of 7 (21/3) observations per bucket 
        tree.1 <- rpart(Sales ~ .,data=treenona,control=rpart.control(minsplit=21,cp=0.015)) 
        ## Note on the above, the value for "cp" determines if a split is worthwhile, lower value leads to more splits/grannularity          
        ## Plot the tree with text size 110% (1.1)
        fancyRpartPlot(tree.1, cex = 1.1, sub = NULL)
      6. Use the first arrow in the R script editor to manually execute the script (any change/interaction with the report will also cause the script to be rerun.

        cc8.png

To recreate this visual for your own data the steps will be almost identical. It is important however to remember to get the granularity right.

 

Links and downloads

 

Decision trees in the R showcase on Power BI

 

* Third-party programs. This software enables you to obtain software applications from other sources. Those applications are offered and distributed by third parties under their own license terms. Microsoft is not developing, distributing or licensing those applications to you, but instead, as a convenience, enables you to use this software to obtain those applications directly from the application providers.

 

By using the software, you acknowledge and agree that you are obtaining the applications directly from the third-party providers and under separate license terms, and that it is your responsibility to locate, understand and comply with those license terms. Microsoft grants you no license rights for third-party software or applications that is obtained using this software.

 

Attachment
Comments
by tringuyenminh92 Senior Member
on ‎12-06-2016 12:11 AM

 

Spoiler

If we instead use a tree we see a much richer explanation of the interactions. To get to the highest outcome (bottom right with an average value of 12) we read the tree from the top:

  1. “Factor B < 1.5” should be “no” (i.e. B should be greater than 1.5),
  2. “Factor A < 1.5” should also be “no” (we again take the branch to the right),
  3. “Factor B >= 2.5” should be “no” (again right which is always “no”) and finally,
  4. “Factor A >= 2.5” should be “no”.

Hi @Christian_Berg,

 

I'm new in this domain, so could you please explain for me how could we have 1.5 or 2.5 cause I don't understand how to have this number? and in next picture, n=14, what is n meaning/definition?

 

Best regards

Tri Nguyen

by Christian_Berg Visitor
on ‎12-06-2016 06:21 PM

Hi Tri

 

Thank you for your question. I agree, the decimals are confusing since the underlying data, in that case, consist only of whole numbers. This algorithm tries to find ways to split the data so that the remaining rows are as similar as possible. In this case the rows with value 2 have a very different (much higher) value than the other rows. It correctly identifies that rows with the factor being both larger than 1.5 yet smaller than 2.5 should be in a group of their own. Since the data is only whole numbers the only factor value that is both larger than 1.5 and smaller than 2.5 is 2. Does that make sense?

 

"n" refers to the sample size at that point in the tree. The first split is done on all of the input data which in that case was 14 rows of data, i.e. n=14.

 

Thank you and please let me know if you have any follow up questions when you try to apply it to your own business scenarios

 

 

 

Christian