Missed the introduction to this series? See Become your organization’s strategic advisor by using Machine Learning and Power BI
Having interviewed and worked with hundreds of business analysts and P&L owners, I am impressed by the capacity of human intuition. People who have been in role for more than a couple of years can often balance multiple complex considerations and quickly make a good decision. This frees up time for the employee. Unfortunately, this spare time is more often used to increase the frequency at which they look at the standard reports, rather than deep analysis to try and challenge their already impressive understanding of the business. A common problem with this is confirmation bias and existing misunderstandings of true drivers can persist or a change in underlying fundamental is overlooked. In the next couple of posts, I will share techniques that I have seen successfully employed to counter common heuristics and to increase the chance of continuous improvement.
Below is an example of a dataset similar to that of a retail chain which a regional store manager shared with me. The initial goal of that project was to change the main business report to update hourly instead of weekly. The store manager had made revenue growth and inventory management the top priorities a couple of years earlier; and believed it had been very successful. Hourly reports on what the stores were ordering was expected to lead to more accurate revenue predictions. The regional manager also wanted to be able to intervene in time if the implicit forecast was too low to meet all-up targets.
An initial look at the business revealed healthy revenue growth and low variations in weeks on hand. A basic analysis however also showed that discounts had increased in size, eroding profits. For descriptive analytics like this, I find it helpful to create a couple of views with one KPI at a time, e.g. marginal profit or revenue and display it for every relevant business hierarchy. For instance, product hierarchy, geography and time. While that is an effective way to show how the business is doing it does little to explain why. For that it tends to be more effective to show correlated KPIs for the same dimension, e.g. revenue vs. profit by store over time.
As a second step, I therefore frequently add a correlation plot. In addition to being simple to understand correlation plots are quick and easy to do. High correlation between two variables means that they move in the same direction, at the same rate. Because it is a relative measure there is rarely a need for normalization. Below is an R correlation plot in Power BI, showing a couple of months’ worth of data from the stores.
The blue to red column to the far right is the legend indicating whether two variables are strongly correlated (dark blue), have no correlation (white), or are inversely correlated (dark red). Large circles in a dark color are thus strongly correlated while small, lightly colored circles indicate weak correlation. The circles appear in the upper-right triangle in the matrix, and their corresponding correlation coefficient appear in the lower-left. For instance, price and discount have a perfect negative correlation.
The visual representation of this can be found on the row “Price” (second row) in the “Discount” column (fourth column). The numeric correlation coefficient is in the “Price” column (2nd) on the “Discount” row (4th).
When we analyze the data using this visual a couple of things pop-out. Revenue and price have low negative correlation, i.e. a decrease in price will typically coincide with an increase in revenue. Revenue and inventory are strongly correlated so days that the stores carry a lot of inventory tend to coincide with high revenue. The key insight for the regional manager however was the -0.51 correlation between price and inventory. Before seeing this correlation plot the regional manager was unaware that the stores had, in response to stronger focus on inventory management, begun managing inventory primarily by changing the price. This was an unintended consequence of the directive which had been issued to improve store managers focus on accurate forecasting. Instead, the store managers had found an inventive way to achieve the same goal using price, thus hurting contribution margins and decreasing customer satisfaction. When the store managers found themselves carrying too much inventory they would simply run a large discount. Conversely when they were running low they would increase prices. By only comparing revenue and inventory this behavior had gone unnoticed by the leadership team.
Using correlation plots for business analytics
You have probably heard that correlation does not equal causation. That is misleading since two things happening at the same time often have a cause and effect relationship but it is important to remember alternative explanations. Let us say that for your business daily revenue for product A and B shows high correlation, e.g. 0.75. So 75% of the changes in revenue for A is captured in the changes for B. That does not necessarily mean that people who buy A has more value from B. Alternative explanations could be:
- Buyers of B have more value from A,
- Products A and B have similar seasonality (maybe they are both more popular on warm days),
- Promotions for A has by chance coincided with promotions for B, etc.
In other words, high correlation might be a coincidence (unlikely with a large enough set of data), might be because one causes the other, or because there is something else which affects both, e.g. product C. Typically the direction of the interaction is known or someone familiar with the business has a strong hypothesis. If not, a common second step is to investigate a difference in timing. If for instance marketing investments happen on average 3 days before you observe an increase in revenue you can be sure that revenue does not drive marketing, but that it is the other way around.
So, while correlation does not in itself prove causation, running a correlation analysis helps you validate your understanding of a business and tells you where to focus next. I typically let the correlations that might be interesting determine which different KPIs I show together for only a few business dimensions to let the data tell the story. In this case inventory by store and product was paired with average price to help the regional manager drive a change in behavior.
How to recreate the report
R visuals will automatically work in Power BI when the report is published to the service but to get them to work in Power BI Desktop you need to have R installed on your computer. R, a third-party program, can be downloaded from CRAN. Once installed, in Power BI Desktop go to File -> Options and settings -> Options -> R Scripting and validate that the R home directory path is the same as what you used to install the base. Please use this link for detailed instructions. You may also want an R dedicated script editor, e.g. RStudio (the Windows installation file is the first link under “Installers for Supported Platforms”.
Once you have R installed you can use R visuals just as you would use any other native visual in Power BI with the addition of an R script that you paste in the editor. An example file is attached to this post.
To create the correlation plot R visual:
- Open a pbix file with the relevant data, e.g. my attached example file and click on the + sign in the bottom row to create a new page.
- Because an R visual will not display anything until you have added a script, I start by adding a simple standard visual like a table that I add the columns that I want to analyze to. This is achieved by clicking on the desired visual under “Visualizations” and then selecting the columns of interest from the far-right menu.
- Correlations are calculated row by row so getting the granularity right is important. In this case I want to understand if a change in inventory for a particular day, product and store is correlated with the average price. To make this level of detail explicit I first added the three columns [Day], [Product] and [StoreId] to the table and for each verified that the aggregation option “Don’t summarize” was used.
- Similarly, I selected “Sum” for all the other columns except Price where I used “Average”.
- Once I’m happy with the data input (this will be the raw data that R receives) I convert the visual to an R visual by selecting the visual in the canvas and then clicking on the R visualization icon
- Add the script using the R script* editor (visible when you select the R visual directly in the canvas). Note, you may have to expand it by clicking on the far-right arrow in the R script editor menu below the canvas.
- Once expanded you should see the below, with space for the R script.
Copy/paste the script below into the script editor (any line that starts with # or ## is a comment only and is not executed):
## Use the corrplot library for the cor() function which calculates correlations and has the corrplot() function to visualize the output library(corrplot) ## Get rid of the first three columns in dataset since they, in this example, contain categorical values dataset <- dataset[,-1:-3] ## Calculate correlations between the remaining columns m <- cor(dataset) ## Plot the result with a light gray background color corrplot.mixed(m, bg="light gray")
- Use the first arrow in the R script editor to manually execute the script (any change/interaction with the report will also cause the script to be rerun.
To recreate this visual for your own data the steps will be almost identical. It is important however to remember to get the granularity right. If you don’t need to do any aggregation you would skip in this case the first three columns [Day], [Product] and [StoreId], set all of the other fields to “Don’t summarize” and delete the line of code that excludes the first three columns from the calculation, namely “dataset <- dataset[,-1:-3]” and its preceding comment, i.e. so that the script looks like:
## Use the corrplot library for the cor() function which calculates correlations and has the corrplot() function to visualize the output library(corrplot) ## Calculate correlations between the remaining columns m <- cor(dataset) ## Plot the result with a light gray background color corrplot.mixed(m, bg="light gray")
Links and downloads
Understanding the correlation coefficient: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
Example of correlation plot in the R showcase: http://community.powerbi.com/t5/R-Script-Showcase/Correlation-Plot/m-p/58462
Additional R installation instructions for Power BI: http://powerbi.tips/2016/09/using-r-visuals-in-power-bi/
* Third-party programs. This software enables you to obtain software applications from other sources. Those applications are offered and distributed by third parties under their own license terms. Microsoft is not developing, distributing or licensing those applications to you, but instead, as a convenience, enables you to use this software to obtain those applications directly from the application providers.
By using the software, you acknowledge and agree that you are obtaining the applications directly from the third-party providers and under separate license terms, and that it is your responsibility to locate, understand and comply with those license terms. Microsoft grants you no license rights for third-party software or applications that is obtained using this software.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.