Pss! There is a Power BI dashboard at the end of the blog post, don’t miss it!
The Bill Gates Effect Explained
A business owner is trying to understand the average income of the clients in the bar next door so he hires a student to interview the customers.
When the student had finished the interviews and the analysis, the person with the highest income leaves the bar and Bill Gates comes in. Frustrated, he re-runs the analysis this time with Bill Gates data instead.
The student comes back to the business owner and reports that the average customer in the next door bar is a billionaire!!
Average income= $ 7 900 042 500
The business owner ask the student to show the calculations and after reviewing the data understands the ”mistake”.
Income average Before Bill enters the bar:
Income average After Bill enters the bar:
Most people would agree that $47,500 (the median) is a better reflection of the income amounts for the people sitting in the bar and that the amount of $8 billion has nothing to do with the reality of the 9 people in the bar who are not Bill Gates.
Side Note: I was refreshing my statistical knowledge for my Lean Six Sigma certification when I found that example, and since then I have never had to search for those terms again: average, mean and median. This clearly demonstrated for me how powerful Story Telling can be.
But, how does the “Bill Gates Effect” relates to Google Analytics?
You might already see where this is going, but let me give you an example using Site speed metrics given by Google Analytics. In Excel, I have the downloaded the following metrics for two days: 8th and 7th of February 2016:
The Calc. Load time (sec) is a manual calculation of Avg. Page Load Time according to Google Analytics formula in Power BI:
Sum(Page Load Time (ms))/1000/sum(Page Load Sample),
and of course, it yields the same results as if you would take Avg. Page Load Time (sec) directly from Google Analytics.
Let’s calculate the “averages” and Median for both days:
Let me explain the metrics:
1. Avg. Page Load Time (sec) is the metric you will get from Google Analytics 2. Avg Load time is the average the column Avg. Page Load Time (sec) 3. Median is the Median of the Avg. Page Load Time (sec)
Here is an example with Google Analytics Average Page Time and our calculated Power BI Median segmented by country:
There is a huge difference between the metrics when you have outliers in your data, right?
What metrics should you use?
Now that you understand the problems using average, you will be horrified by all the averages Google Analytics provides without any consideration to outliers.
Between the average and the mean, here is the Rule of thumb of which metric to use:
• For data sets with outliers (remember, Bill Gates in the club), it is better to use Median • For symmetric data sets, use the mean.
It is a bit more complex than that, refer to Introductory Statistics if you want to dig deeper, but this is good enough for our analysis.
Do I have outliers?
I can almost guarantee you that you do, but, let's found out.
Using Power BI (POwer Query), we categorize the time in bins:
Bins = if [#"Avg. Page Load Time (sec)"]<=1 then "a) 0-1" else if [#"Avg. Page Load Time (sec)"]<=3 then "b) 1-3" else if [#"Avg. Page Load Time (sec)"]<=7 then "c) 3-7" else if [#"Avg. Page Load Time (sec)"]<=13 then "d) 7-13" else if [#"Avg. Page Load Time (sec)"]<=21 then "e) 13-21" else if [#"Avg. Page Load Time (sec)"]<=25 then "f) 21-25" else if [#"Avg. Page Load Time (sec)"]<=35 then "g) 25-35" else if [#"Avg. Page Load Time (sec)"]<=60 then "g) 55-60" else if [#"Avg. Page Load Time (sec)"]<=180 then "h) 60-180" else "i) +180"
And visualize them in a bar chart:
Or you can use one of the newest custom visualizations, the box plot: (the outliers are the red dots)
Can I kick Bill out from my club?
You cannot remove your outliers, but to understand why, we need to understand how site speed is measured. Here is a timeline for how Google Analytics measures site speed and page load:
Simplifying things a lot, you have to take into account:
• People’s internet speed, that varies by location • your webserver and domain name service, • how well your website is built, • and which browser is being used.
Looking back at our example above, 28 seconds might not be an outlier, but a normal performance at some locations and or performance with some browsers.
What should you do then?
Segmentation, segmentation, segmentation
What I normally do is to segment my data by those factors and look for patterns there.
Here is a great example, using our bar chart and filtering them by continent, you can see that the majority of the data in Africa falls in the 7-13 secs bin, while in America falls into the 3-7 secs category.
Makes sense, right? America has probably better internet conenctions than Africa.
So, an outlier in America is not an outlier in Africa.
Does this mean that I don't have outliers? Looking at the tail in the bar chart above, we still have "true" outliers, and those, you can remove.
Let’s dig deeper to the Top Offenders to uncover our outliers:
One person from Russia, had a Page Load Time of 551 Secs!! And the median for Europe is 5,93, so that is definitely an outlier.
In the meantime, Mean instead of averages will give you a better picture of reality. Remember our chart Page Loads by Location?
When we talk about Page Loads, seconds are as important as milliseconds for a 100 meters runner, so it is important to get as close to the truth as possible.
Are you still with me? Great, because here is where the magic happens.
This weeks dashboard will help you monitor your website speed.
I am not going to go into the ongoing discussion about how site speed, is or is not, important in Google search ranking, there are tons of sources for and against it and none I have found that are conclusive.
What I worry about is my visitors and not Google. If they are happy, and there are ways to measure that, Google will be happy too. I obsess about right speed at the visitor’s location, and the time it takes before my visitors can interact with the content, instead of how fast the entire page loads.
Use your data to find out: segment by important locations for your business and monitor them closely. If you are having issues there, then go to browser, device, connection times, etc, to improve those interaction load times.
What insights can we get with this dashboard?
As usual, we start with the insights from the dashboard:
• In this dashboard you will be able to monitor trends on your site speed performance • You will be able to see if important locations are performing well • If your site loads as fast on all device types (mobile, tablet, desktop) • And if your most used browsers are performing well.
This is a monitoring dashboard, but with Power BI’s cross filtering capabilities, you can gain insights as to where you have performance issues pretty quickly.
Don't want to create the Power BI file? If you are a member on our site (it's free), you can download it by following this link. You will also get there an in-depth explanation on how to use/interpreted the dashboard.
If you are new to Power BI, this guide will take some time to complete, otherwise, it is just a few formulas and you are good to go. Skip to the formulas if that is all you need.
Creating the dashboard: Get Data
Before you can start creating this dashboard you need to have Power BI Desktop installed (it’s FREE) and connect to Google Analytics. Once you have done that, we can start importing the data in Power BI.
Let’s do it, open Power Bi Desktop and click on “Get Data”:
Select Google Analytics and click “Connect”:
Select your account, property, and view:
and now we will export a table with the following parameters:
Sort Year Month = Number.ToText(Date.Year([Date]))&Text.PadStart([Month],2,"0")
And click on “Close and Apply”:
Building the Dashboard
Now that we have all the data we need, let’s start building the dashboard.
We need to create some measures first:
Power BI Measures
Google Analytics Average Page Load Time (sec) GA Page Load Time (sec) = DIVIDE(DIVIDE(SUM('Site speed'[Page Load Time (ms)]),1000),SUM('Site speed'[Page Load Sample]))
Median Page Load Time (Sec) Page Load Time-Median (sec) = MEDIAN('Site Speed'[Avg. Page Load Time (sec)])
Median Document Interactive Time (sec) Median Document Interactive Time (sec) = MEDIAN('Site Speed'[Avg. Document Interactive Time (sec)])
Median Content Loaded Time (ms) Median Document Content Loaded Time (sec)=MEDIAN('Site Speed'[Avg. Document Content Loaded Time (sec)])
Median Domain Lookup Time (sec) Median Domain Lookup Time (sec) = MEDIAN('Site Speed'[Avg. Domain Lookup Time (sec)])
Median Page Download Time (sec) Median Page Download Time (sec) = MEDIAN('Site Speed'[Avg. Page Download Time (sec)])
Median Redirection Time (sec) Median Redirection Time (sec) = MEDIAN('Site Speed'[Avg. Redirection Time (sec)])
Median Server Connection Time (sec) Median Server Connection Time (sec) = MEDIAN('Site Speed'[Avg. Server Connection Time (sec)])
Median Server Response Time (sec) Median Server Response Time (sec) = MEDIAN('Site Speed'[Avg. Server Response Time (sec)])
Calculate the number of pages used to determine site speed Page Sampled = SUM('Site Speed'[Page Load Sample])
Create the visualizations
The visualizations for the dashboard are so easy, that I don’t think I need to go through how to create them, this blog post is long already as is…. Let me know if you have questions on how to create them in the comments and I will help you out.