The ”Bill Gates Effect” in Google Analytics and how to get rid of it with Power BI

by ruthpozuelo on ‎03-03-2016 11:43 PM

 

Did you know that when Bill Gates enters a bar, the average customer becomes a billionaire? That is what I call the Bill Gates Effect. (Many thanks to Introductory Statistics)


Pss! There is a Power BI dashboard at the end of the blog post, don’t miss it!

 

The Bill Gates Effect Explained


A business owner is trying to understand the average income of the clients in the bar next door so he hires a student to interview the customers.


When the student had finished the interviews and the analysis, the person with the highest income leaves the bar and Bill Gates comes in. Frustrated, he re-runs the analysis this time with Bill Gates data instead.


The student comes back to the business owner and reports that the average customer in the next door bar is a billionaire!!


Average income= $ 7 900 042 500


The business owner ask the student to show the calculations and after reviewing the data understands the ”mistake”.


Income average Before Bill enters the bar:

 

 bill gates before.png

 

Income average After Bill enters the bar:

 

 

 bill gates after.png

 

Most people would agree that $47,500 (the median) is a better reflection of the income amounts for the people sitting in the bar and that the amount of $8 billion has nothing to do with the reality of the 9 people in the bar who are not Bill Gates.


Side Note: I was refreshing my statistical knowledge for my Lean Six Sigma certification when I found that example, and since then I have never had to search for those terms again: average, mean and median. This clearly demonstrated for me how powerful Story Telling can be. Smiley Happy


But, how does the “Bill Gates Effect” relates to Google Analytics?


You might already see where this is going, but let me give you an example using Site speed metrics given by Google Analytics.
In Excel, I have the downloaded the following metrics for two days: 8th and 7th of February 2016:

 

The Calc. Load time (sec) is a manual calculation of Avg. Page Load Time according to Google Analytics formula in Power BI:

 

Sum(Page Load Time (ms))/1000/sum(Page Load Sample),


and of course, it yields the same results as if you would take Avg. Page Load Time (sec) directly from Google Analytics.

 

 

 ga site speed with outliers.png

Let’s calculate the “averages” and Median for both days:

 

site speed averages.png

 

Let me explain the metrics:


1. Avg. Page Load Time (sec) is the metric you will get from Google Analytics
2. Avg Load time is the average the column Avg. Page Load Time (sec)
3. Median is the Median of the Avg. Page Load Time (sec)


Here is an example with Google Analytics Average Page Time and our calculated Power BI Median segmented by country:

 

image005.png

 

There is a huge difference between the metrics when you have outliers in your data, right?

 
What metrics should you use?


Now that you understand the problems using average, you will be horrified by all the averages Google Analytics provides without any consideration to outliers.


Between the average and the mean, here is the Rule of thumb of which metric to use:


• For data sets with outliers (remember, Bill Gates in the club), it is better to use Median
• For symmetric data sets, use the mean.

 

It is a bit more complex than that, refer to Introductory Statistics if you want to dig deeper, but this is good enough for our analysis.

 

Do I have outliers?


I can almost guarantee you that you do, but, let's found out.

 

Using Power BI (POwer Query), we categorize the time in bins:

 

Bins = if [#"Avg. Page Load Time (sec)"]<=1 then "a) 0-1" else if [#"Avg. Page Load Time (sec)"]<=3 then "b) 1-3" else if [#"Avg. Page Load Time (sec)"]<=7 then "c) 3-7" else if [#"Avg. Page Load Time (sec)"]<=13 then "d) 7-13" else if [#"Avg. Page Load Time (sec)"]<=21 then "e) 13-21" else if [#"Avg. Page Load Time (sec)"]<=25 then "f) 21-25" else if [#"Avg. Page Load Time (sec)"]<=35 then "g) 25-35" else if
[#"Avg. Page Load Time (sec)"]<=60 then "g) 55-60" else if [#"Avg. Page Load Time (sec)"]<=180 then "h) 60-180" else "i) +180"

 

And visualize them in a bar chart:

 

histogram power bi.png

Or you can use one of the newest custom visualizations, the box plot: (the outliers are the red dots)

 

box plot power bi.png

 Can I kick Bill out from my club?


Unfortunately, no.


You cannot remove your outliers, but to understand why, we need to understand how site speed is measured. Here is a timeline for how Google Analytics measures site speed and page load:

 

image006.png

 

Simplifying things a lot, you have to take into account:


• People’s internet speed, that varies by location
• your webserver and domain name service,
• how well your website is built,
• and which browser is being used.


Looking back at our example above, 28 seconds might not be an outlier, but a normal performance at some locations and or performance with some browsers.

 

 ga site speed with outliers.png

 

 

 What should you do then?


Segmentation, segmentation, segmentation


What I normally do is to segment my data by those factors and look for patterns there.

 

Here is a great example, using our bar chart and filtering them by continent, you can see that the majority of the data in Africa falls in the 7-13 secs bin, while in America falls into the 3-7 secs category. 

 

Makes sense, right? America has probably better internet conenctions than Africa.

 

 

 

  site speed by continent.png

 

So, an outlier in America is not an outlier in Africa.

 

Does this mean that I don't have outliers? Looking at the tail in the bar chart  above, we still have "true" outliers, and those, you can remove.

 

Let’s dig deeper to the Top Offenders to uncover our outliers:

 

image008.png

 

One person from Russia, had a Page Load Time of 551 Secs!! And the median for Europe is 5,93, so that is definitely an outlier.


It would be great if we could eliminate those outliers easily in Power BI, and of course, somebody thought about it already, so Vote it up, but if you are in a hurry, DataSavvy solved this and posted the solution in his blog, thanks!.


In the meantime, Mean instead of averages will give you a better picture of reality. Remember our chart Page Loads by Location?


When we talk about Page Loads, seconds are as important as milliseconds for a 100 meters runner, so it is important to get as close to the truth as possible.

 

image005.png

 

Are you still with me? Great, because here is where the magic happens.

 

This weeks dashboard will help you monitor your website speed.

 

I am not going to go into the ongoing discussion about how site speed, is or is not, important in Google search ranking, there are tons of sources for and against it and none I have found that are conclusive.

 

What I worry about is my visitors and not Google. If they are happy, and there are ways to measure that, Google will be happy too. I obsess about right speed at the visitor’s location, and the time it takes before my visitors can interact with the content, instead of how fast the entire page loads.

 

Google Analytics has a metric for that: Document Interactive Time and you should monitor that one.

 

What is the right speed at the right location?

 

Use your data to find out: segment by important locations for your business and monitor them closely. If you are having issues there, then go to browser, device, connection times, etc, to improve those interaction load times.

 

What insights can we get with this dashboard?

 

image009.png

 

 

As usual, we start with the insights from the dashboard:


• In this dashboard you will be able to monitor trends on your site speed performance
• You will be able to see if important locations are performing well
• If your site loads as fast on all device types (mobile, tablet, desktop)
• And if your most used browsers are performing well.


This is a monitoring dashboard, but with Power BI’s cross filtering capabilities, you can gain insights as to where you have performance issues pretty quickly.


Watch it on image010.png


If you would like to see a live version of the dashboard, you can watch it in our YouTube Channel, but if you prefer text, continue reading.

 

Download the Power BI file & Dashboard guide


Don't want to create the Power BI file? If you are a member on our site (it's free), you can download it by following this link. You will also get there an in-depth explanation on how to use/interpreted the dashboard.


You will need also to change the data source from my account to your Google account.


If you are new to Power BI, this guide will take some time to complete, otherwise, it is just a few formulas and you are good to go. Skip to the formulas if that is all you need.

 

Creating the dashboard: Get Data


Before you can start creating this dashboard you need to have Power BI Desktop installed (it’s FREE) and connect to Google Analytics. Once you have done that, we can start importing the data in Power BI.


Let’s do it, open Power Bi Desktop and click on “Get Data”:

 

image011.png

 

Select Google Analytics and click “Connect”:

 

image012.png

Select your account, property, and view:

 

image013.png

 

and now we will export a table with the following parameters:


Download the following dimensions and metrics:


Avg Document Content Loaded Time (sec), Avg Document Interactive Time (sec), Avg Page Download Time (sec), Avg Page Load Time (sec), Avg Redirection Time (sec), Avg Server Connection Time (sec), Avg Server Response Time (sec), Avg. Domain Lookup Time (sec), Page Load Time (ms), Page Load Sample


Browser, Device Type, City, Country, Continent and Date


Rename the table “Site Speed”


Creating additional columns in the Query Editor


Before we load the data in the query editor, we need to add the follwing columns: Month, YearMonth, SortYearMonth
In the Query Editor, select Add Column and Add Custom Column:

 

 image014.png

And write the following formula:


Month = Number.ToText(Date.Month([Date]))

 

image015.png

 

Repeat the process for the remaining columns:


Year-Month = Number.ToText(Date.Year([Date]))&"-"&Text.PadStart([Month],2,"0")


Sort Year Month = Number.ToText(Date.Year([Date]))&Text.PadStart([Month],2,"0")


And click on “Close and Apply”:

 

image016.png

 

Building the Dashboard


Now that we have all the data we need, let’s start building the dashboard.


We need to create some measures first:

 

 

image017.png

Power BI Measures

 

Google Analytics Average Page Load Time (sec)
GA Page Load Time (sec) = DIVIDE(DIVIDE(SUM('Site speed'[Page Load Time (ms)]),1000),SUM('Site speed'[Page Load Sample]))

 

Median Page Load Time (Sec)
Page Load Time-Median (sec) = MEDIAN('Site Speed'[Avg. Page Load Time (sec)])

 

Median Document Interactive Time (sec)
Median Document Interactive Time (sec) = MEDIAN('Site Speed'[Avg. Document Interactive Time (sec)])

 

Median Content Loaded Time (ms)
Median Document Content Loaded Time (sec)=MEDIAN('Site Speed'[Avg. Document Content Loaded Time (sec)])

 

Median Domain Lookup Time (sec)
Median Domain Lookup Time (sec) = MEDIAN('Site Speed'[Avg. Domain Lookup Time (sec)])

 

Median Page Download Time (sec)
Median Page Download Time (sec) = MEDIAN('Site Speed'[Avg. Page Download Time (sec)])

 

Median Redirection Time (sec)
Median Redirection Time (sec) = MEDIAN('Site Speed'[Avg. Redirection Time (sec)])

 

Median Server Connection Time (sec)
Median Server Connection Time (sec) = MEDIAN('Site Speed'[Avg. Server Connection Time (sec)])

 

Median Server Response Time (sec)
Median Server Response Time (sec) = MEDIAN('Site Speed'[Avg. Server Response Time (sec)])

 

Calculate the number of pages used to determine site speed
Page Sampled = SUM('Site Speed'[Page Load Sample])

 

Create the visualizations


The visualizations for the dashboard are so easy, that I don’t think I need to go through how to create them, this blog post is long already as is….
Let me know if you have questions on how to create them in the comments and I will help you out.

Comments
by NV Frequent Visitor
on ‎03-07-2016 03:59 AM

Thanks for the post! A quick and easy way into the distribution of the page loads could also be done with eg.:

PERCENTILE.INC(<column>, <k>)

 

For example, if you want to see how fast the page load is for the majority of your visitors the 95th or 90th percentile would be a pretty good metric.  Together with the median, they offer a quick way to exclude unfrequent outliers and give you a straight answer on whats the page load in general and what is the longest page load for the majority of my visitors.   

 

 

by ruthpozuelo
on ‎03-07-2016 12:09 PM

Hi NV; 

Great idea, thanks!!! 

On my way to update the file Smiley Happy

/Ruth

by Administrator chass
on ‎03-28-2016 10:01 PM

Great Post!