BellaBeat Dashboard
Bellabeat Case Study Documentation
Disclosure
The purpose of this documentation is to present the case study and the steps taken to tackle the issue at hand. The approach used to complete this case study is the one thought by the Google Data Analytics Certificate course available on Coursera which involves the following 6 stages taken sequentially: Ask, Prepare, Process, Analyse, Share and Act.
Introduction
Bellabeat is a fictional high-tech manufacturer of health-focused products for women. Their products include smart watches, bracelets, necklaces, clips and water bottles. They also possess an app and different memberships for their users. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Their team believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. As a result, they have asked the following:
- What are some trends in smart device usage?
- How could these trends apply to Bellabeat customers?
- How could these trends help influence Bellabeat marketing strategy?
Ask Phase
Business task: Analyze non-Bellabeat smart device usage data to identify user habits and trends that correlate to Bellabeat’s customers in order to guide marketing strategy.
Prepare Phase
Data Set: A public data set called the FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius) was used for this case study. This data set can be found on Kaggle in the following link https://www.kaggle.com/datasets/arashnic/fitbit. Its content was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
Storage: The data set was downloaded from Kaggle and locally stored in the computer for further use.
Bias and credibility:
- The dataset only contains data collected in between 03.12.2016 - 05.12.2016 meaning, this data may not be relevant anymore due to its age.
- The data was collected from 30 users only, whereas according to a quick google search the current estimate of active Fitbit users is of 31million people indicating that the sample only accounts for 0.000096% of the total population. For a confidence level of 95% and a margin of error of +/- 5% the size of the sample for the given population should be of 385, i.e an additional 355 user data is needed.
- According to the case study information, BellaBeat operates worldwide, however this dataset does not include where the users are from (their location could affect their behaviour hence have an effect on the data).
- Overall the data seems to be reliable in the sense that is truthful, original and comprehensive, however it has serious issues with currency, relevance and accuracy in depicting the population as a whole. My recommendation would be to either use a different dataset with more data, or take the time to undergo another survey to collect more data.
Process Phase
For the data cleaning process, Excel and SQLite were used. The following checklist was used to ensure the cleanliness and integrity of the data:
- Check for typos;
- Check for extra spaces within cells;
- Check for duplicates;
- Check for nulls and empty cells;
- Check data type consistency across columns;
- Check data range consistency.
The following data cleaning process was used:
File: dailyActivity_merged [Excel]
- Check length of all IDs in the ID column to ensure range consistency;
- Change date format from MDY to DMY;
- Set all numeric values to have two (2) decimal places only;
- Create column UnaccountedForMinutes and set it to =1440-(SUM(K2;L2;M2;N2;)).
File: hourlyCalories_merged [Excel]
- Check Length of all IDs in the ID column to ensure consistency;
- Insert column next to column B and name it ActivityTime;
- Rename Column B to ActivityDate;
- Separate date from time in column B using Text To Column function and insert the time in column C;
- Change date format from MDY to DMY;
- Format Column B to be Short Date and column C to be Time.
File: hourlyIntensities_merged [Excel]
- Check Length of all IDs in the ID column to ensure consistency;
- Insert column next to column B and name it ActivityTime;
- Rename Column B to ActivityDate;
- Separate date from time in column B using Text To Column function and insert the time in column C;
- Change date format from MDY to DMY;
- Format Column B to be Short Date and column C to be Time.
File: hourlySteps_merged [Excel]
- Check Length of all IDs in the ID column to ensure consistency;
- Insert column next to column B and name it ActivityTime;
- Rename Column B to ActivityDate;
- Separate date from time in column B using Text To Column function and insert the time in column C;
- Change date format from MDY to DMY;
- Format Column B to be Short Date and column C to be Time.
File: hourlyActivity_merged [Excel]
- Create new excel file named hourlyActivity_merged;
- Import data into different excel pages from files hourlySteps_merged, hourlyIntensities_merged and hourlyCalories_merged using the Get Data function under Data tab;
- Merge data from all 3 pages into 1 to make a single table;
- Create new page named hourlyActivity_merged
- Copy data from hourlySteps_merged and paste it to page hourlyActivity_merged in column A1.
- Copy data from hourlySteps_merged and paste it to page hourlyIntensities_merged in column F1.
- Copy data from hourlySteps_merged and paste it to page hourlyCalories_merged in column F1.
- Create 3 new columns named IDCheck, DateCheck, TimeCheck in columns Q, R and S respectively.
- Set rows in column Q to =IF(AND(A2=F2;F2=L2);1;0) to ensure all ID rows match in all 3 tables.
- Set rows in column R to =IF(AND(B2=G2;G2=M2);1;0) to ensure all date rows match in all 3 tables
- Set rows in column S to =IF(AND(C2=H2;H2=N2);1;0) to ensure all time rows match in all 3 tables.
- If all values return 1 (i.e Match) delete columns E,F,G,H,K,L,M,N,Q,R,S to maintain only non-repetitive and relevant columns.
- Save file.
File: heartrate_seconds_merged [SQL]
SELECT * FROM heartrate LIMIT 10;
/*This first query separates the date and time into two different colums as a test*/
SELECT id,
TRIM(SUBSTR(Date, 1, instr(Date,' '))) AS Date,
TRIM(SUBSTR(Date, instr(Date,' '))) AS Time,
Value
FROM heartrate;
/*The following query creates a temp table called heartrate_edit that contains the id, date, time and value column from the original heartrate table.
In this table the date and time are in different columns.
This new table also exclude all null values*/
CREATE TABLE heartrate_cleaned AS
WITH heartrate_temp AS(
SELECT id,
TRIM(SUBSTR(Date, 1, instr(Date,' '))) AS Date,
TRIM(SUBSTR(Date, instr(Date,' '))) AS Time,
Value
FROM heartrate
WHERE id IS NOT NULL)
/*The following query creates a table called "heartrate_cleaned" from the temp table with the date column in the correct format (i.e d/mm/yyyy)
*/
SELECT id,
TRIM(SUBSTR(Date, 3, instr(Date, "/")), "/") /*/day*/ || "/" || SUBSTR(Date, 1,1) /*month*/ || "/2016" /*year*/ AS Date,
Time,
/* The following CASE statement converts the time format from 12h to 24h.
ir first turns 12AM to 00, then makes sure 12pm remains as 12 and then grabs all values that are PM and adds 12h to the time.
*/
CASE
WHEN (TRIM(SUBSTR(Time, 1, 2), ":") || ":00:00 " || TRIM(SUBSTR(Time, instr(Time, ' '), 3)) = '12:00:00 AM')
THEN '00:00:00'
WHEN (TRIM(SUBSTR(Time, 1, 2), ":") || ":00:00 " || TRIM(SUBSTR(Time, instr(Time, ' '), 3)) = '12:00:00 PM')
THEN '12:00:00'
WHEN (TRIM(SUBSTR(Time, instr(Time, ' '), 3)) = 'PM')
THEN ((TRIM(SUBSTR(Time, 1, 2), ":") +12) || ":00:00 ")
ELSE TRIM(SUBSTR(Time, 1, 2), ":") || ":00:00 "
END AS Time24h,
Value
FROM heartrate_temp;
/* The following query returns the average heartrate per hour for each id in each date*/
SELECT
Id,
Date,
Time24h AS Time,
ROUND(AVG(Value), 1) AS AVGHeartRatePerHour
FROM heartrate_cleaned
GROUP BY ID, Date, Time24h
ORDER BY ID, Date, Time24h
- Save query results as .Csv file “Hourlyheartrate_generated (cleaned)”.
File: Sleepday_merged [Excel, SQL]
- Check Length of all IDs in the ID column to ensure consistency;
- Separate date from time in column B using Text To Column function and only use the date;
- Change date format from MDY to DMY;
- Format Column B to be Short Date;
- Insert column next to column B and name it Weekday and set its rows to =TEXT(B2;”dddd”);
- Create PercentageOfTimeAsleep column in column G and set rows to =E2/F2;
- Create PercentageTimeInBed column in column H and set all rows to 1;
- Format columns G and H to percentage;
- Import sleepday_merged data into SQL;
/*
This query returns the number of inputs from each user, the average time asleep, the minimum, maximum and average percentage of time asleep with respect to the time spent in bed
*/
SELECT id,
COUNT(id),
ROUND((AVG(TotalMinuteAsleep))/60) AS AverageHoursAsleep,
MIN(PercentageOfSleep) AS MinPercentage,
MAX(PercentageOfSleep) AS MaxPercentage,
ROUND(AVG(PercentageOfSleep)) AS AveragePercentage
FROM sleep_day
GROUP BY id
ORDER BY AverageHoursAsleep DESC
Analyse Phase
The organization of the data for analysis was split into three groups, daily activity data, hourly activity data and sleep. The reason is that while hourly activity provides an in-depth insight into how users are using their smart devices, daily activity provides a broader scope to look at. When combined, it is possible to verify trends on a daily basis and how such trends are spread throughout the day.
Key findings:
- More steps are not necessarily equivalent to more calories burned. Though most users spend less than 15 minutes in a very active manner, those who are less sedentary and more active tend to burn more calories.
- There are three weekdays where user activity peaks and they are Monday, Tuesday and Saturday, at the other end of the spectrum, Sundays seem to be the days user activity drops the most and the amount of hours in bed and asleep peaks.
- The heart rate and step count indicate that users are overall active from 6AM to 9PM, with peak activity at lunch time and from late noon to early evening.
- The number of steps and intensity of activity across users on an hourly basis have a positive relationship, however the relationship between these two and the amount of calories burned seems to be somewhat random. It may be speculated that the reason behind this randomness lies within the health conditions of each user. A healthier/fit user who exercises regularly would burn less calories even after taking more steps than a user who is out of shape.
- Users spend over 90% of their time in bed asleep (6 to 8 hours) with a few exceptions that could be justified by overspending their time in bed and perhaps insomnia or distractions.
Share Phase
All visualizations for this case study were made using Power Bi and are available at https://community.powerbi.com/t5/Data-Stories-Gallery/Coursera-Google-Data-Analytics-Capstone-Projec....
Act Phase
Recommendations:
- According to the data, some users are asleep less than 90% of the time they are in bed which could be an indication that some users have trouble sleeping. BellaBeat’s app could take advantage of that information to promote meditation and relaxation techniques to benefit those users and perhaps enter a new market that focuses on user sleep wellbeing.
- Rather than simply providing users with health related data, the BellaBeat App could make use of said data to set automatic alarms for users to be very active based on their daily activity hence taking advantage of the fact that users spend on average less than 15 minutes in a very active manner. It could also combine the separate datasets to point out when is the best time to be very active.
- Combine the datasets to develop schedules for when it is ideal for users to be very active on a weekly and daily basis to better suit their schedules. This could be based on data collected throughout the day that lead to weekday patterns. Since most users are very active for less than 15 minutes and on Monday, Tuesday and Saturday, this data could be used to generate an activity schedule.
Future work:
- Collect and analyse more recent data from more users so the data can provide an accurate representation of the population as a whole.
- Include data regarding the location of users (country and city) as well as age ranges to better pinpoint the desired audience for marketing.
- Research marketing strategies from the big players in the market to see what they have done in the past that worked and why.
eyJrIjoiZWNmOTk3MzMtZjZiMS00MDVkLTk5NzEtMzZhOWQ4YWMzNjI2IiwidCI6IjhiYjg2YzEyLTczNmUtNDZiYy05ODk0LTJlZjhiY2U3OTk1ZSJ9