Google’s Cyclistic Capstone Project

The Google Data Analytics Course offered on Coursera is a great way for beginners to start their journey into the world of data analytics. It is a beginner course that introduces learners to many of the tools and skills used by data analysts to collect, transform and organize data. One benefit of the Google Data Analytics Course is that it gives beginners the opportunity to complete a capstone project that requires you to use datasets from a real-life case study to ask questions, prepare, process, analyse, visualize and act on the data from the scenario.

While I never started this Coursera course, I decided to use my skills as a data analyst to provide my solution and add the case study as a project in my portfolio, to showcase my skills as a data analyst. I will be using mostly SQL and Tableau for this project.

Situation Analysis

Divvy is a bike-sharing system in Chicago that provides a convenient, fun and affordable transportation option for commuting within the city of Chicago, for work, getting around the city or simply exploring. Currently there are 2 broad categories of Divvy users:

Casual Riders – These riders use the bikes for short trips by buying a single-ride pass or for an entire day through a full-day pass.
Member Riders – These members become annual members which allows them gain unlimited daily access to Divvy bikes for an entire year.

Research has shown that member riders are more profitable to the business than causal riders. Therefore, the plan is to maximize the number of member riders by attempting to convince casual riders into converting to members.

Goals and Expectations

Purpose/Benefit	End Result	Stakeholders	Success Criteria
Unlock insights needed to design marketing strategies that help to convert casual riders into members	Case study profiling the behaviour of the different rider profiles	Lindsay Silk-Kremenak, Director of Marketing	Visualizations that profile the behaviours of casual and member riders
A report that details the behaviour and analysis of consumers and rides	A presentation showcasing expertly crafted visualizations	Divvy’s Executive team	Sophisticated, clear and polished visualizations that highlight the differences between rider profiles
Recommendations and findings that will be used to help guide Divvy Marketing and other initiatives	Suggestions on how to persuade casual riders to try membership	Divvy Marketing Analyst team	Data-rooted answers to suggestions on the question of how to persuade casual riders into members

Identify the Business Task

We have been tasked with providing guidance to the marketing team which will help them with coming up with a strategy with the objective of converting casual riders to member riders. To achieve this, they require us to analyse divvy-trip-data and to answer questions about the Divvy audience.

In particular, Lindsay would like recommendations that will help answer 3 particular questions:

In what ways do members and casual riders use Divvy bikes differently?
Why would Casual Riders want to use Divvy more?
How can Divvy influence Casual Riders to become Members?

Data Collection

We will be using the Divvy monthly datasets from April 20020 to April 2022, a total of 25 datasets, with each entry representing every bike ride within that date range. The datasets will be imported into our analytics tool (in this case, Microsoft SQL Server). The first look into one of the sheets shows a total of 13 columns, each with different attributes. These 25 sheets will be imported into Microsoft SQL Server and a UNION ALL command will be used to combine all the sheets into one spreadsheet.

CREATE VIEW divvy_trip_data AS

WITH divvy_union_table AS (
	  SELECT *
	  FROM [202004-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202005-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202006-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202007-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202008-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202009-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202010-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202011-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202012-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202101-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202102-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202103-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202104-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202105-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202106-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202107-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202108-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202109-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202110-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202111-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202112-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202201-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202202-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202203-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202204-divvy-tripdata])
SELECT *
FROM
	divvy_union_table

The sheet contains the following columns.

Column	Description
ride_id	Unique id tag of each trip taken
rideable_type	Type of bike used on bike trip
started_at	Date and time bike trip started
ended_at	Date and time bike trip ended
start_station_name	Name of station the bike trip started
start_station_id	ID tag of station the bike trip started
end_station_name	Name of station the bike trip ended
end_station_id	ID tag of station the bike trip ended
start_lat	latitude coordinates of start station
start_lng	longitude coordinate of start station
end_lat	latitude coordinates of end station
end_lng	longitude coordinates of end station
member_casual	Rider profile (member or casual)

While importing the 25 sheets into one spreadsheet, some null values jump out in the start_station_name, start_station_id, end_station_name and end_station_id columns. For now, we’ll accept these null values and figure out what to do with them during exploration and cleaning.

Looking at a sample dataset we will be working with, a couple of questions jump up concerning the bike-sharing industry and what kind of insights we can draw from exploring the trips taken.

Initial Hypothesis & Posing Questions

Can we build a consumer psychographic profile for each consumer type?
Will the time spent on each trip by each rider tell us anything about the different rider profiles?
Can we figure out the region where most of the bikes are picked?
Which types of bikes are the most popular?
What is the average time spent on each bike and by each rider type?
Can average time spent on each bike be used to convert casual riders to member riders?
Can we calculate the distance covered on each trip based on the latitude and longitude coordinates provided?
What are the most frequent weekdays for trips for each rider type?
Are there any trends noticed when looking at the frequency of trips taken?

Data Exploration & Cleaning

Looking at the union spreadsheet containing all 25 sheets, the aggregate count command on all the records returns a total of 9,584,529 entries.

SELECT 
    COUNT(*)
FROM
	divvy_union_table

That’s a sizeable amount of data to work with. To avoid allocating a large amount of memory towards maintaining such a large number of records, using a view would be a more efficient approach, rather than saving this spreadsheet into a new table (I’ve named this view divvy_trip_data).

Taking a look at the column names provided, I intend to edit some of the names to provide better clarity on the type of data in the column. I’ll rename start_lng and end_lng to start_long and end_long respectively. I shall also rename the member_casual column name to rider_type. During importation, I also had to change the unique_id type from int to nvarchar(250) to accommodate unique ids containing alphanumeric characters.

Looking at the dates in the started_at column may help uncover some insights into which days in the week riders prefer to take trips. I’ll use the DATENAME command on the started_at column to parse out which day of the week the trip was started.

With the timestamps provided in the started_at and ended_at columns, we can use these two columns to create a new calculated field called trip_duration, which calculates the difference in time (in minutes) between when a bike trip starts and when it ends.

How do we clean the null values in the start_station_name and end_station_name columns? One way to do this would be to create a separate table of start_station_name and start_station_id to pair unique values of station names to station ids; then use this table to populate any missing station names. However, it appears that one station name can have multiple ids. There also appear to be a lot of inconsistencies with how ids and station names are recorded. This will make cleaning these null values tricky.

Analysing the bike stations and routes taken might provide us with some insights into the bike-sharing industry or help up understand the behaviour of bike riders. However, some records are missing values in the start_station_name and end_station_name columns, hence the presence of null values. To clean any null values in the start_station_name and end_station_name columns, I’ll substitute the nulls with ‘Not Available‘, rather than delete those records.

With the null values in the start_station_name and end_station_name cleaned, we can use these two columns to create a new column called bike_route, which represents the route taken from start to finish on every trip. To do this, we’ll concatenate the start_station_name value with the end_station_name value for each record, along with a ‘ — ‘ inserted between them. This gives us a list of all the bike routes taken per rider.

CREATE VIEW divvy_trip_data AS

WITH divvy_union_table AS (
	  SELECT *
	  FROM [202004-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202005-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202006-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202007-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202008-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202009-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202010-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202011-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202012-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202101-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202102-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202103-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202104-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202105-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202106-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202107-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202108-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202109-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202110-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202111-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202112-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202201-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202202-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202203-divvy-tripdata]
	  UNION ALL
	  SELECT *
	  FROM [202204-divvy-tripdata])
SELECT
	ride_id,
	rideable_type,
	started_at,
	DATENAME(DW, started_at) AS start_day,  -- function for extracting day from started_at column
	ended_at,
	DATEDIFF(MINUTE, started_at, ended_at) AS trip_duration,  -- function to calculate time difference
	ISNULL(start_station_name, 'Not Available') AS start_station_name,  -- function to clean Null values in start station name
	start_station_id,
	ISNULL(end_station_name, 'Not Available') AS end_station_name,  -- function to clean null vlues in end station name
	end_station_id,
	CONCAT(ISNULL(start_station_name, 'Not Available'), ' -- ', ISNULL(end_station_name, 'Not Available')) AS bike_route,  -- concatenating start station and end station to create bike route per trip
	start_lat,
	start_lng AS start_long,
	end_lat,
	end_lng AS end_long,
	member_casual AS rider_type
FROM
	divvy_union_table

Continuing with our exploratory data analysis( EDA), more observations start to jump out.

Some of the values in our new trip_duration column are less than 1 minute. This may have occurred because the rider changed their mind and cancelled the trip; or the data was incorrectly entered. As I cannot think of any insight we can gain from trips that last for less than a minute, these records will be deleted.

Still looking at the time spent per trip, the MAX(trip_duration) command returns a value of 58,720 minutes, well over 24 hours. Considering the unlikelihood of a rider cycling for 2 hours, this suggests that there may be outliers in our dataset which would affect the quality of our analysis. It will help to look at the distribution of trip_duration to clean any outliers.

This raises the question of what we do with other null values. As we have cleaned and formatted the start_station_name and end_station_name columns, we have no need for the station ids. The station_station_id and end_station_id columns can be dropped.

With the longitude and latitude coordinates provided in the spreadsheet, we can look at the distribution of stations geographically. Familiarity with the landscape and socio-economic layout of the city of Chicago will help a great deal to analyse if there are any factors which drive the popularity of certain bike stations. I hypothesize that the start location for a trip are more significant for analysis than the end location, and are likelier to tell us about a rider’s profile.

Analysis & Visualizations

We are ready to import our dataset into Tableau to identify any common themes ansd patterns that help answer questions about Divvy and their rider behaviours. Visualizing the data will also help help identify any outlier records in our data, which may still need to be cleaned. Thankfully, with over 9 milion records, we have the luxury of removing a couple of records as outliers, without losing too much insight from the data.

Tableau will also be used to create a calculated field which measures the distance between two points, by using the longitude and latitude coordinates of the station and end station of each trip. In the event of trips where the distance covered is 0 miles, these trips will be left alone because it is possible for a rider to start and end his bike ride at the same location. However, it will be worthy to consider to cleaning rides with a distance of over a certain distance (for this analysis, we shall set that to 20 miles). A histogram is a great visual aid to help represent the distribution of values; we’ll use a distance distribution histogram to repesent the range of distance covered in the rides and identify where outlier distance values fall.

SQL Tableau