Chicago bike-share Analysis Part 1: Cleaning
TL;DR: I cleaned a +5 million records dataset using SQL and Google App Scripts.
I love riding bikes. When I'm exploring a new city, I often try its public bike system. I think it's the most convenient way to discover a new city and find its hidden gems.
I've never been to Chicago, but I've found out that Divvy, Chicago's bike-sharing system, releases historical monthly trip data under this license. The data provided are quite complete which makes it interesting to explore.
But before moving further, let's get some context.
First thing first: the bikes. Riders can choose between 2 kinds of bikes:
- Classic bikes. They must be docked at a station to end a ride.
- Electric bikes. They can be locked to a public bike rack within the service area or docked at a station.
Divvy offers 3 plans to use its service:
- Single ride passes for non-members. It costs $1 to unlock a bike, then $0.17 a minute.
- Day passes for non-members. It costs $16.50 a day and allows for unlimited 3-hour rides for 24 hours.
- Annual memberships, which give users unlimited 45 minutes rides.
So what's the goal of this data analysis? Simple: We want to uncover how annual members and non-members use Divvy bikes differently.
Let's dive in.
Data Preparation
Lyft, which operates Divvy, releases trip history data every month. We will focus this analysis on the year 2022, which means we'll have 12 files to work with.
There are 13 fields in each CSV file and each record represents 1 ride. Metadata are not provided but the majority of variables are self-explanatory:
Field name | Data type | Description |
---|---|---|
ride_id | String | Unique ID for each ride |
rideable_type | String | Bike types |
started_at | Timestamp | Trip start day and time |
ended_at | Timestamp | Trip end day and time |
start_station_name | String | Trip start station |
start_station_id | String | ID of the start station |
end_station_name | String | Trip end station |
end_station_id | String | ID of the end station |
start_lat | Float | Latitude of the start station |
start_lng | Float | Longitude of the start station |
end_lat | Float | Latitude of the end station |
end_lng | Float | Longitude of the end station |
member_casual | String | Rider type |
2 fields need some clarification:
rideable_type
has 3 possible values:classic_bike
,docked_bike
andelectric_bike
.member_casual
has 2 possibles values:casual
users who have bought either a single ride or a day pass andmember
users who have subscribed for an annual membership.
Data bias
Each trip has been anonymized, so there's no personally identifiable information (PII) or credit card information. This makes it impossible to connect past purchases to users to determine, for example, how many trips a casual rider takes before upgrading to an annual membership.
Another limitation is that Divvy does not provide unique IDs for bikes, which makes it impossible to analyze bike-specific data like failure rates.
According to Divvy, data has been processed to remove trips from staff, as well as any trips that are below 60 seconds—which could be attributed to false starts or rider re-docking a bike to ensure it was locked.
Spoiler alert: nop.
Now that we have everything, let's start processing and cleaning the data.
Data cleaning
I first had a quick once-over of the 12 CSV files to make sure everything was in order. I removed the headers from the February to December datasets and noticed that TIMESTAMP records were stored as strings in the September to December datasets.
After removing the quotation marks from the TIMESTAMP records with the search and replace functionality, I merged all 12 files together using the command prompt.
The resulting CSV file came in at a hefty 1GB and has over 5.6 million records. That's a ton of rides sift thought! I then stored it in Google Cloud Storage and created on BigQuery a table named
2022_tripdata
within adivvy_dataset
dataset.
Cleaning strategy
We will start by identifying in the 2022_tripdata
table what needs to be cleaned to achieve our business goal: how annual riders and non-members use Divvy bikes differently.
In the second part, we will create a copy of the 2022_tripdata
table, name it 2022_tripdata_clean
and use it to perform the cleaning, based on the points defined in the previous step.
Duplicates and NULL
values
Let's start by checking if there are no duplicate ride_id
and if the rideable_type
, started_at
, ended_at
and member_casual
fields contain NULL
values.
SELECT
COUNT(ride_id) AS number_of_records,
COUNT(DISTINCT ride_id) AS unique_ride,
COUNTIF(rideable_type IS NOT NULL) AS rideable_records,
COUNTIF(started_at IS NOT NULL) AS started_at_records,
COUNTIF(ended_at IS NOT NULL) AS ended_at_records,
COUNTIF(member_casual IS NOT NULL) AS member_records
FROM `divvy_dataset.2022_tripdata`
This statement returns the following results:
number_of_records | unique_ride | rideable_records |
---|---|---|
5667717 | 5667717 | 5667717 |
started_at_records | ended_at_records | member_records |
---|---|---|
5667717 | 5667717 | 5667717 |
Every record is a unique ride and all of the important fields have non-null values. Good, we're off to a good start!
Station names and station IDs
Now let's check with the following query if there are records with missing station data:
SELECT
COUNTIF(start_station_name IS NULL) AS missing_start_station_name,
COUNTIF(start_station_id IS NULL) AS missing_start_station_id,
COUNTIF(end_station_name IS NULL) AS missing_end_station_name,
COUNTIF(end_station_id IS NULL) AS missing_end_station_id
FROM `divvy_dataset.2022_tripdata`;
Yes, indeed! That's a whole lot of NULL
data:
missing_start_station_name | missing_start_station_id |
---|---|
833064 | 833064 |
missing_end_station_name | missing_end_station_id |
---|---|
892742 | 892742 |
It also shows that when station data is missing, the station name and station ID fields are NULL
. Moving forward we will use the station name fields instead of the station id fields.
Let's then group the results by bike type to further investigate it:
SELECT
rideable_type,
COUNTIF(start_station_name IS NULL) AS missing_start_data,
COUNTIF(end_station_name IS NULL) AS missing_end_data
FROM `divvy_dataset.2022_tripdata`
GROUP BY rideable_type;
We can see that nearly all trips where station data are missing are coming from e-bikes:
rideable_type | missing_start_data | missing_end_data |
---|---|---|
electric_bike | 833064 | 886338 |
classic_bike | 0 | 3788 |
docked_bike | 0 | 2616 |
As a reminder, when riding an e-bike, users can lock the bike to a public rack—without having to dock it at a station. It also means then that users can unlock an e-bike from a public rack and not necessarily from a Divvy station. So these records don't have to be cleaned.
However, some trips where the bike used was a classic_bike
or a docked_bike
have missing end station data. These might be due to lost or stolen bikes. We will have to remove these records from the dataset.
Bike-type distribution
According to Divvy's website, there are only 2 types of bikes: classic and electric. However, the rideable_type
field has 3 possible values: classic_bike
, docked_bike
and electric_bike
.
Let's check the distribution by user type with the following query:
WITH
bike_type AS (
SELECT
rideable_type,
COUNT(ride_id) AS number_of_rides,
FROM `divvy_dataset.2022_tripdata`
GROUP BY rideable_type
)
SELECT
rideable_type,
number_of_rides,
ROUND(SAFE_DIVIDE(number_of_rides, SUM(number_of_rides) OVER()), 4) as share_per_bike_type
FROM bike_type
ORDER BY number_of_rides DESC;
Turns out that only 3% of the records have docked_bike
as the rideable_type
:
rideable_type | number_of_rides | share_per_bike_type |
---|---|---|
electric_bike | 2889029 | 0.5097 |
classic_bike | 2601214 | 0.459 |
docked_bike | 177474 | 0.0313 |
I reached out to Lyft's data service and they told me changes were made to the Divvy platform in April 2020. Before that, a classic bike was identified as docked_bike
. However, since April 2020, these bikes are now tracked as classic_bike
.
Despite the change, there are still instances of docked_bike
value present in the 2022 dataset. I guess the implementation is not perfect, yet. Anyway, we will have to replace all of these instances with classic_bike
.
Trip length
According to Divvy website, all trips from staff and those below 60 seconds have been removed from their datasets.
Let's add a new field called ride_length
to verify this and populate it with the time difference between the ended_at
and started_at
fields using the following query:
ALTER TABLE `divvy_dataset.2022_tripdata`
ADD COLUMN ride_length INT64;
UPDATE `divvy_dataset.2022_tripdata`
SET ride_length = TIMESTAMP_DIFF(ended_at, started_at, SECOND)
WHERE TRUE;
Next, let's run the following query to get a quick glance at the distribution:
SELECT
COUNT(ride_id) AS total_rides,
COUNTIF(ride_length < 60) AS rides_below_60secs,
COUNTIF(ride_length > 10800 ) AS rides_above_3hrs
FROM `divvy_dataset.2022_tripdata`;
The results clearly show that trips below 60 seconds have not been properly removed:
total_rides | rides_below_60secs | rides_above_3hrs |
---|---|---|
5667717 | 121089 | 19592 |
There are also some trips above 3 hours, which may indicate lost or stolen bikes.
Let's dig further and see how the distribution of ride durations looks, starting with members:
WITH
trip_length AS (
SELECT
CASE
WHEN ride_length < 60 THEN 'Less than 1 min'
WHEN ride_length BETWEEN 60 AND 3600 THEN '1min to 1 hr'
WHEN ride_length BETWEEN 3601 AND 7200 THEN '1hr to 2hrs'
WHEN ride_length BETWEEN 7201 AND 10800 THEN '2hrs to 3hrs'
WHEN ride_length BETWEEN 10801 AND 14400 THEN '3 hrs to 4hrs'
ELSE 'More than 4 hrs'
END AS ride_length,
COUNT(ride_id) AS number_of_rides,
CAST(ROUND(AVG(TIMESTAMP_DIFF(ended_at, started_at, SECOND))) AS INT64) AS avg_ride_length -- CAST to INTEGER to construct an INTERVAL object
FROM `divvy_dataset.2022_tripdata`
WHERE member_casual = 'member'
GROUP BY ride_length
)
SELECT
ride_length,
number_of_rides,
ROUND(SAFE_DIVIDE(number_of_rides, SUM(number_of_rides) OVER()), 4) AS share_ride_length,
MAKE_INTERVAL(second => avg_ride_length) AS avg_ride_length_interval
FROM trip_length
ORDER BY avg_ride_length;
98% of bike trips from members were below 1 hour, with an average trip duration of 11 minutes and 49 seconds:
ride_length | number_of_rides | share_ride_length | avg_ride_length_interval |
---|---|---|---|
Less than 1 min | 73199 | 0.0219 | 0-0 0 0:0:15 |
1min to 1 hr | 3248750 | 0.971 | 0-0 0 0:11:49 |
1hr to 2hrs | 17106 | 0.0051 | 0-0 0 1:17:41 |
2hrs to 3hrs | 2619 | 0.0008 | 0-0 0 2:25:32 |
3 hrs to 4hrs | 1178 | 0.0004 | 0-0 0 3:26:37 |
More than 4 hrs | 2833 | 0.0008 | 0-0 0 12:50:22 |
As a reminder, members can use a bike for up to 45 minutes. They can keep the bike out for longer, but they'll have to pay an extra $0.17 per minute.
Based on the extra pricing charge and the average ride length of 1 hour and 17 minutes of the 1 hour to 2 hours
cohort, we will give members an additional 30-minute window and remove all records from members where the trip duration exceeded 1 hour and 15 minutes.
Regarding casual riders, it seems that they ride for a longer period than members. That's something we could expect since members probably use Divvy to commute while casual riders may use it for leisure or sightseeing:
ride_length | number_of_rides | share_ride_length | avg_ride_length_interval |
---|---|---|---|
Less than 1 min | 47890 | 0.0206 | 0-0 0 0:0:23 |
1min to 1 hr | 2127510 | 0.9162 | 0-0 0 0:16:7 |
1hr to 2hrs | 108251 | 0.0466 | 0-0 0 1:21:13 |
2hrs to 3hrs | 22800 | 0.0098 | 0-0 0 2:23:54 |
3 hrs to 4hrs | 4995 | 0.0022 | 0-0 0 3:22:54 |
More than 4 hrs | 10586 | 0.0046 | 0-0 0 31:56:16 |
As a reminder, casual riders with a day pass can use a classic bike for up to 3 hours. Like members, we'll give them an extra 30 minutes widows and remove every record from casual riders where the trip duration was above 3 hours and 30 minutes.
Same-station trips
It is common to begin and end a trip at the same station when using a bike-sharing service. There are plenty of reasons to do so. It could be because of a false start, a faulty bike, or even just a quick errand.
Let's get an idea of how many trips have the same start and end station with the following query:
SELECT
COUNT(ride_id) AS total_rides,
(SELECT
COUNT(ride_id)
FROM `divvy_dataset.2022_tripdata`
WHERE start_station_name = end_station_name
) AS same_station_rides
FROM `divvy_dataset.2022_tripdata`
Turns out that more than 293,000 records have the same start and end station. That's 5.18% of the total rides:
total_rides | same_station_rides |
---|---|
5667717 | 293619 |
Let's get more insight into the distribution of these same-station trips by rider type:
WITH same_stations_trips AS (
SELECT
DISTINCT member_casual, -- DISTINCT since we can't use PERCENTILE_CONT with GROUP BY
COUNT(ride_id) OVER(PARTITION BY member_casual) AS number_of_rides, -- Use of PARTITION BY since we can't use GROUP BY
CAST(AVG(ride_length) OVER(PARTITION BY member_casual) AS INT64) AS avg_ride_length,
CAST(PERCENTILE_CONT(ride_length, 0.5) OVER(PARTITION BY member_casual) AS INT64) AS median_ride_length -- CAST to INTEGER to construct an INTERVAL object
FROM `divvy_dataset.2022_tripdata`
WHERE start_station_name = end_station_name
)
SELECT
member_casual,
number_of_rides,
ROUND(SAFE_DIVIDE(number_of_rides, SUM(number_of_rides) OVER()), 4) AS share,
MAKE_INTERVAL(second => avg_ride_length) AS avg_ride_length_interval,
MAKE_INTERVAL(second => median_ride_length) AS median_ride_length_interval
FROM same_stations_trips
ORDER BY number_of_rides DESC;
Nearly 60% of these same-station trips were taken by casual riders:
member_casual | number_of_rides | share |
---|---|---|
casual | 175571 | 0.598 |
member | 118048 | 0.402 |
member_casual | avg_ride_length_interval | median_ride_length_interval |
---|---|---|
casual | 0-0 0 0:35:45 | 0-0 0 0:19:47 |
member | 0-0 0 0:12:58 | 0-0 0 0:2:11 |
This could indicate that casual riders might use the service for leisurely sightseeing around a particular park or neighborhood.
But the most interesting bit is with members. The median is actually quite low: 2 minutes and 11 seconds. This suggests that they may have unlocked a bike and then returned it due to a defect.
So let's use this result as a baseline to remove all same-station trips in under 2 minutes.
Time to move to coordinates.
Spoiler alert: it's going to be fun.
Coordinates data
There are in total 1,709 stations. So there should be, in total, 1,709 different coordinates data, right? Right. Well, let's check that with the following query:
WITH coordinates_stations AS (
SELECT
start_station_name AS station_name,
CONCAT(start_lat, ' ', start_lng) AS coordinates
FROM `divvy_dataset.2022_tripdata`
WHERE start_station_name IS NOT NULL AND start_lat IS NOT NULL AND start_lng IS NOT NULL
UNION ALL -- UNION ALL to get data from start and end stations under the same field
SELECT
end_station_name AS station_name,
CONCAT(end_lat, ' ', end_lng) AS coordinates
FROM `divvy_dataset.2022_tripdata`
WHERE end_station_name IS NOT NULL AND end_lat IS NOT NULL AND end_lng IS NOT NULL
)
SELECT
COUNT(DISTINCT station_name) AS unique_stations,
COUNT(DISTINCT coordinates) AS unique_coordinates,
ROUND(COUNT(DISTINCT coordinates) / COUNT(DISTINCT station_name)) AS avg_coordinates_per_station
FROM coordinates_stations;
There are 1,952,581 unique coordinates:
unique_stations | unique_coordinates | avg_coordinates_per_station |
---|---|---|
1709 | 1952581 | 1143.0 |
Yup, that's right. One million nine hundred fifty thousand five hundred eighty-one. Unique. Coordinates. That's an average of 1,143 coordinates per station. Definitely not good.
Let's find out which station has the most coordinates with the following query:
WITH coordinates_stations AS (
SELECT
start_station_name AS station_name,
CONCAT(start_lat, ' ', start_lng) AS coordinates
FROM `divvy_dataset.2022_tripdata`
WHERE start_station_name IS NOT NULL AND start_lat IS NOT NULL AND start_lng IS NOT NULL
UNION ALL -- UNION ALL to get data from start and end stations under the same field
SELECT
end_station_name AS station_name,
CONCAT(end_lat, ' ', end_lng) AS coordinates
FROM `divvy_dataset.2022_tripdata`
WHERE end_station_name IS NOT NULL AND end_lat IS NOT NULL AND end_lng IS NOT NULL
)
SELECT
DISTINCT station_name,
COUNT(DISTINCT coordinates) AS unique_coordinates
FROM coordinates_stations
GROUP BY station_name
ORDER BY unique_coordinates DESC
LIMIT 3;
Annnd drum roll please the winner is... Streeter Dr & Grand Ave
station with an astonishing 22,366 different coordinates data!
station_name | unique_coordinates |
---|---|
Streeter Dr & Grand Ave | 22366 |
Wells St & Concord Ln | 15746 |
Kingsbury St & Kinzie St | 13857 |
Let's plot these 22,366 coordinates points on a map using Looker:
While most of them are clustered around the same area, some are scattered around town. This is definitely going to make the visualization of all these coordinates not insightful at all once we'll start analyzing the dataset.
In short, we can't trust these coordinates data. Ideally, each station should have 1 coordinate. No more.
Time to do some cleaning.
Cleaning the coordinates data
First, let's download the Divvy Bicycle Stations dataset from the City of Chicago Data Portal and import it into a Google Sheet.
This dataset has the following fields:
Field | Data type | Description |
---|---|---|
Station Name | String | Name of the station |
ID | Integer | ID of the station |
Total Docks | Integer | Number of docks in the station |
Docks in service | Integer | Docks in service |
Status | String | In Service or Not In Service |
Latitude | Float | Latitude of the station |
Longitude | Float | Longitude of the station |
Location | String | Coordinates of the station between parenthesis |
Quick note on this dataset: the station ID field is not consistent with the Divvy trip history dataset.
For instance, the
Monticello Ave & Chicago Ave
station has the ID set to301
in the Divvy dataset, while it has the ID set to1575949489503728316
in the City of Chicago dataset.Moving forward we will use the station name fields to avoid any confusion.
Let's start by exporting the list of all stations from the Divvy dataset with the following query:
WITH station_data AS (
SELECT
DISTINCT(start_station_name) AS station_name,
FROM `divvy_dataset.2022_tripdata`
WHERE start_station_name IS NOT NULL
UNION ALL -- UNION ALL to get data from start and end stations under the same field
SELECT
DISTINCT(end_station_name) AS station_name,
FROM `divvy_dataset.2022_tripdata`
WHERE end_station_name IS NOT NULL
)
SELECT
DISTINCT(station_name)
FROM station_data;
After loading the data into our Google Sheets, we can then now look up the latitude and longitude data from the City of Chicago dataset to the list of stations we just exported. Should be easy right?
Turns out the name of some stations have changed between the 2 datasets. For example, Fullerton Ave & Narragansett Ave
became Public Rack - Fullerton Ave & Narragansett Ave
in the City of Chicago dataset.
We can nest 2 VLOOKUP()
inside an IFERROR()
function to get the value if the station name starts with Public Rack -
:
=IFERROR(VLOOKUP(A2,Divvy_Bicycle_Stations!A:G,6,FALSE),(VLOOKUP("Public Rack - "&A2,Divvy_Bicycle_Stations!A:G,6,FALSE)))
A bit better. But there are still some errors, though. As a great astrobotanist once said:
In the face of overwhelming odds, we're left with only 1 option: we're going to have to code the sh*t out of this.
Or something like that.
So let's create 2 custom formulas using Google App Script and its Geocoder Class to get the latitude and longitude of an address according to Google Maps.
First, let's create a GETLAT()
formula to get the latitude:
const GETLAT = (address) => { // =GETLAT("Streeter Dr & Grand Ave")
const { results: [data = null] = [] } = Maps.newGeocoder().geocode(address); // address to lookup
if (data === null) {
throw new Error('Address not found!');
}
const { geometry: { location: { lat } } = {} } = data; // get the latitude
Utilities.sleep(500); // pause the loop for 500 milliseconds
return `${lat}`; // return the latitude
};
And then a GETLONG()
formula to get the longitude:
const GETLONG = (address) => { // =GETLONG("Streeter Dr & Grand Ave")
const { results: [data = null] = [] } = Maps.newGeocoder().geocode(address); // address to lookup
if (data === null) {
throw new Error('Address not found!');
}
const { geometry: { location: { lng } } = {} } = data; // get the longitude
Utilities.sleep(500); // pause the loop for 500 milliseconds
return `${lng}`; // return the longitude
};
These 2 formulas use the Maps.newGeocoder().geocode
method that takes an address string and returns either the latitude or the longitude from the response object.
We can now update the previous formula to get the latitude and longitude IF the first 2 VLOOKUP()
return an empty field:
=IFERROR(VLOOKUP(A2,Divvy_Bicycle_Stations!A:G,7,FALSE),IFERROR(VLOOKUP("Public Rack - "&A2,Divvy_Bicycle_Stations!A:G,6,FALSE),GETLAT("Chicago Divvy "&A2)))
=IFERROR(VLOOKUP(A2,Divvy_Bicycle_Stations!A:G,7,FALSE),IFERROR(VLOOKUP("Public Rack - "&A2,Divvy_Bicycle_Stations!A:G,7,FALSE),GETLONG("Chicago Divvy "&A2)))
Before moving further, let's use conditional formatting with a COUNTIF()
formula to check for duplicate coordinate data.
Annnnnnd turns out there are some:
station_name | coordinate |
---|---|
63rd & Western Ave - north corner | 41.77936 -87.68377 |
63rd & Western Ave - south corner | 41.77936 -87.68377 |
Broadway & Wilson - Truman College Vaccination Site | 41.965221 -87.658139 |
Broadway & Wilson Ave | 41.965221 -87.658139 |
Elizabeth (May) St & Fulton St | 41.88661649 -87.65803993 |
Elizabeth St & Fulton St | 41.88661649 -87.65803993 |
Laflin St & 51st St | 41.801354 -87.662076 |
Laflin St &51st ST | 41.801354 -87.662076 |
Woodlawn Ave & 63rd St - NE | 41.7805135 -87.5962081 |
Woodlawn Ave & 63rd St - SE | 41.7805135 -87.5962081 |
This is due because some station names have changed throughout the year in the Divvy dataset. To consolidate our data and avoid any duplicates, we'll have to update the names of the following stations:
Current station name | New station name |
---|---|
63rd & Western Ave - north corner | 63rd & Western Ave N |
63rd & Western Ave - south corner | 63rd & Western Ave S |
Broadway & Wilson - Truman College Vaccination Site | Broadway & Wilson Ave |
Elizabeth (May) St & Fulton St | Elizabeth St & Fulton St |
Laflin St &51st ST | Laflin St & 51st St |
Woodlawn Ave & 63rd St - NE | Woodlawn Ave & 63rd St N |
Woodlawn Ave & 63rd St - SE | Woodlawn Ave & 63rd St S |
While we're still working on the Google Sheets, why not add the zip code for each station? This will allow for some neat visualization by neighborhood once we'll analyze the dataset.
Again, by using the Geocoder class from the Google Maps API we can create a custom GETZIP()
formula to retrieve the ZIP code based on coordinates data:
const GETZIP = (latitude, longitude) => { // =GETZIP(42.23144, 0.873234)
const { results: [data = {}] = [] } = Maps.newGeocoder().reverseGeocode(latitude, longitude);
Utilities.sleep(500); // pause the loop for 500 milliseconds
return data.formatted_address.split(',')[2].trim().split(' ')[1]; // return the complete address and extract its zip code
};
Tada! The CSV file is now ready to be uploaded into Bigquery with the following fields:
Field | Data type | Description |
---|---|---|
station_name | String | Name of the station |
lat | Float | Latitude of the station |
long | Float | Longitude of the station |
zip_code | String | Zip code of the station |
Before uploading the table, I took notes of the following stations that should be removed:
WEST CHI-WATSON Pawel Bialowas - Test- PBSC charging station Divvy Valet - Oakwood Beach DIVVY CASSETTE REPAIR MOBILE STATION Base - 2132 W Hubbard Base - 2132 W Hubbard Warehouse NewHastings WestChi Hastings WH 2
As well as the following strings of characters that should be removed:
Public Rack - Public Rack - Pubic Rack - City Rack - * - Charging - midblock - midblock south (Temp)
Time to (finally) start to clean the dataset.
Clean the data
Finally, here we are. Let's start by copying the 2022_tripdata
dataset so we have a safe backup. The dataset we'll work on is now named 2022_tripdata_clean
.
First, let's replace all docked_bike
values with the rideable_type
field with classic_bike
:
UPDATE `divvy_dataset.2022_tripdata_clean`
SET rideable_type = 'classic_bike'
WHERE rideable_type = 'docked_bike';
SELECT DISTINCT rideable_type
FROM `divvy_dataset.2022_tripdata_clean`;
This statement modified 177,474 rows in the dataset.
Then, let's remove all records with missing end station data where the rideable_type
field is equal to classic_bike
:
SELECT *
FROM `divvy_dataset.2022_tripdata_clean`
WHERE end_station_name IS NULL
AND rideable_type = 'classic_bike';
DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE end_station_name IS NULL
AND rideable_type = 'classic_bike';
This statement removed 6,404 rows from the dataset.
Now we will remove records where the trip length is below 60 seconds:
SELECT *
FROM `divvy_dataset.2022_tripdata_clean`
WHERE ride_length < 60;
LIMIT 100;
DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE ride_length < 60;
This statement removed 121,086 rows from the dataset.
Then all rides from members where the trip length is above 1 hour and 15 minutes:
SELECT *
FROM `divvy_dataset.2022_tripdata_clean`
WHERE member_casual = "member"
AND ride_length > 4500 -- ride_length above 1:15 hours
ORDER BY ride_length
LIMIT 10;
DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE member_casual = "member"
AND ride_length > 4500; -- ride_length above 1:15 hours
This statement removed 13,461 rows from the dataset.
And then let's delete all rides from casual riders where the trip length is above 3 hours and 30 minutes:
SELECT *
FROM `divvy_dataset.2022_tripdata_clean`
WHERE member_casual = "casual"
AND ride_length > 12600 -- ride_length above 3:30 hours
ORDER BY ride_length
LIMIT 10;
DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE member_casual = "casual"
AND ride_length > 12600; -- ride_length above 3:30 hours
This statement removed 7,417 rows from the dataset.
Now, let's remove all same-station trips under 2 minutes:
SELECT *
FROM `divvy_dataset.2022_tripdata_clean`
WHERE start_station_name = end_station_name
AND ride_length < 120
ORDER BY ride_length DESC;
DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE start_station_name = end_station_name
AND ride_length < 120;
This statement removed 23,578 rows from the dataset.
Time to remove all trips starting or ending from the following stations:
WEST CHI-WATSON
Pawel Bialowas - Test- PBSC charging station
Divvy Valet - Oakwood Beach
DIVVY CASSETTE REPAIR MOBILE STATION
Base - 2132 W Hubbard
Base - 2132 W Hubbard Warehouse
NewHastings
WestChi
Hastings WH 2
DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE start_station_name IN (
'WEST CHI-WATSON',
'Pawel Bialowas - Test- PBSC charging station',
'Divvy Valet - Oakwood Beach',
'DIVVY CASSETTE REPAIR MOBILE STATION',
'Base - 2132 W Hubbard',
'Base - 2132 W Hubbard Warehouse',
'NewHastings',
'WestChi',
'Hastings WH 2'
);
DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE end_station_name IN (
'WEST CHI-WATSON',
'Pawel Bialowas - Test- PBSC charging station',
'Divvy Valet - Oakwood Beach',
'DIVVY CASSETTE REPAIR MOBILE STATION',
'Base - 2132 W Hubbard',
'Base - 2132 W Hubbard Warehouse',
'NewHastings',
'WestChi',
'Hastings WH 2'
);
The first statement removed 1,951 from the dataset, while the second removed 231 rows.
The dataset has now 5,493,589 records, down to 5,667,717 records. That's a total of 174,128 records that were removed.
Now let's rename the following stations:
Current station name | New station name |
---|---|
63rd & Western Ave - north corner | 63rd & Western Ave N |
63rd & Western Ave - south corner | 63rd & Western Ave S |
Broadway & Wilson - Truman College Vaccination Site | Broadway & Wilson Ave |
Elizabeth (May) St & Fulton St | Elizabeth St & Fulton St |
Laflin St &51st ST | Laflin St & 51st St |
Woodlawn Ave & 63rd St - NE | Woodlawn Ave & 63rd St N |
Woodlawn Ave & 63rd St - SE | Woodlawn Ave & 63rd St S |
We can achieve that with the following query. Note that there's no way to use Common Table Expressions with BigQuery to return only the data we need, so the query may seem a bit complex and harder to read:
UPDATE`divvy_dataset.2022_tripdata_clean`
SET
start_station_name =
CASE
WHEN start_station_name = '63rd & Western Ave - north corner' THEN '63rd & Western Ave N'
WHEN start_station_name = '63rd & Western Ave - south corner' THEN '63rd & Western Ave S'
WHEN start_station_name = 'Broadway & Wilson - Truman College Vaccination Site' THEN 'Broadway & Wilson Ave'
WHEN start_station_name = 'Elizabeth (May) St & Fulton St' THEN 'Elizabeth St & Fulton St'
WHEN start_station_name = 'Laflin St &51st ST' THEN 'Laflin St & 51st St'
WHEN start_station_name = 'Woodlawn Ave & 63rd St - NE' THEN 'Woodlawn Ave & 63rd St N'
WHEN start_station_name = 'Woodlawn Ave & 63rd St - SE' THEN 'Woodlawn Ave & 63rd St S'
ELSE start_station_name
END,
end_station_name =
CASE
WHEN end_station_name = '63rd & Western Ave - north corner' THEN '63rd & Western Ave N'
WHEN end_station_name = '63rd & Western Ave - south corner' THEN '63rd & Western Ave S'
WHEN end_station_name = 'Broadway & Wilson - Truman College Vaccination Site' THEN 'Broadway & Wilson Ave'
WHEN end_station_name = 'Elizabeth (May) St & Fulton St' THEN 'Elizabeth St & Fulton St'
WHEN end_station_name = 'Laflin St &51st ST' THEN 'Laflin St & 51st St'
WHEN end_station_name = 'Woodlawn Ave & 63rd St - NE' THEN 'Woodlawn Ave & 63rd St N'
WHEN end_station_name = 'Woodlawn Ave & 63rd St - SE' THEN 'Woodlawn Ave & 63rd St S'
ELSE end_station_name
END
WHERE
start_station_name IN (
'63rd & Western Ave - north corner',
'63rd & Western Ave - south corner',
'Broadway & Wilson - Truman College Vaccination Site',
'Elizabeth (May) St & Fulton St',
'Laflin St &51st ST',
'Woodlawn Ave & 63rd St - NE',
'Woodlawn Ave & 63rd St - SE'
)
OR
end_station_name IN (
'63rd & Western Ave - north corner',
'63rd & Western Ave - south corner',
'Broadway & Wilson - Truman College Vaccination Site',
'Elizabeth (May) St & Fulton St',
'Laflin St &51st ST',
'Woodlawn Ave & 63rd St - NE',
'Woodlawn Ave & 63rd St - SE'
)
This statement modified 28,008 rows in the dataset.
Now, let's remove the following substrings of characters from the start_station_name
and end_station_name
fields:
Public Rack -
Public Rack -
Pubic Rack -
City Rack -
*
- Charging
- midblock
- midblock south
(Temp)
We'll do that using regex—thank you ChatGPT for this one:
SELECT
DISTINCT start_station_name,
REGEXP_REPLACE(start_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))', '') AS cleaned_station_name
FROM `divvy_dataset.2022_tripdata_clean`
WHERE REGEXP_CONTAINS(start_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))');
UPDATE `divvy_dataset.2022_tripdata_clean`
SET start_station_name = REGEXP_REPLACE(start_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))', '')
WHERE REGEXP_CONTAINS(start_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))');
SELECT
DISTINCT end_station_name,
REGEXP_REPLACE(end_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))', '') AS cleaned_station_name
FROM `divvy_dataset.2022_tripdata_clean`
WHERE REGEXP_CONTAINS(end_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))');
UPDATE `divvy_dataset.2022_tripdata_clean`
SET end_station_name = REGEXP_REPLACE(end_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))', '')
WHERE REGEXP_CONTAINS(end_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))');
The first UPDATE
statement modified 84,146 rows from the dataset, while the second one modified 87,316 rows.
The dataset has now 1379 stations, down to 1709.
Now, let's finish things up by cleaning coordinate data and then adding the zip code for every stations,
First, we will update the latitude and longitude fields with the following query:
UPDATE `divvy_dataset.2022_tripdata_clean` AS tripdata
SET tripdata.start_lat = stations_data.lat, tripdata.start_lng = stations_data.long
FROM `divvy_dataset.stations_data` AS stations_data
WHERE tripdata.start_station_name = stations_data.station_name;
UPDATE `divvy_dataset.2022_tripdata_clean` AS tripdata
SET tripdata.end_lat = stations_data.lat, tripdata.end_lng = stations_data.long
FROM `divvy_dataset.stations_data` AS stations_data
WHERE tripdata.end_station_name = stations_data.station_name;
The first UPDATE
statement modified 4,694,387 rows in the dataset, while the second one modified 4,653,520 rows.
Let's quickly check that now every station has 1 coordinate data:
WITH coordinates_stations AS (
SELECT
start_station_name AS station_name,
CONCAT(start_lat, ' ', start_lng) AS coordinate
FROM `divvy_dataset.2022_tripdata_clean`
WHERE start_station_name IS NOT NULL AND start_lat IS NOT NULL AND start_lng IS NOT NULL
UNION DISTINCT -- UNION DISTINCT to get unique station names
SELECT
end_station_name AS station_name,
CONCAT(end_lat, ' ', end_lng) AS coordinates
FROM `divvy_dataset.2022_tripdata_clean`
WHERE end_station_name IS NOT NULL AND end_lat IS NOT NULL AND end_lng IS NOT NULL
)
SELECT
COUNT(station_name) AS stations_count,
COUNT(coordinate) AS coordinate_count
FROM coordinates_stations;
Yay! There are now 1,379 unique coordinates, down from 1,952,581 coordinates:
stations_count | coordinate_count |
---|---|
1379 | 1379 |
And now, let's add a start_station_zip
and end_station_zip
fields, and populate them by joining the zip code data from the stations_data
table:
ALTER TABLE `divvy_dataset.2022_tripdata_clean`
ADD COLUMN start_station_zip STRING,
ADD COLUMN end_station_zip STRING;
UPDATE `divvy_dataset.2022_tripdata_clean` AS tripdata
SET tripdata.start_station_zip = stations_data.zip_code
FROM `divvy_dataset.stations_data` AS stations_data
WHERE tripdata.start_station_name = stations_data.station_name;
UPDATE `divvy_dataset.2022_tripdata_clean` AS tripdata
SET tripdata.end_station_zip = stations_data.zip_code
FROM `divvy_dataset.stations_data` AS stations_data
WHERE tripdata.end_station_name = stations_data.station_name;
The first UPDATE
statement updated 4,694,387 rows and the second one 4,653,520.
Alright, we're finally done with the cleaning step and the dataset is now ready to be analyzed. That was honestly quite a lot of work, but now we can confidently use data that we know is trustworthy. Plus, we know this dataset inside out, which will make the analysis process a breeze!
See you in part 2!