Chicago bike-share Analysis Part 1: Cleaning

TL;DR: I cleaned a +5 million records dataset using SQL and Google App Scripts.

I love riding bikes. When I'm exploring a new city, I often try its public bike system. I think it's the most convenient way to discover a new city and find its hidden gems.

I've never been to Chicago, but I've found out that Divvy, Chicago's bike-sharing system, releases historical monthly trip data under this license. The data provided are quite complete which makes it interesting to explore.

But before moving further, let's get some context.

First thing first: the bikes. Riders can choose between 2 kinds of bikes:

  • Classic bikes. They must be docked at a station to end a ride.
  • Electric bikes. They can be locked to a public bike rack within the service area or docked at a station.

Divvy offers 3 plans to use its service:

  • Single ride passes for non-members. It costs $1 to unlock a bike, then $0.17 a minute.
  • Day passes for non-members. It costs $16.50 a day and allows for unlimited 3-hour rides for 24 hours.
  • Annual memberships, which give users unlimited 45 minutes rides.

So what's the goal of this data analysis? Simple: We want to uncover how annual members and non-members use Divvy bikes differently.

Let's dive in.

Data Preparation

Lyft, which operates Divvy, releases trip history data every month. We will focus this analysis on the year 2022, which means we'll have 12 files to work with.

There are 13 fields in each CSV file and each record represents 1 ride. Metadata are not provided but the majority of variables are self-explanatory:

Field nameData typeDescription
ride_idStringUnique ID for each ride
rideable_typeStringBike types
started_atTimestampTrip start day and time
ended_atTimestampTrip end day and time
start_station_nameStringTrip start station
start_station_idStringID of the start station
end_station_nameStringTrip end station
end_station_idStringID of the end station
start_latFloatLatitude of the start station
start_lngFloatLongitude of the start station
end_latFloatLatitude of the end station
end_lngFloatLongitude of the end station
member_casualStringRider type

2 fields need some clarification:

  • rideable_type has 3 possible values: classic_bike, docked_bike and electric_bike.
  • member_casual has 2 possibles values: casual users who have bought either a single ride or a day pass and member users who have subscribed for an annual membership.

Data bias

Each trip has been anonymized, so there's no personally identifiable information (PII) or credit card information. This makes it impossible to connect past purchases to users to determine, for example, how many trips a casual rider takes before upgrading to an annual membership.

Another limitation is that Divvy does not provide unique IDs for bikes, which makes it impossible to analyze bike-specific data like failure rates.

According to Divvy, data has been processed to remove trips from staff, as well as any trips that are below 60 seconds—which could be attributed to false starts or rider re-docking a bike to ensure it was locked.

Spoiler alert: nop.

Now that we have everything, let's start processing and cleaning the data.

Data cleaning

I first had a quick once-over of the 12 CSV files to make sure everything was in order. I removed the headers from the February to December datasets and noticed that TIMESTAMP records were stored as strings in the September to December datasets.

After removing the quotation marks from the TIMESTAMP records with the search and replace functionality, I merged all 12 files together using the command prompt.

The resulting CSV file came in at a hefty 1GB and has over 5.6 million records. That's a ton of rides sift thought! I then stored it in Google Cloud Storage and created on BigQuery a table named 2022_tripdata within a divvy_dataset dataset.

Cleaning strategy

We will start by identifying in the 2022_tripdata table what needs to be cleaned to achieve our business goal: how annual riders and non-members use Divvy bikes differently.

In the second part, we will create a copy of the 2022_tripdata table, name it 2022_tripdata_clean and use it to perform the cleaning, based on the points defined in the previous step.

Duplicates and NULL values

Let's start by checking if there are no duplicate ride_id and if the rideable_type, started_at, ended_at and member_casual fields contain NULL values.

SELECT 
  COUNT(ride_id) AS number_of_records,
  COUNT(DISTINCT ride_id) AS unique_ride,
  COUNTIF(rideable_type IS NOT NULL) AS rideable_records,
  COUNTIF(started_at IS NOT NULL) AS started_at_records,
  COUNTIF(ended_at IS NOT NULL) AS ended_at_records,
  COUNTIF(member_casual IS NOT NULL) AS member_records
FROM `divvy_dataset.2022_tripdata`

This statement returns the following results:

number_of_recordsunique_riderideable_records
566771756677175667717
started_at_recordsended_at_recordsmember_records
566771756677175667717

Every record is a unique ride and all of the important fields have non-null values. Good, we're off to a good start!

Station names and station IDs

Now let's check with the following query if there are records with missing station data:

SELECT
  COUNTIF(start_station_name IS NULL) AS missing_start_station_name,
  COUNTIF(start_station_id IS NULL) AS missing_start_station_id,
  COUNTIF(end_station_name IS NULL) AS missing_end_station_name,
  COUNTIF(end_station_id IS NULL) AS missing_end_station_id
FROM `divvy_dataset.2022_tripdata`;

Yes, indeed! That's a whole lot of NULL data:

missing_start_station_namemissing_start_station_id
833064833064
missing_end_station_namemissing_end_station_id
892742892742

It also shows that when station data is missing, the station name and station ID fields are NULL. Moving forward we will use the station name fields instead of the station id fields.

Let's then group the results by bike type to further investigate it:

SELECT
  rideable_type,
  COUNTIF(start_station_name IS NULL) AS missing_start_data,
  COUNTIF(end_station_name IS NULL) AS missing_end_data
FROM `divvy_dataset.2022_tripdata`
GROUP BY rideable_type;

We can see that nearly all trips where station data are missing are coming from e-bikes:

rideable_typemissing_start_datamissing_end_data
electric_bike833064886338
classic_bike03788
docked_bike02616

As a reminder, when riding an e-bike, users can lock the bike to a public rack—without having to dock it at a station. It also means then that users can unlock an e-bike from a public rack and not necessarily from a Divvy station. So these records don't have to be cleaned.

However, some trips where the bike used was a classic_bike or a docked_bike have missing end station data. These might be due to lost or stolen bikes. We will have to remove these records from the dataset.

Bike-type distribution

According to Divvy's website, there are only 2 types of bikes: classic and electric. However, the rideable_type field has 3 possible values: classic_bike, docked_bike and electric_bike.

Let's check the distribution by user type with the following query:

WITH
  bike_type AS (
    SELECT
      rideable_type,
      COUNT(ride_id) AS number_of_rides,
    FROM `divvy_dataset.2022_tripdata`
    GROUP BY rideable_type
  )
SELECT
  rideable_type,
  number_of_rides,
  ROUND(SAFE_DIVIDE(number_of_rides, SUM(number_of_rides) OVER()), 4) as share_per_bike_type
FROM bike_type
ORDER BY number_of_rides DESC;

Turns out that only 3% of the records have docked_bike as the rideable_type:

rideable_typenumber_of_ridesshare_per_bike_type
electric_bike28890290.5097
classic_bike26012140.459
docked_bike1774740.0313

I reached out to Lyft's data service and they told me changes were made to the Divvy platform in April 2020. Before that, a classic bike was identified as docked_bike. However, since April 2020, these bikes are now tracked as classic_bike.

Despite the change, there are still instances of docked_bike value present in the 2022 dataset. I guess the implementation is not perfect, yet. Anyway, we will have to replace all of these instances with classic_bike.

Trip length

According to Divvy website, all trips from staff and those below 60 seconds have been removed from their datasets.

Let's add a new field called ride_length to verify this and populate it with the time difference between the ended_at and started_at fields using the following query:

ALTER TABLE `divvy_dataset.2022_tripdata`
ADD COLUMN ride_length INT64;

UPDATE `divvy_dataset.2022_tripdata`
SET ride_length = TIMESTAMP_DIFF(ended_at, started_at, SECOND)
WHERE TRUE;

Next, let's run the following query to get a quick glance at the distribution:

SELECT
  COUNT(ride_id) AS total_rides,
  COUNTIF(ride_length < 60) AS rides_below_60secs,
  COUNTIF(ride_length > 10800 ) AS rides_above_3hrs
FROM `divvy_dataset.2022_tripdata`;

The results clearly show that trips below 60 seconds have not been properly removed:

total_ridesrides_below_60secsrides_above_3hrs
566771712108919592

There are also some trips above 3 hours, which may indicate lost or stolen bikes.

Let's dig further and see how the distribution of ride durations looks, starting with members:

WITH
  trip_length AS (
    SELECT
      CASE
        WHEN ride_length < 60 THEN 'Less than 1 min'
        WHEN ride_length BETWEEN 60 AND 3600 THEN '1min to 1 hr'
        WHEN ride_length BETWEEN 3601 AND 7200 THEN '1hr to 2hrs'
        WHEN ride_length BETWEEN 7201 AND 10800 THEN '2hrs to 3hrs'
        WHEN ride_length BETWEEN 10801 AND 14400 THEN '3 hrs to 4hrs'
        ELSE 'More than 4 hrs'
      END AS ride_length,
      COUNT(ride_id) AS number_of_rides,
      CAST(ROUND(AVG(TIMESTAMP_DIFF(ended_at, started_at, SECOND))) AS INT64) AS avg_ride_length -- CAST to INTEGER to construct an INTERVAL object
    FROM `divvy_dataset.2022_tripdata`
    WHERE member_casual = 'member'
    GROUP BY ride_length       
  )
SELECT
  ride_length,
  number_of_rides,
  ROUND(SAFE_DIVIDE(number_of_rides, SUM(number_of_rides) OVER()), 4) AS share_ride_length,
  MAKE_INTERVAL(second => avg_ride_length) AS avg_ride_length_interval
FROM trip_length
ORDER BY avg_ride_length;

98% of bike trips from members were below 1 hour, with an average trip duration of 11 minutes and 49 seconds:

ride_lengthnumber_of_ridesshare_ride_lengthavg_ride_length_interval
Less than 1 min731990.02190-0 0 0:0:15
1min to 1 hr32487500.9710-0 0 0:11:49
1hr to 2hrs171060.00510-0 0 1:17:41
2hrs to 3hrs26190.00080-0 0 2:25:32
3 hrs to 4hrs11780.00040-0 0 3:26:37
More than 4 hrs28330.00080-0 0 12:50:22

As a reminder, members can use a bike for up to 45 minutes. They can keep the bike out for longer, but they'll have to pay an extra $0.17 per minute.

Based on the extra pricing charge and the average ride length of 1 hour and 17 minutes of the 1 hour to 2 hours cohort, we will give members an additional 30-minute window and remove all records from members where the trip duration exceeded 1 hour and 15 minutes.

Regarding casual riders, it seems that they ride for a longer period than members. That's something we could expect since members probably use Divvy to commute while casual riders may use it for leisure or sightseeing:

ride_lengthnumber_of_ridesshare_ride_lengthavg_ride_length_interval
Less than 1 min478900.02060-0 0 0:0:23
1min to 1 hr21275100.91620-0 0 0:16:7
1hr to 2hrs1082510.04660-0 0 1:21:13
2hrs to 3hrs228000.00980-0 0 2:23:54
3 hrs to 4hrs49950.00220-0 0 3:22:54
More than 4 hrs105860.00460-0 0 31:56:16

As a reminder, casual riders with a day pass can use a classic bike for up to 3 hours. Like members, we'll give them an extra 30 minutes widows and remove every record from casual riders where the trip duration was above 3 hours and 30 minutes.

Same-station trips

It is common to begin and end a trip at the same station when using a bike-sharing service. There are plenty of reasons to do so. It could be because of a false start, a faulty bike, or even just a quick errand.

Let's get an idea of how many trips have the same start and end station with the following query:

SELECT 
    COUNT(ride_id) AS total_rides,
    (SELECT
        COUNT(ride_id)
        FROM `divvy_dataset.2022_tripdata`
        WHERE start_station_name = end_station_name
    ) AS same_station_rides
FROM `divvy_dataset.2022_tripdata`

Turns out that more than 293,000 records have the same start and end station. That's 5.18% of the total rides:

total_ridessame_station_rides
5667717293619

Let's get more insight into the distribution of these same-station trips by rider type:

WITH same_stations_trips AS (
  SELECT
    DISTINCT member_casual, -- DISTINCT since we can't use PERCENTILE_CONT with GROUP BY
    COUNT(ride_id) OVER(PARTITION BY member_casual) AS number_of_rides, -- Use of PARTITION BY since we can't use GROUP BY
    CAST(AVG(ride_length) OVER(PARTITION BY member_casual) AS INT64) AS avg_ride_length,
    CAST(PERCENTILE_CONT(ride_length, 0.5) OVER(PARTITION BY member_casual) AS INT64) AS median_ride_length -- CAST to INTEGER to construct an INTERVAL object
  FROM `divvy_dataset.2022_tripdata`
  WHERE start_station_name = end_station_name
)

SELECT
  member_casual,
  number_of_rides,
  ROUND(SAFE_DIVIDE(number_of_rides, SUM(number_of_rides) OVER()), 4) AS share,
  MAKE_INTERVAL(second => avg_ride_length) AS avg_ride_length_interval,
  MAKE_INTERVAL(second => median_ride_length) AS median_ride_length_interval
FROM same_stations_trips
ORDER BY number_of_rides DESC;

Nearly 60% of these same-station trips were taken by casual riders:

member_casualnumber_of_ridesshare
casual1755710.598
member1180480.402
member_casualavg_ride_length_intervalmedian_ride_length_interval
casual0-0 0 0:35:450-0 0 0:19:47
member0-0 0 0:12:580-0 0 0:2:11

This could indicate that casual riders might use the service for leisurely sightseeing around a particular park or neighborhood.

But the most interesting bit is with members. The median is actually quite low: 2 minutes and 11 seconds. This suggests that they may have unlocked a bike and then returned it due to a defect.

So let's use this result as a baseline to remove all same-station trips in under 2 minutes.

Time to move to coordinates.

Spoiler alert: it's going to be fun.

Coordinates data

There are in total 1,709 stations. So there should be, in total, 1,709 different coordinates data, right? Right. Well, let's check that with the following query:

WITH coordinates_stations AS (
  SELECT
    start_station_name AS station_name,
    CONCAT(start_lat, ' ', start_lng) AS coordinates
  FROM `divvy_dataset.2022_tripdata`
  WHERE start_station_name IS NOT NULL AND start_lat IS NOT NULL AND start_lng IS NOT NULL
  UNION ALL -- UNION ALL to get data from start and end stations under the same field
  SELECT
    end_station_name AS station_name,
    CONCAT(end_lat, ' ', end_lng) AS coordinates
  FROM `divvy_dataset.2022_tripdata`
  WHERE end_station_name IS NOT NULL AND end_lat IS NOT NULL AND end_lng IS NOT NULL
)
SELECT
  COUNT(DISTINCT station_name) AS unique_stations,
  COUNT(DISTINCT coordinates) AS unique_coordinates,
  ROUND(COUNT(DISTINCT coordinates) / COUNT(DISTINCT station_name)) AS avg_coordinates_per_station
FROM coordinates_stations;

There are 1,952,581 unique coordinates:

unique_stationsunique_coordinatesavg_coordinates_per_station
170919525811143.0

Yup, that's right. One million nine hundred fifty thousand five hundred eighty-one. Unique. Coordinates. That's an average of 1,143 coordinates per station. Definitely not good.

Let's find out which station has the most coordinates with the following query:

WITH coordinates_stations AS (
  SELECT
    start_station_name AS station_name,
    CONCAT(start_lat, ' ', start_lng) AS coordinates
  FROM `divvy_dataset.2022_tripdata`
  WHERE start_station_name IS NOT NULL AND start_lat IS NOT NULL AND start_lng IS NOT NULL
  UNION ALL -- UNION ALL to get data from start and end stations under the same field
  SELECT
    end_station_name AS station_name,
    CONCAT(end_lat, ' ', end_lng) AS coordinates
  FROM `divvy_dataset.2022_tripdata`
  WHERE end_station_name IS NOT NULL AND end_lat IS NOT NULL AND end_lng IS NOT NULL
)
SELECT
  DISTINCT station_name,
  COUNT(DISTINCT coordinates) AS unique_coordinates
FROM coordinates_stations
GROUP BY station_name
ORDER BY unique_coordinates DESC
LIMIT 3;

Annnd drum roll please the winner is... Streeter Dr & Grand Ave station with an astonishing 22,366 different coordinates data!

station_nameunique_coordinates
Streeter Dr & Grand Ave22366
Wells St & Concord Ln15746
Kingsbury St & Kinzie St13857

Let's plot these 22,366 coordinates points on a map using Looker:

While most of them are clustered around the same area, some are scattered around town. This is definitely going to make the visualization of all these coordinates not insightful at all once we'll start analyzing the dataset.

In short, we can't trust these coordinates data. Ideally, each station should have 1 coordinate. No more.

Time to do some cleaning.

Cleaning the coordinates data

First, let's download the Divvy Bicycle Stations dataset from the City of Chicago Data Portal and import it into a Google Sheet.

This dataset has the following fields:

FieldData typeDescription
Station NameStringName of the station
IDIntegerID of the station
Total DocksIntegerNumber of docks in the station
Docks in serviceIntegerDocks in service
StatusStringIn Service or Not In Service
LatitudeFloatLatitude of the station
LongitudeFloatLongitude of the station
LocationStringCoordinates of the station between parenthesis

Quick note on this dataset: the station ID field is not consistent with the Divvy trip history dataset.

For instance, the Monticello Ave & Chicago Ave station has the ID set to 301 in the Divvy dataset, while it has the ID set to 1575949489503728316 in the City of Chicago dataset.

Moving forward we will use the station name fields to avoid any confusion.

Let's start by exporting the list of all stations from the Divvy dataset with the following query:

WITH station_data AS (
  SELECT
    DISTINCT(start_station_name) AS station_name,
  FROM `divvy_dataset.2022_tripdata`
  WHERE start_station_name IS NOT NULL
  UNION ALL -- UNION ALL to get data from start and end stations under the same field
  SELECT
    DISTINCT(end_station_name) AS station_name,
  FROM `divvy_dataset.2022_tripdata`
  WHERE end_station_name IS NOT NULL  
)
SELECT
  DISTINCT(station_name)
FROM station_data;

After loading the data into our Google Sheets, we can then now look up the latitude and longitude data from the City of Chicago dataset to the list of stations we just exported. Should be easy right?

Turns out the name of some stations have changed between the 2 datasets. For example, Fullerton Ave & Narragansett Ave became Public Rack - Fullerton Ave & Narragansett Ave in the City of Chicago dataset.

We can nest 2 VLOOKUP() inside an IFERROR() function to get the value if the station name starts with Public Rack - :

=IFERROR(VLOOKUP(A2,Divvy_Bicycle_Stations!A:G,6,FALSE),(VLOOKUP("Public Rack - "&A2,Divvy_Bicycle_Stations!A:G,6,FALSE)))

A bit better. But there are still some errors, though. As a great astrobotanist once said:

In the face of overwhelming odds, we're left with only 1 option: we're going to have to code the sh*t out of this.

Or something like that.

So let's create 2 custom formulas using Google App Script and its Geocoder Class to get the latitude and longitude of an address according to Google Maps.

First, let's create a GETLAT() formula to get the latitude:

const GETLAT = (address) => { // =GETLAT("Streeter Dr & Grand Ave")
  const { results: [data = null] = [] } = Maps.newGeocoder().geocode(address); // address to lookup
  if (data === null) {
    throw new Error('Address not found!');
  }
  const { geometry: { location: { lat } } = {} } = data; // get the latitude
  Utilities.sleep(500); // pause the loop for 500 milliseconds
  return `${lat}`; // return the latitude
};

And then a GETLONG() formula to get the longitude:

const GETLONG = (address) => { // =GETLONG("Streeter Dr & Grand Ave")
  const { results: [data = null] = [] } = Maps.newGeocoder().geocode(address); // address to lookup
  if (data === null) {
    throw new Error('Address not found!');
  }
  const { geometry: { location: { lng } } = {} } = data; // get the longitude
  Utilities.sleep(500); // pause the loop for 500 milliseconds
  return `${lng}`; // return the longitude
};

These 2 formulas use the Maps.newGeocoder().geocode method that takes an address string and returns either the latitude or the longitude from the response object.

We can now update the previous formula to get the latitude and longitude IF the first 2 VLOOKUP() return an empty field:

=IFERROR(VLOOKUP(A2,Divvy_Bicycle_Stations!A:G,7,FALSE),IFERROR(VLOOKUP("Public Rack - "&A2,Divvy_Bicycle_Stations!A:G,6,FALSE),GETLAT("Chicago Divvy "&A2)))
=IFERROR(VLOOKUP(A2,Divvy_Bicycle_Stations!A:G,7,FALSE),IFERROR(VLOOKUP("Public Rack - "&A2,Divvy_Bicycle_Stations!A:G,7,FALSE),GETLONG("Chicago Divvy "&A2)))

Before moving further, let's use conditional formatting with a COUNTIF() formula to check for duplicate coordinate data.

Annnnnnd turns out there are some:

station_namecoordinate
63rd & Western Ave - north corner41.77936 -87.68377
63rd & Western Ave - south corner41.77936 -87.68377
Broadway & Wilson - Truman College Vaccination Site41.965221 -87.658139
Broadway & Wilson Ave41.965221 -87.658139
Elizabeth (May) St & Fulton St41.88661649 -87.65803993
Elizabeth St & Fulton St41.88661649 -87.65803993
Laflin St & 51st St41.801354 -87.662076
Laflin St &51st ST41.801354 -87.662076
Woodlawn Ave & 63rd St - NE41.7805135 -87.5962081
Woodlawn Ave & 63rd St - SE41.7805135 -87.5962081

This is due because some station names have changed throughout the year in the Divvy dataset. To consolidate our data and avoid any duplicates, we'll have to update the names of the following stations:

Current station nameNew station name
63rd & Western Ave - north corner63rd & Western Ave N
63rd & Western Ave - south corner63rd & Western Ave S
Broadway & Wilson - Truman College Vaccination SiteBroadway & Wilson Ave
Elizabeth (May) St & Fulton StElizabeth St & Fulton St
Laflin St &51st STLaflin St & 51st St
Woodlawn Ave & 63rd St - NEWoodlawn Ave & 63rd St N
Woodlawn Ave & 63rd St - SEWoodlawn Ave & 63rd St S

While we're still working on the Google Sheets, why not add the zip code for each station? This will allow for some neat visualization by neighborhood once we'll analyze the dataset.

Again, by using the Geocoder class from the Google Maps API we can create a custom GETZIP() formula to retrieve the ZIP code based on coordinates data:

const GETZIP = (latitude, longitude) => { // =GETZIP(42.23144, 0.873234)
  const { results: [data = {}] = [] } = Maps.newGeocoder().reverseGeocode(latitude, longitude);
  Utilities.sleep(500); // pause the loop for 500 milliseconds
  return data.formatted_address.split(',')[2].trim().split(' ')[1]; // return the complete address and extract its zip code
};

Tada! The CSV file is now ready to be uploaded into Bigquery with the following fields:

FieldData typeDescription
station_nameStringName of the station
latFloatLatitude of the station
longFloatLongitude of the station
zip_codeStringZip code of the station

Before uploading the table, I took notes of the following stations that should be removed:

WEST CHI-WATSON
Pawel Bialowas - Test- PBSC charging station
Divvy Valet - Oakwood Beach
DIVVY CASSETTE REPAIR MOBILE STATION
Base - 2132 W Hubbard
Base - 2132 W Hubbard Warehouse
NewHastings
WestChi
Hastings WH 2

As well as the following strings of characters that should be removed:

Public Rack -
Public  Rack -
Pubic Rack -
City Rack -
*
 - Charging
 - midblock
 - midblock south
 (Temp)

Time to (finally) start to clean the dataset.

Clean the data

Finally, here we are. Let's start by copying the 2022_tripdata dataset so we have a safe backup. The dataset we'll work on is now named 2022_tripdata_clean.

First, let's replace all docked_bike values with the rideable_type field with classic_bike:

UPDATE `divvy_dataset.2022_tripdata_clean`
SET rideable_type = 'classic_bike'
WHERE rideable_type = 'docked_bike';

SELECT DISTINCT rideable_type
FROM `divvy_dataset.2022_tripdata_clean`;

This statement modified 177,474 rows in the dataset.

Then, let's remove all records with missing end station data where the rideable_type field is equal to classic_bike:

SELECT *
FROM `divvy_dataset.2022_tripdata_clean`
WHERE end_station_name IS NULL
AND rideable_type = 'classic_bike';

DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE end_station_name IS NULL
AND rideable_type = 'classic_bike';

This statement removed 6,404 rows from the dataset.

Now we will remove records where the trip length is below 60 seconds:

SELECT *
FROM `divvy_dataset.2022_tripdata_clean`
WHERE ride_length < 60;
LIMIT 100;

DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE ride_length < 60;

This statement removed 121,086 rows from the dataset.

Then all rides from members where the trip length is above 1 hour and 15 minutes:

SELECT *
FROM `divvy_dataset.2022_tripdata_clean`
WHERE member_casual = "member"
AND ride_length > 4500 -- ride_length above 1:15 hours
ORDER BY ride_length
LIMIT 10;

DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE member_casual = "member"
AND ride_length > 4500; -- ride_length above 1:15 hours

This statement removed 13,461 rows from the dataset.

And then let's delete all rides from casual riders where the trip length is above 3 hours and 30 minutes:

SELECT *
FROM `divvy_dataset.2022_tripdata_clean`
WHERE member_casual = "casual"
AND ride_length > 12600 -- ride_length above 3:30 hours
ORDER BY ride_length
LIMIT 10;

DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE member_casual = "casual"
AND ride_length > 12600; -- ride_length above 3:30 hours

This statement removed 7,417 rows from the dataset.

Now, let's remove all same-station trips under 2 minutes:

SELECT *
FROM `divvy_dataset.2022_tripdata_clean`
WHERE start_station_name = end_station_name
AND ride_length <  120
ORDER BY ride_length DESC;

DELETE
FROM `divvy_dataset.2022_tripdata_clean`
WHERE start_station_name = end_station_name
AND ride_length <  120;

This statement removed 23,578 rows from the dataset.

Time to remove all trips starting or ending from the following stations:

WEST CHI-WATSON
Pawel Bialowas - Test- PBSC charging station
Divvy Valet - Oakwood Beach
DIVVY CASSETTE REPAIR MOBILE STATION
Base - 2132 W Hubbard
Base - 2132 W Hubbard Warehouse
NewHastings
WestChi
Hastings WH 2
DELETE 
FROM `divvy_dataset.2022_tripdata_clean`
WHERE start_station_name IN (
  'WEST CHI-WATSON',
  'Pawel Bialowas - Test- PBSC charging station',
  'Divvy Valet - Oakwood Beach',
  'DIVVY CASSETTE REPAIR MOBILE STATION',
  'Base - 2132 W Hubbard',
  'Base - 2132 W Hubbard Warehouse',
  'NewHastings',
  'WestChi',
  'Hastings WH 2'
);

DELETE 
FROM `divvy_dataset.2022_tripdata_clean`
WHERE end_station_name IN (
  'WEST CHI-WATSON',
  'Pawel Bialowas - Test- PBSC charging station',
  'Divvy Valet - Oakwood Beach',
  'DIVVY CASSETTE REPAIR MOBILE STATION',
  'Base - 2132 W Hubbard',
  'Base - 2132 W Hubbard Warehouse',
  'NewHastings',
  'WestChi',
  'Hastings WH 2'
);

The first statement removed 1,951 from the dataset, while the second removed 231 rows.

The dataset has now 5,493,589 records, down to 5,667,717 records. That's a total of 174,128 records that were removed.

Now let's rename the following stations:

Current station nameNew station name
63rd & Western Ave - north corner63rd & Western Ave N
63rd & Western Ave - south corner63rd & Western Ave S
Broadway & Wilson - Truman College Vaccination SiteBroadway & Wilson Ave
Elizabeth (May) St & Fulton StElizabeth St & Fulton St
Laflin St &51st STLaflin St & 51st St
Woodlawn Ave & 63rd St - NEWoodlawn Ave & 63rd St N
Woodlawn Ave & 63rd St - SEWoodlawn Ave & 63rd St S

We can achieve that with the following query. Note that there's no way to use Common Table Expressions with BigQuery to return only the data we need, so the query may seem a bit complex and harder to read:

UPDATE`divvy_dataset.2022_tripdata_clean`
SET
  start_station_name =
    CASE
      WHEN start_station_name = '63rd & Western Ave - north corner' THEN '63rd & Western Ave N'
      WHEN start_station_name = '63rd & Western Ave - south corner' THEN '63rd & Western Ave S'
      WHEN start_station_name = 'Broadway & Wilson - Truman College Vaccination Site' THEN 'Broadway & Wilson Ave'
      WHEN start_station_name = 'Elizabeth (May) St & Fulton St' THEN 'Elizabeth St & Fulton St'
      WHEN start_station_name = 'Laflin St &51st ST' THEN 'Laflin St & 51st St'
      WHEN start_station_name = 'Woodlawn Ave & 63rd St - NE' THEN 'Woodlawn Ave & 63rd St N'
      WHEN start_station_name = 'Woodlawn Ave & 63rd St - SE' THEN 'Woodlawn Ave & 63rd St S'
      ELSE start_station_name
    END,
  end_station_name =
    CASE
      WHEN end_station_name = '63rd & Western Ave - north corner' THEN '63rd & Western Ave N'
      WHEN end_station_name = '63rd & Western Ave - south corner' THEN '63rd & Western Ave S'
      WHEN end_station_name = 'Broadway & Wilson - Truman College Vaccination Site' THEN 'Broadway & Wilson Ave'
      WHEN end_station_name = 'Elizabeth (May) St & Fulton St' THEN 'Elizabeth St & Fulton St'
      WHEN end_station_name = 'Laflin St &51st ST' THEN 'Laflin St & 51st St'
      WHEN end_station_name = 'Woodlawn Ave & 63rd St - NE' THEN 'Woodlawn Ave & 63rd St N'
      WHEN end_station_name = 'Woodlawn Ave & 63rd St - SE' THEN 'Woodlawn Ave & 63rd St S'
      ELSE end_station_name
    END
WHERE 
  start_station_name IN (
    '63rd & Western Ave - north corner',
    '63rd & Western Ave - south corner',
    'Broadway & Wilson - Truman College Vaccination Site',
    'Elizabeth (May) St & Fulton St',
    'Laflin St &51st ST',
    'Woodlawn Ave & 63rd St - NE',
    'Woodlawn Ave & 63rd St - SE'
  ) 
  OR 
  end_station_name IN (
    '63rd & Western Ave - north corner',
    '63rd & Western Ave - south corner',
    'Broadway & Wilson - Truman College Vaccination Site',
    'Elizabeth (May) St & Fulton St',
    'Laflin St &51st ST',
    'Woodlawn Ave & 63rd St - NE',
    'Woodlawn Ave & 63rd St - SE'
  )

This statement modified 28,008 rows in the dataset.

Now, let's remove the following substrings of characters from the start_station_name and end_station_name fields:

Public Rack -
Public  Rack -
Pubic Rack -
City Rack -
*
 - Charging
 - midblock
 - midblock south
 (Temp)

We'll do that using regex—thank you ChatGPT for this one:

SELECT
  DISTINCT start_station_name,
  REGEXP_REPLACE(start_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))', '') AS cleaned_station_name
FROM `divvy_dataset.2022_tripdata_clean`
WHERE REGEXP_CONTAINS(start_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))');

UPDATE `divvy_dataset.2022_tripdata_clean`
SET start_station_name = REGEXP_REPLACE(start_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))', '')
WHERE REGEXP_CONTAINS(start_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))');

SELECT
  DISTINCT end_station_name,
  REGEXP_REPLACE(end_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))', '') AS cleaned_station_name
FROM `divvy_dataset.2022_tripdata_clean`
WHERE REGEXP_CONTAINS(end_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))');

UPDATE `divvy_dataset.2022_tripdata_clean`
SET end_station_name = REGEXP_REPLACE(end_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))', '')
WHERE REGEXP_CONTAINS(end_station_name, r'((Public|Pubic)\s+Rack\s+-\s+|City\s+Rack\s+-\s+|[*]| - Charging| - midblock(?:\s+south)?| \(Temp\))');

The first UPDATE statement modified 84,146 rows from the dataset, while the second one modified 87,316 rows.

The dataset has now 1379 stations, down to 1709.

Now, let's finish things up by cleaning coordinate data and then adding the zip code for every stations,

First, we will update the latitude and longitude fields with the following query:

UPDATE `divvy_dataset.2022_tripdata_clean` AS tripdata
SET tripdata.start_lat = stations_data.lat, tripdata.start_lng = stations_data.long
FROM `divvy_dataset.stations_data` AS stations_data
WHERE tripdata.start_station_name = stations_data.station_name;

UPDATE `divvy_dataset.2022_tripdata_clean` AS tripdata
SET tripdata.end_lat = stations_data.lat, tripdata.end_lng = stations_data.long
FROM `divvy_dataset.stations_data` AS stations_data
WHERE tripdata.end_station_name = stations_data.station_name;

The first UPDATE statement modified 4,694,387 rows in the dataset, while the second one modified 4,653,520 rows.

Let's quickly check that now every station has 1 coordinate data:

WITH coordinates_stations AS (
  SELECT
    start_station_name AS station_name,
    CONCAT(start_lat, ' ', start_lng) AS coordinate
  FROM `divvy_dataset.2022_tripdata_clean`
  WHERE start_station_name IS NOT NULL AND start_lat IS NOT NULL AND start_lng IS NOT NULL
  UNION DISTINCT -- UNION DISTINCT to get unique station names
  SELECT
    end_station_name AS station_name,
    CONCAT(end_lat, ' ', end_lng) AS coordinates
  FROM `divvy_dataset.2022_tripdata_clean`
  WHERE end_station_name IS NOT NULL AND end_lat IS NOT NULL AND end_lng IS NOT NULL
)
SELECT
  COUNT(station_name) AS stations_count,
  COUNT(coordinate) AS coordinate_count
FROM coordinates_stations;

Yay! There are now 1,379 unique coordinates, down from 1,952,581 coordinates:

stations_countcoordinate_count
13791379

And now, let's add a start_station_zip and end_station_zip fields, and populate them by joining the zip code data from the stations_data table:

ALTER TABLE `divvy_dataset.2022_tripdata_clean`
ADD COLUMN start_station_zip STRING,
ADD COLUMN end_station_zip STRING;

UPDATE `divvy_dataset.2022_tripdata_clean` AS tripdata
SET tripdata.start_station_zip = stations_data.zip_code
FROM `divvy_dataset.stations_data` AS stations_data
WHERE tripdata.start_station_name = stations_data.station_name;

UPDATE `divvy_dataset.2022_tripdata_clean` AS tripdata
SET tripdata.end_station_zip = stations_data.zip_code
FROM `divvy_dataset.stations_data` AS stations_data
WHERE tripdata.end_station_name = stations_data.station_name;

The first UPDATE statement updated 4,694,387 rows and the second one 4,653,520.

Alright, we're finally done with the cleaning step and the dataset is now ready to be analyzed. That was honestly quite a lot of work, but now we can confidently use data that we know is trustworthy. Plus, we know this dataset inside out, which will make the analysis process a breeze!

See you in part 2!