if (!require("tidyverse")) install.packages("tidyverse")
if (!require("lubridate")) install.packages("lubridate")
if (!require("DT")) install.packages("DT")
Transit Data Analysis
Introduction
This analysis will examine the fiscal characteristics of major U.S. public transit systems using publicly available data. For more details on the problem description, please refer to: Mini-Project #01. The primary objective is to answer key questions related to transit agencies, focusing on areas such as farebox recovery performance, ridership trends, and operating expenses. The analysis will involve tasks such as renaming columns, recoding modes, and addressing instructor-specified questions using various transit data sources.
Data Sources
We will use data from the National Transit Database as our primary source. In particular, since we want to analyze farebox revenues, total number of trips, total number of vehicle miles traveled, and total revenues and expenses by source, we will need to analyze several different tables. (Clicking link will download the files.)
- The 2022 Fare Revenue table
- July Monthly Ridership tables
- The 2022 Operating Expenses reports
Library Setup
Install Required Packages
We will be analyzing various data from various sources. Following libraries are needed for this analysis. First check if the library is already installed and then install if not installed.
Load the packages
Once the packages are installed, those will be loaded to the workspace so that they can be used later.
library(tidyverse)
library(readxl)
library(readr)
library(lubridate)
library(DT)
Load Data
Since we have now setup libraries, we will now download the data to our project so that we can use later fo our analysis. You might get an error when trying to download the file programmatically. If the error persists, download the files manually and rename those and copy them to project folder.
Loading Fare Revenue Data
We will first Load Fare revenue data from 2022 Fare Revenue table. This table Contains data on revenues a transit agency earns from carrying passengers, organized by mode and type of service. Reported as funds earned, funds expended on operations, and funds expended on capital.
# Let's start with Fare Revenue
library(tidyverse)
if (!file.exists("2022_fare_revenue.xlsx")) {
# This should work _in theory_ but in practice it's still a bit finicky
# If it doesn't work for you, download this file 'by hand' in your
# browser and save it as "2022_fare_revenue.xlsx" in your project
# directory.
download.file("http://www.transit.dot.gov/sites/fta.dot.gov/files/2024-04/2022%20Fare%20Revenue.xlsx",
destfile = "2022_fare_revenue.xlsx",
quiet = FALSE,
method = "wget"
)
}<- readxl::read_xlsx("2022_fare_revenue.xlsx") |>
FARES select(
-`State/Parent NTD ID`,
-`Reporter Type`,
-`Reporting Module`,
-`TOS`,
-`Passenger Paid Fares`,
-`Organization Paid Fares`
|>
) filter(`Expense Type` == "Funds Earned During Period") |>
select(-`Expense Type`) |>
group_by(
`NTD ID`, # Sum over different `TOS` for the same `Mode`
`Agency Name`, # These are direct operated and sub-contracted
`Mode`
|> # of the same transit modality
) # Not a big effect in most munis (significant DO
# tends to get rid of sub-contractors), but we'll sum
# to unify different passenger experiences
summarize(`Total Fares` = sum(`Total Fares`)) |>
ungroup()
Next load Expenses
The 2022 Annual dataset containing data on expenses applied to operate public transportation services for each agency, by mode, and type of service operated. Divides expenses among NTD expense functions and object classes.
# Next, expenses
if (!file.exists("2022_expenses.csv")) {
# This should work _in theory_ but in practice it's still a bit finicky
# If it doesn't work for you, download this file 'by hand' in your
# browser and save it as "2022_expenses.csv" in your project
# directory.
download.file("https://data.transportation.gov/api/views/dkxx-zjd6/rows.csv?date=20231102&accessType=DOWNLOAD&bom=true&format=true",
destfile = "2022_expenses.csv",
quiet = FALSE,
method = "wget"
)
}<- readr::read_csv("2022_expenses.csv") |>
EXPENSES select(
`NTD ID`,
`Agency`,
`Total`,
`Mode`
|>
) mutate(`NTD ID` = as.integer(`NTD ID`)) |>
rename(Expenses = Total) |>
group_by(`NTD ID`, `Mode`) |>
summarize(Expenses = sum(Expenses)) |>
ungroup()
Merge two and create Financials
We won’t need all the information from both the tables. So let’s just join these two tables on columnn NTD ID and Mode since both are present in both the tables and create FINANCIALs.
<- inner_join(FARES, EXPENSES, join_by(`NTD ID`, `Mode`)) FINANCIALS
Monthly Transit Numbers
We will now load Contains monthly-updated service information reported by urban Full Reporters.
# Monthly Transit Numbers
library(tidyverse)
if (!file.exists("ridership.xlsx")) {
# This should work _in theory_ but in practice it's still a bit finicky
# If it doesn't work for you, download this file 'by hand' in your
# browser and save it as "ridership.xlsx" in your project
# directory.
download.file("https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-09/July%202024%20Complete%20Monthly%20Ridership%20%28with%20adjustments%20and%20estimates%29_240903.xlsx",
destfile = "ridership.xlsx",
quiet = FALSE,
method = "wget"
)
}<- readxl::read_xlsx("ridership.xlsx", sheet = "UPT") |>
TRIPS filter(`Mode/Type of Service Status` == "Active") |>
select(
-`Legacy NTD ID`,
-`Reporter Type`,
-`Mode/Type of Service Status`,
-`UACE CD`,
-`TOS`
|>
) pivot_longer(-c(`NTD ID`:`3 Mode`),
names_to = "month",
values_to = "UPT"
|>
) drop_na() |>
mutate(month = my(month)) # Parse _m_onth _y_ear date specs
<- readxl::read_xlsx("ridership.xlsx", sheet = "VRM") |>
MILES filter(`Mode/Type of Service Status` == "Active") |>
select(
-`Legacy NTD ID`,
-`Reporter Type`,
-`Mode/Type of Service Status`,
-`UACE CD`,
-`TOS`
|>
) pivot_longer(-c(`NTD ID`:`3 Mode`),
names_to = "month",
values_to = "VRM"
|>
) drop_na() |>
group_by(
`NTD ID`, `Agency`, `UZA Name`,
`Mode`, `3 Mode`, month
|>
) summarize(VRM = sum(VRM)) |>
ungroup() |>
mutate(month = my(month)) # Parse _m_onth _y_ear date specs
Create USAGE by joining two tables
<- inner_join(TRIPS, MILES) |>
USAGE mutate(`NTD ID` = as.integer(`NTD ID`))
Now, let’s review the data so far.
sample_n(USAGE, 1000) |>
mutate(month = as.character(month)) |>
::datatable(options = list(
DTpageLength = 5
))
Tasks
Now, we will complete the tasks mentioned in this page
Rename a column: UZA Name
to metro_area
.
<- USAGE |> rename(metro_area = "UZA Name") USAGE
We will also rename few other columns to make them more readable
<- USAGE |>
USAGE rename(Passenger_Trips = UPT, Vehicle_Miles = VRM)
Mode
column
Find Unique Modes and Print.
<- USAGE |>
unique_modes distinct(Mode)
print(unique_modes)
# A tibble: 18 × 1
Mode
<chr>
1 DR
2 FB
3 MB
4 SR
5 TB
6 VP
7 CB
8 RB
9 LR
10 YR
11 MG
12 CR
13 AR
14 TR
15 HR
16 IP
17 PB
18 CC
Now we will get the meaning of these symbols from NDT website. Once we have the meaning for each Acronyms, we will replace using case-when
.
<- USAGE |>
USAGE mutate(Mode = case_when(
== "DR" ~ "Demand Response",
Mode == "FB" ~ "Ferryboat",
Mode == "MB" ~ "Motorbus",
Mode == "SR" ~ "Streetcar Rail",
Mode == "TB" ~ "Trolleybus",
Mode == "VP" ~ "Vanpool",
Mode == "CB" ~ "Commuter Bus",
Mode == "RB" ~ "Bus Rapid Transit",
Mode == "LR" ~ "Light Rail",
Mode == "YR" ~ "Hybrid Rail",
Mode == "MG" ~ "Monorail/Automated Guideway",
Mode == "CR" ~ "Commuter Rail",
Mode == "AR" ~ "Alaska Railroad",
Mode == "TR" ~ "Aerial Tramway",
Mode == "HR" ~ "Heavy Rail",
Mode == "IP" ~ "Inclined Plane",
Mode == "PB" ~ "Publico",
Mode == "CC" ~ "Cable Car",
Mode TRUE ~ "Unknown"
))
dplyr
1. What transit agency had the most total VRM in this sample?
|>
USAGE group_by(Agency) |>
summarize(Total_VRM = sum(Vehicle_Miles, na.rm = TRUE)) |>
arrange(desc(Total_VRM)) |>
datatable(
options = list(pageLength = 1, dom = "t"), # Only display top row
rownames = FALSE
|>
) formatRound("Total_VRM", digits = 0, mark = ",")
2. What transit mode had the most total VRM in this sample?
|>
USAGE group_by(Mode) |>
summarize(Total_VRM = sum(Vehicle_Miles, na.rm=TRUE)) |>
arrange(desc(Total_VRM)) |>
datatable(options = list(pageLength = 1, dom = 't'), # Only display top row
rownames = FALSE) |>
formatRound("Total_VRM", digits = 0, mark = ",")
3. How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?
<- USAGE |>
total_trips filter(Agency == "MTA New York City Transit", Mode == "Heavy Rail", month == "2024-05-01") |>
summarize(Total_Trips = sum(Passenger_Trips, na.rm = TRUE)) |>
pull(Total_Trips)
<- sprintf(
message "There were %s trips taken on the NYC Subway (Heavy Rail) in May 2024.",
format(total_trips, big.mark = ",")
)
cat(message)
There were 180,458,819 trips taken on the NYC Subway (Heavy Rail) in May 2024.
5. How much did NYC subway ridership fall between April 2019 and April 2020?
To solve this, we will first find ridership for 2019 and 2020 separately. Then we will subtract to get change and get percentage.
# Filter and summarize data for April 2019
<- USAGE |>
april_2019 filter(Agency == "MTA New York City Transit", Mode == "Heavy Rail", month == "2019-04-01") |>
summarize(Total_Trips_2019 = sum(Passenger_Trips, na.rm = TRUE)) |>
pull(Total_Trips_2019)
# Filter and summarize data for April 2020
<- USAGE |>
april_2020 filter(Agency == "MTA New York City Transit", Mode == "Heavy Rail", month == "2020-04-01") |>
summarize(Total_Trips_2020 = sum(Passenger_Trips, na.rm = TRUE)) |>
pull(Total_Trips_2020)
# Calculate the absolute difference and percentage drop
<- april_2019 - april_2020
ridership_difference <- (ridership_difference / april_2019) * 100
percentage_drop
# Print the custom message with the result and percentage drop
<- sprintf(
message "NYC subway ridership fell by %s trips between April 2019 and April 2020, which is a %.2f%% decrease.",
format(ridership_difference, big.mark = ","), percentage_drop
)
cat(message)
NYC subway ridership fell by 211,969,660 trips between April 2019 and April 2020, which is a 91.28% decrease.
Find three more interesting transit facts in this data other than those above.
1. Top 5 Transit Agencies by Total Passenger Trips
|>
USAGE group_by(Agency) |>
summarize(Total_Trips = sum(Passenger_Trips, na.rm = TRUE)) |>
arrange(desc(Total_Trips)) |>
head(5) |>
datatable(options = list(pageLength = 5, dom = "t"), rownames = FALSE) |>
formatRound("Total_Trips", digits = 0, mark = ",")
2. Top 5 Transit Modes by Total Vehicle Miles
<- USAGE |>
top_modes_vrm group_by(Mode) |>
summarize(Total_VRM = sum(Vehicle_Miles, na.rm = TRUE)) |>
arrange(desc(Total_VRM)) |>
head(5)
datatable(top_modes_vrm, options = list(pageLength = 5, dom = "t"), rownames = FALSE) |>
formatRound("Total_VRM", digits = 0, mark = ",")
3. Top 5 Agencies with Largest Decrease in Ridership between 2019 and 2020
<- USAGE |>
ridership_decline group_by(Agency) |>
summarize(
Trips_2019 = sum(ifelse(month == "2019-04-01", Passenger_Trips, NA), na.rm = TRUE),
Trips_2020 = sum(ifelse(month == "2020-04-01", Passenger_Trips, NA), na.rm = TRUE)
|>
) mutate(Decline = Trips_2019 - Trips_2020) |>
arrange(desc(Decline)) |>
head(5)
datatable(ridership_decline, options = list(pageLength = 5, dom = "t"), rownames = FALSE) |>
formatRound(c("Trips_2019", "Trips_2020", "Decline"), digits = 0, mark = ",")
Create a new table from USAGE
that has annual total (sum) UPT
and VRM
for 2022. This will require use of the group_by
, summarize
, and filter
functions. You will also want to use the year
function, to extract a year from the month
column.
The resulting table should have the following columns:
NTD ID
Agency
metro_area
Mode
UPT
VRM
Make sure to ungroup
your table after creating it.
Name this table USAGE_2022_ANNUAL
.
This will be done with following command.
# Create the USAGE_2022_ANNUAL table
<- USAGE |>
USAGE_2022_ANNUAL # Extract the year from the month column
mutate(year = year(month)) |>
# Filter for the year 2022
filter(year == 2022) |>
# Group by the necessary columns
group_by(`NTD ID`, Agency, metro_area, Mode) |>
# Summarize total UPT and VRM for the year
summarize(UPT = sum(Passenger_Trips, na.rm = TRUE), VRM = sum(Vehicle_Miles, na.rm = TRUE)) |>
# Ungroup the table
ungroup()
Let’s verify that all the columns are there.
cat(colnames(USAGE_2022_ANNUAL), sep = "\n")
NTD ID
Agency
metro_area
Mode
UPT
VRM
Now, Let’s join with FINANCIALS
to create single table and create USAGE_AND_FINANCIALS. However, before joining, let’s view the table Financials:
|>
FINANCIALS ::datatable(options = list(pageLength = 5)) DT
Here we notice that Mode is Acronym. However, USAGE_2022_ANNUAL
doesn’t have acronyms. To make sure we can join both the table, let’s change FINANCIALS
mode to match USAGE_2022_ANNUAL.
<- FINANCIALS |>
FINANCIALS mutate(Mode = case_when(
== "DR" ~ "Demand Response",
Mode == "FB" ~ "Ferryboat",
Mode == "MB" ~ "Motorbus",
Mode == "SR" ~ "Streetcar Rail",
Mode == "TB" ~ "Trolleybus",
Mode == "VP" ~ "Vanpool",
Mode == "CB" ~ "Commuter Bus",
Mode == "RB" ~ "Bus Rapid Transit",
Mode == "LR" ~ "Light Rail",
Mode == "YR" ~ "Hybrid Rail",
Mode == "MG" ~ "Monorail/Automated Guideway",
Mode == "CR" ~ "Commuter Rail",
Mode == "AR" ~ "Alaska Railroad",
Mode == "TR" ~ "Aerial Tramway",
Mode == "HR" ~ "Heavy Rail",
Mode == "IP" ~ "Inclined Plane",
Mode == "PB" ~ "Publico",
Mode == "CC" ~ "Cable Car",
Mode TRUE ~ "Unknown"
))
Now, let’s join to create USAGE_AND_FINANCIALS
. We will join NTD ID
and Mode
as they are present in both the tables to create USAGE_AND_FINANCIALS
.
<- left_join(
USAGE_AND_FINANCIALS
USAGE_2022_ANNUAL,
FINANCIALS,join_by(`NTD ID`, Mode)
|>
) drop_na()
Let’s view few records to make sure we have them:
|>
USAGE_AND_FINANCIALS ::datatable(options = list(pageLength = 5)) DT
Before we answer the questions, we will rename few columns to make them more readable:
<- USAGE_AND_FINANCIALS |>
USAGE_AND_FINANCIALS rename(Passenger_Trips = UPT, Vehicle_Miles = VRM)
Using the USAGE_AND_FINANCIALS
table, answer the following questions:
1. Which transit system (agency and mode) had the most UPT in 2022?
|>
USAGE_AND_FINANCIALS select(Agency, Mode, Passenger_Trips) |>
arrange(desc(Passenger_Trips)) |>
datatable(
options = list(pageLength = 1, dom = "t"), # Only display top row
rownames = FALSE
|>
) formatRound("Passenger_Trips", digits = 0, mark = ",")
2. Which transit system (agency and mode) had the highest farebox recovery (Total Fares to Expenses)?
|>
USAGE_AND_FINANCIALS mutate(Farebox_Recovery = `Total Fares` / Expenses) |>
filter(!is.na(`Expenses`) &`Expenses`>0) |>
arrange(desc(Farebox_Recovery)) |>
select(Agency, Mode, Farebox_Recovery) |>
datatable(
options = list(pageLength = 1, dom = "t"), # Only display top row
rownames = FALSE
|>
) formatRound("Farebox_Recovery", mark = ",")
3 Which transit system (agency and mode) has the lowest expenses per UPT?
|>
USAGE_AND_FINANCIALS mutate(Expenses_per_UPT = Expenses / Passenger_Trips) |>
arrange(Expenses_per_UPT) |>
select(Agency, Mode, Expenses_per_UPT) |>
datatable(
options = list(pageLength = 1, dom = "t"), # Only display top row
rownames = FALSE
|>
) formatRound("Expenses_per_UPT", mark = ",")
4. Which transit system (agency and mode) has the highest total fares per UPT?
|>
USAGE_AND_FINANCIALS mutate(Fares_per_UPT = `Total Fares` / Passenger_Trips) |>
arrange(desc(Fares_per_UPT)) |>
select(Agency, Mode, Fares_per_UPT) |>
datatable(
options = list(pageLength = 1, dom = "t"), # Only display top row
rownames = FALSE
|>
) formatRound("Fares_per_UPT", mark = ",")
5. Which transit system (agency and mode) has the lowest expenses per VRM?
|>
USAGE_AND_FINANCIALS mutate(Expenses_per_VRM = Expenses / Vehicle_Miles) |>
arrange(Expenses_per_VRM) |>
select(Agency, Mode, Expenses_per_VRM) |>
datatable(
options = list(pageLength = 1, dom = "t"), # Only display top row
rownames = FALSE
|>
) formatRound("Expenses_per_VRM", mark = ",")
6. Which transit system (agency and mode) has the highest total fares per VRM?
|>
USAGE_AND_FINANCIALS mutate(Fares_per_VRM = `Total Fares` / Vehicle_Miles) |>
arrange(desc(Fares_per_VRM)) |>
select(Agency, Mode, Fares_per_VRM) |>
datatable(
options = list(pageLength = 1, dom = "t"), # Only display top row
rownames = FALSE
|>
) formatRound("Fares_per_VRM", mark = ",")
Conclusion
In my view, the Transit Authority of Central Kentucky’s Vanpool stands out as the most efficient transit system due to its farebox recovery ratio exceeding 100%, meaning it generates more fare revenue than its operating costs. This high level of financial self-sufficiency is uncommon in public transit and makes it highly efficient from a financial sustainability perspective.
Overall, this was an interesting assignment to understand basics DT operations using Transportation data. This analysis provided valuable insights into the financial and operational performance of different transit agencies and modes.
Key points from this analysis
- MTA New York City Transit had the most passenger trips in 2022.
- NYC Subway saw a 91.28% drop in ridership between April 2019 and April 2020 due to COVID-19.
- Transit Authority of Central Kentucky’s Vanpool had the highest farebox recovery, covering costs effectively through fare revenue.
- North Carolina State University’s Motorbus service had the lowest expenses per passenger trip.
- The Motorbus mode, across all agencies, is the one that collects the most fare revenue for every mile the buses travel while carrying passengers.
- The transit system with the highest total fares per Vehicle Revenue Mile is the Chicago Water Taxi in the Ferryboat mode.