Transit Data Analysis

Introduction

This analysis will examine the fiscal characteristics of major U.S. public transit systems using publicly available data. For more details on the problem description, please refer to: Mini-Project #01. The primary objective is to answer key questions related to transit agencies, focusing on areas such as farebox recovery performance, ridership trends, and operating expenses. The analysis will involve tasks such as renaming columns, recoding modes, and addressing instructor-specified questions using various transit data sources.

Data Sources

We will use data from the National Transit Database as our primary source. In particular, since we want to analyze farebox revenues, total number of trips, total number of vehicle miles traveled, and total revenues and expenses by source, we will need to analyze several different tables. (Clicking link will download the files.)

The 2022 Fare Revenue table
July Monthly Ridership tables
The 2022 Operating Expenses reports

Library Setup

Install Required Packages

We will be analyzing various data from various sources. Following libraries are needed for this analysis. First check if the library is already installed and then install if not installed.

if (!require("tidyverse")) install.packages("tidyverse")
if (!require("lubridate")) install.packages("lubridate")
if (!require("DT")) install.packages("DT")

Load the packages

Once the packages are installed, those will be loaded to the workspace so that they can be used later.

library(tidyverse)
library(readxl)
library(readr)
library(lubridate)
library(DT)

Load Data

Since we have now setup libraries, we will now download the data to our project so that we can use later fo our analysis. You might get an error when trying to download the file programmatically. If the error persists, download the files manually and rename those and copy them to project folder.

Loading Fare Revenue Data

We will first Load Fare revenue data from 2022 Fare Revenue table. This table Contains data on revenues a transit agency earns from carrying passengers, organized by mode and type of service. Reported as funds earned, funds expended on operations, and funds expended on capital.

# Let's start with Fare Revenue
library(tidyverse)
if (!file.exists("2022_fare_revenue.xlsx")) {
  # This should work _in theory_ but in practice it's still a bit finicky
  # If it doesn't work for you, download this file 'by hand' in your
  # browser and save it as "2022_fare_revenue.xlsx" in your project
  # directory.
  download.file("http://www.transit.dot.gov/sites/fta.dot.gov/files/2024-04/2022%20Fare%20Revenue.xlsx",
    destfile = "2022_fare_revenue.xlsx",
    quiet = FALSE,
    method = "wget"
  )
}
FARES <- readxl::read_xlsx("2022_fare_revenue.xlsx") |>
  select(
    -`State/Parent NTD ID`,
    -`Reporter Type`,
    -`Reporting Module`,
    -`TOS`,
    -`Passenger Paid Fares`,
    -`Organization Paid Fares`
  ) |>
  filter(`Expense Type` == "Funds Earned During Period") |>
  select(-`Expense Type`) |>
  group_by(
    `NTD ID`, # Sum over different `TOS` for the same `Mode`
    `Agency Name`, # These are direct operated and sub-contracted
    `Mode`
  ) |> # of the same transit modality
  # Not a big effect in most munis (significant DO
  # tends to get rid of sub-contractors), but we'll sum
  # to unify different passenger experiences
  summarize(`Total Fares` = sum(`Total Fares`)) |>
  ungroup()

Next load Expenses

The 2022 Annual dataset containing data on expenses applied to operate public transportation services for each agency, by mode, and type of service operated. Divides expenses among NTD expense functions and object classes.

# Next, expenses
if (!file.exists("2022_expenses.csv")) {
  # This should work _in theory_ but in practice it's still a bit finicky
  # If it doesn't work for you, download this file 'by hand' in your
  # browser and save it as "2022_expenses.csv" in your project
  # directory.
  download.file("https://data.transportation.gov/api/views/dkxx-zjd6/rows.csv?date=20231102&accessType=DOWNLOAD&bom=true&format=true",
    destfile = "2022_expenses.csv",
    quiet = FALSE,
    method = "wget"
  )
}
EXPENSES <- readr::read_csv("2022_expenses.csv") |>
  select(
    `NTD ID`,
    `Agency`,
    `Total`,
    `Mode`
  ) |>
  mutate(`NTD ID` = as.integer(`NTD ID`)) |>
  rename(Expenses = Total) |>
  group_by(`NTD ID`, `Mode`) |>
  summarize(Expenses = sum(Expenses)) |>
  ungroup()

Merge two and create Financials

We won’t need all the information from both the tables. So let’s just join these two tables on columnn NTD ID and Mode since both are present in both the tables and create FINANCIALs.

FINANCIALS <- inner_join(FARES, EXPENSES, join_by(`NTD ID`, `Mode`))

Monthly Transit Numbers

We will now load Contains monthly-updated service information reported by urban Full Reporters.

# Monthly Transit Numbers
library(tidyverse)
if (!file.exists("ridership.xlsx")) {
  # This should work _in theory_ but in practice it's still a bit finicky
  # If it doesn't work for you, download this file 'by hand' in your
  # browser and save it as "ridership.xlsx" in your project
  # directory.
  download.file("https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-09/July%202024%20Complete%20Monthly%20Ridership%20%28with%20adjustments%20and%20estimates%29_240903.xlsx",
    destfile = "ridership.xlsx",
    quiet = FALSE,
    method = "wget"
  )
}
TRIPS <- readxl::read_xlsx("ridership.xlsx", sheet = "UPT") |>
  filter(`Mode/Type of Service Status` == "Active") |>
  select(
    -`Legacy NTD ID`,
    -`Reporter Type`,
    -`Mode/Type of Service Status`,
    -`UACE CD`,
    -`TOS`
  ) |>
  pivot_longer(-c(`NTD ID`:`3 Mode`),
    names_to = "month",
    values_to = "UPT"
  ) |>
  drop_na() |>
  mutate(month = my(month)) # Parse _m_onth _y_ear date specs
MILES <- readxl::read_xlsx("ridership.xlsx", sheet = "VRM") |>
  filter(`Mode/Type of Service Status` == "Active") |>
  select(
    -`Legacy NTD ID`,
    -`Reporter Type`,
    -`Mode/Type of Service Status`,
    -`UACE CD`,
    -`TOS`
  ) |>
  pivot_longer(-c(`NTD ID`:`3 Mode`),
    names_to = "month",
    values_to = "VRM"
  ) |>
  drop_na() |>
  group_by(
    `NTD ID`, `Agency`, `UZA Name`,
    `Mode`, `3 Mode`, month
  ) |>
  summarize(VRM = sum(VRM)) |>
  ungroup() |>
  mutate(month = my(month)) # Parse _m_onth _y_ear date specs

Create USAGE by joining two tables

USAGE <- inner_join(TRIPS, MILES) |>
  mutate(`NTD ID` = as.integer(`NTD ID`))

Now, let’s review the data so far.

sample_n(USAGE, 1000) |>
  mutate(month = as.character(month)) |>
  DT::datatable(options = list(
    pageLength = 5
  ))

Tasks

Now, we will complete the tasks mentioned in this page

Task 1 - Creating Syntatic Names

Rename a column: UZA Name to metro_area.

USAGE <- USAGE |> rename(metro_area = "UZA Name")

We will also rename few other columns to make them more readable

USAGE <- USAGE |>
  rename(Passenger_Trips = UPT, Vehicle_Miles = VRM)

Task 2: Recoding the Mode column

Find Unique Modes and Print.

unique_modes <- USAGE |>
  distinct(Mode)

print(unique_modes)

# A tibble: 18 × 1
   Mode 
   <chr>
 1 DR   
 2 FB   
 3 MB   
 4 SR   
 5 TB   
 6 VP   
 7 CB   
 8 RB   
 9 LR   
10 YR   
11 MG   
12 CR   
13 AR   
14 TR   
15 HR   
16 IP   
17 PB   
18 CC

Now we will get the meaning of these symbols from NDT website. Once we have the meaning for each Acronyms, we will replace using case-when.

USAGE <- USAGE |>
  mutate(Mode = case_when(
    Mode == "DR" ~ "Demand Response",
    Mode == "FB" ~ "Ferryboat",
    Mode == "MB" ~ "Motorbus",
    Mode == "SR" ~ "Streetcar Rail",
    Mode == "TB" ~ "Trolleybus",
    Mode == "VP" ~ "Vanpool",
    Mode == "CB" ~ "Commuter Bus",
    Mode == "RB" ~ "Bus Rapid Transit",
    Mode == "LR" ~ "Light Rail",
    Mode == "YR" ~ "Hybrid Rail",
    Mode == "MG" ~ "Monorail/Automated Guideway",
    Mode == "CR" ~ "Commuter Rail",
    Mode == "AR" ~ "Alaska Railroad",
    Mode == "TR" ~ "Aerial Tramway",
    Mode == "HR" ~ "Heavy Rail",
    Mode == "IP" ~ "Inclined Plane",
    Mode == "PB" ~ "Publico",
    Mode == "CC" ~ "Cable Car",
    TRUE ~ "Unknown"
  ))

Task 3: Answering Instructor Specified Questions with dplyr

1. What transit agency had the most total VRM in this sample?

USAGE |>
  group_by(Agency) |>
  summarize(Total_VRM = sum(Vehicle_Miles, na.rm = TRUE)) |>
  arrange(desc(Total_VRM)) |>
  datatable(
    options = list(pageLength = 1, dom = "t"), # Only display top row
    rownames = FALSE
  ) |>
  formatRound("Total_VRM", digits = 0, mark = ",")

2. What transit mode had the most total VRM in this sample?

USAGE |>
  group_by(Mode) |>
  summarize(Total_VRM = sum(Vehicle_Miles, na.rm=TRUE)) |>
  arrange(desc(Total_VRM)) |>
  datatable(options = list(pageLength = 1, dom = 't'),  # Only display top row
          rownames = FALSE) |> 
  formatRound("Total_VRM", digits = 0, mark = ",")

3. How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?

total_trips <- USAGE |>
  filter(Agency == "MTA New York City Transit", Mode == "Heavy Rail", month == "2024-05-01") |>
  summarize(Total_Trips = sum(Passenger_Trips, na.rm = TRUE)) |>
  pull(Total_Trips)

message <- sprintf(
  "There were %s trips taken on the NYC Subway (Heavy Rail) in May 2024.",
  format(total_trips, big.mark = ",")
)

cat(message)

There were 180,458,819 trips taken on the NYC Subway (Heavy Rail) in May 2024.

5. How much did NYC subway ridership fall between April 2019 and April 2020?

To solve this, we will first find ridership for 2019 and 2020 separately. Then we will subtract to get change and get percentage.

# Filter and summarize data for April 2019
april_2019 <- USAGE |>
  filter(Agency == "MTA New York City Transit", Mode == "Heavy Rail", month == "2019-04-01") |>
  summarize(Total_Trips_2019 = sum(Passenger_Trips, na.rm = TRUE)) |>
  pull(Total_Trips_2019)

# Filter and summarize data for April 2020
april_2020 <- USAGE |>
  filter(Agency == "MTA New York City Transit", Mode == "Heavy Rail", month == "2020-04-01") |>
  summarize(Total_Trips_2020 = sum(Passenger_Trips, na.rm = TRUE)) |>
  pull(Total_Trips_2020)

# Calculate the absolute difference and percentage drop
ridership_difference <- april_2019 - april_2020
percentage_drop <- (ridership_difference / april_2019) * 100

# Print the custom message with the result and percentage drop
message <- sprintf(
  "NYC subway ridership fell by %s trips between April 2019 and April 2020, which is a %.2f%% decrease.",
  format(ridership_difference, big.mark = ","), percentage_drop
)

cat(message)

NYC subway ridership fell by 211,969,660 trips between April 2019 and April 2020, which is a 91.28% decrease.

Task 4: Explore and Analyze

Find three more interesting transit facts in this data other than those above.

1. Top 5 Transit Agencies by Total Passenger Trips

USAGE |>
  group_by(Agency) |>
  summarize(Total_Trips = sum(Passenger_Trips, na.rm = TRUE)) |>
  arrange(desc(Total_Trips)) |>
  head(5) |>
  datatable(options = list(pageLength = 5, dom = "t"), rownames = FALSE) |>
  formatRound("Total_Trips", digits = 0, mark = ",")

2. Top 5 Transit Modes by Total Vehicle Miles

top_modes_vrm <- USAGE |>
  group_by(Mode) |>
  summarize(Total_VRM = sum(Vehicle_Miles, na.rm = TRUE)) |>
  arrange(desc(Total_VRM)) |>
  head(5)

datatable(top_modes_vrm, options = list(pageLength = 5, dom = "t"), rownames = FALSE) |>
  formatRound("Total_VRM", digits = 0, mark = ",")

3. Top 5 Agencies with Largest Decrease in Ridership between 2019 and 2020

ridership_decline <- USAGE |>
  group_by(Agency) |>
  summarize(
    Trips_2019 = sum(ifelse(month == "2019-04-01", Passenger_Trips, NA), na.rm = TRUE),
    Trips_2020 = sum(ifelse(month == "2020-04-01", Passenger_Trips, NA), na.rm = TRUE)
  ) |>
  mutate(Decline = Trips_2019 - Trips_2020) |>
  arrange(desc(Decline)) |>
  head(5)

datatable(ridership_decline, options = list(pageLength = 5, dom = "t"), rownames = FALSE) |>
  formatRound(c("Trips_2019", "Trips_2020", "Decline"), digits = 0, mark = ",")

Task 5: Table Summarization

Create a new table from USAGE that has annual total (sum) UPT and VRM for 2022. This will require use of the group_by, summarize, and filter functions. You will also want to use the year function, to extract a year from the month column.

The resulting table should have the following columns:

NTD ID
Agency
metro_area
Mode
UPT
VRM

Make sure to ungroup your table after creating it.

Name this table USAGE_2022_ANNUAL.

This will be done with following command.

# Create the USAGE_2022_ANNUAL table
USAGE_2022_ANNUAL <- USAGE |>
  # Extract the year from the month column
  mutate(year = year(month)) |>
  # Filter for the year 2022
  filter(year == 2022) |>
  # Group by the necessary columns
  group_by(`NTD ID`, Agency, metro_area, Mode) |>
  # Summarize total UPT and VRM for the year
  summarize(UPT = sum(Passenger_Trips, na.rm = TRUE), VRM = sum(Vehicle_Miles, na.rm = TRUE)) |>
  # Ungroup the table
  ungroup()

Let’s verify that all the columns are there.

cat(colnames(USAGE_2022_ANNUAL), sep = "\n")

NTD ID
Agency
metro_area
Mode
UPT
VRM

Now, Let’s join with FINANCIALS to create single table and create USAGE_AND_FINANCIALS. However, before joining, let’s view the table Financials:

FINANCIALS |>
  DT::datatable(options = list(pageLength = 5))

Here we notice that Mode is Acronym. However, USAGE_2022_ANNUAL doesn’t have acronyms. To make sure we can join both the table, let’s change FINANCIALS mode to match USAGE_2022_ANNUAL.

FINANCIALS <- FINANCIALS |>
  mutate(Mode = case_when(
    Mode == "DR" ~ "Demand Response",
    Mode == "FB" ~ "Ferryboat",
    Mode == "MB" ~ "Motorbus",
    Mode == "SR" ~ "Streetcar Rail",
    Mode == "TB" ~ "Trolleybus",
    Mode == "VP" ~ "Vanpool",
    Mode == "CB" ~ "Commuter Bus",
    Mode == "RB" ~ "Bus Rapid Transit",
    Mode == "LR" ~ "Light Rail",
    Mode == "YR" ~ "Hybrid Rail",
    Mode == "MG" ~ "Monorail/Automated Guideway",
    Mode == "CR" ~ "Commuter Rail",
    Mode == "AR" ~ "Alaska Railroad",
    Mode == "TR" ~ "Aerial Tramway",
    Mode == "HR" ~ "Heavy Rail",
    Mode == "IP" ~ "Inclined Plane",
    Mode == "PB" ~ "Publico",
    Mode == "CC" ~ "Cable Car",
    TRUE ~ "Unknown"
  ))

Now, let’s join to create USAGE_AND_FINANCIALS. We will join NTD ID and Mode as they are present in both the tables to create USAGE_AND_FINANCIALS.

 USAGE_AND_FINANCIALS <- left_join(
  USAGE_2022_ANNUAL,
  FINANCIALS,
  join_by(`NTD ID`, Mode)
) |>
  drop_na()

Let’s view few records to make sure we have them:

USAGE_AND_FINANCIALS |>
  DT::datatable(options = list(pageLength = 5))

Before we answer the questions, we will rename few columns to make them more readable:

USAGE_AND_FINANCIALS <- USAGE_AND_FINANCIALS |>
  rename(Passenger_Trips = UPT, Vehicle_Miles = VRM)

Task 6: Farebox Recovery Among Major Systems

Using the USAGE_AND_FINANCIALS table, answer the following questions:

1. Which transit system (agency and mode) had the most UPT in 2022?

 USAGE_AND_FINANCIALS |>
  select(Agency, Mode, Passenger_Trips) |>
  arrange(desc(Passenger_Trips)) |>
  datatable(
    options = list(pageLength = 1, dom = "t"), # Only display top row
    rownames = FALSE
  ) |>
  formatRound("Passenger_Trips", digits = 0, mark = ",")

2. Which transit system (agency and mode) had the highest farebox recovery (Total Fares to Expenses)?

    USAGE_AND_FINANCIALS |>
  mutate(Farebox_Recovery = `Total Fares` / Expenses) |>
  filter(!is.na(`Expenses`) &`Expenses`>0) |>
  arrange(desc(Farebox_Recovery)) |>
  select(Agency, Mode, Farebox_Recovery) |>
  datatable(
    options = list(pageLength = 1, dom = "t"), # Only display top row
    rownames = FALSE
  ) |>
  formatRound("Farebox_Recovery", mark = ",")

3 Which transit system (agency and mode) has the lowest expenses per UPT?

  USAGE_AND_FINANCIALS |>
  mutate(Expenses_per_UPT = Expenses / Passenger_Trips) |>
  arrange(Expenses_per_UPT) |>
  select(Agency, Mode, Expenses_per_UPT) |>
  datatable(
    options = list(pageLength = 1, dom = "t"), # Only display top row
    rownames = FALSE
  ) |>
  formatRound("Expenses_per_UPT", mark = ",")

4. Which transit system (agency and mode) has the highest total fares per UPT?

  USAGE_AND_FINANCIALS |>
   mutate(Fares_per_UPT = `Total Fares` / Passenger_Trips) |>
   arrange(desc(Fares_per_UPT)) |>
   select(Agency, Mode, Fares_per_UPT) |>
   datatable(
     options = list(pageLength = 1, dom = "t"), # Only display top row
     rownames = FALSE
   ) |>
   formatRound("Fares_per_UPT", mark = ",")

5. Which transit system (agency and mode) has the lowest expenses per VRM?

  USAGE_AND_FINANCIALS |>
   mutate(Expenses_per_VRM = Expenses / Vehicle_Miles) |>
   arrange(Expenses_per_VRM) |>
   select(Agency, Mode, Expenses_per_VRM) |>
   datatable(
     options = list(pageLength = 1, dom = "t"), # Only display top row
     rownames = FALSE
   ) |>
   formatRound("Expenses_per_VRM", mark = ",")

6. Which transit system (agency and mode) has the highest total fares per VRM?

  USAGE_AND_FINANCIALS |>
  mutate(Fares_per_VRM = `Total Fares` / Vehicle_Miles) |>
  arrange(desc(Fares_per_VRM)) |>
  select(Agency, Mode, Fares_per_VRM) |>
  datatable(
    options = list(pageLength = 1, dom = "t"), # Only display top row
    rownames = FALSE
  ) |>
  formatRound("Fares_per_VRM", mark = ",")

Conclusion

In my view, the Transit Authority of Central Kentucky’s Vanpool stands out as the most efficient transit system due to its farebox recovery ratio exceeding 100%, meaning it generates more fare revenue than its operating costs. This high level of financial self-sufficiency is uncommon in public transit and makes it highly efficient from a financial sustainability perspective.

Overall, this was an interesting assignment to understand basics DT operations using Transportation data. This analysis provided valuable insights into the financial and operational performance of different transit agencies and modes.

Key points from this analysis

MTA New York City Transit had the most passenger trips in 2022.
NYC Subway saw a 91.28% drop in ridership between April 2019 and April 2020 due to COVID-19.
Transit Authority of Central Kentucky’s Vanpool had the highest farebox recovery, covering costs effectively through fare revenue.
North Carolina State University’s Motorbus service had the lowest expenses per passenger trip.
The Motorbus mode, across all agencies, is the one that collects the most fare revenue for every mile the buses travel while carrying passengers.
The transit system with the highest total fares per Vehicle Revenue Mile is the Chicago Water Taxi in the Ferryboat mode.