In this project, we’ll take the perspective of a data analyst working for a major overseas airline company called GG Airlines that is contemplating an expansion into the U.S. domestic air market.
The company’s C.E.O wants to start small with an initial 5-plane investment before committing further resources, and has us asked us to identify key drivers of flight profitability and recommend a set of optimal routes given company objectives and constraints.
This is meant as a demonstration of my analytic thought process, not just a stand-alone final work product!
Objective
The goal of this analysis is to support GGA’s expansion by using 1Q2019 U.S. aviation data to recommend 5 potential round-trip routes based on volume, profitability, and any other relevant KPIs.
Background
Only direct dual-leg round-trip routes between medium or large U.S. airports were considered
Origin of Leg 1 = Destination of Leg 2 (CLT - RDU - CLT)
Each of the 5 planes will be dedicated to a single route (5 planes, 5 routes)
Punctuality is a core component of the GGA brand
Profitability Objective Function
Profitability was first calculated at the individual flight level and then aggregated across routes and carriers (discussed further later). At a high level, profitability can be broken down into revenue-generating fees minus operational and logistical costs:
Further detail regarding calculations and relevant assumptions can be found in the Appendix
Data Prep & Exploration
This project uses three real FAA data sets from the Q1 2019 related to flights, tickets, and airports. Our goal is to use these 3 data sets to create a final data frame organized with one row per round-trip route that contains all data needed to calculate profitability for each route.
Note: Each round-trip flight is uniquely identified by a combination of: carrier_code, plane_tail_number, carrier_flight_number, and flight_date.
We’ll start by loading all 3 preliminary data sets:
Code
# No Scientific Mode for numeric varsoptions(scipen =999)# Source Fnssource("Airline Functions.R")# List Data CSV Files in working directorylist.files(pattern =".csv") -> files2load# Iteratively load filelist and clean names while loadingmap(files2load, cleanreader) -> filelist# name dataframes to match corresponding filepath namenames(filelist) <-str_to_lower(str_remove_all(files2load, ".csv"))## Assign objectslist2env(filelist, .GlobalEnv)# Remove list to clear memoryrm(filelist)
The above code will import all CSV files currently residing in your working directory simultaneously.
Flights
We begin with the flights data set, which will form the foundation that we join our tickets and airports data sets to.
Data Prep
After loading our flights dataset, we perform some basic clean-up steps by parsing the distance and flight_date fields, removing duplicate records, and filtering out flights that are cancelled or whose origin and destination IATA codes are the same.
Code
# Replace fl_date, clean/parse distance from character to numeric variableflights |>mutate(fl_date =datefixer(fl_date),distance =str_replace_all(str_remove_all(distance, "\\.0"), "^-?0+", ""),distance =abs(parse_number(distance)) ) |># De-Dupedistinct() |># Format names for later join and clarityrename(origin_iata = origin,destin_iata = destination) |># Remove flights where origin = destination, flight is cancelled, or distance invalidfilter(origin_iata != destin_iata, cancelled ==0) -> flights
Missing Flights Data
After our initial wrangling steps, we need to check for missing data. We first select for any variables with > 0.01% missing data, and then visualize both the % of valid data remaining and the raw count of missing observations for each.
As Figure 1 shows below, only air-time and arrival_delays register more than 0.01% missingness, and both of those still have over 99% valid records.
TLDR: None of the flights variables initially appear to be missing an impactful number of data points
Code
flights |>missingval_viz()
Figure 1: Missingness patterns for variables missing values
While only air-time and arrival_delays register more than 0.01% missingness, the occupancy_rate, distance, or flight_number variables all had a tiny number of missing rows (< 0.01%).
As these variables are critical to profitability calculations, all records missing values on any of these variables are removed.
Code
# Filter out flights missing vars needed for profit aside from distanceflights |>filter(!is.na(op_carrier_fl_num),!is.na(arr_delay),!is.na(occupancy_rate),!is.na(distance)) -> flights
Airports
Our airports data set contains information on the airports each route departs/arrives in, which is needed to calculate operational & logistical costs.
Remember, our goal is to have a final format where each single row contains all necessary info on both the origin and destination legs of each route’s flights.
To facilitate joining our airports data to the flights data, we will transform the airports data to have 1 row for each possible combination of origin and destination airports.
Data Prep
The airports data has a single row and code for each airport, so we duplicate each airport’s code and label them as origin_iata and destination_iata codes.
This will allow us to create a cross-product grid of all possible origin and destination airport combinations, which can then be joined to all of the flights records such that origin and destination airport info is available for each flight.
Finally, we’ll also parse latitude and longitude from the coordinates field and remove any duplicate rows.
Code
airport_codes |># De-dupedistinct() |># Separate Lat/Longitude and convert to Numeric formseparate_wider_delim(cols = coordinates,delim =", ",names =c("latitude","longitude")) |>mutate(across(.cols =c("latitude", "longitude"),.fns =~as.numeric(.x))) |># Create origin and destin cols necessary for join with flights datamutate(origin_iata = iata_code,origin_type = type,destin_iata = iata_code,destin_type = type,origin_latitude = latitude,origin_longitude = longitude,destin_latitude = latitude,destin_longitude = longitude) -> airport_codes
Missing Airport Data
As before, we now need to identify any variables with at least 0.01% missing data and visualize both the % of valid data remaining and the raw number of missing records.
As Figure 2 shows, it appears that almost 85% of our airport data records are missing the iata_code values needed to join to our flights data.
We are also missing some iso_country values, but those records are all associated with airports on the African continent.
Code
airport_codes |>missingval_viz()
Figure 2: Initial missingness for airport_codes data
Lastly, after filtering for just U.S. medium and large airports, Figure 3 shows that 96% of the iata_code records are now valid.
Code
# Filter for just US medium/large airports airport_codes |>filter(iso_country =="US", type %in%c('medium_airport',"large_airport")) -> airport_codes_medlargeairport_codes_medlarge |>missingval_viz()
Figure 3: revised missingness for airport_codes data
While we’ve resolved the majority of our missing iata_code values, we still need to remove any remaining records with null values, as this field is needed as a join key.
Several IATA codes were shared by an open and a closed airport, but only one (Austin) was a med/large U.S. airport
Airports Data Cross-Product Self-Join
For our final wrangling step, we will use our new origin_iata_code and destination_code fields as keys for a self-join to create a full cross-product grid of all possible origin and destination pairs of medium & large U.S. airports.
Code
# Create grid of crossprods of origin and destination for flights data joinairport_codes_medlarge |># Cross Product Gridcrossprodder(complete = F, origin_iata, destin_iata) |># Inner Join back to self to pull in origin section metadatainner_join(airport_codes_medlarge |>select(origin_iata, origin_type, origin_latitude, origin_longitude),by ="origin_iata") |># Inner Join back to self to pull in destination section metadatainner_join(airport_codes_medlarge |>select(destin_iata, destin_type, destin_latitude, destin_longitude),by ="destin_iata") -> airport_codes_medlarge# Glimpseairport_codes_medlarge |>slice_sample(n =3) |>select(origin_iata, origin_type, destin_iata, destin_type) |>gt() |>tab_header(title =md("**Random Sample of Airport Grid for Validation**"))
Random Sample of Airport Grid for Validation
origin_iata
origin_type
destin_iata
destin_type
COS
large_airport
OSH
medium_airport
ADK
medium_airport
RWI
medium_airport
FKL
medium_airport
ANC
large_airport
Merge Flights & Airports Data
Now we we can join our flights data to our grid of origin/destination airport pairs to pull in metadata about the airports associated with each flight’s origin and destination.
We can do a quick comparison to see which data fields are unique and which are common to both data sets.
In this case, the origin_iata_code and destination_iata_code fields will serve as our primary keys for the join:
Code
# Cross-check common and unique columns before joincompare_df_cols(airport_codes, flights)
Currently our merged flights and airports data is structured with 1 row per flight, but our final format needs to have 1 row per round-trip route, wherein each row reflects both route legs.
We can accomplish this by self-joining our intermediate data set to itself such that the “left side/first copy” represents the route’s Leg #1, and the “right side/second copy” represents Leg #2.
Lastly, as our analysis is constrained to only round-trip routes, we now need to isolate only those routes where the destination of leg 1 is the origin of leg 2, and the destination of leg 2 is the origin of leg 1
If this is hard to follow, my apologies! Please check out the sample below to get a visual example of what we’re trying to accomplish
Code
# Round Trip identified by: origin 1 = dest 2, dest 2 = origin 1flights_medlarge |># Self Join where "Right" side = Leg 1, "Left" side = Leg 2inner_join(flights_medlarge, by =c("op_carrier", "op_carrier_fl_num", "tail_num","fl_date"),relationship ="many-to-many",suffix =c("_leg1","_leg2")) |>filter(# Filter for round trip: origin 1 = dest 2, dest 2 = origin 1 origin_iata_leg1 == destin_iata_leg2, destin_iata_leg1 == origin_iata_leg2 ) |>arrange(op_carrier, fl_date, op_carrier_fl_num,fl_date) -> flights_medlarge_roundtrip
Crowflies Route Distance & Route Clean-Up
Before proceeding to our tickets data, we have a few final clean-up steps:
Create a single-field round-trip route identifier (e.g. CLT-RDU)
Filter out the now-redundant secondary rows for each round-trip flight
(all necessary info for both legs is now in a single row).
We will also create a custom distance field that calculates the point-to-point or crow-flies distance for each flight/route.
We will use this field for validity checks of our pre-existing distance field, as well as for missing records.
Code
flights_medlarge_roundtrip |># round all numeric cols to 2 digits for claritymutate(across(where(is.numeric),~round(.x, 2))) |># Create single field identifying the route and calculate crow-fly distancemutate(route_iata =paste(origin_iata_leg1, destin_iata_leg1, sep ="-"),route_city_name =paste(origin_city_name_leg1, dest_city_name_leg1, sep ="-"),# Crow-Fly distance for each leg and then total crow flydistance_crowfly_leg1 =crowfly_distance_miles(lat1 = origin_latitude_leg1,lon1 = origin_longitude_leg1,lat2 = destin_latitude_leg1,lon2 = destin_longitude_leg1),distance_crowfly_leg2 =crowfly_distance_miles(lat1 = origin_latitude_leg2,lon1 = origin_longitude_leg2,lat2 = destin_latitude_leg2,lon2 = destin_longitude_leg2),distance_crowfly_total = distance_crowfly_leg1 + distance_crowfly_leg2) |># Round to 2 digits for numeric colsmutate(across(.cols =where(is.numeric),.fns =~round(.x, 2))) |># De-Dupedistinct() |># Filter out redundant secondary rows for each flight# we have 1 row per round-trip flight with all info for both legs on 1 rowfilter(row_number() ==1,.by =c(fl_date, op_carrier, tail_num, op_carrier_fl_num)) -> flights_medlarge_roundtrip # Random sample a few rows to validate processflights_medlarge_roundtrip |>slice_sample(n =3) |>select(origin_iata_leg1, destin_iata_leg1, distance_leg1, origin_iata_leg2, destin_iata_leg2, distance_leg2) |>gt() |>tab_header(title =md("**Random Sample of Round-Trips to Validate**"))
Random Sample of Round-Trips to Validate
origin_iata_leg1
destin_iata_leg1
distance_leg1
origin_iata_leg2
destin_iata_leg2
distance_leg2
BDL
DCA
313
DCA
BDL
313
CPR
SLC
320
SLC
CPR
320
LAX
SAN
109
SAN
LAX
109
We can now use our new distance_crowfly metrics to validate the original distance values.
As an example, we can see below in Figure 4 that for the Charlotte-Raleigh route, the 2 values aren’t identical but are close enough to be confident in the original distance values.
Code
# Display improvement for Charlotte to RDUflights_medlarge_roundtrip |>filter(route_iata =="CLT-RDU") |>distinct(route_iata, distance_leg1, distance_crowfly_leg1) |>gt() |>tab_header(title =md("**Original and Custom Distance Metrics**"),subtitle =md("*Charlotte-Raleigh route*"))
Original and Custom Distance Metrics
Charlotte-Raleigh route
route_iata
distance_leg1
distance_crowfly_leg1
CLT-RDU
130
148.78
Figure 4: Distance Values Inaccurately Low
While crowfly_distance has relatively inaccurate raw values, it correctly rank orders routes by distance
Tickets
The final tickets data contains a sample of tickets that can be joined to our flights/airports data to get ticket_fare information for the profitability calculations.
Tickets Data Clean-Up
We start by filtering for ticket info corresponding to round-trip flights only, parsing the ticket_fare values, and creating an origin/destination structure that matches our airports/flights data.
Code
# De-Dupe and filter for round triptickets |># De-Dupedistinct() |># Keep only round-tripfilter(roundtrip ==1) |># Parse ticket faremutate(itin_fare =parse_number(itin_fare)) |># Rename fields to be consistent with flights/airports datarename(origin_iata_leg1 = origin, destin_iata_leg1 = destination,op_carrier = reporting_carrier) |>mutate(origin_iata_leg2 = destin_iata_leg1,destin_iata_leg2 = origin_iata_leg1,route_iata =paste(origin_iata_leg1, destin_iata_leg1, sep ="-")) |># Keep only necessary fieldsselect(itin_id, op_carrier, route_iata,origin_iata_leg1, destin_iata_leg1, origin_iata_leg2, destin_iata_leg2, itin_fare) -> ticketsround
Missing Tickets Data
As before, we now need to identify any variables with at least 0.01% missing data and visualize both the % of valid data remaining and the raw number of missing records.
As seen below in Figure 5, only the ticket_fare field has at least 0.01% missingness, and even then only a tiny proportion of the ticket_fare values are missing.
Code
# Missing tickets values for round trip routesticketsround |>missingval_viz()
Figure 5: Missing data for round-trip tickets
Those null records are removed, as ticket_fare is needed for profitability calculations.
Next we need to aggregate our tickets data so that we have 1 row with a single estimated ticket_fare per route.
For simplicity, this analysis assumes all ticket fares are created equal, with no distinction between passenger class, age, or other factors.
We could calculate an average_ticket_fare for each route, but a review of the ticket_fare distribution in Figure 6 shows significant price variability amongst carriers, along with several major outliers
Code
# Significant inter-carrier fare variability and outliersticketsround |>doublebox_viz(op_carrier, itin_fare) +labs(x ="Carrier", y ="Ticket Fare Distribution",title ="Significant Fare Variability Amongst Carriers",subtitle ="Highly skewed distribution with major outliers") +scale_y_continuous(limits =c(0, 40000),breaks =seq(0, 40000, 5000),labels = \(x) paste0("$",x))
Figure 6: Boxplots of fare prices by carrier
Given the carrier variability, skewness, and outliers, a group-wise median_ticket_fare was calculated for each route and carrier combination.
Code
# Trimmed Avg and Median fares calculated, only median will be usedticketsround |>summarize(avg_fare =roundmean(x = itin_fare, trim =0.1),med_fare =median(itin_fare),.by =c(route_iata, op_carrier)) -> ticketsround_summary# Previewticketsround_summary |>filter(route_iata=="CLT-RDU") |>head() |>gt() |>tab_header(title =md("**Validation Preview of Fare Summary Metrics**"))
Validation Preview of Fare Summary Metrics
route_iata
op_carrier
avg_fare
med_fare
CLT-RDU
AA
328.6
382
CLT-RDU
OH
1159.0
1159
Merge Tickets to Flights & Airports
For our final formatting step, we need to join our tickets data to our flights/airports data.
We use a left-join was in case not all routes and carriers in the flights/airports data are present in the tickets data.
In cases where ticket_fare values were in fact missing for a specific carrier, the route-level median was imputed instead. If route-level fares were also unavailable, the median_imputed_fare was left NA.
Code
# Comparison of Common Cols to be used as primary keys for joincompare_df_cols(ticketsround_summary, flights_medlarge_roundtrip)$common_cols -> primarykeys_tickets_airports_flights# Join med large round trip routes to tickets dataflights_medlarge_roundtrip |>left_join(ticketsround_summary,by =c(primarykeys_tickets_airports_flights)) |>## Impute Mean/Median of route (across all carriers) where possiblemutate(avg_fare_imputed =if_else(is.na(avg_fare),roundmean(avg_fare, na.rm = T), avg_fare),med_fare_imputed =if_else(is.na(med_fare),median(med_fare, na.rm = T), med_fare),.by =c(route_iata)) -> flights_medlarge_roundtrip
Fortunately, a quick validation check shows that none of the routes that still have missing ticket_fare values are in the top 50 busiest routes
lack of output indicates there were no top 50 routes that matched the NAticket_fare criteria
As ticket_fares are needed for profitability, and the removal impact is likely to be minimal, the remaining records with NA values were removed.
Code
# Top 50 Routesflights_medlarge_roundtrip |>count(origin_iata_leg1, destin_iata_leg1,route_iata, sort = T) |>slice_head(n =50) -> top50routes# Check if any flights missing ticket fares are in top 50 busiest (NO)flights_medlarge_roundtrip |>filter(is.na(med_fare_imputed)) |>inner_join(top50routes)# Remove remaining NAflights_medlarge_roundtrip |>filter(!is.na(med_fare_imputed)) -> flights_medlarge_roundtrip
Profitability Calculations
The last step is to calculate the cost, revenue, and profit for each flight, which we can then aggregate up to the route and carrier level. T
To facilitate data exploration and validation, profitability metrics are calculated for both the individual legs and the overall round-trip flight.
Revenue
Revenue is comprised of ticket_fare and baggage_fee components, both of which are impacted by passenger_counts.
To derive passenger_counts, the occupancy_rates are multiplied by an assumed max capacity of 200 passengers. These passenger_countsin turn serve as multipliers for the median_ticket_fare and baggage_fee calculations.
Code
# TICKET Fees = Median Fare * Estimated Passenger Countflights_medlarge_roundtrip |>mutate(n_passengers_leg1 = occupancy_rate_leg1*200,n_passengers_leg2 = occupancy_rate_leg2*200,n_passengers_total = n_passengers_leg1 + n_passengers_leg1,avg_occ_rate_bothlegs =round(n_passengers_total/400, 2)) |>mutate(est_fare_revenue_total = n_passengers_total*med_fare_imputed) |># Baggage Fees = $35 * Total Number Passengers * 0.5 bags per passengermutate(n_bags_leg1 = n_passengers_leg1*0.5,n_bags_leg2 = n_passengers_leg2*0.5,n_bags_total = n_passengers_leg1 + n_passengers_leg2,est_bag_revenue_total = n_bags_total*35) |># Total Revenue Calc = Bag Rev + Fare Revmutate(est_all_revenue_total = est_bag_revenue_total + est_fare_revenue_total) -> flights_medlarge_roundtrip
Costs
Costs are comprised of mileage_variable, airport_size, and delay_time components.
mileage_variable costs are determined by an assumed $9.18 fee per distance mile traveled.
airport_size costs are based on a $5k/$10k charge for medium/large airports.
delay_time costs are based on a $75/min charge applied to each arrival and departure for each minute of delay_time past a 15-minute grace period.
Code
# Mileage-variable costsflights_medlarge_roundtrip |>mutate(mileage_var_costs_leg1 =9.18*distance_leg1,mileage_var_costs_leg2 =9.18*distance_leg2,mileage_var_costs_total = mileage_var_costs_leg1 + mileage_var_costs_leg2) |># Airport Cost = if med then 5k, if large then 10kmutate(destin_airport_cost_leg1 =if_else(destin_type_leg1 =="large_airport",true =10000,false =5000),destin_airport_cost_leg2 =if_else(destin_type_leg2 =="large_airport",true =10000,false =5000),airport_landing_cost_total = destin_airport_cost_leg1 + destin_airport_cost_leg2) |># Departure/Arrival Delay > 15min = $75/min# Subtract 15min grace period from all 4 flight legsmutate(across(.cols =contains("delay"),.fns =~.x-15,.names ="{.col}_adj")) |># Zero trunc adjusted delay time for all 4 flight legsmutate(across(.cols =contains("_adj"),.fns =~zero_trunc(.x))) |># Total adjusted delay time summed across all 4 flight legsmutate(arr_dep_delay_time_total = (dep_delay_leg1_adj + arr_delay_leg1_adj + dep_delay_leg2_adj + arr_delay_leg2_adj)) |># Delay costs based on adjusted delay times for all 4 flight legsmutate(dep_delay_cost_leg1 = dep_delay_leg1_adj*75,arr_delay_cost_leg1 = arr_delay_leg1_adj*75, dep_delay_cost_leg2 = dep_delay_leg2_adj*75, arr_delay_cost_leg2 = arr_delay_leg2_adj*75) |># Total delay cost based on summing across delay costs for 4 legsmutate(arr_dep_delay_cost_total = (dep_delay_cost_leg1 + dep_delay_cost_leg2 + arr_delay_cost_leg1 + arr_delay_cost_leg2)) |># Total Costs = milage variable + airport + delaymutate(est_all_costs_total = mileage_var_costs_total + airport_landing_cost_total + arr_dep_delay_cost_total) -> flights_medlarge_roundtrip
Profit
Profit is calculated for each individual round-trip flight by subtracting the flight-level total_costs from the total_revenue.
Code
# Total Profit = Total Revenue - Total Costflights_medlarge_roundtrip |>mutate(est_profit_total = est_all_revenue_total - est_all_costs_total) -> flights_medlarge_roundtrip# Validateflights_medlarge_roundtrip |>slice_sample(n =1) |>select(route_iata,op_carrier,tail_num, est_profit_total,est_all_revenue_total,est_all_costs_total) |>gt() |>tab_header(title =md("**Validation Sample of Flight-Level Profit Metrics**"))
Validation Sample of Flight-Level Profit Metrics
route_iata
op_carrier
tail_num
est_profit_total
est_all_revenue_total
est_all_costs_total
ATL-BHM
DL
N903DE
112457.8
134918
22460.24
Findings
We can now use our final dataset to evaluate each route and/or route/carrier combo using a variety of KPIs.
Top 10 Busiest Routes
At the route-level, we can see in Figure 7 that the LA-SanFran (LAX-SFO) route was the busiest overall route in 1Q2019, followed closely by a number of routes based out of Atlanta.
Code
## Top 10 Flights Bar Chart### Roundflights_medlarge_roundtrip |>singlebar_viz(x = route_iata, nrecs =10) +labs(x ="Route",title ="Top 10 Busiest Routes & Flights Recorded in 1Q2019")
Figure 7: Routes with the greatest number of flights
Further, when just the originating airport is considered, Figure 8 shows that the overwhelming majority of all round-trip routes originated out of Atlanta (ATL), with Dallas-Fort Worth (DFW) and Charlotte (CLT) coming in 2nd and 3rd.
We also see clear carrier-origin_hub patterns, with Delta dominating Atlanta and American Airlines and its subsidiaries covering Charlotte and Dallas-Fort Worth.
Code
flights_medlarge_roundtrip |>doublebar_viz(x = origin_iata_leg1,grp1 = op_carrier, nrecs =10, labels = F) +labs(x ="Originating Hub", y ="Flights", fill ="Carrier",title ="Delta's Atlanta and AA's Charlotte/Dallas Hubs")
Figure 8: Originating airports with greatest number of flights
Accordingly, route/carrier combinations were probed to see which routes operated by specific carriers were the busiest.
As shown below in Figure 9, Delta (DL) had the greatest number of 1Q2019 flights by far, the majority of which originated out of its main hub in Atlanta (ATL).
Interestingly, there were no individual route/carrier combinations that appeared overwhelmingly busier.
Code
flights_medlarge_roundtrip |>doublebar_viz(x = op_carrier,grp1 = route_iata, nrecs =10, labels = F) +labs(x ="Carrier", y ="Flights", fill ="Route",title ="Delta's Atlanta-based Round-Trip Routes were Busiest by Far")
Figure 9: Route-Carrier combos with greatest number of flights
Top 10 Profitable Routes
In order to probe high-level profitability, estimates for each individual round-trip flight were aggregated across routes and route/carrier combinations.
These route and route/carrier-level profit estimates were then rank-ordered to derive the top 10 results.
The following profitability estimates represent two levels of summary data, as first the group-wise median fares were imputed for flight-level estimates tjat were then averaged again to create route-level profitability
Routes
As seen in Figure 10, while most of the top 10 routes generate an average profit of between $150,000 to $250,000, the Vail, CO - New York route (EGE - JFK) is a significant outlier that generates an average profit of over $500,000.
What is unique about the outlier EGE-JFK (Vail, CO - New York) route?
Is there significant variation across carriers within specific routes?
e.g. Is a route-level estimate reliable?
Vail, CO - New York
When comparing the EGE-JFK route to the other Top 10 profitable routes, Table 1 below shows us that:
Vail-NY is a much lower-volume route, as while the passenger_counts are roughly equivalent, the total number of 1Q2019 recorded_flights is only 20% of the average flight count for the other top 10 routes
Vail-NY flights cost about $25,000 more per flight on average, but generate $350,000 more in average revenue than the other Top 10 routes!
Code
# Create flag separating JFK-Vail route from other top 10 and get statstop10routes_profit_summary |>mutate(route_jfk_ege =if_else(route_iata =="EGE-JFK","Vail - NYC","Other")) |>summarize(across(where(is.numeric),~roundmean(.x, digits =0)),.by = route_jfk_ege) |>select(route_jfk_ege, n_flights, avg_profit, avg_n_passengers,avg_cost_total, avg_revenue_total) |>gt() |>tab_header(title =md("**Vail - NY has Greater Revenue than other Top 10**"))
Table 1: Vail - NY route vs other Top 10 routes
Vail - NY has Greater Revenue than other Top 10
route_jfk_ege
n_flights
avg_profit
avg_n_passengers
avg_cost_total
avg_revenue_total
Vail - NYC
54
512848
249
51595
564443
Other
243
185361
258
27279
212640
Further, as seen below, the main driver in this revenue gap is the ticket_fares charged, with the Vail-NY route charging over 3x the other top 10 routes per passenger on average.
Code
#| label: tbl-vail_revenue_diffs# Create flag separating JFK-Vail route and get revenue infotop10routes_profit_summary |>mutate(route_jfk_ege =if_else(route_iata =="EGE-JFK","Vail - NYC","Other")) |>summarize(across(.cols =where(is.numeric),.fns =~roundmean(.x, digits =2)),.by = route_jfk_ege) |>select(route_jfk_ege, avg_occupancy_rate, avg_bag_revenue, avg_fare_revenue,avg_median_fare) |>gt() |>tab_header(title =md("**Vail - NY has 3x Higher Ticket Fares than other Top 10**"))
Vail - NY has 3x Higher Ticket Fares than other Top 10
route_jfk_ege
avg_occupancy_rate
avg_bag_revenue
avg_fare_revenue
avg_median_fare
Vail - NYC
0.62
9094
555350
2228.00
Other
0.65
9046
203594
788.44
Carrier Variability
Having determined that fare prices can be a significant driver of profit, particularly for low-volume routes, we now turn to Question 2 to probe possible profitability variations across carriers for these top 10 overall routes.
As seen in Figure 11, the following top 10 overall routes do have significant differences across carriers:
Cincinatti-Detroit (CVG-DTW)
Charlotte-Wilmington (CLT-ILM)
Atlanta-Columbus, MS(ATL-GTR)
Code
# Filter for those top 10 profitable routes and examine inter-carrier variationflights_medlarge_roundtrip |>filter(route_iata %in% top10routes_profit) |>ggplot(aes(x = route_iata,fill = op_carrier,y = est_profit_total)) +geom_boxplot() +theme_minimal() +coord_flip() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(face ="italic"),plot.caption =element_text(color ="firebrick"),axis.title =element_text(face ="bold")) +labs(y ="Profit ($) per Round Trip Flight",fill ="Carrier",x ="Route",title ="Top 10 Most Profitable Routes Overall",subtitle ="Variation in Profitability by Carrier",caption ="Only routes w/20+ flights recorded included in profitability")
Figure 11: Carrier variability for top 10 profitable routes
This is significant because it suggests that:
Not all carriers experience the same profitability on these Top 10 routes
There may be other routes that provide even greater profitability for specific carriers
Routes & Carriers
Given the inter-carrier variability in profit that was observed, profitability statistics were next calculated for specific route/carrier combos.
Figure 12: Top 10 routes & carriers ordered by average profit
As seen above in Figure 11, there are a number of routes operated by specific carriers that did not appear in the overall T10 profit route list due to some carriers performing poorly on those routes
CLT-RDU, DFW-HSV, DTW-ROC, MSP-OMA
This can be seen in further detail by looking at the inter-carrier variation for these additional routes in Figure 13, where we see major differences in profit across carriers.
Takeaway: Certain routes may in fact have the potential to be highly profitable but do not appear so at first due to lower-performing carriers
Code
# IATA route codes for the new top routessetdiff(top10routes_carriers_profit, top10routes_profit) -> discreps# Viz - Filter for newer top routes and show carrier variabilityflights_medlarge_roundtrip |>filter(route_iata %in% discreps) |>ggplot(aes(x = route_iata,fill = op_carrier,y = est_profit_total)) +geom_boxplot() +theme_minimal() +coord_flip() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(face ="italic"),plot.caption =element_text(color ="firebrick"),axis.title =element_text(face ="bold")) +labs(y ="Profit ($) per Round Trip Flight",fill ="Carrier",x ="Route",title ="Variation in Profitability by Carrier",subtitle ="Routes that are highly profitable for some carriers but not others",caption ="Only routes w/20+ flights recorded included in profitability")
Figure 13: Top routes with major profit discrepancies across carriers
Charlotte-Raleigh
Of particular note is the Charlotte-Raleigh(CLT-RDU) route, where the regional carrier OH (a subsidiary of American Airlines AA) generates a massively greater average profit than AA. As with the Vail-NY route, this difference is driven by higher fare_prices.
Code
flights_medlarge_roundtrip |>filter(route_iata =="CLT-RDU") |>summarize(avg_profit =roundmean(est_profit_total),avg_revenue =roundmean(est_all_revenue_total),avg_occupany =roundmean(avg_occ_rate_bothlegs),avg_fare =roundmean(med_fare_imputed),.by = op_carrier) |>gt() |>tab_header(title =md("**Revenue and Fare Differences for CLT-RDU Route**"),subtitle =md("*Regional subsidiary generates higher revenue due to 3.5x higher ticket prices*"))
Table 2: Charlotte-Raleigh differences in fares by carrier
Revenue and Fare Differences for CLT-RDU Route
Regional subsidiary generates higher revenue due to 3.5x higher ticket prices
op_carrier
avg_profit
avg_revenue
avg_occupany
avg_fare
AA
85837.54
109192.8
0.65
382
OH
308462.58
331149.4
0.69
1159
Profit, Revenue, Cost Interaction
Lastly, the distribution and trade-off relationships between profit and its constituent components is directly visualized below, along with callouts for the top 10 most-profitable.
Code
#| label: fig-top10profit_tradeoff#| fig-cap: Profit as a function of cost and revenue with top 10 highlighted# Base plot of cost, revenue, profitflights_medlarge_roundtrip |>flightdicer(grp1 = route_iata,grp2 = op_carrier,min_flights =10) |>arrange(desc(avg_profit)) |>mutate(rank_profit =row_number()) |>#slice_max(avg_profit, n = 10) |> mutate(across(.cols = (where(is.numeric) &!contains("occ")),.fns =~round(.x))) |>ggplot(aes(x = avg_cost_total,y = avg_revenue_total,color = avg_profit)) +geom_point() +theme_minimal() +scale_color_viridis_c(option ="inferno", direction =-1) +scale_y_continuous(limits =c(0, 600000),breaks =seq(0, 600000, 100000),labels = \(x) paste0("$",x)) +scale_x_continuous(limits =c(0, 100000),breaks =seq(0, 100000, 10000),labels = \(x) paste0("$",x)) +labs(x ="Avg Total Cost",y ="Avg Total Revenue",color ="Avg Profit",title ="Total Revenue, Cost, and Profit for all Routes/Carriers",subtitle ="Top 10 Profitable Routes/Carriers are Highlighted") -> p1# Identify Top 10 profitable route/carrier combos to create labelsflights_medlarge_roundtrip |>flightdicer(grp1 = route_iata, grp2 = op_carrier,min_flights =10) |>arrange(desc(avg_profit)) |>mutate(route_carrier =paste(route_iata, op_carrier, sep ="_via_")) |>slice_max(avg_profit, n =10) |>select(route_carrier, avg_cost_total, avg_revenue_total, avg_profit) -> top10labels# Add labels to main plot to call out top 10p1 +geom_label_repel(data = top10labels,aes(x = avg_cost_total, y = avg_revenue_total, label = route_carrier),box.padding =0.5,point.padding =1,force =1,max.overlaps =100,segment.color ="black",segment.size =0.5,color ="black",size =3)
Finally, the summary statistics for the top 10 most profitable routes/carriers are displayed below
Table 3: Summary stats for top 10 profitable routes by specific carriers
Summary Statistics for Top 10 Profitable Routes & Carriers
rank_profit
route_iata
op_carrier
avg_profit
avg_revenue_total
avg_cost_total
1
EGE-JFK
AA
512847.7
564443.2
51595.45
2
CLT-RDU
OH
308462.6
331149.4
22686.80
3
CLT-ILM
AA
227219.8
246270.7
19050.87
4
CVG-DTW
OO
226431.3
255727.7
29296.35
5
DFW-HSV
MQ
212906.1
244750.0
31843.91
6
DTW-ROC
G7
211990.7
243482.8
31492.12
7
JAC-MSP
DL
207088.6
240298.5
33209.92
8
CLT-MYR
OH
205724.0
224609.8
18885.79
9
CLT-MYR
AA
203170.4
225288.7
22118.29
10
MSP-OMA
9E
199564.4
228832.3
29267.90
Drivers & KPIs
Based on the above profitability analyses, we have some general ideas about what drives success:
Smaller regional airlines and/or low-volume routes can be highly successful by leveraging ticket_fare prices
Routes alone aren’t a guarantee of success due to variability across carriers
Profitability Heatmap
For the final analysis, correlation heatmaps were used upon the full set of routes/carriers to uncover the greatest drivers of profitability. Figure 14 shows that:
The biggest positive driver of average_profit is the average_ticket_fare charged, followed to a much lesser extent by passenger and baggage counts.
The average_ticket_fare has a moderate positive correlation with distance and a mild negative correlation with delay_time,
Longer flights tend to command higher fares, while longer delay times command reduced fares
The average_profit of a route is also negatively impacted slightly by airport_costs and distance
Even though greater distances yield higher fares, it appears insufficient to offset the profit decrease resulting from increased turn-around time, lower volume and increased delay potential
Based on the heatmap results, the following is concluded about profitability drivers:
ticket_fares are by far the best lever for raising profitability
Shorter distance flights can help boost profit by decreasing turn-around time and delays, while optimizing the number of possible flights
Maximizing occupancy_rates boosts profit by raising total revenue from ticket_fare and baggage_fees, and is especially critical for carriers with only a small number of planes.
Final Recommendation
Based on the findings detailed above, the limitations imposed by only having 5 planes, and the brand commitment to punctuality, the following approach is suggested:
Identify the top 25 most-profitable specific route/carrier combinations
Use quantile ranks to filter these 25 flights for those with higher ticket_fares, shorter distances, lower delay_times, and greater occupancy_rates
Focusing on reduced delay_times and distance also supports the key brand component of punctuality
Emulate the approach of highly-profitable low-volume regional airlines
Top 5 Recommended Routes
In following this approach, we land upon the final 5 routes:
Code
# Summary Stats for flights ordered by average profitflights_medlarge_roundtrip |>flightdicer(grp1 = route_iata, grp2 = op_carrier,min_flights =10) |>arrange(desc(avg_profit)) |># Filter for top 25 profit levelsslice_max(avg_profit, n =25) |># Filter for top quantile fare and occupancy rate, low tile distance and delayfilter(avg_median_fare_decile ==20, avg_occupancy_rate_decile >=15, avg_distance_decile <=5, avg_delay_time_decile <=5) |>select(route_iata, op_carrier, avg_profit) -> finaltop5# Pull in full route name for clarityfinaltop5 |>left_join(flights_medlarge_roundtrip |>distinct(route_iata, route_city_name),by =c("route_iata")) |>gt() |>tab_header(title =md("**Final Top 5 Routes**"))
Table 4: Final top 5 routes based on profit, delay, distance, occupancy, and fares
Final Top 5 Routes
route_iata
op_carrier
avg_profit
route_city_name
CLT-RDU
OH
308462.6
Charlotte, NC-Raleigh/Durham, NC
CLT-ILM
AA
227219.8
Charlotte, NC-Wilmington, NC
DTW-IND
OO
190335.6
Detroit, MI-Indianapolis, IN
CLE-DTW
9E
180941.8
Cleveland, OH-Detroit, MI
ATL-LEX
DL
170716.1
Atlanta, GA-Lexington, KY
Quantiles used for selecting routes range from 1 to 20 on each metric, where the 20th quantile corresponds to the top 5% raw values of each metric, and the 1st quintile equates to the bottom 5% raw values
ROI to Break Even
Using the average profitability estimates for our final top 5 flights, we can calculate the estimated number of flights required for each to recoup the initial $90m investment per plane.
As shown below, each of the 5 routes will require between 300 - 500 flights to break even.
Code
finaltop5 |>mutate(n_flights_breakeven =round(90000000/avg_profit)) |>gt() |>tab_header(title =md("**Number of Flights Required to Recoup Plane Investment**"))
Number of Flights Required to Recoup Plane Investment
route_iata
op_carrier
avg_profit
n_flights_breakeven
CLT-RDU
OH
308462.6
292
CLT-ILM
AA
227219.8
396
DTW-IND
OO
190335.6
473
CLE-DTW
9E
180941.8
497
ATL-LEX
DL
170716.1
527
KPI Monitoring
The following KPIs should be monitored to track progress going forward:
All metrics involved in profitability calculations
Proportion of on-time flights and resulting customer satisfaction (via survey data)
Number of round-trip flights that a single plane can take in a single flight day
Monitor demand and occupancy_rate to gauge if customer appetite is there for 2+ flights a day on the shorter-distance routes
Additional Considerations & Limitations
The following considerations should also be taken into account:
All analyses are based only on 1Q2019 data, and the world has certainly changed since then
Seasonality is not taken into account, and that is arguably one of the biggest factors in airline profit
Especially critical given ticket_fares are the biggest lever in profitability
Profitability estimates don’t take into account the infrastructure and efficiency advantage that some of the major carriers have, particularly at their hubs (e.g. Delta’s ATL hub)
Unrealistic to truly assume static universal operational costs
Hub-based model is critical to modern airline efficiency
Appendix
Cost & Revenue Assumptions
Calculations included the following assumptions:
Costs
Mileage-variable costs totaling $9.18/mile
$8.00 for fuel, oil, maintenance, crew
$1.18 for depreciation, insurance, other
Plane make/model variations not considered
Airport operational costs based on landing airport size assessed once per flight leg
$5000 for medium-size
$10000 for large-size
Delay costs can be assessed a total of 4 times as there are 2 flight legs that each have a departure and an arrival
Each departure and arrival given initial 15 minute grace period
$75 charged for every minute of delays past the grace period
Delays calculated separately for each departure and arrival, such that early arrivals or departures do not detract from earlier delay costs
Plane cost fixed at $90 million per plane
Plane cost was not considered in profitability, but was considered when calculating expected break-even points
Revenue
Max passenger capacity of 200 passengers for all 5 planes
Plane make/model variability not considered
Number of passengers per flight leg calculated based on occupancy rates
Baggage fees of $35 per bag assessed based on an expectation of 1 bag checked for 50% of passengers per flight leg
Baggage fees were first calculated per flight leg based on the expected number of passengers and then summed across both legs
Ticket prices are assessed using an estimated median fixed fee per route
fares estimated from incomplete representative sample