Final project: Airline Passenger Satisfaction Classification
2023-05-09
Chapter 1 Proposal
1.1 The data:
Airline Passenger Satisfaction from kaggle https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction
Variables: Gender: Gender of the passengers (Female, Male)
Customer Type: The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)
Flight distance: The flight distance of this journey
Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)
Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
Ease of Online booking: Satisfaction level of online booking
Gate location: Satisfaction level of Gate location
Food and drink: Satisfaction level of Food and drink
Online boarding: Satisfaction level of online boarding
Seat comfort: Satisfaction level of Seat comfort
Inflight entertainment: Satisfaction level of inflight entertainment
On-board service: Satisfaction level of On-board service
Leg room service: Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes: Minutes delayed when departure
Arrival Delay in Minutes: Minutes delayed when Arrival
Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)
1.2 Modeling goal:
This dataset contains an airline passenger satisfaction survey.We want to know what factors are highly correlated to a satisfied (or dissatisfied) passenger, and try to model and predict passenger satisfaction.
1.3 The models we intend to use:
random forest(white box), lightGBM(black box), KNN(black box), logistic regression(white box)
From the plot_missing function, we can see that there are some missing values in the arrival delay variable. We will use the mice package to deal with them.

The result from mice shows that we have imputated all the missing values successfully
##
## iter imp variable
## 1 1 Arrival.Delay.in.Minutes
## 1 2 Arrival.Delay.in.Minutes
## 1 3 Arrival.Delay.in.Minutes
## 1 4 Arrival.Delay.in.Minutes
## 1 5 Arrival.Delay.in.Minutes
## 2 1 Arrival.Delay.in.Minutes
## 2 2 Arrival.Delay.in.Minutes
## 2 3 Arrival.Delay.in.Minutes
## 2 4 Arrival.Delay.in.Minutes
## 2 5 Arrival.Delay.in.Minutes
## 3 1 Arrival.Delay.in.Minutes
## 3 2 Arrival.Delay.in.Minutes
## 3 3 Arrival.Delay.in.Minutes
## 3 4 Arrival.Delay.in.Minutes
## 3 5 Arrival.Delay.in.Minutes
## 4 1 Arrival.Delay.in.Minutes
## 4 2 Arrival.Delay.in.Minutes
## 4 3 Arrival.Delay.in.Minutes
## 4 4 Arrival.Delay.in.Minutes
## 4 5 Arrival.Delay.in.Minutes
## 5 1 Arrival.Delay.in.Minutes
## 5 2 Arrival.Delay.in.Minutes
## 5 3 Arrival.Delay.in.Minutes
## 5 4 Arrival.Delay.in.Minutes
## 5 5 Arrival.Delay.in.Minutes
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## X id
## "" ""
## Gender Customer.Type
## "" ""
## Age Type.of.Travel
## "" ""
## Class Flight.Distance
## "" ""
## Inflight.wifi.service Departure.Arrival.time.convenient
## "" ""
## Ease.of.Online.booking Gate.location
## "" ""
## Food.and.drink Online.boarding
## "" ""
## Seat.comfort Inflight.entertainment
## "" ""
## On.board.service Leg.room.service
## "" ""
## Baggage.handling Checkin.service
## "" ""
## Inflight.service Cleanliness
## "" ""
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
## "" "pmm"
## satisfaction
## ""
## PredictorMatrix:
## X id Gender Customer.Type Age Type.of.Travel Class
## X 0 1 0 0 1 0 0
## id 1 0 0 0 1 0 0
## Gender 1 1 0 0 1 0 0
## Customer.Type 1 1 0 0 1 0 0
## Age 1 1 0 0 0 0 0
## Type.of.Travel 1 1 0 0 1 0 0
## Flight.Distance Inflight.wifi.service
## X 1 1
## id 1 1
## Gender 1 1
## Customer.Type 1 1
## Age 1 1
## Type.of.Travel 1 1
## Departure.Arrival.time.convenient Ease.of.Online.booking
## X 1 1
## id 1 1
## Gender 1 1
## Customer.Type 1 1
## Age 1 1
## Type.of.Travel 1 1
## Gate.location Food.and.drink Online.boarding Seat.comfort
## X 1 1 1 1
## id 1 1 1 1
## Gender 1 1 1 1
## Customer.Type 1 1 1 1
## Age 1 1 1 1
## Type.of.Travel 1 1 1 1
## Inflight.entertainment On.board.service Leg.room.service
## X 1 1 1
## id 1 1 1
## Gender 1 1 1
## Customer.Type 1 1 1
## Age 1 1 1
## Type.of.Travel 1 1 1
## Baggage.handling Checkin.service Inflight.service Cleanliness
## X 1 1 1 1
## id 1 1 1 1
## Gender 1 1 1 1
## Customer.Type 1 1 1 1
## Age 1 1 1 1
## Type.of.Travel 1 1 1 1
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes satisfaction
## X 1 1 0
## id 1 1 0
## Gender 1 1 0
## Customer.Type 1 1 0
## Age 1 1 0
## Type.of.Travel 1 1 0
## Number of logged events: 5
## it im dep meth out
## 1 0 0 constant Gender
## 2 0 0 constant Customer.Type
## 3 0 0 constant Type.of.Travel
## 4 0 0 constant Class
## 5 0 0 constant satisfaction
##
## iter imp variable
## 1 1 Arrival.Delay.in.Minutes
## 1 2 Arrival.Delay.in.Minutes
## 1 3 Arrival.Delay.in.Minutes
## 1 4 Arrival.Delay.in.Minutes
## 1 5 Arrival.Delay.in.Minutes
## 2 1 Arrival.Delay.in.Minutes
## 2 2 Arrival.Delay.in.Minutes
## 2 3 Arrival.Delay.in.Minutes
## 2 4 Arrival.Delay.in.Minutes
## 2 5 Arrival.Delay.in.Minutes
## 3 1 Arrival.Delay.in.Minutes
## 3 2 Arrival.Delay.in.Minutes
## 3 3 Arrival.Delay.in.Minutes
## 3 4 Arrival.Delay.in.Minutes
## 3 5 Arrival.Delay.in.Minutes
## 4 1 Arrival.Delay.in.Minutes
## 4 2 Arrival.Delay.in.Minutes
## 4 3 Arrival.Delay.in.Minutes
## 4 4 Arrival.Delay.in.Minutes
## 4 5 Arrival.Delay.in.Minutes
## 5 1 Arrival.Delay.in.Minutes
## 5 2 Arrival.Delay.in.Minutes
## 5 3 Arrival.Delay.in.Minutes
## 5 4 Arrival.Delay.in.Minutes
## 5 5 Arrival.Delay.in.Minutes
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## X id
## "" ""
## Gender Customer.Type
## "" ""
## Age Type.of.Travel
## "" ""
## Class Flight.Distance
## "" ""
## Inflight.wifi.service Departure.Arrival.time.convenient
## "" ""
## Ease.of.Online.booking Gate.location
## "" ""
## Food.and.drink Online.boarding
## "" ""
## Seat.comfort Inflight.entertainment
## "" ""
## On.board.service Leg.room.service
## "" ""
## Baggage.handling Checkin.service
## "" ""
## Inflight.service Cleanliness
## "" ""
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
## "" "pmm"
## satisfaction
## ""
## PredictorMatrix:
## X id Gender Customer.Type Age Type.of.Travel Class
## X 0 1 0 0 1 0 0
## id 1 0 0 0 1 0 0
## Gender 1 1 0 0 1 0 0
## Customer.Type 1 1 0 0 1 0 0
## Age 1 1 0 0 0 0 0
## Type.of.Travel 1 1 0 0 1 0 0
## Flight.Distance Inflight.wifi.service
## X 1 1
## id 1 1
## Gender 1 1
## Customer.Type 1 1
## Age 1 1
## Type.of.Travel 1 1
## Departure.Arrival.time.convenient Ease.of.Online.booking
## X 1 1
## id 1 1
## Gender 1 1
## Customer.Type 1 1
## Age 1 1
## Type.of.Travel 1 1
## Gate.location Food.and.drink Online.boarding Seat.comfort
## X 1 1 1 1
## id 1 1 1 1
## Gender 1 1 1 1
## Customer.Type 1 1 1 1
## Age 1 1 1 1
## Type.of.Travel 1 1 1 1
## Inflight.entertainment On.board.service Leg.room.service
## X 1 1 1
## id 1 1 1
## Gender 1 1 1
## Customer.Type 1 1 1
## Age 1 1 1
## Type.of.Travel 1 1 1
## Baggage.handling Checkin.service Inflight.service Cleanliness
## X 1 1 1 1
## id 1 1 1 1
## Gender 1 1 1 1
## Customer.Type 1 1 1 1
## Age 1 1 1 1
## Type.of.Travel 1 1 1 1
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes satisfaction
## X 1 1 0
## id 1 1 0
## Gender 1 1 0
## Customer.Type 1 1 0
## Age 1 1 0
## Type.of.Travel 1 1 0
## Number of logged events: 5
## it im dep meth out
## 1 0 0 constant Gender
## 2 0 0 constant Customer.Type
## 3 0 0 constant Type.of.Travel
## 4 0 0 constant Class
## 5 0 0 constant satisfaction
We will visualize the numercial variables below



Here we check distribution of the score variables. Those are more categorical than numerical.

Here we display histogram to demonstrate distribution of character variables. We find that the genders of customers are quite balanced. Majority of the travels are business travels. There are more neutral or dissatisfied reviews than satisfied, but not by a large margin.



## X Inflight.wifi.service Departure.Arrival.time.convenient
## Min. :-0.0043215 Min. :-0.00249 Min. :-0.004861
## 1st Qu.:-0.0008216 1st Qu.: 0.12121 1st Qu.: 0.011893
## Median : 0.0007393 Median : 0.13472 Median : 0.070119
## Mean : 0.0669499 Mean : 0.26709 Mean : 0.176147
## 3rd Qu.: 0.0016382 3rd Qu.: 0.34005 3rd Qu.: 0.218589
## Max. : 1.0000000 Max. : 1.00000 Max. : 1.000000
## Ease.of.Online.booking Gate.location Food.and.drink
## Min. :0.001913 Min. :-0.035428 Min. :-0.002162
## 1st Qu.:0.030944 1st Qu.:-0.002494 1st Qu.: 0.032185
## Median :0.038833 Median : 0.002313 Median : 0.059073
## Mean :0.224940 Mean : 0.145529 Mean : 0.233672
## 3rd Qu.:0.420518 3rd Qu.: 0.170661 3rd Qu.: 0.404512
## Max. :1.000000 Max. : 1.000000 Max. : 1.000000
## Online.boarding Seat.comfort Inflight.entertainment
## Min. :0.001002 Min. :0.0000435 Min. :-0.004861
## 1st Qu.:0.078926 1st Qu.:0.0496160 1st Qu.: 0.083950
## Median :0.204462 Median :0.1226578 Median : 0.299691
## Mean :0.256455 Mean :0.2683176 Mean : 0.339342
## 3rd Qu.:0.367796 3rd Qu.:0.4973836 3rd Qu.: 0.515371
## Max. :1.000000 Max. :1.0000000 Max. : 1.000000
## On.board.service Leg.room.service Baggage.handling Checkin.service
## Min. :-0.02837 Min. :-0.005873 Min. :-0.0005263 Min. :-0.03543
## 1st Qu.: 0.06398 1st Qu.: 0.064434 1st Qu.: 0.0554440 1st Qu.: 0.06525
## Median : 0.13197 Median : 0.123950 Median : 0.0957927 Median : 0.15314
## Mean : 0.25072 Mean : 0.212240 Mean : 0.2433687 Mean : 0.18395
## 3rd Qu.: 0.38782 3rd Qu.: 0.327593 3rd Qu.: 0.3738770 3rd Qu.: 0.21879
## Max. : 1.00000 Max. : 1.000000 Max. : 1.0000000 Max. : 1.00000
## Inflight.service Cleanliness
## Min. :-0.0001341 Min. :-0.00383
## 1st Qu.: 0.0522451 1st Qu.: 0.05248
## Median : 0.0887792 Median : 0.12322
## Mean : 0.2451461 Mean : 0.27344
## 3rd Qu.: 0.3867554 3rd Qu.: 0.49464
## Max. : 1.0000000 Max. : 1.00000
By displaying the correlation matrix, we can see that arrival delay and departure delay have really high correlation, which is very obvious. If a flight takes off late, it is very likely to arrive late as well. The other numerical variables don’t correlate with each other, as they are supposed to be.
Variables that fall into the “customer in-flight experience” area such as food and drink, in-flight entertainment and seat comfort have positive correlation with each other. The reason might be that airline companies manage these factors under one department/plan. If they decide to improve the food and drink, they might improve the in-flight entertainment as well. But it is surprised to see that in-flight service has no correlation with cleanness.
Variables such as WiFi service, online booking, and gate location tend to have a positive correlation. In speculation, larger airline companies might have better online booking sites, get access to good gate locations and so on.