Chapter 5 Reflections

Overall, We have tried three models: Logistic Regression, Random Forest and Catboost, ordering from most interpretable and least interpretable. Logistic Regression has an accuracy of 0.9248, while random forest has an accuracy of 0.963, and Catboost, the best model in terms of accuracy, has an accuracy of 0.964 . In our data pre-processing, we have utilized the ‘mice’ package to impute the missing values, and scaled some of the numerical variables to normal. We also converted the categorical variables to factors. The mice package shows great capability of detecting and imputing empty values. In our logistic regression model, we have detected that the departure delay variable is showing unusual pattern. It is extremely unlikely that customers will feel more satisfied when airlines departs late; However our result shows such. We suspect the existence of Simpson’s paradox; But we can not confirm it since we do not know if our data is a subset of a larger population. Removing either the departure delay or the arrival delay makes it behaving normally. We can see that since the departure delay is a confounding variable for both Arrival delay and satisfaction: If a plane departs late, it is very likely to arrive late. It is likely that the confounding effect from the departure delay variable caused its unnatural results. The random forest model and catboost model have very close accuracies. Both machine learning models are less interpretable. Although their accuracies are very close, their features are more different. For Catboost, wifi service has really high importance, and then type of travel and online-boarding. For random forest, online-boarding is the highest, following wifi-service and class. The big picture of variable importance are similar for both machine learning models, with some differences. It is very different for logistic regression. The top three variables with most importance are type of travel, customer type and checkin service, which is vastly different from the machine learning models. In conclusion, from the models we tested, we have seen that the major factors of customer satisfaction wifi service, type of travel, online boarding and customer service. It is interesting that all models have accuracies over 92% but their features show different behaviors.