Chapter 3 Random Forest

Then we will try the Random Forest model

Results of Random Forest model

## [1] 0.9631198

Details of the random forest model

## 
## Call:
##  randomForest(formula = satisfaction ~ Gender + Customer.Type +      Age + Type.of.Travel + Class + Flight.Distance + Inflight.wifi.service +      Departure.Arrival.time.convenient + Ease.of.Online.booking +      Gate.location + Food.and.drink + Online.boarding + Seat.comfort +      Inflight.entertainment + On.board.service + Leg.room.service +      Baggage.handling + Checkin.service + Inflight.service + Cleanliness +      Departure.Delay.in.Minutes + Arrival.Delay.in.Minutes, data = train,      controls = cforest_control(mtry = 2, mincriterion = 0)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 3.71%
## Confusion matrix:
##                         neutral or dissatisfied satisfied class.error
## neutral or dissatisfied                   57695      1184  0.02010904
## satisfied                                  2667     42358  0.05923376

We can see that the random forest model did yield better accuracy than the logistic regression model. From the variable importance plot, the online-boarding, wifi-service and class variables have the most importance.

Although random forest is not very interpretable, there are some ways; One of them is partial dependence plot. Obviously we can not create partial plots for every variable; To connect with a challenge that we encounter earlier, we will build a partial dependence plot between the ‘Departure delay’ variable and the target variable. We can see that after 150, it is very unlikely for passengers to give a satisfied feedback. It is demonstrated that the dissatisfaction accumulates very quickly after the delay is more than 30 minutes. This variable is behaving a lot more normal in the random forest model.