COMPARISON OF BAGGING, BOOSTING, AND STACKING ENSEMBLE MODELS FOR AIRLINE CUSTOMER SATISFACTION ANALYSIS [PERBANDINGAN MODEL ENSEMBLE BAGGING, BOOSTING, DAN STACKING

By the end of COVID-19 pandemic and subsequent lockdowns last year, air travel has soared high, with an increase of 30.1% compared to last year according to one report. The rise of number of passengers means a good opportunity for the airline carriers to recoup losses due to lockdowns, and competition becomes heated as rival carriers try to lure new and old customers into their services. To remain competitive, more and more companies are turning towards machine learning to analyze large amounts of data to gain an edge towards their competitors, with ensemble learning being one of the many methods employed for the analysis work. In this study, Decision Tree, Random Forest, Boosting, and Stacking methods will be chosen for comparative study, which will be supplied with Airline Satisfaction dataset which is cleaned of null values and changing data types, for the study itself and then compared with each other using confusion matrix, precision-recall-f1-scoreaccuracy metrics, ROC curve, and feature importances. The results have shown that while the three chosen classifiers are almost similar in their overall success rate, with Bagging method reaching 96.117%, Boosting with a rate of 96.037%, and stacking with a rate of 96.264%, overall Stacking has the highest rate among all. These results show the almost negligible differences on all three main ensemble learning methods in terms of efficacy. Additional studies with larger datasets, and more varieties of ensemble learning methods can improve the overall judgement of the results.

With the complexity of identifying  (Janiesch et al., 2021).Combining multiple machine learning algorithms will combine the output methods to perform more complex calculations for better results which is called Ensemble Learning (Zhou, 2009).
Past studies on analyses of a multidimensional problem have seen higher

Problem Limitations
With the problems for this study identified, the next step will be to determine what focus will this study be, thus the limitations in this thesis are thus: 1.The number of Ensemble learning techniques to be used for this study.
2. The factors of airline satisfaction will be the values and parameters in a dataset which will be used for the ensemble learning study.
3. The expected results will be the  The second dataset that will be used for this study is the "test" dataset (Figure 2), which contains similar columns and data types as the "training" dataset (Figure 3), has several 25.977 records present inside.
The "test" dataset will be used for the actual evaluation of the ensemble learning methods, while the "training" dataset will be used for the model training for the ensemble learning methods.There are some unneeded columns which will be removed in the preprocessing stage later, thus the description of each column for the "training" and "test" dataset which will be used for this study is as follows: 1. Gender: contains a binary data type between male and female.
2. Customer Type: contains binary data between loyal and disloyal customers.Note the strong correlation between departure delay in minutes with the arrival delay in minutes.
Figure 11 shows the heatmap cluster between data columns within the "training" dataset.Note that "test" dataset, has similar columns and features as shown in Figure 3.
Interestingly, Figure 12 shows the differences of satisfaction levels of ages from the youngest (5) to the oldest (79) within the dataset, and it shows that people aging from 39 -60 usually have higher satisfaction rates than the ages ranging from 7 -38, and 61 -79.
Although gender can affect the ratings of certain aspects of an airline, Figure 13 shows that in overall satisfaction levels, there is only a slight difference, 32 where women tend to be a little bit more dissatisfied than the men, while satisfied levels remain the same for both genders.
The bias of customers based on the loyalty type, and data exploration has shown a stark contrast between loyal and disloyal customer types, with loyal customer rating more frequently than disloyal customers, and the level of satisfaction is the lowest on disloyal customer types (Figure 14).

Evaluation Method
To determine the performance of each ensemble learning models, there will be four metrics to be used for this study to measure  Figure 17 shows what is an ROC curve.
Feature importance will also be used for the evaluation, where it lists all the available data columns used in machine learning and weighted with scores, which the higher the score, the more that data column will have a larger effect on the model that is being used for the prediction.Figure 18 shows the example of a feature importance.
Figure 17.ROC curve, where scores above 0.5 and higher are accurate, while scores under 0.5 are less accurate.Values larger than 0.5 also indicate that model has an ability to discriminate Figure 18.Bar chart of Feature Importance.The higher the score is, the more it will affect the overall model scoring

Calculation Methods
This following section will detail the calculation methods of all Ensemble learning methods and algorithms that will be used for the analysis, and the discussion for this study.
The sampling done within this study will be stratified random sampling, grouped according to the age, time taken for the flight to arrive which in each row flight distance data is divided by 20 plus the arrival in minutes, and overtake which in each row, arrival delay in minutes is subtracted with the departure delay in minutes.The calculation present in this section will be divided into two parts: the calculation used to measure the results that will be used in the evaluation, and the calculation that will be used for the ensemble learning method algorithms itself.
To gauge the efficacy of the ensemble learning model, firstly the formula to calculate precision is as follows: Moving on, the equation to calculate the recall score is thus: These two results will then be used for the F1 Score calculation formula, which is: In addition to these three equations, the next

RESULTS AND DISCUSSION Results
Stacking akan dipilih untuk studi komparatif, yang akan dilengkapi dengan dataset Kepuasan Maskapai yang dibersihkan dari INTRODUCTION As of 2023, total passengers boarding airlines has increased by 30.1% compared to last year, showing strong recovery from COVID-19 pandemic and will continue to see a strong growth trend in the future (Airlines IATA, 2023), with another source predicting that global passenger traffic will fully recover by 2024 and may reach 9.4 billion passengers (Figure 1).

Figure 1 .
Figure 1.The passenger traffic by each region from 2019 to 2024 prediction.Source: (Airports Council International, 2023).
and analyzing the overall customer satisfaction, airline industries have turned to machine learning, specifically ensemble learning for making complex calculations and reporting on customer satisfaction analysis.Machine learning in its basic definition, describes the ability of a system to learn from given data related to analytics and solving given problems, which works by slowly learning meaningful patterns and relationships between pieces of data through examples and observations The survey responses which will be used for the data collection will have data values in either Boolean or numeric values with no open-ended questions., the dataset "Airline Passenger Satisfaction" and containing US passenger details, modified to be more cleaned-up from a previous dataset by the author TJ Klien, is obtained from an opensource dataset website Kaggle, with their rating for each of the airline's aspects, which will be called "training" dataset, which features 25 columns, and numbering with 103.904 records.

Figure 3 .
Figure 3. Dataset for the "training" dataset Figure 4 and Figure 5, the "unnamed column" together with "id" column as seen on Figure 1 has been removed, and most datatypes from number 6 to 19 have been changed into "category" data type to better fit for inputting on ensemble learning.The next problem will be to solve the issue of missing values present on both datasets.Figure 6 shows the list of missing values on "training" dataset and Figure 7 is the "test" dataset showing the number of missing values present in the dataset.In Figure 6 and Figure 7, the number of missing values in Arrival Delay in Minutes column.

Figure 4 .
Figure 4.The data attributes of the dataset "training" after the fix.

Figure 5 .
Figure 5. Dataset attributes of the dataset "test" after the fix.

Figure 6 .
Figure 6.The list of total missing values on each column of the "training" dataset.

Figure 7 .FaST
Figure 7.The list of total missing values on each column of the "test" dataset.

Figure 8 .
Figure 8.The list of total missing values on the "training" dataset after the fix.

Figure 9 .
Figure 9.The list of total missing values on the "test" dataset after the fix.Exploratory Data AnalysisTo better understand the characteristics and gaining insight on the dataset used for the training, an exploratory data analysis will be conducted to learn

Figure 10 .
Figure 10.Pie chart showing the overall number of neutral or dissatisfied, and satisfied passengers within the dataset, overall "training" dataset is almost near balanced among both level of satisfaction.

Figure 11 .
Figure 11.Correlation heatmap showing the relationship of each column with one another in "training" dataset.Note the strong correlation between the column departure delay and arrival dela

Figure 13 .
Figure 13.Bar chart showing the differences of satisfaction levels between the two genders.

Figure 14 .
Figure 14.Bar chart showing the total number of satisfied and neutral or dissatisfied passengers, split between loyal and disloyal types of customer.

Figure 15
Figure 15 shows an interesting insight that the longer the flight distance, the higher the level of satisfaction of a passenger regarding the inflight entertainment and leg

Figure 15 .
Figure 15.Box chart and histogram plot showing the relations of flight distance and in-flight entertainment with satisfaction levels.
its efficacy: Precision which is the accuracy of the model to predict positive labels from the given data, Recall which calculates how much actual positive data can be obtained by the model with the true positive data labels, F1 Score which is a calculation with weighting from the precision results, and Accuracy which measures how many times can the model classify data correctly.All of these four metrics will be laid out on tables comparing each methods to each other, supplied with confusion matrix showing the predicted values on four dimensions, which are: True Positive which means the model accurately predicts a positive data sample, False Negative where the model incorrectly predicts a negative data sample incorrectly, False Positive where the model incorrectly predicts a positive data sample, and True Negative where the model accurately predicts a negative data sample.Figure 16 shows how a confusion matrix would look like.

Figure 16 .
Figure 16.Confusion matrix and what each value signifies.For Bagging ensemble learning, the method used will be the Random Forest classifier, and the Decision Tree classifier methods, while the Boosting Method uses the XGBoost method which is one of the most used sub-methods in Boosting, and for the Stacking method, it will employ the standard Blending method which is widely used.ROC, or Receiver Operating Characteristic will also be used for comparison and evaluation between the ensemble learning models, where the ROC shows the test accuracy where the closer the graph is to the top and left-hand border the more accurate the test is, vice versa.The test and the last important equation is the accuracy of the ensemble learning model, which is: Next up are the formulas of each ensemble learning models from chapter 2 which will be used for this study.The first formula to look up to will be the for Bagging: where () represents the weak learners present in the machine learning model,  1 generates the bootstrapping sets.Next is the equation for boosting method, which is: where ℎ() is created from several weak classifiers through training data and model building of it, which creates a second model that attempts to correct the errors, which is .Finally, the formula for Stacking method is: where () is the output of the Nth base model, with N denoting the length of the dataset,  denotes the weight of the Nth base model of the input X.

Figure 19 ,
Figure 19, Figure 20, and Figure 21 are the test results on the Decision Tree ensemble learning method.Figure 22, Figure 23, and Figure 24 are the test results on the Random Forest ensemble learning method.

Figure 25 ,
Figure 25, Figure 26, and Figure 27 are the test results on the Boosting ensemble learning method.Figure 28, Figure 29, and Figure 30 are the test results on the Stacking ensemble learning method.

FaSTFigure 19 .
Figure 19.Confusion matrix and precision, recall, f1score, and support scores of Decision Tree ensemble learning method.

Figure 20 .
Figure 20.Feature importance for Decision Tree ensemble learning method.

Figure 21 .
Figure 21.ROC curve for Decision Tree ensemble learning method.

Figure 22 .
Figure 22.Confusion matrix and precision, recall, f1score, and support scores for the Random Forest ensemble learning method.

Figure 23 .
Figure 23.Features importance of Random Forest ensemble learning method.

Figure 24 .
Figure 24.ROC curve of Random Forest ensemble learning method.

Figure 25 .
Figure 25.Confusion matrix and precision, recall, f1score, and support scores for the Boosting ensemble learning method.

Figure 26 .
Figure 26.Features importance chart of the boosting ensemble learning method.

Figure 28 .
Figure 28.Confusion matrix and precision, recall, f1score, and support scores for Stacking ensemble learning method.

Figure 29 .
Figure 29.Features importance chart of the Stacking ensemble learning method.

Figure 30 .
Figure 30.ROC curve of Stacking ensemble learning method.

FaST
methods used for this study, which are Random Forest and Decision Tree for Bagging methods, Boosting, and Stacking, have shown different results with each other.Random Forest together with Boosting and Stacking have a success rate of 96.117%, 96.037%, and 96.264% respectively.The Decision Tree method seemed to perform the most poorly with a success rate of 89.63%.Among all the ensemble learning methods, Stacking has the highest overall success rate, showing that ensemble learning can identify the aspects of satisfaction of each passenger towards the flight with many factors of satisfaction within the given dataset.Meanwhile, the worst performing ensemble learning method is the Decision Tree.This study hopefully acts as a guidance towards the ensemble learning enthusiasts of all levels of experience and providing some knowledge and insight towards those interested in gauging airline satisfaction.However, in the future research on ensemble learning, there will be plans to expand to more ensemble learning methods, with larger datasets for a better analysis in the efficacy of ensemble learning methods available.