Harnessing Predictive Modelling for Education Index : A Dual Approach with Random Forest and Multiple Linear Regression
DOI:
https://doi.org/10.19166/fastjst.v10i1.10367Keywords:
Education Index, Random Forest, Multiple Linear Regression, SocioeconomicAbstract
This study explores the predictive modeling of the Education Index (EI) using a dual approach using Random Forest and Multiple Linear Regression (MLR). The data, obtained from "Our World in Data" spanning 1990–2022, integrates socio-economic and infrastructure indicators, including GDP per capita, government spending on education, and access to electricity. This study includes 20 countries that are categorized by income level: Low-Income (Vietnam, Nepal, Myanmar, Pakistan, Zimbabwe), Lower-Middle-Income (Ghana, Bolivia, Cambodia, Egypt, Bangladesh), Upper-Middle-Income (Argentina, Brazil, Peru, Russia, Mexico) and High-Income (Germany, Italy, Portugal, Iceland, Greece). The analysis reveals that Random Forest outperforms MLR in terms of accuracy and lower error rates, while MLR provides better interpretability of variable relationships. With R2 of 99.34% by Random Forest Regression and 94% by Multiple Linear Regression (MLR). Key findings reveal that GDP per capita, primary and secondary completion rates, and internet usage significantly influence EI, underscoring the importance of economic conditions and infrastructure for educational outcomes. This study contributes to the field by offering comparative insights into machine learning and traditional statistical methods for educational analytics, providing a robust basis for policy development to enhance global education standards.
References
[1] S. Sukidin, W. Hartanto, R. N. Sedyati and S. Shofiyah, “Role of Education concerning the Gross Domestic Product. Human Development Index and Poverty Rate in East Java,” AL-ISHLAH: Jurnal Pendidikan, vol. 15 no. 3, pp. 4140–4149, 2023. https://doi.org/10.35445/alishlah.v15i3.1716
[2] A. Hovhannisyan, R. Castillo-Ponce and R. Valdez, “The determinants of income inequality: The role of education,” Scientific Annals of Economics and Business, vol. 66, no. 4, pp. 451–464, 2019. https://doi.org/10.47743/saeb-2019-0040
[3] Local Burden of Disease Educational Attainment Collaborators, “Mapping disparities in education across low- and middle-income countries,” Nature, vol. 577, no. 7789, pp. 235–238, 2020. https://doi.org/10.1038/s41586-019-1872-1
[4] J. S. Jamal, M. Salam, A. N. Tenriawaru, D. Rukmana, M. H. Jamil and S. Saadah, “Determinant factors affecting the improvement of education index,” Jurnal Penelitian dan Evaluasi Pendidikan, vol. 25, no. 1, pp. 88–96, 2021. https://doi.org/10.21831/pep.v25i1.40160
[5] F. Riandari, H. T. Sihotang and H. Husain, “Forecasting the Number of Students in Multiple Linear Regressions,” MATRIK: Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 21, no. 2, pp. 249–256, 2022. https://doi.org/10.30812/matrik.v21i2.1348
[6] H. S. Alim, N. Rohmah and M. Milawati, “Study of education leverage factors to improve sampang human development index,” Cendikia: Media Jurnal Ilmiah Pendidikan, vol. 14, no. 3, pp. 366-374, 2024. https://doi.org/10.35335/cendikia.v14i4.4624
[7] O. Adeleke and P.E. McSharry, “Female enrollment, child mortality and corruption are good predictors of a country’s UN Education Index,” International Journal of Educational Development, vol. 90, pp. 102561, 2022. https://doi.org/10.1016/j.ijedudev.2022.102561
[8] G. Chairunisa, M. K. Najib, S. Nurdiati, S. F. Sanjaya, W. R. D. Andriani and D. Ekaputri, “Life Expectancy Prediction Using Decision Tree, Random Forest, Gradient Boosting, and XGBoost Regressions,” Jurnal Sintak, vol. 2, no. 2, pp. 71-82, 2024. https://doi.org/10.62375/jsintak.v2i2.249
[9] A. Primajaya and B. N. Sari, “Random forest algorithm for prediction of precipitation,” Indonesian Journal of Artificial Intelligence and Data Mining, vol. 1, no. 1, pp. 27–31, 2018 http://dx.doi.org/10.24014/ijaidm.v1i1.4562
[10] O. Dewi, G. E. Laukon, S. A. Sutresno and H. J. Christanto, “Modification of random forest method to predict student graduation data,” Jurnal Mantik, vol. 7, no. 4, pp. 2949–2961, 2024. https://doi.org/10.35335/mantik.v7i4.4528
[11] S. N. Wahyuni, “Implementation of Multiple Linear Regression for Predicting Time Series Data in Infectious Diseases Using a Machine Learning Approach,” JATISI (Jurnal Teknik Informatika dan Sistem Informasi), vol. 11, no. 2, 2024. https://doi.org/10.35957/jatisi.v11i2.7878
[12] K. Spoon, J. Beemer, J. C. Whitmer, J. Fan, J. P. Frazee, J. Stronach, A.J. Bohonak and R. A. Levine, “Random Forests for Evaluating Pedagogy and Informing Personalized Learning,” Journal of Educational Data Mining, vol. 8, no. 2, pp. 20–50, 2016. https://doi.org/10.5281/zenodo.3554595
[13] J. Raymaekers and P. J. Rousseeuw, “Transforming variables to central normality,” Machine Learning, vol. 113, no. 8, pp. 4953–4975, 2024. https://doi.org/10.1007/s10994-021-05960-5
[14] S. Wijaya and Fauziah, “Analysis of the comparison between linear regression, random forest, and logistic regression methods in predicting Crude Palm Oil (CPO) price,” Brilliance: Jurnal Riset dan Konseptual, vol. 3, no. 2, pp. 343–350, 2023. https://doi.org/10.47709/brilliance.v3i2.3334
[15] S. Obata, C. J. Cieszewski, R. C. Lowe III and P. Bettinger, “Random Forest regression model for estimation of the growing stock volumes in Georgia, USA, using dense Landsat time series and FIA dataset,” Remote Sensing, vol. 13, no. 2, pp. 218, 2021. https://doi.org/10.3390/rs13020218
[16] R. J. Barro and J. W. Lee, “A new data set of educational attainment in the world, 1950–2010,” Journal of development economics, vol. 104, pp. 184-198, 2013. https://doi.org/10.1016/j.jdeveco.2012.10.001
[17] J. Li, S. Guo, R. Ma, J. He, X. Zhang, D. Rui, Y. Ding, Y. Li, L. Jian, J. Cheng, and H. Guo, “Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets,” BMC Medical Research Methodology, vol. 24, no. 1, pp. 41, 2024. https://doi.org/10.1186/s12874-024-02173-x
[18] S. M. Ribeiro and C. L. de Castro, “Missing data in time series: A review of imputation methods and case study,” in Learning and Nonlinear Models-Revista Da Sociedade Brasileira De Redes Neurais-Special Issue: Time Series Analysis and Forecasting Using Computational Intelligence, vol. 19, no. 2, 2021. http://dx.doi.org/10.21528/lnlm-vol20-no1-art3
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Olivia Kristianti Kusuma, Jessica, Grace Felicia Christy Widjaya, Helena Margaretha, Ferry Vincenttius Ferdinand, Kie Van Ivanky Saputra

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
“Authors who publish with this journal agree to the following terms:
1) Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC-BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
2) Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
3) Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website). The final published PDF should be used and bibliographic details that credit the publication in this journal should be included.”


