Memanfaatkan R untuk Preprocessing Data yang Efisien dalam Analisis Prediktif
[Leveraging R for Efficient Data Preprocessing in Predictive Analytics]
DOI:
https://doi.org/10.19166/jstfast.v9i2.10366Keywords:
Data mining, Data preprocessing, Predictive analysis, R programmingAbstract
The digital era has triggered a data explosion, demanding efficient data preprocessing capabilities. The R programming language, supported by a wide range of packages, offers effective solutions for preprocessing tasks, particularly in handling missing values. This study aims to demonstrate the practical and efficient use of R in improving the quality of predictive models and to provide a practical guide for academics and practitioners. The research adopts a descriptive-exploratory approach through a case study using R for data preprocessing. The stages include data collection, data cleaning and transformation, result visualization, optional predictive analysis, and systematic documentation as a practical guide. The data imputation process in R begins with analyzing variable correlations and distributions using scatter plot matrices and histograms, followed by selecting appropriate imputation methods such as linear regression, mean, or median. R facilitates this process through its comprehensive functions and visualization tools. As this study does not address all aspects of data preprocessing—particularly missing data handling—it is recommended that future research explore alternative imputation techniques such as k-nearest neighbors (kNN) and other preprocessing components.
Bahasa Indonesia Abstract: Era digital menghasilkan ledakan data yang menuntut kemampuan preprocessing data yang efisien. Bahasa R, dengan berbagai paket pendukungnya, menawarkan solusi efektif untuk preprocessing, khususnya dalam penanganan missing values. Penelitian ini bertujuan mendemonstrasikan pemanfaatan R untuk meningkatkan kualitas model prediktif dan memberikan panduan praktis bagi akademisi serta praktisi. Metodologi yang digunakan dalam penelitian adalah metode deskriptif eksploratif dengan studi kasus menggunakan R untuk preprocessing data. Tahapannya meliputi pengumpulan data, pembersihan dan transformasi data, visualisasi hasil, serta dokumentasi langkah-langkah sebagai panduan praktis. Pada penelitian ini dilakukan percobaan dengan membangun data simulasi, yang dihasilkan dari data besar yang sudah bersih, kemudian dibuat dibuat menjadi data yang tidak lengkap dengan memanfaatkan paket R messy. Proses imputasi data dengan R dimulai dari analisis korelasi dan distribusi variabel menggunakan scatter plot matrix dan histogram, memilih metode imputasi yang sesuai seperti regresi linear, rata-rata, atau median. R memudahkan proses ini lewat fungsi dan visualisasi yang lengkap. Hasil evaluasi dilakukan dengan membandingkan bentuk sebaran data asli, dengan data simulasi yang telah dibersihkan. Hasil yang diberikan menunjukkan bahwa kedua data memberikan bentuk sebaran yang tidak signifikan.
References
Cook, R. D., & Weisberg, S. (2009). An introduction to regression graphics. Vol. 405. John Wiley & Sons.
Daniswara, A. A. A., & Nuryana, I. K. D. (2023). Data preprocessing pola pada penilaian mahasiswa program profesi guru. Journal of Informatics and Computer Science (JINACS), 5(1), 97–100.
Emerson, J. W., Green, W. A., Schloerke, B., Crowley, J., Cook, D., Hofmann, H., & Wickham, H. (2013). The generalized pairs plot. Journal of Computational and Graphical Statistics, 22(1), 79–91. https://doi.org/10.1080/10618600.2012.694762
Haliduola, H. N., Bretz, F., & Mansmann, U. (2022). Missing data imputation using utility-based regression and sampling approaches. Computer Methods and Programs in Biomedicine, 226, 107172. https://doi.org/10.1016/j.cmpb.2022.107172
Hamdani, I. M., Nurhidayat, N., Karman, A., & Julyaningsih, A. H. (2024). Edukasi dan pelatihan data science dan data preprocessing. Intisari: Jurnal Inovasi Pengabdian Masyarakat, 2(1), 19–26. https://doi.org/10.58227/intisari.v2i1.125
Hasan, R., Palaniappan, S., Mahmood, S., Abbas, A., & Sarker, K. U. (2021). Dataset of students’ performance using student information system, Moodle and the mobile application “eDify”. Data, 6(11), 110. https://doi.org/10.3390/data6110110
Hirsch, R. (2023). Introduction to R. In Analysis of epidemiologic data using R (pp. 1–12). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-41914-0_1
Hsu, J. L., Jones, A., Lin, J.-H., & Chen, Y.-R. (2022). Data visualization in introductory business statistics to strengthen students’ practical skills. Teaching Statistics, 44, 21–28. https://doi.org/10.1111/test.12291
Hudiburgh, L. M., & Garbinsky, D. (2020). Data visualization: Bringing data to life in an introductory statistics course. Journal of Statistics Education, 28, 262–279. https://doi.org/10.1080/10691898.2020.1796399
Lin, W. C., & Tsai, C. F. (2020). Missing value imputation: A review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53(2), 1487–1509.
Ochieng’Odhiambo, F. (2020). Comparative study of various methods of handling missing data. Mathematical Modelling and Applications, 5(2), 87. https://doi.org/10.11648/j.mma.20200502.14
Pavlenko, L. V., Pavlenko, M. P., Khomenko, V. H., & Mezhuyev, V. I. (2022). Application of R programming language in learning statistics. In Proceedings of the 1st Symposium on Advances in Educational Technology (Vol. 2, pp. 62–72). https://doi.org/10.5220/0010928500003364
Rahmah, F. R., Sutami, N. A. Z. S., Amanda, M. D. A., & Asbari, M. A. (2025). Ledakan informasi dan kesehatan mental: Peran kecerdasan emosional di era digital. Journal of Information Systems and Management (JISMA), 4(2), 19–28. https://jisma.org/index.php/jisma/article/view/1170/234
Rennie, N., & Davison, J. (2025). Making ‘messy’ data: An R package for teaching data wrangling with realistic data. Teaching Statistics. https://nrennie.rbind.io/making-messy-data/
Reyhan, M., Ahmad, D. R., Ramadhan, N. A., & Kusumasari, I. R. (2024). Penggunaan data analisis dan big data dalam strategi pengambilan keputusan keuangan. Jurnal Akuntansi, Manajemen, dan Perencanaan Kebijakan, 2(2), 9. https://doi.org/10.47134/jampk.v2i2.540
Downloads
Published
Issue
Section
License
Copyright (c) 2025 I Gusti Agung Anom Yudistira

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
“Authors who publish with this journal agree to the following terms:
1) Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC-BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
2) Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
3) Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website). The final published PDF should be used and bibliographic details that credit the publication in this journal should be included.”
