Machine Learning to Early Prediction of Chronic Kidney Disease: Using Imbalanced and Limited Size Data Sets (Preprint)
BACKGROUND Chronic kidney disease (CKD) is a worldwide public health problem, usually diagnosed in the late stages of the disease, increasing public health costs and mortality rates. The late diagnosis is even more critical in low- and middle-income countries due to the high poverty levels, many hard-to-reach locations, and sometimes lack/precarious primary care. Therefore, to alleviate these issues, investment in early prediction is necessary. OBJECTIVE The purpose of this study is to assist the early prediction of CKD, addressing problems related to imbalanced and limited-size data sets. METHODS To address our multi-class problem (low risk, medium risk, high risk, and very high risk), we used data from medical records of 60 Brazilians with or without a diagnosis of CKD, containing the following attributes: hypertension, diabetes mellitus, creatinine, urea, albuminuria, age, gender, and glomerular filtration rate. We used two approaches for oversampling: (1) manual augmentation with data validated by an experienced nephrologist and (2) automated augmentation with the synthetic minority oversampling technique (SMOTE), borderline-SMOTE, and Borderline-SMOTE support vector machine. We implemented classification models based on such data sets and the algorithms: decision tree (DT), random forest, and multi-class AdaBoosted DTs. We also applied the overall local accuracy and local class accuracy methods for dynamic classifier selection; and the k-nearest oracles-union, k-nearest oracles-eliminate, and META-DES for dynamic ensemble selection. We analyzed the models' performances using the hold-out validation, multiple stratified cross-validation (CV), and nested CV. We also computed the importance of features using feature selection methods. RESULTS The best performance was achieved using the DT and multi-class AdaBoosted DTs classification models, oversampled with SMOTE, and validated with the multiple stratified CV and nested CV methods. The DT model presented the highest accuracy score (98.99%) for both multiple stratified CV and nested CV, followed by multi-class AdaBoosted DTs (97.99% and 98.00%), respectively. CONCLUSIONS The SMOTE and multiple stratified CV or nested CV methods provided reliable results for such an imbalanced and limited size data set. During CKD monitoring, based on the DT model, assuming the previous DM evaluation, the user only needs to perform two blood tests: creatinine and urea. Thus, the DT model can assist in designing systems for the early prediction of CKD, providing easy interpretation and cost reduction.