A Comparison of Resampling Techniques for Medical Data Using Machine Learning
Data imbalance with respect to the class labels has been recognised as a challenging problem for machine learning techniques as it has a direct impact on the classification model’s performance. In an imbalanced dataset, most of the instances belong to one class, while far fewer instances are associated with the remaining classes. Most of the machine learning algorithms tend to favour the majority class and ignore the minority classes leading to classification models being generated that cannot be generalised. This paper investigates the problem of class imbalance for a medical application related to autism spectrum disorder (ASD) screening to identify the ideal data resampling method that can stabilise classification performance. To achieve the aim, experimental analyses to measure the performance of different oversampling and under-sampling techniques have been conducted on a real imbalanced ASD dataset related to adults. The results produced by multiple classifiers on the considered datasets showed superiority in terms of specificity, sensitivity, and precision, among others, when adopting oversampling techniques in the pre-processing phase.