scholarly journals Different Approaches to Reducing Bias in Classification of Medical Data by Ensemble Learning Methods

Author(s):  
Adem Doganer

In this study, different models were created to reduce bias by ensemble learning methods. Reducing the bias error will improve the classification performance. In order to increase the classification performance, the most appropriate ensemble learning method and ideal sample size were investigated. Bias values and learning performances of different ensemble learning methods were compared. AdaBoost ensemble learning method provided the lowest bias value with n: 250 sample size while Stacking ensemble learning method provided the lowest bias value with n: 500, n: 750, n: 1000, n: 2000, n: 4000, n: 6000, n: 8000, n: 10000, and n: 20000 sample sizes. When the learning performances were compared, AdaBoost ensemble learning method and RBF classifier achieved the best performance with n: 250 sample size (ACC = 0.956, AUC: 0.987). The AdaBoost ensemble learning method and REPTree classifier achieved the best performance with n: 20000 sample size (ACC = 0.990, AUC = 0.999). In conclusion, for reduction of bias, methods based on stacking displayed a higher performance compared to other methods.

2021 ◽  
Vol 21 (S2) ◽  
Author(s):  
Kun Zeng ◽  
Yibin Xu ◽  
Ge Lin ◽  
Likeng Liang ◽  
Tianyong Hao

Abstract Background Eligibility criteria are the primary strategy for screening the target participants of a clinical trial. Automated classification of clinical trial eligibility criteria text by using machine learning methods improves recruitment efficiency to reduce the cost of clinical research. However, existing methods suffer from poor classification performance due to the complexity and imbalance of eligibility criteria text data. Methods An ensemble learning-based model with metric learning is proposed for eligibility criteria classification. The model integrates a set of pre-trained models including Bidirectional Encoder Representations from Transformers (BERT), A Robustly Optimized BERT Pretraining Approach (RoBERTa), XLNet, Pre-training Text Encoders as Discriminators Rather Than Generators (ELECTRA), and Enhanced Representation through Knowledge Integration (ERNIE). Focal Loss is used as a loss function to address the data imbalance problem. Metric learning is employed to train the embedding of each base model for feature distinguish. Soft Voting is applied to achieve final classification of the ensemble model. The dataset is from the standard evaluation task 3 of 5th China Health Information Processing Conference containing 38,341 eligibility criteria text in 44 categories. Results Our ensemble method had an accuracy of 0.8497, a precision of 0.8229, and a recall of 0.8216 on the dataset. The macro F1-score was 0.8169, outperforming state-of-the-art baseline methods by 0.84% improvement on average. In addition, the performance improvement had a p-value of 2.152e-07 with a standard t-test, indicating that our model achieved a significant improvement. Conclusions A model for classifying eligibility criteria text of clinical trials based on multi-model ensemble learning and metric learning was proposed. The experiments demonstrated that the classification performance was improved by our ensemble model significantly. In addition, metric learning was able to improve word embedding representation and the focal loss reduced the impact of data imbalance to model performance.


The problem of medical data classification is analyzed and the methods of classification are reviewed in various aspects. However, the efficiency of classification algorithms is still under question. With the motivation to leverage the classification performance, a Class Level disease Convergence and Divergence (CLDC) measure based algorithm is presented in this paper. For any dimension of medical data, it convergence or divergence indicates the support for the disease class. Initially, the data set has been preprocessed to remove the noisy data points. Further, the method estimates disease convergence/divergence measure on different dimensions. The convergence measure is computed based on the frequency of dimensional match where the divergence is estimated based on the dimensional match of other classes. Based on the measures a disease support factor is estimated. The value of disease support has been used to classify the data point and improves the classification performance.


Author(s):  
Naeem Ahmed Mahoto ◽  
Abdul Hafeez Babar

The sparse nature of medical data makes knowledge discovery and prediction a complex task for analysis. Machine learning algorithms have produced promising results for diversified data. This chapter constructs the effective classification model for medical data analysis. In particular, nine classification models, namely Naïve Bayes, decision tree (i.e., J48 and Random Forest), multilayer perceptron, radial bias function, k-nearest neighbors, single conjunctive rule learner, support vector machine, and simple logistics have been applied for developing an effective model. Besides, classification models have also been used in conjunction with ensemble learning methods, since ensemble methods significantly increase the predictive outcomes of the classification models. The evaluation of classification models has been measured using accuracy, f-measure, precision, and recall metrics. The empirical results revealed that the combination of ensemble learning methods with classification models produces better predictions in comparison with sole classification model for the medical data.


Author(s):  
WIDYA DWI ARYATI ◽  
MUHAMMAD SIDDIQ WINARKO ◽  
GERRY MAY SUSANTO ◽  
ARRY YANUAR

Objective: New psychoactive substances (NPS) have been rapidly developed to avoid legal entanglement. In 2013–2018, the number of cathinonederivedcompounds increased from 30 to 89. In 2016, of 56 NPS compounds, 21 were identified as cannabinoid-derived; only 43 were regulated inthe narcotics law. Artificial intelligence, such as machine and deep learning, is a method of data processing and object recognition, including humanposes and image classifications.Methods: Herein, the machine and deep learning methods for cathinone- and cannabinoid-derived compound classification were compared usingpharmacophore modeling as the reference method. For classifying cathinone-derived compounds, the structure was transformed into fingerprints,which was used as a learning parameter for the machine and deep learning methods. Contrarily, the physicochemical properties and fingerprint shapewere utilized as learning materials for the deep learning method to classify the cannabinoid-derived substances.Results: Consequently, in the cathinone-derived compound classification, the deep learning method produced the accuracy and Cohen kappa valuesof 0.9932 and 0.992, respectively. Furthermore, such values in the pharmacophore modeling method were higher than those in the machine learningmethod (0.911 and 0.708 vs. 0.718 and 0.673, respectively). In the cannabinoid-derived compound classification, the deep learning method with thefingerprint form had the highest accuracy and Cohen kappa values (0.9904 and 0.9876). Such values in this method with the descriptor form werehigher than those in the pharmacophore modeling method (0.8958 and 0.8622 vs. 0.68 and 0.396, respectively).Conclusion: The deep learning method has the potential in the NPS classification.


2021 ◽  
Vol 13 (19) ◽  
pp. 3945
Author(s):  
Bin Wang ◽  
Linghui Xia ◽  
Dongmei Song ◽  
Zhongwei Li ◽  
Ning Wang

Sea ice information in the Arctic region is essential for climatic change monitoring and ship navigation. Although many sea ice classification methods have been put forward, the accuracy and usability of classification systems can still be improved. In this paper, a two-round weight voting strategy-based ensemble learning method is proposed for refining sea ice classification. The proposed method includes three main steps. (1) The preferable features of sea ice are constituted by polarization features (HH, HV, HH/HV) and the top six GLCM-derived texture features via a random forest. (2) The initial classification maps can then be generated by an ensemble learning method, which includes six base classifiers (NB, DT, KNN, LR, ANN, and SVM). The tuned voting weights by a genetic algorithm are employed to obtain the category score matrix and, further, the first coarse classification result. (3) Some pixels may be misclassified due to their corresponding numerically close score value. By introducing an experiential score threshold, each pixel is identified as a fuzzy or an explicit pixel. The fuzzy pixels can then be further rectified based on the local similarity of the neighboring explicit pixels, thereby yielding the final precise classification result. The proposed method was examined on 18 Sentinel-1 EW images, which were captured in the Northeast Passage from November 2019 to April 2020. The experiments show that the proposed method can effectively maintain the edge profile of sea ice and restrain noise from SAR. It is superior to the current mainstream ensemble learning algorithms with the overall accuracy reaching 97%. The main contribution of this study is proposing a superior weight voting strategy in the ensemble learning method for sea ice classification of Sentinel-1 imagery, which is of great significance for guiding secure ship navigation and ice hazard forecasting in winter.


2014 ◽  
Vol 945-949 ◽  
pp. 2505-2508
Author(s):  
Xiao Yu Chen ◽  
Bo Liu ◽  
Xin Xia

ReliefF feature selection and LogitBoost ensemble learning method are employed in the data mining procedure of 2126 fetal cardiotocograms (CTGs). Based on 10 critical features selected by ReliefF and the full 21 features, LogitBoost algorithm almost outperforms the other three ensemble learning methods of Stacking, Bagging and AdaBoostM1 in ACC (%) and AUC in classification, and the ACC (%) and AUC of LogitBoost algorithm are achieved to 94.45% and 0.977 based on the critical features from ReliefF.


Sign in / Sign up

Export Citation Format

Share Document