scholarly journals Label Noise Cleaning with an Adaptive Ensemble Method Based on Noise Detection Metric

Sensors ◽  
2020 ◽  
Vol 20 (23) ◽  
pp. 6718
Author(s):  
Wei Feng ◽  
Yinghui Quan ◽  
Gabriel Dauphin

Real-world datasets are often contaminated with label noise; labeling is not a clear-cut process and reliable methods tend to be expensive or time-consuming. Depending on the learning technique used, such label noise is potentially harmful, requiring an increased size of the training set, making the trained model more complex and more prone to overfitting and yielding less accurate prediction. This work proposes a cleaning technique called the ensemble method based on the noise detection metric (ENDM). From the corrupted training set, an ensemble classifier is first learned and used to derive four metrics assessing the likelihood for a sample to be mislabeled. For each metric, three thresholds are set to maximize the classifying performance on a corrupted validation dataset when using three different ensemble classifiers, namely Bagging, AdaBoost and k-nearest neighbor (k-NN). These thresholds are used to identify and then either remove or correct the corrupted samples. The effectiveness of the ENDM is demonstrated in performing the classification of 15 public datasets. A comparative analysis is conducted concerning the homogeneous-ensembles-based majority vote method and consensus vote method, two popular ensemble-based label noise filters.

2021 ◽  
pp. 107907
Author(s):  
Shuyin Xia ◽  
Longhai Huang ◽  
Guoyin Wang ◽  
Xinbo Gao ◽  
Yabin Shao ◽  
...  

2020 ◽  
Vol 34 (04) ◽  
pp. 6853-6860
Author(s):  
Xuchao Zhang ◽  
Xian Wu ◽  
Fanglan Chen ◽  
Liang Zhao ◽  
Chang-Tien Lu

The success of training accurate models strongly depends on the availability of a sufficient collection of precisely labeled data. However, real-world datasets contain erroneously labeled data samples that substantially hinder the performance of machine learning models. Meanwhile, well-labeled data is usually expensive to obtain and only a limited amount is available for training. In this paper, we consider the problem of training a robust model by using large-scale noisy data in conjunction with a small set of clean data. To leverage the information contained via the clean labels, we propose a novel self-paced robust learning algorithm (SPRL) that trains the model in a process from more reliable (clean) data instances to less reliable (noisy) ones under the supervision of well-labeled data. The self-paced learning process hedges the risk of selecting corrupted data into the training set. Moreover, theoretical analyses on the convergence of the proposed algorithm are provided under mild assumptions. Extensive experiments on synthetic and real-world datasets demonstrate that our proposed approach can achieve a considerable improvement in effectiveness and robustness to existing methods.


2019 ◽  
Vol 19 (01) ◽  
pp. 1940009 ◽  
Author(s):  
AHMAD MOHSIN ◽  
OLIVER FAUST

Cardiovascular disease has been the leading cause of death worldwide. Electrocardiogram (ECG)-based heart disease diagnosis is simple, fast, cost effective and non-invasive. However, interpreting ECG waveforms can be taxing for a clinician who has to deal with hundreds of patients during a day. We propose computing machinery to reduce the workload of clinicians and to streamline the clinical work processes. Replacing human labor with machine work can lead to cost savings. Furthermore, it is possible to improve the diagnosis quality by reducing inter- and intra-observer variability. To support that claim, we created a computer program that recognizes normal, Dilated Cardiomyopathy (DCM), Hypertrophic Cardiomyopathy (HCM) or Myocardial Infarction (MI) ECG signals. The computer program combined Discrete Wavelet Transform (DWT) based feature extraction and K-Nearest Neighbor (K-NN) classification for discriminating the signal classes. The system was verified with tenfold cross validation based on labeled data from the PTB diagnostic ECG database. During the validation, we adjusted the number of neighbors [Formula: see text] for the machine learning algorithm. For [Formula: see text], training set has an accuracy and cross validation of 98.33% and 95%, respectively. However, when [Formula: see text], it showed constant for training set but dropped drastically to 80% for cross-validation. Hence, training set [Formula: see text] prevails. Furthermore, a confusion matrix proved that normal data was identified with 96.7% accuracy, 99.6% sensitivity and 99.4% specificity. This means an error of 3.3% will occur. For every 30 normal signals, the classifier will mislabel only 1 of the them as HCM. With these results, we are confident that the proposed system can improve the speed and accuracy with which normal and diseased subjects are identified. Diseased subjects can be treated earlier which improves their probability of survival.


2004 ◽  
Vol 8 (3) ◽  
pp. 141-154
Author(s):  
Virginia Wheway

Ensemble classification techniques such as bagging, (Breiman, 1996a), boosting (Freund & Schapire, 1997) and arcing algorithms (Breiman, 1997) have received much attention in recent literature. Such techniques have been shown to lead to reduced classification error on unseen cases. Even when the ensemble is trained well beyond zero training set error, the ensemble continues to exhibit improved classification error on unseen cases. Despite many studies and conjectures, the reasons behind this improved performance and understanding of the underlying probabilistic structures remain open and challenging problems. More recently, diagnostics such as edge and margin (Breiman, 1997; Freund & Schapire, 1997; Schapire et al., 1998) have been used to explain the improvements made when ensemble classifiers are built. This paper presents some interesting results from an empirical study performed on a set of representative datasets using the decision tree learner C4.5 (Quinlan, 1993). An exponential-like decay in the variance of the edge is observed as the number of boosting trials is increased. i.e. boosting appears to ‘homogenise’ the edge. Some initial theory is presented which indicates that a lack of correlation between the errors of individual classifiers is a key factor in this variance reduction.


2018 ◽  
Vol 275 ◽  
pp. 2374-2383 ◽  
Author(s):  
Maryam Sabzevari ◽  
Gonzalo Martínez-Muñoz ◽  
Alberto Suárez

Minerals ◽  
2021 ◽  
Vol 11 (10) ◽  
pp. 1128
Author(s):  
Sebeom Park ◽  
Dahee Jung ◽  
Hoang Nguyen ◽  
Yosoon Choi

This study proposes a method for diagnosing problems in truck ore transport operations in underground mines using four machine learning models (i.e., Gaussian naïve Bayes (GNB), k-nearest neighbor (kNN), support vector machine (SVM), and classification and regression tree (CART)) and data collected by an Internet of Things system. A limestone underground mine with an applied mine production management system (using a tablet computer and Bluetooth beacon) is selected as the research area, and log data related to the truck travel time are collected. The machine learning models are trained and verified using the collected data, and grid search through 5-fold cross-validation is performed to improve the prediction accuracy of the models. The accuracy of CART is highest when the parameters leaf and split are set to 1 and 4, respectively (94.1%). In the validation of the machine learning models performed using the validation dataset (1500), the accuracy of the CART was 94.6%, and the precision and recall were 93.5% and 95.7%, respectively. In addition, it is confirmed that the F1 score reaches values as high as 94.6%. Through field application and analysis, it is confirmed that the proposed CART model can be utilized as a tool for monitoring and diagnosing the status of truck ore transport operations.


2016 ◽  
Vol 7 (4) ◽  
Author(s):  
Mochammad Yusa ◽  
Ema Utami ◽  
Emha T. Luthfi

Abstract. Readmission is associated with quality measures on patients in hospitals. Different attributes related to diabetic patients such as medication, ethnicity, race, lifestyle, age, and others result in the calculation of quality care that tends to be complicated. Classification techniques of data mining can solve this problem. In this paper, the evaluation on three different classifiers, i.e. Decision Tree, k-Nearest Neighbor (k-NN), dan Naive Bayes with various settingparameter, is developed by using 10-Fold Cross Validation technique. The targets of parameter performance evaluated is based on term of Accuracy, Mean Absolute Error (MAE), dan Kappa Statistic. The selected dataset consists of 47 attributes and 49.735 records. The result shows that k-NN classifier with k=100 has a better performance in term of accuracy and Kappa Statistic, but Naive Bayes outperforms in term of MAE among other classifiers. Keywords: k-NN, naive bayes, diabetes, readmissionAbstrak. Proses Readmisi dikaitkan dengan perhitungan kualitas penanganan pasien di rumah sakit. Perbedaan atribut-atribut yang berhubungan dengan pasien diabetes proses medikasi, etnis, ras, gaya hidup, umur, dan lain-lain, mengakibatkan perhitungan kualitas cenderung rumit. Teknik klasifikasi data mining dapat menjadi solusi dalam perhitungan kualitas ini. Teknik klasifikasi merupakan salah satu teknik data mining yang perkembangannya cukup signifikan. Di dalam penelitian ini, model algoritma klasifikasi Decision Tree, k-Nearest Neighbor (k-NN), dan Naive Bayes dengan berbagai parameter setting akan dievaluasi performanya berdasarkan nilai performa Accuracy, Mean AbsoluteError (MAE), dan Kappa Statistik dengan metode 10-Fold Cross Validation. Dataset yang dievaluasi memiliki 47 atribut dengan 49.735 records. Hasil penelitian menunjukan bahwa performa accuracy, MAE, dan Kappa Statistik terbaik didapatkan dari Model Algoritma Naive Bayes.Kata Kunci: k-NN, naive bayes, diabetes, readmisi


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. e14553-e14553
Author(s):  
Gordon Vansant ◽  
Adam Jendrisak ◽  
Ramsay Sutton ◽  
Sarah Orr ◽  
David Lu ◽  
...  

e14553 Background: Different cancers subtypes can often be effectively treated with similar Rx classes (i.e. platinum or taxane Rx). Yet, within a disease patient therapy benefit can be variable. The origins of precision medicine derive from pathologic sub-stratification to guide therapy (e.g. SCLC vs. NSCLC). Using the Epic Sciences platform, we performed FPC analysis of ~100,000 single CTCs from multiple indications and sought to utilize high resolution digital pathology and machine learning to index metastatic cancers for the purpose of improving our understanding of therapy response and precision medicine. Methods: 92,300 CTCs underwent FCP analysis (single cell digital pathology features of cellular and sub-cellular morphometrics) were collected from prostate (1641 pts, 70,747 CTCs), breast (268 pts, 8,718 CTCs), NSCLC ( 110 pts, 1884 CTCs), SCLC ( 141 pts, 8,872 CTCs) and bladder (65 pts, 2079 CTCs) cancer pts. After pre-processing the raw data, a training set was balanced by sampling the same number of CTCs from each indication. K-means clustering was applied on the training set and optimized number of clusters were determined by using the elbow approach. After generating the clusters on the training set, the cluster centers were extracted from k-means, and used to train a k-Nearest Neighbor (k-NN) classifier to predict the cluster assignment for the remaining CTCs (test set). Results: The optimized # of clusters was 9. The % and characteristics of CTCs in each indication are listed below. BCa CTCs were more enriched in cluster c1, which had higher CK expression, while SCLC and some of mCRPC shared the small cell features (c5). Conclusions: Heterogeneous CTC phenotypic subtypes were observed across multiple indications. Each indication harbored subtype heterogeneity and shared clusters with other disease subtypes. Patient cluster subtype analysis to prognosis and therapy benefit are on-going. Analysis of linking of CTC subtypes genotypes (by single cell sequencing) and to patient survival on multiple indications is ongoing.[Table: see text]


Author(s):  
Abbas Keramati ◽  
Niloofar Yousefi ◽  
Amin Omidvar

Credit scoring has become a very important issue due to the recent growth of the credit industry. As the first objective, this chapter provides an academic database of literature between and proposes a classification scheme to classify the articles. The second objective of this chapter is to suggest the employing of the Optimally Weighted Fuzzy K-Nearest Neighbor (OWFKNN) algorithm for credit scoring. To show the performance of this method, two real world datasets from UCI database are used. In classification task, the empirical results demonstrate that the OWFKNN outperforms the conventional KNN and fuzzy KNN methods and also other methods. In the predictive accuracy of probability of default, the OWFKNN also show the best performance among the other methods. The results in this chapter suggest that the OWFKNN approach is mostly effective in estimating default probabilities and is a promising method to the fields of classification.


Sign in / Sign up

Export Citation Format

Share Document