scholarly journals A Label Noise Robust Stacked Auto-Encoder Algorithm for Inaccurate Supervised Classification Problems

2019 ◽  
Vol 2019 ◽  
pp. 1-19
Author(s):  
Zi-yang Wang ◽  
Xiao-yi Luo ◽  
Jun Liang

In real applications, label noise and feature noise are two main noise sources. Similar to feature noise, label noise imposes great detriment on training classification models. Motivated by successful application of deep learning method in normal classification problems, this paper proposes a new framework called LNC-SDAE to handle those datasets corrupted with label noise, or so-called inaccurate supervision problems. The LNC-SDAE framework contains a preliminary label noise cleansing part and a stacked denoising auto-encoder. In preliminary label noise cleansing part, the K-fold cross-validation thought is applied for detecting and relabeling those mislabeled samples. After being preprocessed by label noise cleansing part, the cleansed training dataset is then input into the stacked denoising auto-encoder to learn robust representation for classification. A corrupted UCI standard dataset and a corrupted real industrial dataset are used for test, both of which contain a certain proportion of label noise (the ratio changes from 0% to 30%). The experiment results prove the effectiveness of LNC-SDAE, the representation learnt by which is shown robust.

Author(s):  
Yuhong Huang ◽  
Wenben Chen ◽  
Xiaoling Zhang ◽  
Shaofu He ◽  
Nan Shao ◽  
...  

Aim: After neoadjuvant chemotherapy (NACT), tumor shrinkage pattern is a more reasonable outcome to decide a possible breast-conserving surgery (BCS) than pathological complete response (pCR). The aim of this article was to establish a machine learning model combining radiomics features from multiparametric MRI (mpMRI) and clinicopathologic characteristics, for early prediction of tumor shrinkage pattern prior to NACT in breast cancer.Materials and Methods: This study included 199 patients with breast cancer who successfully completed NACT and underwent following breast surgery. For each patient, 4,198 radiomics features were extracted from the segmented 3D regions of interest (ROI) in mpMRI sequences such as T1-weighted dynamic contrast-enhanced imaging (T1-DCE), fat-suppressed T2-weighted imaging (T2WI), and apparent diffusion coefficient (ADC) map. The feature selection and supervised machine learning algorithms were used to identify the predictors correlated with tumor shrinkage pattern as follows: (1) reducing the feature dimension by using ANOVA and the least absolute shrinkage and selection operator (LASSO) with 10-fold cross-validation, (2) splitting the dataset into a training dataset and testing dataset, and constructing prediction models using 12 classification algorithms, and (3) assessing the model performance through an area under the curve (AUC), accuracy, sensitivity, and specificity. We also compared the most discriminative model in different molecular subtypes of breast cancer.Results: The Multilayer Perception (MLP) neural network achieved higher AUC and accuracy than other classifiers. The radiomics model achieved a mean AUC of 0.975 (accuracy = 0.912) on the training dataset and 0.900 (accuracy = 0.828) on the testing dataset with 30-round 6-fold cross-validation. When incorporating clinicopathologic characteristics, the mean AUC was 0.985 (accuracy = 0.930) on the training dataset and 0.939 (accuracy = 0.870) on the testing dataset. The model further achieved good AUC on the testing dataset with 30-round 5-fold cross-validation in three molecular subtypes of breast cancer as following: (1) HR+/HER2–: 0.901 (accuracy = 0.816), (2) HER2+: 0.940 (accuracy = 0.865), and (3) TN: 0.837 (accuracy = 0.811).Conclusions: It is feasible that our machine learning model combining radiomics features and clinical characteristics could provide a potential tool to predict tumor shrinkage patterns prior to NACT. Our prediction model will be valuable in guiding NACT and surgical treatment in breast cancer.


AI ◽  
2020 ◽  
Vol 1 (4) ◽  
pp. 539-557 ◽  
Author(s):  
Barath Narayanan ◽  
Russell Hardie ◽  
Vignesh Krishnaraja ◽  
Christina Karam ◽  
Venkata Davuluru

The coronavirus disease 2019 (COVID-19) global pandemic has severely impacted lives across the globe. Respiratory disorders in COVID-19 patients are caused by lung opacities similar to viral pneumonia. A Computer-Aided Detection (CAD) system for the detection of COVID-19 using chest radiographs would provide a second opinion for radiologists. For this research, we utilize publicly available datasets that have been marked by radiologists into two-classes (COVID-19 and non-COVID-19). We address the class imbalance problem associated with the training dataset by proposing a novel transfer-to-transfer learning approach, where we break a highly imbalanced training dataset into a group of balanced mini-sets and apply transfer learning between these. We demonstrate the efficacy of the method using well-established deep convolutional neural networks. Our proposed training mechanism is more robust to limited training data and class imbalance. We study the performance of our algorithm(s) based on 10-fold cross validation and two hold-out validation experiments to demonstrate its efficacy. We achieved an overall sensitivity of 0.94 for the hold-out validation experiments containing 2265 and 2139 marked as COVID-19 chest radiographs, respectively. For the 10-fold cross validation experiment, we achieve an overall Area under the Receiver Operating Characteristic curve (AUC) value of 0.996 for COVID-19 detection. This paper serves as a proof-of-concept that an automated detection approach can be developed with a limited set of COVID-19 images, and in areas with scarcity of trained radiologists.


2018 ◽  
Vol 7 (2.15) ◽  
pp. 136 ◽  
Author(s):  
Rosaida Rosly ◽  
Mokhairi Makhtar ◽  
Mohd Khalid Awang ◽  
Mohd Isa Awang ◽  
Mohd Nordin Abdul Rahman

This paper analyses the performance of classification models using single classification and combination of ensemble method, which are Breast Cancer Wisconsin and Hepatitis data sets as training datasets. This paper presents a comparison of different classifiers based on a 10-fold cross validation using a data mining tool. In this experiment, various classifiers are implemented including three popular ensemble methods which are boosting, bagging and stacking for the combination. The result shows that for the classification of the Breast Cancer Wisconsin data set, the single classification of Naïve Bayes (NB) and a combination of bagging+NB algorithm displayed the highest accuracy at the same percentage (97.51%) compared to other combinations of ensemble classifiers. For the classification of the Hepatitisdata set, the result showed that the combination of stacking+Multi-Layer Perception (MLP) algorithm achieved a higher accuracy at 86.25%. By using the ensemble classifiers, the result may be improved. In future, a multi-classifier approach will be proposed by introducing a fusion at the classification level between these classifiers to obtain classification with higher accuracies.  


2018 ◽  
Vol 7 (2.27) ◽  
pp. 93
Author(s):  
Pooja Thakur ◽  
Mandeep Singh ◽  
Harpreet Singh ◽  
Prashant Singh Rana

H1B work visas are utilized to contract profoundly talented outside specialists at low wages in America which help firms and impact U.S economy unfavorably. In excess of 100,000 individuals for every year apply tight clamp for higher examinations and also to work and number builds each year. Selections of foreigners are done by lottery system which doesn’t follow any full proofed method and so results cause a loophole between US-based and foreign workers. We endeavor to examine petitions filled from 2015 to 2017 with the goal that a superior prediction model need to develop using machine learning which helps to foresee the aftereffect of the request of ahead of time which shows whether an appeal to is commendable or not. In this work, we use seven classification models Decision tree, C5.0, Random Forest, Naïve Bayes, Neural Network and SVM which predict the status of a petition as certified, denied, withdrawal or certified with-drawls. The predictions of these models are checked on accuracy parameter. It is found that C5.0 outperform with the best accuracy of 94.62 as a single model but proposed model gives better results of 95.4 accuracies which is built by machine ensemble method and this is validated by 10 fold cross-validation. 


Author(s):  
D. Mabuni ◽  
S. Aquter Babu

In machine learning data usage is the most important criterion than the logic of the program. With very big and moderate sized datasets it is possible to obtain robust and high classification accuracies but not with small and very small sized datasets. In particular only large training datasets are potential datasets for producing robust decision tree classification results. The classification results obtained by using only one training and one testing dataset pair are not reliable. Cross validation technique uses many random folds of the same dataset for training and validation. In order to obtain reliable and statistically correct classification results there is a need to apply the same algorithm on different pairs of training and validation datasets. To overcome the problem of the usage of only a single training dataset and a single testing dataset the existing k-fold cross validation technique uses cross validation plan for obtaining increased decision tree classification accuracy results. In this paper a new cross validation technique called prime fold is proposed and it is experimentally tested thoroughly and then verified correctly using many bench mark UCI machine learning datasets. It is observed that the prime fold based decision tree classification accuracy results obtained after experimentation are far better than the existing techniques of finding decision tree classification accuracies.


2019 ◽  
Vol 8 (4) ◽  
pp. 2588-2593

In the domain of Soft Computing, Support Vector Machines (SVMs) have acquired considerable significance. These are widely used in making predictions, owing to their ability of generalization. This paper is about the development of SVM based classification models for the prediction of rice yield in India. Experiments have been conducted involving oneagainst-one multi classification method, k-fold cross validation and polynomial kernel function for SVM training. Rice production data of India has been sourced from Directorate of Economics and Statistics, Ministry of Agriculture, Government of India, for this work. The best prediction accuracy for the 4- year relative average increase has been achieved as 75.06% using 4-fold cross validation method. MATLAB software has been used for experimentation in this work.


Ciencia Unemi ◽  
2018 ◽  
Vol 11 (28) ◽  
pp. 8-17
Author(s):  
Pavel Novoa-Hernández ◽  
Dailín Cobos-Valdes ◽  
Eduardo Samaniego-Mena ◽  
Milvio Novoa-Pérez

En el presente trabajo se propone un nuevo modelo para la evaluación del riesgo biológico en procesos biofarmacéuticos. La propuesta extiende un modelo existente, aportando como principal novedad el tratamiento de las determinaciones de los niveles de consecuencia y probabilidad de riesgo, como problemas de clasificación supervisada. Específicamente, se obtuvieron modelos de clasificación basados en árboles de decisión que poseen como ventajas más importantes: 1) un número menor de indicadores para la determinación de consecuencias y probabilidades, 2) un orden de medición de los indicadores, basado en la importancia de los mismos. Con el objetivo de analizar las bondades del nuevo modelo, se consideraron tres casos de estudio relacionados con procesos farmacéuticos reales. En comparación con el modelo anterior, el nuevo ofrece resultados similares, pero facilitando notablemente el proceso de evaluación del riesgo biológico. AbstractA new model for assessing biological risk in biopharmaceutical process is proposed in the present work. This proposal extends an existing model including the handling of the consequence and probability levels computations as main novelty, and also as supervised classification problems. Specifically, two classification models based on decision trees were obtained, which gives as major advantages: 1) a lower number of indicators for the determination of consequence and probabilities, and 2) an order of measurement of the related indicators. In order to analyze the benefits of the new model, three real pharmaceutical processes were considered as cases studies. In comparison with the previous model, the new one offers similar results, but significantly facilitating the biological risk assessment process.


2019 ◽  
Vol 21 (4) ◽  
pp. 1425-1436 ◽  
Author(s):  
Xiangxiang Zeng ◽  
Yue Zhong ◽  
Wei Lin ◽  
Quan Zou

Abstract Identification of disease-associated circular RNAs (circRNAs) is of critical importance, especially with the dramatic increase in the amount of circRNAs. However, the availability of experimentally validated disease-associated circRNAs is limited, which restricts the development of effective computational methods. To our knowledge, systematic approaches for the prediction of disease-associated circRNAs are still lacking. In this study, we propose the use of deep forests combined with positive-unlabeled learning methods to predict potential disease-related circRNAs. In particular, a heterogeneous biological network involving 17 961 circRNAs, 469 miRNAs, and 248 diseases was constructed, and then 24 meta-path-based topological features were extracted. We applied 5-fold cross-validation on 15 disease data sets to benchmark the proposed approach and other competitive methods and used Recall@k and PRAUC@k to evaluate their performance. In general, our method performed better than the other methods. In addition, the performance of all methods improved with the accumulation of known positive labels. Our results provided a new framework to investigate the associations between circRNA and disease and might improve our understanding of its functions.


2018 ◽  
Vol 2018 ◽  
pp. 1-20 ◽  
Author(s):  
Guido Bologna ◽  
Yoichi Hayashi

One way to make the knowledge stored in an artificial neural network more intelligible is to extract symbolic rules. However, producing rules from Multilayer Perceptrons (MLPs) is an NP-hard problem. Many techniques have been introduced to generate rules from single neural networks, but very few were proposed for ensembles. Moreover, experiments were rarely assessed by 10-fold cross-validation trials. In this work, based on the Discretized Interpretable Multilayer Perceptron (DIMLP), experiments were performed on 10 repetitions of stratified 10-fold cross-validation trials over 25 binary classification problems. The DIMLP architecture allowed us to produce rules from DIMLP ensembles, boosted shallow trees (BSTs), and Support Vector Machines (SVM). The complexity of rulesets was measured with the average number of generated rules and average number of antecedents per rule. From the 25 used classification problems, the most complex rulesets were generated from BSTs trained by “gentle boosting” and “real boosting.” Moreover, we clearly observed that the less complex the rules were, the better their fidelity was. In fact, rules generated from decision stumps trained by modest boosting were, for almost all the 25 datasets, the simplest with the highest fidelity. Finally, in terms of average predictive accuracy and average ruleset complexity, the comparison of some of our results to those reported in the literature proved to be competitive.


PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0249396
Author(s):  
Sabit Ahmed ◽  
Afrida Rahman ◽  
Md. Al Mehedi Hasan ◽  
Md Khaled Ben Islam ◽  
Julia Rahman ◽  
...  

Post-translational modification (PTM) involves covalent modification after the biosynthesis process and plays an essential role in the study of cell biology. Lysine phosphoglycerylation, a newly discovered reversible type of PTM that affects glycolytic enzyme activities, and is responsible for a wide variety of diseases, such as heart failure, arthritis, and degeneration of the nervous system. Our goal is to computationally characterize potential phosphoglycerylation sites to understand the functionality and causality more accurately. In this study, a novel computational tool, referred to as predPhogly-Site, has been developed to predict phosphoglycerylation sites in the protein. It has effectively utilized the probabilistic sequence-coupling information among the nearby amino acid residues of phosphoglycerylation sites along with a variable cost adjustment for the skewed training dataset to enhance the prediction characteristics. It has achieved around 99% accuracy with more than 0.96 MCC and 0.97 AUC in both 10-fold cross-validation and independent test. Even, the standard deviation in 10-fold cross-validation is almost negligible. This performance indicates that predPhogly-Site remarkably outperformed the existing prediction tools and can be used as a promising predictor, preferably with its web interface at http://103.99.176.239/predPhogly-Site.


Sign in / Sign up

Export Citation Format

Share Document