scholarly journals Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data

PLoS ONE ◽  
2012 ◽  
Vol 7 (7) ◽  
pp. e39932 ◽  
Author(s):  
Enrico Glaab ◽  
Jaume Bacardit ◽  
Jonathan M. Garibaldi ◽  
Natalio Krasnogor
2021 ◽  
Vol 12 (2) ◽  
pp. 2422-2439

Cancer classification is one of the main objectives for analyzing big biological datasets. Machine learning algorithms (MLAs) have been extensively used to accomplish this task. Several popular MLAs are available in the literature to classify new samples into normal or cancer populations. Nevertheless, most of them often yield lower accuracies in the presence of outliers, which leads to incorrect classification of samples. Hence, in this study, we present a robust approach for the efficient and precise classification of samples using noisy GEDs. We examine the performance of the proposed procedure in a comparison of the five popular traditional MLAs (SVM, LDA, KNN, Naïve Bayes, Random forest) using both simulated and real gene expression data analysis. We also considered several rates of outliers (10%, 20%, and 50%). The results obtained from simulated data confirm that the traditional MLAs produce better results through our proposed procedure in the presence of outliers using the proposed modified datasets. The further transcriptome analysis found the significant involvement of these extra features in cancer diseases. The results indicated the performance improvement of the traditional MLAs with our proposed procedure. Hence, we propose to apply the proposed procedure instead of the traditional procedure for cancer classification.


2019 ◽  
Vol 20 (1) ◽  
pp. 15-20
Author(s):  
Ho Sun Shon ◽  
YearnGui Yi ◽  
Kyoung Ok Kim ◽  
Eun-Jong Cha ◽  
Kyung-Ah Kim

2019 ◽  
Vol 28 ◽  
pp. 69-80
Author(s):  
M Shahjaman ◽  
MM Rashid ◽  
MI Asifuzzaman ◽  
H Akter ◽  
SMS Islam ◽  
...  

Classification of samples into one or more populations is one of the main objectives of gene expression data (GED) analysis. Many machine learning algorithms were employed in several studies to perform this task. However, these studies did not consider the outliers problem. GEDs are often contaminated by outliers due to several steps involve in the data generating process from hybridization of DNA samples to image analysis. Most of the algorithms produce higher false positives and lower accuracies in presence of outliers, particularly for lower number of replicates in the biological conditions. Therefore, in this paper, a comprehensive study has been carried out among five popular machine learning algorithms (SVM, RF, Naïve Bayes, k-NN and LDA) using both simulated and real gene expression datasets, in absence and presence of outliers. Three different rates of outliers (5%, 10% and 50%) and six performance indices (TPR, FPR, TNR, FNR, FDR and AUC) were considered to investigate the performance of five machine learning algorithms. Both simulated and real GED analysis results revealed that SVM produced comparatively better performance than the other four algorithms (RF, Naïve Bayes, k-NN and LDA) for both small-and-large sample sizes. J. bio-sci. 28: 69-80, 2020


Sign in / Sign up

Export Citation Format

Share Document