2021 ◽  
Vol 12 (2) ◽  
pp. 2422-2439

Cancer classification is one of the main objectives for analyzing big biological datasets. Machine learning algorithms (MLAs) have been extensively used to accomplish this task. Several popular MLAs are available in the literature to classify new samples into normal or cancer populations. Nevertheless, most of them often yield lower accuracies in the presence of outliers, which leads to incorrect classification of samples. Hence, in this study, we present a robust approach for the efficient and precise classification of samples using noisy GEDs. We examine the performance of the proposed procedure in a comparison of the five popular traditional MLAs (SVM, LDA, KNN, Naïve Bayes, Random forest) using both simulated and real gene expression data analysis. We also considered several rates of outliers (10%, 20%, and 50%). The results obtained from simulated data confirm that the traditional MLAs produce better results through our proposed procedure in the presence of outliers using the proposed modified datasets. The further transcriptome analysis found the significant involvement of these extra features in cancer diseases. The results indicated the performance improvement of the traditional MLAs with our proposed procedure. Hence, we propose to apply the proposed procedure instead of the traditional procedure for cancer classification.


Author(s):  
Ching Wei Wang

One of the most active areas of research in supervised machine learning has been to study methods for constructing good ensembles of classifiers. The main discovery is that the ensemble classifier often performs much better than single classifiers that make them up. Recent researches (Dettling, 2004, Tan & Gilbert, 2003) have confirmed the utility of ensemble machine learning algorithms for gene expression analysis. The motivation of this work is to investigate a suitable machine learning algorithm for classification and prediction on gene expression data. The research starts with analyzing the behavior and weaknesses of three popular ensemble machine learning methods—Bagging, Boosting, and Arcing—followed by presentation of a new ensemble machine learning algorithm. The proposed method is evaluated with the existing ensemble machine learning algorithms over 12 gene expression datasets (Alon et al., 1999; Armstrong et al., 2002; Ash et al., 2000; Catherine et al., 2003; Dinesh et al., 2002; Gavin et al., 2002; Golub et al., 1999; Scott et al., 2002; van ’t Veer et al., 2002; Yeoh et al., 2002; Zembutsu et al., 2002). The experimental results show that the proposed algorithm greatly outperforms existing methods, achieving high accuracy in classification. The outline of this chapter is as follows: Ensemble machine learning approach and three popular ensembles (i.e., Bagging, Boosting, and Arcing) are introduced first in the Background section; second, the analyses on existing ensembles, details of the proposed algorithm, and experimental results are presented in Method section, followed by discussions on the future trends and conclusion.


2019 ◽  
Vol 28 ◽  
pp. 69-80
Author(s):  
M Shahjaman ◽  
MM Rashid ◽  
MI Asifuzzaman ◽  
H Akter ◽  
SMS Islam ◽  
...  

Classification of samples into one or more populations is one of the main objectives of gene expression data (GED) analysis. Many machine learning algorithms were employed in several studies to perform this task. However, these studies did not consider the outliers problem. GEDs are often contaminated by outliers due to several steps involve in the data generating process from hybridization of DNA samples to image analysis. Most of the algorithms produce higher false positives and lower accuracies in presence of outliers, particularly for lower number of replicates in the biological conditions. Therefore, in this paper, a comprehensive study has been carried out among five popular machine learning algorithms (SVM, RF, Naïve Bayes, k-NN and LDA) using both simulated and real gene expression datasets, in absence and presence of outliers. Three different rates of outliers (5%, 10% and 50%) and six performance indices (TPR, FPR, TNR, FNR, FDR and AUC) were considered to investigate the performance of five machine learning algorithms. Both simulated and real GED analysis results revealed that SVM produced comparatively better performance than the other four algorithms (RF, Naïve Bayes, k-NN and LDA) for both small-and-large sample sizes. J. bio-sci. 28: 69-80, 2020


2020 ◽  
Author(s):  
Dmitry Rychkov ◽  
Jessica Neely ◽  
Tomiko Oskotsky ◽  
Steven Yu ◽  
Noah Perlmutter ◽  
...  

AbstractBackground/PurposeThere is an urgent need to identify effective biomarkers for early diagnosis of rheumatoid arthritis (RA) and to accurately monitor disease activity. Here we define an RA meta-profile using publicly available cross-tissue gene expression data and apply machine learning to identify putative biomarkers, which we further validate on independent datasets.MethodsWe carried out a comprehensive search for publicly available microarray gene expression data in the NCBI Gene Expression Omnibus database for whole blood and synovial tissues from RA patients and healthy controls. The raw data from 13 synovium datasets with 284 samples and 14 blood datasets with 1,885 samples were downloaded and processed. The datasets for each tissue were merged, batch corrected and split into training and test sets. We then developed and applied a robust feature selection pipeline to identify genes dysregulated in both tissues and highly associated with RA. From the training data, we identified a set of overlapping differentially expressed genes following the condition of co-directionality. The classification performance of each gene in the resulting set was evaluated on the testing sets using the area under a receiver operating characteristic curve. Five independent datasets were used to validate and threshold the feature selected (FS) genes. Finally, we defined the RA Score, composed of the geometric mean of the selected RA Score Panel genes, and demonstrated its clinical utility.ResultsThis feature selection pipeline resulted in a set of 25 upregulated and 28 downregulated genes. To assess the robustness of these FS genes, we trained a Random Forest machine learning model with this set of 53 genes and then with the set of 33 overlapping genes differentially expressed in both tissues and tested on the validation cohorts. The model with FS genes outperformed the model with common DE genes with AUC 0.89 ± 0.04 vs 0.87 ± 0.04. The FS genes were further validated on the 5 independent datasets resulting in 10 upregulated genes, TNFAIP6, S100A8, TNFSF10, DRAM1, LY96, QPCT, KYNU, ENTPD1, CLIC1, and ATP6V0E1, which are involved in innate immune system pathways, including neutrophil degranulation and apoptosis. There were also three downregulated genes, HSP90AB1, NCL, and CIRBP, that are involved in metabolic processes and T-cell receptor regulation of apoptosis.To investigate the clinical utility of the 13 validated genes, the RA Score was developed and found to be highly correlated with the disease activity score based on the 28 examined joints (DAS28) (r = 0.33 ± 0.03, p = 7e-9) and able to distinguish osteoarthritis (OA) from RA samples (OR 0.57, 95% CI [0.34, 0.80], p = 8e-10). Moreover, the RA Score was not significantly different for rheumatoid factor (RF) positive and RF-negative RA sub-phenotypes (p = 0.9) and also distinguished polyarticular juvenile idiopathic arthritis (polyJIA) from healthy individuals in 10 independent pediatric cohorts (OR 1.15, 95% CI [1.01, 1.3], p = 2e-4) suggesting the generalizability of this score in clinical applications. The RA Score was also able to monitor the treatment effect among RA patients (t-test of treated vs untreated, p = 2e-4). Finally, we performed immunoblotting analysis of 6 proteins in unstimulated PBMC lysates from an independent cohort of 8 newly diagnosed RA patients and 7 healthy controls, where two proteins, TNFAIP6/TSG6 and HSP90AB1/HSP90, were validated and the S100A8 protein showed near significant up-regulation.ConclusionThe RA Score, consisting of 13 putative biomarkers identified through a robust feature selection procedure on public data and validated using multiple independent data sets, could be useful in the diagnosis and treatment monitoring of RA.


2021 ◽  
Author(s):  
Jingyi Zhang ◽  
Farhan Ibrahim ◽  
Doaa Altarawy ◽  
Lenwood S Heath ◽  
Sarah Tulin

Abstract BackgroundGene regulatory network (GRN) inference can now take advantage of powerful machine learning algorithms to predict the entire landscape of gene-to-gene interactions with the potential to complement traditional experimental methods in building gene networks. However, the dynamical nature of embryonic development -- representing the time-dependent interactions between thousands of transcription factors, signaling molecules, and effector genes -- is one of the most challenging arenas for GRN prediction. ResultsIn this work, we show that successful GRN predictions for developmental systems from gene expression data alone can be obtained with the Priors Enriched Absent Knowledge (PEAK) network inference algorithm. PEAK is a noise-robust method that models gene expression dynamics via ordinary differential equations and selects the best network based on information-theoretic criteria coupled with the machine learning algorithm Elastic net. We test our GRN prediction methodology using two gene expression data sets for the purple sea urchin (S. purpuratus) and cross-check our results against existing GRN models that have been constructed and validated by over 30 years of experimental results. Our results found a remarkably high degree of sensitivity in identifying known gene interactions in the network (maximum 76.32%). We also generated 838 novel predictions for interactions that have not yet been described, which provide a resource for researchers to use to further complete the sea urchin GRN. ConclusionsGRN predictions that match known gene interactions can be produced using gene expression data alone from developmental time series experiments.


2019 ◽  
Author(s):  
Tom M George ◽  
Pietro Lio

AbstractMachine learning algorithms are revolutionising how information can be extracted from complex and high-dimensional data sets via intelligent compression. For example, unsupervised Autoen-coders train a deep neural network with a low-dimensional “bottlenecked” central layer to reconstruct input vectors. Variational Autoencoders (VAEs) have shown promise at learning meaningful latent spaces for text, image and more recently, gene-expression data. In the latter case they have been shown capable of capturing biologically relevant features such as a patients sex or tumour type. Here we train a VAE on ovarian cancer transcriptomes from The Cancer Genome Atlas and show that, in many cases, the latent spaces learns an encoding predictive of cisplatin chemotherapy resistance. We analyse the effectiveness of such an architecture to a wide range of hyperparameters as well as use a state-of-the-art clustering algorithm, t-SNE, to embed the data in a two-dimensional manifold and visualise the predictive power of the trained latent spaces. By correlating genes to resistance-predictive encodings we are able to extract biological processes likely responsible for platinum resistance. Finally we demonstrate that variational autoencoders can reliably encode gene expression data contaminated with significant amounts of Gaussian and dropout noise, a necessary feature if this technique is to be applicable to other data sets, including those in non-medical fields.


PLoS ONE ◽  
2021 ◽  
Vol 16 (12) ◽  
pp. e0261926
Author(s):  
Jingyi Zhang ◽  
Farhan Ibrahim ◽  
Emily Najmulski ◽  
George Katholos ◽  
Doaa Altarawy ◽  
...  

Gene regulatory network (GRN) inference can now take advantage of powerful machine learning algorithms to complement traditional experimental methods in building gene networks. However, the dynamical nature of embryonic development–representing the time-dependent interactions between thousands of transcription factors, signaling molecules, and effector genes–is one of the most challenging arenas for GRN prediction. In this work, we show that successful GRN predictions for a developmental network from gene expression data alone can be obtained with the Priors Enriched Absent Knowledge (PEAK) network inference algorithm. PEAK is a noise-robust method that models gene expression dynamics via ordinary differential equations and selects the best network based on information-theoretic criteria coupled with the machine learning algorithm Elastic Net. We test our GRN prediction methodology using two gene expression datasets for the purple sea urchin, Stronglyocentrotus purpuratus, and cross-check our results against existing GRN models that have been constructed and validated by over 30 years of experimental results. Our results find a remarkably high degree of sensitivity in identifying known gene interactions in the network (maximum 81.58%). We also generate novel predictions for interactions that have not yet been described, which provide a resource for researchers to use to further complete the sea urchin GRN. Published ChIPseq data and spatial co-expression analysis further support a subset of the top novel predictions. We conclude that GRN predictions that match known gene interactions can be produced using gene expression data alone from developmental time series experiments.


Feature Selection techniques are generally employed to remove the inessential attributes before machine learning technique could be applied. It thus plays an extremely important role by eliminating the unnecessary features that do not contribute and sometimes degrade the performance and prediction accuracy of the machine learning technique. With the growth of dimensionality of data, Feature Selection becomes even more important because it helps to reduce the dimensions of data and hence decreases the requisite memory and computational complexity of the machine learning techniques. Support vector machine-recursive feature elimination (SVM-RFE) has proven to be an efficient wrapper feature selection technique which continues to be widely utilized in many applications, especially in classification of gene expression data. From the perspective of this data, not only the precision in classification but also the stability of Feature Selection method plays an important role. Nonetheless, the topic of stability is ignored in study of feature selection algorithms. To improve the stability of RFE method, a fusion of Information Gain and RFE (IG-RFE-SVM) method is proposed in this paper. Experimental studies show that IG-RFE-SVM outperforms SVM-RFE method in terms of stability


Sign in / Sign up

Export Citation Format

Share Document