PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning

A Robust Procedure for Machine Learning Algorithms Using Gene Expression Data

Biointerface Research in Applied Chemistry ◽

10.33263/briac122.24222439 ◽

2021 ◽

Vol 12 (2) ◽

pp. 2422-2439

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Learning Algorithms ◽

Simulated Data ◽

Cancer Classification ◽

Machine Learning Algorithms ◽

Expression Data ◽

Traditional Procedure

Cancer classification is one of the main objectives for analyzing big biological datasets. Machine learning algorithms (MLAs) have been extensively used to accomplish this task. Several popular MLAs are available in the literature to classify new samples into normal or cancer populations. Nevertheless, most of them often yield lower accuracies in the presence of outliers, which leads to incorrect classification of samples. Hence, in this study, we present a robust approach for the efficient and precise classification of samples using noisy GEDs. We examine the performance of the proposed procedure in a comparison of the five popular traditional MLAs (SVM, LDA, KNN, Naïve Bayes, Random forest) using both simulated and real gene expression data analysis. We also considered several rates of outliers (10%, 20%, and 50%). The results obtained from simulated data confirm that the traditional MLAs produce better results through our proposed procedure in the presence of outliers using the proposed modified datasets. The further transcriptome analysis found the significant involvement of these extra features in cancer diseases. The results indicated the performance improvement of the traditional MLAs with our proposed procedure. Hence, we propose to apply the proposed procedure instead of the traditional procedure for cancer classification.

Download Full-text

Cancer Classification of Gene Expression Data using Machine Learning Models

2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM) ◽

10.1109/hnicem.2018.8666435 ◽

2018 ◽

Author(s):

Joseph M. De Guia ◽

Madhavi Devaraj ◽

Larry A. Vea

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Cancer Classification ◽

Expression Data ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data

F1000Research ◽

10.12688/f1000research.10529.1 ◽

2016 ◽

Vol 5 ◽

pp. 2927 ◽

Cited By ~ 9

Author(s):

Linh Nguyen ◽

Cuong C Dang ◽

Pedro J. Ballester

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Cell Line ◽

Cell Lines ◽

Gene Expression Data ◽

Single Gene ◽

Cancer Cell Line ◽

Expression Data ◽

Gene Markers ◽

Pan Cancer

Background:Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data.Methods:Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC50measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than K-fold cross-validation.Results and Discussion:Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG.Conclusions:We now know that this type of models can predictin vitrotumour response to these drugs. These models can thus be further investigated onin vivotumour models.

Download Full-text

Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms

Cognitive Informatics, Computer Modelling, and Cognitive Science ◽

10.1016/b978-0-12-819443-0.00014-3 ◽

2020 ◽

pp. 273-317

Author(s):

Maniruzzaman ◽

Jahanur Rahman ◽

Benojir Ahammed ◽

Menhazul Abedin ◽

Harman S. Suri ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Statistical Characterization ◽

Microarray Gene

Download Full-text

Lightweight Convolutional Neural Network for Breast Cancer Classification Using RNA-Seq Gene Expression Data

IEEE Access ◽

10.1109/access.2019.2960722 ◽

2019 ◽

Vol 7 ◽

pp. 185338-185348 ◽

Cited By ~ 6

Author(s):

Murtada K. Elbashir ◽

Mohamed Ezz ◽

Mohanad Mohammed ◽

Said S. Saloum

Keyword(s):

Breast Cancer ◽

Neural Network ◽

Gene Expression ◽

Convolutional Neural Network ◽

Gene Expression Data ◽

Cancer Classification ◽

Expression Data ◽

Rna Seq ◽

Breast Cancer Classification

Download Full-text

Abstract 5104: Pan-cancer classification on gene expression data by neural network

10.1158/1538-7445.am2019-5104 ◽

2019 ◽

Author(s):

Kijin Yu ◽

Bong-Hyun Kim ◽

Peter Chang Whan Lee

Keyword(s):

Neural Network ◽

Gene Expression ◽

Gene Expression Data ◽

Cancer Classification ◽

Expression Data ◽

Pan Cancer

Download Full-text

Hybrid Correlation based Gene Selection for Accurate Cancer Classification of Gene Expression Data

International Journal of Computer Applications ◽

10.5120/6170-8591 ◽

2012 ◽

Vol 43 (14) ◽

pp. 13-18 ◽

Cited By ~ 3

Author(s):

Vibhav PrakashSingh ◽

Singh Gaurav Arvind ◽

Arindam G Mahapatra

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Selection ◽

Cancer Classification ◽

Expression Data ◽

Selection For

Download Full-text

Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data

10.1101/095224 ◽

2016 ◽

Author(s):

Linh C. Nguyen ◽

Cuong C. Dang ◽

Pedro J. Ballester

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Cell Line ◽

Cell Lines ◽

Gene Expression Data ◽

Single Gene ◽

Cancer Cell Line ◽

Expression Data ◽

Gene Markers ◽

Pan Cancer

AbstractSelected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data.Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC50measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than K-fold cross-validation. Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG.

Download Full-text

A Review on Recent Progress in Machine Learning and Deep Learning Methods for Cancer Classification on Gene Expression Data

Processes ◽

10.3390/pr9081466 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1466

Author(s):

Aina Umairah Mazlan ◽

Noor Azida Sahabudin ◽

Muhammad Akmal Remli ◽

Nor Syahidatul Nadiah Ismail ◽

Mohd Saberi Mohamad ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Deep Learning ◽

Gene Expression Data ◽

Recent Progress ◽

Cancer Classification ◽

Expression Data ◽

Classification Methods ◽

Healthcare Applications ◽

Learning Methods

Data-driven model with predictive ability are important to be used in medical and healthcare. However, the most challenging task in predictive modeling is to construct a prediction model, which can be addressed using machine learning (ML) methods. The methods are used to learn and trained the model using a gene expression dataset without being programmed explicitly. Due to the vast amount of gene expression data, this task becomes complex and time consuming. This paper provides a recent review on recent progress in ML and deep learning (DL) for cancer classification, which has received increasing attention in bioinformatics and computational biology. The development of cancer classification methods based on ML and DL is mostly focused on this review. Although many methods have been applied to the cancer classification problem, recent progress shows that most of the successful techniques are those based on supervised and DL methods. In addition, the sources of the healthcare dataset are also described. The development of many machine learning methods for insight analysis in cancer classification has brought a lot of improvement in healthcare. Currently, it seems that there is highly demanded further development of efficient classification methods to address the expansion of healthcare applications.

Download Full-text

Abstract 5104: Pan-cancer classification on gene expression data by neural network

10.1158/1538-7445.sabcs18-5104 ◽

2019 ◽

Author(s):

Kijin Yu ◽

Bong-Hyun Kim ◽

Peter Chang Whan Lee

Keyword(s):

Neural Network ◽

Gene Expression ◽

Gene Expression Data ◽

Cancer Classification ◽

Expression Data ◽

Pan Cancer

Download Full-text