Leveraging TCGA gene expression data to build predictive models for cancer drug response

Abstract Background Machine learning has been utilized to predict cancer drug response from multi-omics data generated from sensitivities of cancer cell lines to different therapeutic compounds. Here, we build machine learning models using gene expression data from patients’ primary tumor tissues to predict whether a patient will respond positively or negatively to two chemotherapeutics: 5-Fluorouracil and Gemcitabine. Results We focused on 5-Fluorouracil and Gemcitabine because based on our exclusion criteria, they provide the largest numbers of patients within TCGA. Normalized gene expression data were clustered and used as the input features for the study. We used matching clinical trial data to ascertain the response of these patients via multiple classification methods. Multiple clustering and classification methods were compared for prediction accuracy of drug response. Clara and random forest were found to be the best clustering and classification methods, respectively. The results show our models predict with up to 86% accuracy; despite the study’s limitation of sample size. We also found the genes most informative for predicting drug response were enriched in well-known cancer signaling pathways and highlighted their potential significance in chemotherapy prognosis. Conclusions Primary tumor gene expression is a good predictor of cancer drug response. Investment in larger datasets containing both patient gene expression and drug response is needed to support future work of machine learning models. Ultimately, such predictive models may aid oncologists with making critical treatment decisions.

Download Full-text

Abstract 676: Leveraging TCGA gene expression data to build predictive models for cancer drug response

10.1158/1538-7445.sabcs18-676 ◽

2019 ◽

Author(s):

Evan Clayton

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Predictive Models ◽

Drug Response ◽

Cancer Drug ◽

Expression Data

Download Full-text

Abstract 676: Leveraging TCGA gene expression data to build predictive models for cancer drug response

10.1158/1538-7445.am2019-676 ◽

2019 ◽

Author(s):

Evan Clayton

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Predictive Models ◽

Drug Response ◽

Cancer Drug ◽

Expression Data

Download Full-text

Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets

Statistical Analysis and Data Mining The ASA Data Science Journal ◽

10.1002/sam.11549 ◽

2021 ◽

Author(s):

Jessica Krepel ◽

Magdalena Kircher ◽

Moritz Kohls ◽

Klaus Jung

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Data Sets ◽

Expression Data ◽

Learning Models ◽

Independent Gene ◽

Machine Learning Models

Download Full-text

Survival prediction and treatment optimization of multiple myeloma patients using machine-learning models based on clinical and gene expression data

Leukemia ◽

10.1038/s41375-021-01286-2 ◽

2021 ◽

Author(s):

Adrián Mosquera Orgueira ◽

Marta Sonia González Pérez ◽

José Ángel Díaz Arias ◽

Beatriz Antelo Rodríguez ◽

Natalia Alonso Vence ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Multiple Myeloma ◽

Gene Expression Data ◽

Survival Prediction ◽

Expression Data ◽

Learning Models ◽

Treatment Optimization ◽

Machine Learning Models

Download Full-text

Comparative Study of Disease Classification Using Multiple Machine Learning Models Based on Landmark and Non-Landmark Gene Expression Data

Procedia Computer Science ◽

10.1016/j.procs.2021.05.028 ◽

2021 ◽

Vol 185 ◽

pp. 264-273

Author(s):

Xiaoqin Huang ◽

Jian Sun ◽

Satish Mahadevan Srinivasan ◽

Raghvinder S Sangwan

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Comparative Study ◽

Gene Expression Data ◽

Disease Classification ◽

Expression Data ◽

Learning Models ◽

Machine Learning Models

Download Full-text

A Review on Recent Progress in Machine Learning and Deep Learning Methods for Cancer Classification on Gene Expression Data

Processes ◽

10.3390/pr9081466 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1466

Author(s):

Aina Umairah Mazlan ◽

Noor Azida Sahabudin ◽

Muhammad Akmal Remli ◽

Nor Syahidatul Nadiah Ismail ◽

Mohd Saberi Mohamad ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Deep Learning ◽

Gene Expression Data ◽

Recent Progress ◽

Cancer Classification ◽

Expression Data ◽

Classification Methods ◽

Healthcare Applications ◽

Learning Methods

Data-driven model with predictive ability are important to be used in medical and healthcare. However, the most challenging task in predictive modeling is to construct a prediction model, which can be addressed using machine learning (ML) methods. The methods are used to learn and trained the model using a gene expression dataset without being programmed explicitly. Due to the vast amount of gene expression data, this task becomes complex and time consuming. This paper provides a recent review on recent progress in ML and deep learning (DL) for cancer classification, which has received increasing attention in bioinformatics and computational biology. The development of cancer classification methods based on ML and DL is mostly focused on this review. Although many methods have been applied to the cancer classification problem, recent progress shows that most of the successful techniques are those based on supervised and DL methods. In addition, the sources of the healthcare dataset are also described. The development of many machine learning methods for insight analysis in cancer classification has brought a lot of improvement in healthcare. Currently, it seems that there is highly demanded further development of efficient classification methods to address the expansion of healthcare applications.

Download Full-text

Cancer Classification of Gene Expression Data using Machine Learning Models

2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM) ◽

10.1109/hnicem.2018.8666435 ◽

2018 ◽

Author(s):

Joseph M. De Guia ◽

Madhavi Devaraj ◽

Larry A. Vea

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Cancer Classification ◽

Expression Data ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data

F1000Research ◽

10.12688/f1000research.10529.2 ◽

2017 ◽

Vol 5 ◽

pp. 2927 ◽

Cited By ~ 4

Author(s):

Linh Nguyen ◽

Cuong C Dang ◽

Pedro J. Ballester

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Drug Sensitivity ◽

Single Gene ◽

Cancer Cell Line ◽

Expression Data ◽

Gene Markers ◽

Pan Cancer ◽

Machine Learning Models

Background:Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets, such as those by Genomics of Drug Sensitivity in Cancer (GDSC) consortium, were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data.Methods:Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC50measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than standard k-fold cross-validation.Results and Discussion:Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by the multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG.Conclusions:Thanks to this unbiased validation, we now know that this type of models can predictin vitrotumour response to some of these drugs. These models can thus be further investigated onin vivotumour models. R code to facilitate the construction of alternative machine learning models and their validation in the presented benchmark is available athttp://ballester.marseille.inserm.fr/gdsc.transcriptomicDatav2.tar.gz.

Download Full-text

Prediction Errors in Learning Drug Response from Gene Expression Data – Influence of Labeling, Sample Size, and Machine Learning Algorithm

PLoS ONE ◽

10.1371/journal.pone.0070294 ◽

2013 ◽

Vol 8 (7) ◽

pp. e70294 ◽

Cited By ~ 10

Author(s):

Immanuel Bayer ◽

Philip Groth ◽

Sebastian Schneckener

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Sample Size ◽

Gene Expression Data ◽

Drug Response ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Prediction Errors ◽

Expression Data

Download Full-text

Predicting the targets of IRF8 and NFATc1 during osteoclast differentiation using the machine learning method framework cTAP

BMC Genomics ◽

10.1186/s12864-021-08159-z ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Honglin Wang ◽

Pujan Joshi ◽

Seung-Hyun Hong ◽

Peter F. Maye ◽

David W. Rowe ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Target Genes ◽

Osteoclast Differentiation ◽

Target Prediction ◽

Data Sets ◽

Expression Data ◽

Learning Models ◽

Machine Learning Models

Abstract Background Interferon regulatory factor-8 (IRF8) and nuclear factor-activated T cells c1 (NFATc1) are two transcription factors that have an important role in osteoclast differentiation. Thanks to ChIP-seq technology, scientists can now estimate potential genome-wide target genes of IRF8 and NFATc1. However, finding target genes that are consistently up-regulated or down-regulated across different studies is hard because it requires analysis of a large number of high-throughput expression studies from a comparable context. Method We have developed a machine learning based method, called, Cohort-based TF target prediction system (cTAP) to overcome this problem. This method assumes that the pathway involving the transcription factors of interest is featured with multiple “functional groups” of marker genes pertaining to the concerned biological process. It uses two notions, Gene-Present Sufficiently (GP) and Gene-Absent Insufficiently (GA), in addition to log2 fold changes of differentially expressed genes for the prediction. Target prediction is made by applying multiple machine-learning models, which learn the patterns of GP and GA from log2 fold changes and four types of Z scores from the normalized cohort’s gene expression data. The learned patterns are then associated with the putative transcription factor targets to identify genes that consistently exhibit Up/Down gene regulation patterns within the cohort. We applied this method to 11 publicly available GEO data sets related to osteoclastgenesis. Result Our experiment identified a small number of Up/Down IRF8 and NFATc1 target genes as relevant to osteoclast differentiation. The machine learning models using GP and GA produced NFATc1 and IRF8 target genes different than simply using a log2 fold change alone. Our literature survey revealed that all predicted target genes have known roles in bone remodeling, specifically related to the immune system and osteoclast formation and functions, suggesting confidence and validity in our method. Conclusion cTAP was motivated by recognizing that biologists tend to use Z score values present in data sets for the analysis. However, using cTAP effectively presupposes assembling a sizable cohort of gene expression data sets within a comparable context. As public gene expression data repositories grow, the need to use cohort-based analysis method like cTAP will become increasingly important.

Download Full-text