Machine Learning Approaches for Biomarker Discovery Using Gene Expression Data

Machine learning approaches to predict lupus disease activity from gene expression data

Scientific Reports ◽

10.1038/s41598-019-45989-0 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 11

Author(s):

Brian Kegerreis ◽

Michelle D. Catalina ◽

Prathyusha Bachali ◽

Nicholas S. Geraci ◽

Adam C. Labonte ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Disease Activity ◽

Gene Expression Data ◽

Learning Approaches ◽

Expression Data ◽

Lupus Disease Activity

Download Full-text

The Advances in Cancer Survival Prediction by Gene Expression Data; Using Machine Learning Approaches

Acta healthmedica ◽

10.19082/ah136 ◽

2017 ◽

Vol 2 (1) ◽

pp. 136-136

Author(s):

Marjan Ghazisaeedi ◽

Azadeh Bashiri

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Cancer Survival ◽

Survival Prediction ◽

Learning Approaches ◽

Expression Data

Download Full-text

A method of gene expression data transfer from cell lines to cancer patients for machine-learning prediction of drug efficiency

Cell Cycle ◽

10.1080/15384101.2017.1417706 ◽

2018 ◽

Vol 17 (4) ◽

pp. 486-491 ◽

Cited By ~ 22

Author(s):

Nicolas Borisov ◽

Victor Tkachev ◽

Maria Suntsova ◽

Olga Kovalchuk ◽

Alex Zhavoronkov ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Cancer Patients ◽

Cell Lines ◽

Gene Expression Data ◽

Data Transfer ◽

Expression Data ◽

Drug Efficiency

Download Full-text

Integrating Gene Ontology Based Grouping and Ranking into the Machine Learning Algorithm for Gene Expression Data Analysis

10.1007/978-3-030-87101-7_20 ◽

2021 ◽

pp. 205-214

Author(s):

Malik Yousef ◽

Ahmet Sayıcı ◽

Burcu Bakir-Gungor

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Ontology ◽

Data Analysis ◽

Gene Expression Data ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Expression Data ◽

Gene Expression Data Analysis

Download Full-text

Leveraging TCGA gene expression data to build predictive models for cancer drug response

BMC Bioinformatics ◽

10.1186/s12859-020-03690-4 ◽

2020 ◽

Vol 21 (S14) ◽

Cited By ~ 3

Author(s):

Evan A. Clayton ◽

Toyya A. Pujol ◽

John F. McDonald ◽

Peng Qiu

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Predictive Models ◽

Drug Response ◽

Cancer Drug ◽

Expression Data ◽

Classification Methods ◽

Clustering And Classification ◽

Machine Learning Models

Abstract Background Machine learning has been utilized to predict cancer drug response from multi-omics data generated from sensitivities of cancer cell lines to different therapeutic compounds. Here, we build machine learning models using gene expression data from patients’ primary tumor tissues to predict whether a patient will respond positively or negatively to two chemotherapeutics: 5-Fluorouracil and Gemcitabine. Results We focused on 5-Fluorouracil and Gemcitabine because based on our exclusion criteria, they provide the largest numbers of patients within TCGA. Normalized gene expression data were clustered and used as the input features for the study. We used matching clinical trial data to ascertain the response of these patients via multiple classification methods. Multiple clustering and classification methods were compared for prediction accuracy of drug response. Clara and random forest were found to be the best clustering and classification methods, respectively. The results show our models predict with up to 86% accuracy; despite the study’s limitation of sample size. We also found the genes most informative for predicting drug response were enriched in well-known cancer signaling pathways and highlighted their potential significance in chemotherapy prognosis. Conclusions Primary tumor gene expression is a good predictor of cancer drug response. Investment in larger datasets containing both patient gene expression and drug response is needed to support future work of machine learning models. Ultimately, such predictive models may aid oncologists with making critical treatment decisions.

Download Full-text

Classification models for Invasive Ductal Carcinoma Progression, based on gene expression data-trained supervised machine learning

Scientific Reports ◽

10.1038/s41598-020-60740-w ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Shikha Roy ◽

Rakesh Kumar ◽

Vaibhav Mittal ◽

Dinesh Gupta

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Invasive Ductal Carcinoma ◽

Gene Expression Data ◽

Ductal Carcinoma ◽

Supervised Machine Learning ◽

Expression Data ◽

Classification Models

Download Full-text

Machine learning analysis of gene expression data reveals novel diagnostic and prognostic biomarkers and identifies therapeutic targets for soft tissue sarcomas

PLoS Computational Biology ◽

10.1371/journal.pcbi.1006826 ◽

2019 ◽

Vol 15 (2) ◽

pp. e1006826 ◽

Cited By ~ 14

Author(s):

David G. P. van IJzendoorn ◽

Karoly Szuhai ◽

Inge H. Briaire-de Bruijn ◽

Marie Kostine ◽

Marieke L. Kuijjer ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Soft Tissue ◽

Gene Expression Data ◽

Therapeutic Targets ◽

Soft Tissue Sarcomas ◽

Prognostic Biomarkers ◽

Expression Data ◽

Learning Analysis

Download Full-text

3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data

Journal of Clinical and Translational Science ◽

10.1017/cts.2019.8 ◽

2019 ◽

Vol 3 (s1) ◽

pp. 2-2

Author(s):

Megan C Hollister ◽

Jeffrey D. Blume

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Random Forest ◽

Gene Expression Data ◽

Large Scale ◽

Second Generation ◽

A Priori ◽

Expression Data ◽

P Values ◽

Machine Learning Methods

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.

Download Full-text

Studying Microarray Gene Expression Data of Schizophrenic Patients for Derivation of a Diagnostic Signature through the Aid of Machine Learning

Biometrics & Biostatistics International Journal ◽

10.15406/bbij.2016.04.00106 ◽

2016 ◽

Vol 4 (5) ◽

Author(s):

Aristotelis Chatziioannou

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Diagnostic Signature ◽

Schizophrenic Patients ◽

Microarray Gene

Download Full-text

Maximizing the Reusability of Public Gene Expression Data by Predicting Missing Metadata

10.1101/792382 ◽

2019 ◽

Author(s):

Pei-Yau Lung ◽

Xiaodong Pang ◽

Yan Li ◽

Jinfeng Zhang

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Missing Values ◽

Expression Data ◽

New Approach ◽

Machine Learning Methods ◽

Differential Gene ◽

Missing Variables ◽

Better Than

AbstractReusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we develop a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We propose a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we show that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.

Download Full-text