cTAP: A Machine Learning Framework for Predicting Target Genes of a Transcription Factor using a Cohort of Gene Expression Data Sets

Abstract Background Interferon regulatory factor-8 (IRF8) and nuclear factor-activated T cells c1 (NFATc1) are two transcription factors that have an important role in osteoclast differentiation. Thanks to ChIP-seq technology, scientists can now estimate potential genome-wide target genes of IRF8 and NFATc1. However, finding target genes that are consistently up-regulated or down-regulated across different studies is hard because it requires analysis of a large number of high-throughput expression studies from a comparable context. Method We have developed a machine learning based method, called, Cohort-based TF target prediction system (cTAP) to overcome this problem. This method assumes that the pathway involving the transcription factors of interest is featured with multiple “functional groups” of marker genes pertaining to the concerned biological process. It uses two notions, Gene-Present Sufficiently (GP) and Gene-Absent Insufficiently (GA), in addition to log2 fold changes of differentially expressed genes for the prediction. Target prediction is made by applying multiple machine-learning models, which learn the patterns of GP and GA from log2 fold changes and four types of Z scores from the normalized cohort’s gene expression data. The learned patterns are then associated with the putative transcription factor targets to identify genes that consistently exhibit Up/Down gene regulation patterns within the cohort. We applied this method to 11 publicly available GEO data sets related to osteoclastgenesis. Result Our experiment identified a small number of Up/Down IRF8 and NFATc1 target genes as relevant to osteoclast differentiation. The machine learning models using GP and GA produced NFATc1 and IRF8 target genes different than simply using a log2 fold change alone. Our literature survey revealed that all predicted target genes have known roles in bone remodeling, specifically related to the immune system and osteoclast formation and functions, suggesting confidence and validity in our method. Conclusion cTAP was motivated by recognizing that biologists tend to use Z score values present in data sets for the analysis. However, using cTAP effectively presupposes assembling a sizable cohort of gene expression data sets within a comparable context. As public gene expression data repositories grow, the need to use cohort-based analysis method like cTAP will become increasingly important.

Download Full-text

Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets

Statistical Analysis and Data Mining The ASA Data Science Journal ◽

10.1002/sam.11549 ◽

2021 ◽

Author(s):

Jessica Krepel ◽

Magdalena Kircher ◽

Moritz Kohls ◽

Klaus Jung

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Data Sets ◽

Expression Data ◽

Learning Models ◽

Independent Gene ◽

Machine Learning Models

Download Full-text

Unsupervised Machine Learning for Data Encoding applied to Ovarian Cancer Transcriptomes

10.1101/855593 ◽

2019 ◽

Author(s):

Tom M George ◽

Pietro Lio

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Ovarian Cancer ◽

Gene Expression Data ◽

Clustering Algorithm ◽

Machine Learning Algorithms ◽

Data Sets ◽

Dimensional Manifold ◽

Expression Data ◽

Wide Range

AbstractMachine learning algorithms are revolutionising how information can be extracted from complex and high-dimensional data sets via intelligent compression. For example, unsupervised Autoen-coders train a deep neural network with a low-dimensional “bottlenecked” central layer to reconstruct input vectors. Variational Autoencoders (VAEs) have shown promise at learning meaningful latent spaces for text, image and more recently, gene-expression data. In the latter case they have been shown capable of capturing biologically relevant features such as a patients sex or tumour type. Here we train a VAE on ovarian cancer transcriptomes from The Cancer Genome Atlas and show that, in many cases, the latent spaces learns an encoding predictive of cisplatin chemotherapy resistance. We analyse the effectiveness of such an architecture to a wide range of hyperparameters as well as use a state-of-the-art clustering algorithm, t-SNE, to embed the data in a two-dimensional manifold and visualise the predictive power of the trained latent spaces. By correlating genes to resistance-predictive encodings we are able to extract biological processes likely responsible for platinum resistance. Finally we demonstrate that variational autoencoders can reliably encode gene expression data contaminated with significant amounts of Gaussian and dropout noise, a necessary feature if this technique is to be applicable to other data sets, including those in non-medical fields.

Download Full-text

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

BMC Bioinformatics ◽

10.1186/s12859-020-03755-4 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Elisabetta Manduchi ◽

Weixuan Fu ◽

Joseph D. Romano ◽

Stefano Ruberto ◽

Jason H. Moore

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Big Data ◽

Gene Expression Data ◽

Data Sets ◽

Biomedical Data ◽

Training Procedure ◽

Expression Data ◽

Differential Gene ◽

Automated Machine Learning

Abstract Background A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. Results We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj. Conclusions In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.

Download Full-text

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

10.1101/2020.08.24.265116 ◽

2020 ◽

Author(s):

Elisabetta Manduchi ◽

Weixuan Fu ◽

Joseph D. Romano ◽

Stefano Ruberto ◽

Jason H. Moore

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Big Data ◽

Gene Expression Data ◽

Data Sets ◽

Biomedical Data ◽

Training Procedure ◽

Expression Data ◽

Differential Gene ◽

Automated Machine Learning

AbstractBackgroundA typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.ResultsWe present an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We then describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj.ConclusionsIn this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.

Download Full-text

BICORN: An R package for integrative inference of de novo cis-regulatory modules

10.1101/560557 ◽

2019 ◽

Author(s):

Xi Chen

Keyword(s):

Gene Expression ◽

Transcription Factor ◽

Gene Expression Data ◽

Regulatory Networks ◽

Target Genes ◽

De Novo ◽

R Package ◽

Expression Data ◽

Regulatory Modules ◽

Regulatory Module

AbstractBICORN is an R package developed to integrate prior transcription factor binding information and gene expression data for cis-regulatory module (CRM) inference. BICORN searches for a list of candidate CRMs from binary bindings on potential target genes. Applying Gibbs sampling, BICORN samples CRMs for each gene using the fitting performance of transcription factor activities and regulation strengths of TFs in each CRM on gene expression. Consequently, sparse regulatory networks are inferred as functional CRMs regulating target genes. The BICORN package is implemented in R and is available at https://cran.r-project.org/web/packages/BICORN/index.html.

Download Full-text

A method of gene expression data transfer from cell lines to cancer patients for machine-learning prediction of drug efficiency

Cell Cycle ◽

10.1080/15384101.2017.1417706 ◽

2018 ◽

Vol 17 (4) ◽

pp. 486-491 ◽

Cited By ~ 22

Author(s):

Nicolas Borisov ◽

Victor Tkachev ◽

Maria Suntsova ◽

Olga Kovalchuk ◽

Alex Zhavoronkov ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Cancer Patients ◽

Cell Lines ◽

Gene Expression Data ◽

Data Transfer ◽

Expression Data ◽

Drug Efficiency

Download Full-text

Predicting transcription factor binding motifs from DNA-binding domains, chromatin accessibility and gene expression data

Nucleic Acids Research ◽

10.1093/nar/gkx358 ◽

2017 ◽

Vol 45 (10) ◽

pp. 5666-5677 ◽

Cited By ~ 5

Author(s):

Mahdi Zamanighomi ◽

Zhixiang Lin ◽

Yong Wang ◽

Rui Jiang ◽

Wing Hung Wong

Keyword(s):

Gene Expression ◽

Transcription Factor ◽

Dna Binding ◽

Gene Expression Data ◽

Chromatin Accessibility ◽

Expression Data ◽

Binding Motifs ◽

Dna Binding Domains ◽

Binding Domains ◽

Transcription Factor Binding Motifs

Download Full-text

Transcription factor regulation can be accurately predicted from the presence of target gene signatures in microarray gene expression data

Nucleic Acids Research ◽

10.1093/nar/gkq149 ◽

2010 ◽

Vol 38 (11) ◽

pp. e120-e120 ◽

Cited By ~ 134

Author(s):

Ahmed Essaghir ◽

Federica Toffalini ◽

Laurent Knoops ◽

Anders Kallin ◽

Jacques van Helden ◽

...

Keyword(s):

Gene Expression ◽

Transcription Factor ◽

Gene Expression Data ◽

Target Gene ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Gene Signatures ◽

Transcription Factor Regulation ◽

Microarray Gene

Download Full-text

Analyzing Large Gene Expression Data Sets

Computational Text Analysis ◽

10.1093/oso/9780198567400.003.0014 ◽

2006 ◽

Author(s):

Soumya Raychaudhuri

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Analysis ◽

Gene Expression Analysis ◽

Data Sets ◽

Expression Data ◽

Clustering Methods ◽

Biologically Relevant ◽

Large Gene ◽

Functional Coherence

The most interesting and challenging gene expression data sets to analyze are large multidimensional data sets that contain expression values for many genes across multiple conditions. In these data sets the use of scientific text can be particularly useful, since there are a myriad of genes examined under vastly different conditions, each of which may induce or repress expression of the same gene for different reasons. There is an enormous complexity to the data that we are examining—each gene is associated with dozens if not hundreds of expression values as well as multiple documents built up from vocabularies consisting of thousands of words. In Section 2.4 we reviewed common gene expression strategies, most of which revolve around defining groups of genes based on common profiles. A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present computational methods that leverage the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in gene expression data analysis offers an opportunity to incorporate background functional information about the genes when defining expression clusters. In Chapter 5 we saw how literature- based approaches could help in the analysis of single condition experiments. Here we will apply the strategies introduced in Chapter 6 to assess the coherence of groups of genes to enhance gene expression analysis approaches. The methods proposed here could, in fact, be applied to any multivariate genomics data type. The key concepts discussed in this chapter are listed in the frame box. We begin with a discussion of gene groups and their role in expression analysis; we briefly discuss strategies to assign keywords to groups and strategies to assess their functional coherence. We apply functional coherence measures to gene expression analysis; for examples we focus on a yeast expression data set. We first demonstrate how functional coherence can be used to focus in on the key biologically relevant gene groups derived by clustering methods such as self-organizing maps and k-means clustering.

Download Full-text