scholarly journals Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

2020 ◽  
Author(s):  
Elisabetta Manduchi ◽  
Weixuan Fu ◽  
Joseph D. Romano ◽  
Stefano Ruberto ◽  
Jason H. Moore

AbstractBackgroundA typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.ResultsWe present an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We then describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj.ConclusionsIn this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Elisabetta Manduchi ◽  
Weixuan Fu ◽  
Joseph D. Romano ◽  
Stefano Ruberto ◽  
Jason H. Moore

Abstract Background A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. Results We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj. Conclusions In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.


2019 ◽  
Author(s):  
Pei-Yau Lung ◽  
Xiaodong Pang ◽  
Yan Li ◽  
Jinfeng Zhang

AbstractReusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we develop a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We propose a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we show that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.


2019 ◽  
Author(s):  
Tom M George ◽  
Pietro Lio

AbstractMachine learning algorithms are revolutionising how information can be extracted from complex and high-dimensional data sets via intelligent compression. For example, unsupervised Autoen-coders train a deep neural network with a low-dimensional “bottlenecked” central layer to reconstruct input vectors. Variational Autoencoders (VAEs) have shown promise at learning meaningful latent spaces for text, image and more recently, gene-expression data. In the latter case they have been shown capable of capturing biologically relevant features such as a patients sex or tumour type. Here we train a VAE on ovarian cancer transcriptomes from The Cancer Genome Atlas and show that, in many cases, the latent spaces learns an encoding predictive of cisplatin chemotherapy resistance. We analyse the effectiveness of such an architecture to a wide range of hyperparameters as well as use a state-of-the-art clustering algorithm, t-SNE, to embed the data in a two-dimensional manifold and visualise the predictive power of the trained latent spaces. By correlating genes to resistance-predictive encodings we are able to extract biological processes likely responsible for platinum resistance. Finally we demonstrate that variational autoencoders can reliably encode gene expression data contaminated with significant amounts of Gaussian and dropout noise, a necessary feature if this technique is to be applicable to other data sets, including those in non-medical fields.


2020 ◽  
Vol 16 (11) ◽  
pp. e1007450
Author(s):  
Pei-Yau Lung ◽  
Dongrui Zhong ◽  
Xiaodong Pang ◽  
Yan Li ◽  
Jinfeng Zhang

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.


BMC Genomics ◽  
2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Honglin Wang ◽  
Pujan Joshi ◽  
Seung-Hyun Hong ◽  
Peter F. Maye ◽  
David W. Rowe ◽  
...  

Abstract Background Interferon regulatory factor-8 (IRF8) and nuclear factor-activated T cells c1 (NFATc1) are two transcription factors that have an important role in osteoclast differentiation. Thanks to ChIP-seq technology, scientists can now estimate potential genome-wide target genes of IRF8 and NFATc1. However, finding target genes that are consistently up-regulated or down-regulated across different studies is hard because it requires analysis of a large number of high-throughput expression studies from a comparable context. Method We have developed a machine learning based method, called, Cohort-based TF target prediction system (cTAP) to overcome this problem. This method assumes that the pathway involving the transcription factors of interest is featured with multiple “functional groups” of marker genes pertaining to the concerned biological process. It uses two notions, Gene-Present Sufficiently (GP) and Gene-Absent Insufficiently (GA), in addition to log2 fold changes of differentially expressed genes for the prediction. Target prediction is made by applying multiple machine-learning models, which learn the patterns of GP and GA from log2 fold changes and four types of Z scores from the normalized cohort’s gene expression data. The learned patterns are then associated with the putative transcription factor targets to identify genes that consistently exhibit Up/Down gene regulation patterns within the cohort. We applied this method to 11 publicly available GEO data sets related to osteoclastgenesis. Result Our experiment identified a small number of Up/Down IRF8 and NFATc1 target genes as relevant to osteoclast differentiation. The machine learning models using GP and GA produced NFATc1 and IRF8 target genes different than simply using a log2 fold change alone. Our literature survey revealed that all predicted target genes have known roles in bone remodeling, specifically related to the immune system and osteoclast formation and functions, suggesting confidence and validity in our method. Conclusion cTAP was motivated by recognizing that biologists tend to use Z score values present in data sets for the analysis. However, using cTAP effectively presupposes assembling a sizable cohort of gene expression data sets within a comparable context. As public gene expression data repositories grow, the need to use cohort-based analysis method like cTAP will become increasingly important.


Cell Cycle ◽  
2018 ◽  
Vol 17 (4) ◽  
pp. 486-491 ◽  
Author(s):  
Nicolas Borisov ◽  
Victor Tkachev ◽  
Maria Suntsova ◽  
Olga Kovalchuk ◽  
Alex Zhavoronkov ◽  
...  

Author(s):  
Soumya Raychaudhuri

The most interesting and challenging gene expression data sets to analyze are large multidimensional data sets that contain expression values for many genes across multiple conditions. In these data sets the use of scientific text can be particularly useful, since there are a myriad of genes examined under vastly different conditions, each of which may induce or repress expression of the same gene for different reasons. There is an enormous complexity to the data that we are examining—each gene is associated with dozens if not hundreds of expression values as well as multiple documents built up from vocabularies consisting of thousands of words. In Section 2.4 we reviewed common gene expression strategies, most of which revolve around defining groups of genes based on common profiles. A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present computational methods that leverage the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in gene expression data analysis offers an opportunity to incorporate background functional information about the genes when defining expression clusters. In Chapter 5 we saw how literature- based approaches could help in the analysis of single condition experiments. Here we will apply the strategies introduced in Chapter 6 to assess the coherence of groups of genes to enhance gene expression analysis approaches. The methods proposed here could, in fact, be applied to any multivariate genomics data type. The key concepts discussed in this chapter are listed in the frame box. We begin with a discussion of gene groups and their role in expression analysis; we briefly discuss strategies to assign keywords to groups and strategies to assess their functional coherence. We apply functional coherence measures to gene expression analysis; for examples we focus on a yeast expression data set. We first demonstrate how functional coherence can be used to focus in on the key biologically relevant gene groups derived by clustering methods such as self-organizing maps and k-means clustering.


Sign in / Sign up

Export Citation Format

Share Document