scholarly journals DataRemix: a universal data transformation for optimal inference from gene expression datasets

Author(s):  
Weiguang Mao ◽  
Javad Rahimikollu ◽  
Ryan Hausler ◽  
Maria Chikina

Abstract Motivation RNA-seq technology provides unprecedented power in the assessment of the transcription abundance and can be used to perform a variety of downstream tasks such as inference of gene-correlation network and eQTL discovery. However, raw gene expression values have to be normalized for nuisance biological variation and technical covariates, and different normalization strategies can lead to dramatically different results in the downstream study. Results We describe a generalization of singular value decomposition-based reconstruction for which the common techniques of whitening, rank-k approximation and removing the top k principal components are special cases. Our simple three-parameter transformation, DataRemix, can be tuned to reweigh the contribution of hidden factors and reveal otherwise hidden biological signals. In particular, we demonstrate that the method can effectively prioritize biological signals over noise without leveraging external dataset-specific knowledge, and can outperform normalization methods that make explicit use of known technical factors. We also show that DataRemix can be efficiently optimized via Thompson sampling approach, which makes it feasible for computationally expensive objectives such as eQTL analysis. Finally, we apply our method to the Religious Orders Study and Memory and Aging Project dataset, and we report what to our knowledge is the first replicable trans-eQTL effect in human brain. Availabilityand implementation DataRemix is an R package which is freely available at GitHub (https://github.com/wgmao/DataRemix). Supplementary information Supplementary data are available at Bioinformatics online.

2018 ◽  
Author(s):  
Weiguang Mao ◽  
Ryan Hausler ◽  
Maria Chikina

AbstractMotivationRNAseq technology provides unprecedented power in the assesment of the transcription abundance and can be used to perform a variety of downstream tasks such as inference of gene-correlation network and eQTL discovery. However, raw gene expression values have to be normalized for nuisance biological variation and technical covariates, and different normalization strategies can lead to dramatically different results in the downstream study.ResultsWe describe a generalization of SVD-based reconstruction for which the common techniques of whitening, rank-k approximation, and removing the top k principle components are special cases. Our simple three-parameter transformation, DataRemix, can be tuned to reweight the contribution of hidden factors and reveal otherwise hidden biological signals. In particular, we demonstrate that the method can effectively prioritize biological signals over noise without leveraging external dataset-specific knowledge, and can outperform normalization methods that make explicit use of known technical factors. We also show that DataRemix can be efficiently optimized via Thompson Sampling approach, which makes it feasible for computationally expensive objectives such as eQTL analysis. Finally we reanalyze the Depression Gene Networks (DGN) dataset, and we highlight new trans-eQTL networks which were not reported in the initial study.AvailabilityDataRemix is an R package which is freely available at GitHub (https://github.com/wgmao/DataRemix)[email protected]


2019 ◽  
Vol 36 (8) ◽  
pp. 2608-2610
Author(s):  
Aritro Nath ◽  
Jeremy Chang ◽  
R Stephanie Huang

Abstract Summary MicroRNAs (miRNAs) are critical post-transcriptional regulators of gene expression. Due to challenges in accurate profiling of small RNAs, a vast majority of public transcriptome datasets lack reliable miRNA profiles. However, the biological consequence of miRNA activity in the form of altered protein-coding gene (PCG) expression can be captured using machine-learning algorithms. Here, we present iMIRAGE (imputed miRNA activity from gene expression), a convenient tool to predict miRNA expression using PCG expression of the test datasets. The iMIRAGE package provides an integrated workflow for normalization and transformation of miRNA and PCG expression data, along with the option to utilize predicted miRNA targets to impute miRNA activity from independent test PCG datasets. Availability and implementation The iMIRAGE package for R, along with package documentation and vignette, is available at https://aritronath.github.io/iMIRAGE/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Zhenfeng Wu ◽  
Weixiang Liu ◽  
Xiufeng Jin ◽  
Deshui Yu ◽  
Hua Wang ◽  
...  

AbstractData normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the current normalization methods, the different metrics yield inconsistent results. In this study, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods, achieving consistency in our evaluation results using both bulk RNA-seq and scRNA-seq data from the same library construction protocol. This consistency has validated the underlying theory that a sucessiful normalization method simultaneously maximizes the number of uniform genes and minimizes the correlation between the expression profiles of gene pairs. This consistency can also be used to analyze the quality of gene expression data. The gene expression data, normalization methods and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to evaluate methods (particularly some data-driven methods or their own methods) and then select a best one for data normalization in the gene expression analysis.


2020 ◽  
Author(s):  
Xianjun Dong ◽  
Xiaoqi Li ◽  
Tzuu-Wang Chang ◽  
Scott T Weiss ◽  
Weiliang Qiu

Genome-wide association studies (GWAS) have revealed thousands of genetic loci for common diseases. One of the main challenges in the post-GWAS era is to understand the causality of the genetic variants. Expression quantitative trait locus (eQTL) analysis has been proven to be an effective way to address this question by examining the relationship between gene expression and genetic variation in a sufficiently powered cohort. However, it is often tricky to determine the sample size at which a variant with a specific allele frequency will be detected to associate with gene expression with sufficient power. This is particularly demanding with single-cell RNAseq studies. Therefore, a user-friendly tool to perform power analysis for eQTL at both bulk tissue and single-cell level will be critical. Here, we presented an R package called powerEQTL with flexible functions to calculate power, minimal sample size, or detectable minor allele frequency in both bulk tissue and single-cell eQTL analysis. A user-friendly, program-free web application is also provided, allowing customers to calculate and visualize the parameters interactively.


2020 ◽  
Vol 36 (15) ◽  
pp. 4301-4308
Author(s):  
Stephan Seifert ◽  
Sven Gundlach ◽  
Olaf Junge ◽  
Silke Szymczak

Abstract Motivation High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. Results The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. Availability and implementation An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Yue Hu ◽  
Xuegong Zhang

With the development of single-cell sequencing technologies, parallel sequencing the transcriptome and genome is becoming available and will bring us the opportunity to uncover association between genotype and phenotype at single-cell level. Due to the special characteristics of single-cell sequencing data, new method is needed to identify eQTL from single-cell data. We developed an R package SCeQTL that uses zero-inflated negative binomial regression to do eQTL analysis on single-cell data. It can distinguish two type of gene-expression differences among different genotype groups. It can also be used for finding gene expression variations associated with other grouping factors like cell lineages or cell types.


2020 ◽  
Vol 36 (12) ◽  
pp. 3916-3917 ◽  
Author(s):  
Daniele Mercatelli ◽  
Gonzalo Lopez-Garcia ◽  
Federico M Giorgi

Abstract Motivation Gene network inference and master regulator analysis (MRA) have been widely adopted to define specific transcriptional perturbations from gene expression signatures. Several tools exist to perform such analyses but most require a computer cluster or large amounts of RAM to be executed. Results We developed corto, a fast and lightweight R package to infer gene networks and perform MRA from gene expression data, with optional corrections for copy-number variations and able to run on signatures generated from RNA-Seq or ATAC-Seq data. We extensively benchmarked it to infer context-specific gene networks in 39 human tumor and 27 normal tissue datasets. Availability and implementation Cross-platform and multi-threaded R package on CRAN (stable version) https://cran.r-project.org/package=corto and Github (development release) https://github.com/federicogiorgi/corto. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (8) ◽  
pp. 2486-2491 ◽  
Author(s):  
Serin Zhang ◽  
Jiang Shao ◽  
Disa Yu ◽  
Xing Qiu ◽  
Jinfeng Zhang

Abstract Motivation Combining gene expression (GE) profiles generated from different platforms enables previously infeasible studies due to sample size limitations. Several cross-platform normalization methods have been developed to remove the systematic differences between platforms, but they may also remove meaningful biological differences among datasets. In this work, we propose a novel approach that removes the platform, not the biological differences. Dubbed as ‘MatchMixeR’, we model platform differences by a linear mixed effects regression (LMER) model, and estimate them from matched GE profiles of the same cell line or tissue measured on different platforms. The resulting model can then be used to remove platform differences in other datasets. By using LMER, we achieve better bias-variance trade-off in parameter estimation. We also design a computationally efficient algorithm based on the moment method, which is ideal for ultra-high-dimensional LMER analysis. Results Compared with several prominent competing methods, MatchMixeR achieved the highest after-normalization concordance. Subsequent differential expression analyses based on datasets integrated from different platforms showed that using MatchMixeR achieved the best trade-off between true and false discoveries, and this advantage is more apparent in datasets with limited samples or unbalanced group proportions. Availability and implementation Our method is implemented in a R-package, ‘MatchMixeR’, freely available at: https://github.com/dy16b/Cross-Platform-Normalization. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Valentin Junet ◽  
Judith Farrés ◽  
José M. Mas ◽  
Xavier Daura

AbstractMotivationCross-(multi)platform normalization of gene-expression microarray data remains an unresolved issue. Despite the existence of several algorithms, they are either constrained by the need to normalize all samples of all platforms together, compromising scalability and reuse, by adherence to the platforms of a specific provider, or simply by poor performance. In addition, many of the methods presented in the literature have not been specifically tested against multi-platform data and/or other methods applicable in this context. Thus, we set out to develop a normalization algorithm appropriate for gene-expression studies based on multiple, potentially large microarray sets collected along multiple platforms and at different times, applicable in systematic studies aimed at extracting knowledge from the wealth of microarray data available in public repositories; for example, for the extraction of Real-World Data to complement data from Randomized Controlled Trials. Our main focus or criterion for performance was on the capacity of the algorithm to properly separate samples from different biological groups.ResultsWe present CuBlock, an algorithm addressing this objective, together with a strategy to validate cross-platform normalization methods. To validate the algorithm and benchmark it against existing methods, we used two distinct data sets, one specifically generated for testing and standardization purposes and one from an actual experimental study. Using these data sets, we benchmarked CuBlock against ComBat (Johnson et al., 2007), YuGene (Lê Cao et al., 2014), DBNorm (Meng et al., 2017), Shambhala (Borisov et al., 2019) and a simple log2 transform as reference. We note that many other popular normalization methods are not applicable in this context. CuBlock was the only algorithm in this group that could always and clearly differentiate the underlying biological groups after mixing the data, from up to six different platforms in this study.AvailabilityCuBlock can be downloaded from https://www.mathworks.com/matlabcentral/fileexchange/[email protected], [email protected] informationSupplementary data are available at bioRxiv online.


2018 ◽  
Author(s):  
Nanne Aben ◽  
Johan A. Westerhuis ◽  
Yipeng Song ◽  
Henk A.L. Kiers ◽  
Magali Michaut ◽  
...  

AbstractMotivationIn biology, we are often faced with multiple datasets recorded on the same set of objects, such as multi-omics and phenotypic data of the same tumors. These datasets are typically not independent from each other. For example, methylation may influence gene expression, which may, in turn, influence drug response. Such relationships can strongly affect analyses performed on the data, as we have previously shown for the identification of biomarkers of drug response. Therefore, it is important to be able to chart the relationships between datasets.ResultsWe present iTOP, a methodology to infera topology of relationships between datasets. We base this methodology on the RV coefficient, a measure of matrix correlation, which can be used to determine how much information is shared between two datasets. We extended the RV coefficient for partial matrix correlations, which allows the use of graph reconstruction algorithms, such as the PC algorithm, to infer the topologies. In addition, since multi-omics data often contain binary data (e.g. mutations), we also extended the RV coefficient for binary data. Applying iTOP to pharmacogenomics data, we found that gene expression acts as a mediator between most other datasets and drug response: only proteomics clearly shares information with drug response that is not present in gene expression. Based on this result, we used TANDEM, a method for drug response prediction, to identify which variables predictive of drug response were distinct to either gene expression or proteomics.AvailabilityAn implementation of our methodology is available in the R package iTOP on CRAN. Additionally, an R Markdown document with code to reproduce all figures is provided as Supplementary [email protected] and [email protected] informationSupplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document