iMIRAGE: an R package to impute microRNA expression using protein-coding genes

Aritro Nath; Jeremy Chang; R Stephanie Huang

doi:10.1093/bioinformatics/btz939

iMIRAGE: an R package to impute microRNA expression using protein-coding genes

Bioinformatics ◽

10.1093/bioinformatics/btz939 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2608-2610

Author(s):

Aritro Nath ◽

Jeremy Chang ◽

R Stephanie Huang

Keyword(s):

Gene Expression ◽

Small Rnas ◽

Transcriptional Regulators ◽

R Package ◽

Machine Learning Algorithms ◽

Supplementary Information ◽

Expression Data ◽

Protein Coding ◽

Altered Protein ◽

Independent Test

Abstract Summary MicroRNAs (miRNAs) are critical post-transcriptional regulators of gene expression. Due to challenges in accurate profiling of small RNAs, a vast majority of public transcriptome datasets lack reliable miRNA profiles. However, the biological consequence of miRNA activity in the form of altered protein-coding gene (PCG) expression can be captured using machine-learning algorithms. Here, we present iMIRAGE (imputed miRNA activity from gene expression), a convenient tool to predict miRNA expression using PCG expression of the test datasets. The iMIRAGE package provides an integrated workflow for normalization and transformation of miRNA and PCG expression data, along with the option to utilize predicted miRNA targets to impute miRNA activity from independent test PCG datasets. Availability and implementation The iMIRAGE package for R, along with package documentation and vignette, is available at https://aritronath.github.io/iMIRAGE/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Designing a general method for predicting the regulatory relationships between long noncoding RNAs and protein-coding genes based on multi-omics characteristics

Bioinformatics ◽

10.1093/bioinformatics/btz886 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2025-2032

Author(s):

Yuwei Zhang ◽

Tianfei Yi ◽

Huihui Ji ◽

Guofang Zhao ◽

Yang Xi ◽

...

Keyword(s):

Noncoding Rna ◽

Cross Validation ◽

Machine Learning Algorithms ◽

Supplementary Information ◽

Protein Coding ◽

Protein Coding Genes ◽

Negative Results ◽

Independent Test ◽

Multiple Characteristics ◽

Test Sets

Abstract Motivation Long noncoding RNA (lncRNA) has been verified to interact with other biomolecules especially protein-coding genes (PCGs), thus playing essential regulatory roles in life activities and disease development. However, the inner mechanisms of most lncRNA–PCG relationships are still unclear. Our study investigated the characteristics of true lncRNA–PCG relationships and constructed a novel predictor with machine learning algorithms. Results We obtained the 307 true lncRNA-PCG pairs from database and found that there are significant differences in multiple characteristics between true and random lncRNA–PCG sets. Besides, 3-fold cross-validation and prediction results on independent test sets show the great AUC values of LR, SVM and RF, among which RF has the best performance with average AUC 0.818 for cross-validation, 0.823 and 0.853 for two independent test sets, respectively. In case study, some candidate lncRNA–PCG relationships in colorectal cancer were found and HOTAIR–COMP interaction was specially exemplified. The proportion of the reported pairs in the predicted positive results was significantly higher than that in negative results (P < 0.05). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

runibic: a Bioconductor package for parallel row-based biclustering of gene expression data

10.1101/210682 ◽

2017 ◽

Cited By ~ 1

Author(s):

Patryk Orzechowski ◽

Artur Pańszczyk ◽

Xiuzhen Huang ◽

Jason H. Moore

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Memory Management ◽

Large Scale ◽

R Package ◽

Supplementary Information ◽

Sequential Algorithm ◽

Expression Data ◽

Bioconductor Package ◽

Parallel Version

AbstractMotivationBiclustering (called also co-clustering) is an unsupervised technique of simultaneous analysis of rows and columns of input matrix. From the first application to gene expression data, multiple algorithms have been proposed. Only a handful of them were able to provide accurate results and were fast enough to be suitable for large-scale genomic datasets.ResultsIn this paper we introduce a Bioconductor package with parallel version of UniBic biclustering algorithm: one of the most accurate biclustering methods that have been developed so far. For the convenience of usage, we have wrapped the algorithm in an R package called runibic. The package includes: (1) a couple of times faster parallel version of the original sequential algorithm,(2) muchmore efficient memory management, (3) modularity which allows to build new methods on top of the provided one, (4) integration with the modern Bioconductor packages such as SummarizedExperiment, ExpressionSetand biclust.AvailabilityThe package is implemented in R (3.4) and will be available in the new release of Bioconductor (3.6). Currently it could be downloaded from the following URL: http://github.com/athril/runibic/[email protected], [email protected] informationSupplementary informations are available in vignette of the package.

Download Full-text

Using multiple measurements of tissue to estimate subject- and cell-type-specific gene expression

Bioinformatics ◽

10.1093/bioinformatics/btz619 ◽

2019 ◽

Vol 36 (3) ◽

pp. 782-788 ◽

Cited By ~ 6

Author(s):

Jiebiao Wang ◽

Bernie Devlin ◽

Kathryn Roeder

Keyword(s):

Gene Expression ◽

Empirical Bayes ◽

Brain Regions ◽

Tissue Level ◽

Supplementary Information ◽

Specific Gene ◽

Expression Data ◽

Cell Type ◽

Multiple Measurements ◽

Cell Type Specific

Abstract Motivation Patterns of gene expression, quantified at the level of tissue or cells, can inform on etiology of disease. There are now rich resources for tissue-level (bulk) gene expression data, which have been collected from thousands of subjects, and resources involving single-cell RNA-sequencing (scRNA-seq) data are expanding rapidly. The latter yields cell type information, although the data can be noisy and typically are derived from a small number of subjects. Results Complementing these approaches, we develop a method to estimate subject- and cell-type-specific (CTS) gene expression from tissue using an empirical Bayes method that borrows information across multiple measurements of the same tissue per subject (e.g. multiple regions of the brain). Analyzing expression data from multiple brain regions from the Genotype-Tissue Expression project (GTEx) reveals CTS expression, which then permits downstream analyses, such as identification of CTS expression Quantitative Trait Loci (eQTL). Availability and implementation We implement this method as an R package MIND, hosted on https://github.com/randel/MIND. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Causal Inference Engine: a platform for directional gene set enrichment analysis and inference of active transcriptional regulators

Nucleic Acids Research ◽

10.1093/nar/gkz1046 ◽

2019 ◽

Cited By ~ 1

Author(s):

Saman Farahmand ◽

Corey O’Connor ◽

Jill A Macoska ◽

Kourosh Zarringhalam

Keyword(s):

Gene Expression ◽

Causal Inference ◽

Gene Expression Data ◽

Transcriptional Regulators ◽

Regulatory Mechanisms ◽

Gene Interactions ◽

Expression Data ◽

Inference Algorithms ◽

Tf Gene ◽

Mode Of Regulation

Abstract Inference of active regulatory mechanisms underlying specific molecular and environmental perturbations is essential for understanding cellular response. The success of inference algorithms relies on the quality and coverage of the underlying network of regulator–gene interactions. Several commercial platforms provide large and manually curated regulatory networks and functionality to perform inference on these networks. Adaptation of such platforms for open-source academic applications has been hindered by the lack of availability of accurate, high-coverage networks of regulatory interactions and integration of efficient causal inference algorithms. In this work, we present CIE, an integrated platform for causal inference of active regulatory mechanisms form differential gene expression data. Using a regularized Gaussian Graphical Model, we construct a transcriptional regulatory network by integrating publicly available ChIP-seq experiments with gene-expression data from tissue-specific RNA-seq experiments. Our GGM approach identifies high confidence transcription factor (TF)–gene interactions and annotates the interactions with information on mode of regulation (activation vs. repression). Benchmarks against manually curated databases of TF–gene interactions show that our method can accurately detect mode of regulation. We demonstrate the ability of our platform to identify active transcriptional regulators by using controlled in vitro overexpression and stem-cell differentiation studies and utilize our method to investigate transcriptional mechanisms of fibroblast phenotypic plasticity.

Download Full-text

AssessORF: combining evolutionary conservation and proteomics to assess prokaryotic gene predictions

Bioinformatics ◽

10.1093/bioinformatics/btz714 ◽

2019 ◽

Author(s):

Deepank R Korandla ◽

Jacob M Wozniak ◽

Anaamika Campeau ◽

David J Gonzalez ◽

Erik S Wright

Keyword(s):

R Package ◽

Evolutionary Conservation ◽

Supplementary Information ◽

Bioconductor Package ◽

Gene Finding ◽

Proteomics Data ◽

Protein Coding ◽

New Approach ◽

Protein Coding Genes ◽

Clear Winner

Abstract Motivation A core task of genomics is to identify the boundaries of protein coding genes, which may cover over 90% of a prokaryote's genome. Several programs are available for gene finding, yet it is currently unclear how well these programs perform and whether any offers superior accuracy. This is in part because there is no universal benchmark for gene finding and, therefore, most developers select their own benchmarking strategy. Results Here, we introduce AssessORF, a new approach for benchmarking prokaryotic gene predictions based on evidence from proteomics data and the evolutionary conservation of start and stop codons. We applied AssessORF to compare gene predictions offered by GenBank, GeneMarkS-2, Glimmer and Prodigal on genomes spanning the prokaryotic tree of life. Gene predictions were 88–95% in agreement with the available evidence, with Glimmer performing the worst but no clear winner. All programs were biased towards selecting start codons that were upstream of the actual start. Given these findings, there remains considerable room for improvement, especially in the detection of correct start sites. Availability and implementation AssessORF is available as an R package via the Bioconductor package repository. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DataRemix: a universal data transformation for optimal inference from gene expression datasets

Bioinformatics ◽

10.1093/bioinformatics/btaa745 ◽

2020 ◽

Cited By ~ 1

Author(s):

Weiguang Mao ◽

Javad Rahimikollu ◽

Ryan Hausler ◽

Maria Chikina

Keyword(s):

Gene Expression ◽

R Package ◽

Supplementary Information ◽

Eqtl Analysis ◽

Thompson Sampling ◽

Normalization Methods ◽

Special Cases ◽

Biological Signals ◽

Gene Correlation ◽

Value Decomposition

Abstract Motivation RNA-seq technology provides unprecedented power in the assessment of the transcription abundance and can be used to perform a variety of downstream tasks such as inference of gene-correlation network and eQTL discovery. However, raw gene expression values have to be normalized for nuisance biological variation and technical covariates, and different normalization strategies can lead to dramatically different results in the downstream study. Results We describe a generalization of singular value decomposition-based reconstruction for which the common techniques of whitening, rank-k approximation and removing the top k principal components are special cases. Our simple three-parameter transformation, DataRemix, can be tuned to reweigh the contribution of hidden factors and reveal otherwise hidden biological signals. In particular, we demonstrate that the method can effectively prioritize biological signals over noise without leveraging external dataset-specific knowledge, and can outperform normalization methods that make explicit use of known technical factors. We also show that DataRemix can be efficiently optimized via Thompson sampling approach, which makes it feasible for computationally expensive objectives such as eQTL analysis. Finally, we apply our method to the Religious Orders Study and Memory and Aging Project dataset, and we report what to our knowledge is the first replicable trans-eQTL effect in human brain. Availabilityand implementation DataRemix is an R package which is freely available at GitHub (https://github.com/wgmao/DataRemix). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Analysis of RDR1/RDR2/RDR6-independent small RNAs in Arabidopsis thaliana improves MIRNA annotations and reveals novel siRNA loci

10.1101/238691 ◽

2017 ◽

Cited By ~ 1

Author(s):

Seth Polydore ◽

Michael J. Axtell

Keyword(s):

Gene Expression ◽

Arabidopsis Thaliana ◽

Small Rna ◽

Small Rnas ◽

Rna Seq ◽

Triple Mutant ◽

Physiological Mechanisms ◽

Protein Coding ◽

Regulate Gene Expression ◽

Rna Biogenesis

SummaryPlant small RNAs regulate key physiological mechanisms through post-transcriptional and transcriptional silencing of gene expression. sRNAs fall into two major categories: those that are reliant on RNA Dependent RNA Polymerases (RDRs) for biogenesis and those that aren’t. Known RDR-dependent sRNAs include phased and repeat-associated short interfering RNAs, while known RDR-independent sRNAs are primarily microRNAs and other hairpin-derived sRNAs. In this study, we produced and analyzed small RNA-seq libraries from rdr1/rdr2/rdr6 triple mutant plants. Only a small fraction of all sRNA loci were RDR1/RDR2/RDR6-independent; most of these were microRNA loci or associated with predicted hairpin precursors. We found 58 previously annotated microRNA loci that were reliant on RDR1, −2, or −6 function, casting doubt on their classification. We also found 38 RDR1/2/6-independent small RNA loci that are not MIRNAs or otherwise hairpin-derived, and did not fit into other known paradigms for small RNA biogenesis. These 38 small RNA-producing loci have novel biogenesis mechanisms, and are frequently located in the vicinity of protein-coding genes. Altogether, our analysis suggest that these 38 loci represent one or more new types of small RNAs in Arabidopsis thaliana.Significance StatementSmall RNAs regulate gene expression in plants and are produced through a variety of previously-described mechanisms. Here, we examine a set of previously undiscovered small RNA-producing loci that are produced by novel mechanisms.

Download Full-text

Space-log: a novel approach to inferring gene-gene net-works using SPACE model with log penalty

F1000Research ◽

10.12688/f1000research.26128.2 ◽

2022 ◽

Vol 9 ◽

pp. 1159

Author(s):

Qian (Vicky) Wu ◽

Wei Sun ◽

Li Hsu

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Regulatory Networks ◽

Penalized Regression ◽

R Package ◽

Expression Data ◽

Computationally Efficient ◽

P Gene ◽

Novel Approach

Gene expression data have been used to infer gene-gene networks (GGN) where an edge between two genes implies the conditional dependence of these two genes given all the other genes. Such gene-gene networks are of-ten referred to as gene regulatory networks since it may reveal expression regulation. Most of existing methods for identifying GGN employ penalized regression with L1 (lasso), L2 (ridge), or elastic net penalty, which spans the range of L1 to L2 penalty. However, for high dimensional gene expression data, a penalty that spans the range of L0 and L1 penalty, such as the log penalty, is often needed for variable selection consistency. Thus, we develop a novel method that em-ploys log penalty within the framework of an earlier network identification method space (Sparse PArtial Correlation Estimation), and implement it into a R package space-log. We show that the space-log is computationally efficient (source code implemented in C), and has good performance comparing with other methods, particularly for networks with hubs.Space-log is open source and available at GitHub, https://github.com/wuqian77/SpaceLog

Download Full-text

NormExpression: an R package to normalize gene expression data using evaluated methods

10.1101/251140 ◽

2018 ◽

Cited By ~ 3

Author(s):

Zhenfeng Wu ◽

Weixiang Liu ◽

Xiufeng Jin ◽

Deshui Yu ◽

Hua Wang ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Analysis ◽

Gene Expression Analysis ◽

Expression Profiles ◽

R Package ◽

Data Normalization ◽

Expression Data ◽

Normalization Methods ◽

Normalize Gene Expression

AbstractData normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the current normalization methods, the different metrics yield inconsistent results. In this study, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods, achieving consistency in our evaluation results using both bulk RNA-seq and scRNA-seq data from the same library construction protocol. This consistency has validated the underlying theory that a sucessiful normalization method simultaneously maximizes the number of uniform genes and minimizes the correlation between the expression profiles of gene pairs. This consistency can also be used to analyze the quality of gene expression data. The gene expression data, normalization methods and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to evaluate methods (particularly some data-driven methods or their own methods) and then select a best one for data normalization in the gene expression analysis.

Download Full-text

target: an R package to predict combined function of transcription factors

F1000Research ◽

10.12688/f1000research.52173.2 ◽

2021 ◽

Vol 10 ◽

pp. 344

Author(s):

Mahmoud Ahmed ◽

Deok Ryong Kim

Keyword(s):

Gene Expression ◽

Transcription Factor ◽

Transcription Factors ◽

Binding Sites ◽

R Package ◽

Expression Data ◽

Bioconductor Package ◽

Chip Experiment ◽

Binding Data ◽

Two Factors

Researchers use ChIP binding data to identify potential transcription factor binding sites. Similarly, they use gene expression data from sequencing or microarrays to quantify the effect of the factor overexpression or knockdown on its targets. Therefore, the integration of the binding and expression data can be used to improve the understanding of a transcription factor function. Here, we implemented the binding and expression target analysis (BETA) in an R/Bioconductor package. This algorithm ranks the targets based on the distances of their assigned peaks from the factor ChIP experiment and the signed statistics from gene expression profiling with factor perturbation. We further extend BETA to integrate two sets of data from two factors to predict their targets and their combined functions. In this article, we briefly describe the workings of the algorithm and provide a workflow with a real dataset for using it. The gene targets and the aggregate functions of transcription factors YY1 and YY2 in HeLa cells were identified. Using the same datasets, we identified the shared targets of the two factors, which were found to be, on average, more cooperatively regulated.

Download Full-text