Defining data-driven primary transcript annotations with primaryTranscriptAnnotation in R

Warren D Anderson; Fabiana M Duarte; Mete Civelek; Michael J Guertin

doi:10.1093/bioinformatics/btaa011

Defining data-driven primary transcript annotations with primaryTranscriptAnnotation in R

Bioinformatics ◽

10.1093/bioinformatics/btaa011 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2926-2928 ◽

Cited By ~ 1

Author(s):

Warren D Anderson ◽

Fabiana M Duarte ◽

Mete Civelek ◽

Michael J Guertin

Keyword(s):

Regulatory Networks ◽

De Novo ◽

R Package ◽

Data Driven ◽

Transcriptional Unit ◽

Supplementary Information ◽

Primary Transcript ◽

Nascent Transcript ◽

Unbiased Manner ◽

Mrna Gene

Abstract Summary Nascent transcript measurements derived from run-on sequencing experiments are critical for the investigation of transcriptional mechanisms and regulatory networks. However, conventional mRNA gene annotations significantly differ from the boundaries of primary transcripts. New primary transcript annotations are needed to accurately interpret run-on data. We developed the primaryTranscriptAnnotation R package to infer the transcriptional start and termination sites of primary transcripts from genomic run-on data. We then used these inferred coordinates to annotate transcriptional units identified de novo. This package provides the novel utility to integrate data-driven primary transcript annotations with transcriptional unit coordinates identified in an unbiased manner. Highlighting the importance of using accurate primary transcript coordinates, we demonstrate that this new methodology increases the detection of differentially expressed transcripts and provides more accurate quantification of RNA polymerase pause indices. Availability and implementation https://github.com/WarrenDavidAnderson/genomicsRpackage/tree/master/primaryTranscriptAnnotation. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Defining data-driven primary transcript annotations with primaryTranscriptAnnotation in R

10.1101/779587 ◽

2019 ◽

Author(s):

Warren D. Anderson ◽

Fabiana M. Duarte ◽

Mete Civelek ◽

Michael J. Guertin

Keyword(s):

Regulatory Networks ◽

De Novo ◽

Cell Types ◽

R Package ◽

Data Driven ◽

Transcriptional Unit ◽

Primary Transcript ◽

Nascent Transcript ◽

Unbiased Manner ◽

Transcription Data

Nascent transcript measurements derived from run-on sequencing experiments are critical for the investigation of transcriptional mechanisms and regulatory networks. However, conventional gene annotations specify the boundaries of mRNAs, which significantly differ from the boundaries of primary transcripts. Moreover, transcript isoforms with distinct transcription start and end coordinates can vary between cell types. Therefore, new primary transcript annotations are needed to accurately interpret run-on data. We developed the primaryTranscriptAnnotation R package to infer the transcriptional start and termination sites of annotated genes from genomic run-on data. We then used these inferred co-ordinates to annotate transcriptional units identified de novo. Hence, this package provides the novel utility to integrate data-driven primary transcript annotations with transcriptional unit coordinates identified in an unbiased manner. Our analyses demonstrated that this new methodology increases the sensitivity for detecting differentially expressed transcripts and provides more accurate quantification of RNA polymerase pause indices, consistent with the importance of using accurate primary transcript coordinates for interpreting genomic nascent transcription data.Availabilityhttps://github.com/WarrenDavidAnderson/genomicsRpackage/tree/master/primaryTranscriptAnnotation

Download Full-text

De Novo Protein Design for Novel Folds using Guided Conditional Wasserstein Generative Adversarial Networks (gcWGAN)

10.1101/769919 ◽

2019 ◽

Cited By ~ 4

Author(s):

Mostafa Karimi ◽

Shaowen Zhu ◽

Yue Cao ◽

Yang Shen

Keyword(s):

Protein Design ◽

Sequence Space ◽

De Novo ◽

Sequence Data ◽

Generative Models ◽

Current Data ◽

Data Driven ◽

Supplementary Information ◽

Generative Adversarial Networks ◽

Sequence Structure

AbstractMotivationFacing data quickly accumulating on protein sequence and structure, this study is addressing the following question: to what extent could current data alone reveal deep insights into the sequence-structure relationship, such that new sequences can be designed accordingly for novel structure folds?ResultsWe have developed novel deep generative models, constructed low-dimensional and generalizable representation of fold space, exploited sequence data with and without paired structures, and developed ultra-fast fold predictor as an oracle providing feedback. The resulting semi-supervised gcWGAN is assessed with the oracle over 100 novel folds not in the training set and found to generate more yields and cover 3.6 times more target folds compared to a competing data-driven method (cVAE). Assessed with structure predictor over representative novel folds (including one not even part of basis folds), gcWGAN designs are found to have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. gcWGAN explores uncharted sequence space to design proteins by learning from current sequence-structure data. The ultra fast data-driven model can be a powerful addition to principle-driven design methods through generating seed designs or tailoring sequence space.AvailabilityData and source codes will be available upon [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

sismonr: simulation of in silico multi-omic networks with adjustable ploidy and post-transcriptional regulation in R

Bioinformatics ◽

10.1093/bioinformatics/btaa002 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2938-2940

Author(s):

Olivia Angelin-Bonnet ◽

Patrick J Biggs ◽

Samantha Baldwin ◽

Susan Thomson ◽

Matthieu Vignes

Keyword(s):

In Silico ◽

Regulatory Networks ◽

Anthocyanin Biosynthesis ◽

Expression Profiles ◽

R Package ◽

Supplementary Information ◽

Protein Coding ◽

Stochastic Simulation Algorithms ◽

Post Transcriptional Regulation ◽

Biosynthesis Regulation

Abstract Summary We present sismonr, an R package for an integral generation and simulation of in silico biological systems. The package generates gene regulatory networks, which include protein-coding and non-coding genes along with different transcriptional and post-transcriptional regulations. The effect of genetic mutations on the system behaviour is accounted for via the simulation of genetically different in silico individuals. The ploidy of the system is not restricted to the usual haploid or diploid situations but can be defined by the user to higher ploidies. A choice of stochastic simulation algorithms allows us to simulate the expression profiles of the genes in the in silico system. We illustrate the use of sismonr by simulating the anthocyanin biosynthesis regulation pathway for three genetically distinct in silico plants. Availability and implementation The sismonr package is implemented in R and Julia and is publicly available on the CRAN repository (https://CRAN.R-project.org/package=sismonr). A detailed tutorial is available from GitHub at https://oliviaab.github.io/sismonr/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MEScan: a powerful statistical framework for genome-scale mutual exclusivity analysis of cancer mutations

Bioinformatics ◽

10.1093/bioinformatics/btaa957 ◽

2020 ◽

Author(s):

Sisheng Liu ◽

Jinpeng Liu ◽

Yanqi Xie ◽

Tingting Zhai ◽

Eugene W Hinderer ◽

...

Keyword(s):

Mutation Rate ◽

De Novo ◽

R Package ◽

Supplementary Information ◽

Driver Mutations ◽

Mutual Exclusivity ◽

Statistical Framework ◽

Gene Sets ◽

Genome Wide ◽

Background Mutation Rate

Abstract Motivation Cancer somatic driver mutations associated with genes within a pathway often show a mutually exclusive pattern across a cohort of patients. This mutually exclusive mutational signal has been frequently used to distinguish driver from passenger mutations and to investigate relationships among driver mutations. Current methods for de novo discovery of mutually exclusive mutational patterns are limited because the heterogeneity in background mutation rate can confound mutational patterns, and the presence of highly mutated genes can lead to spurious patterns. In addition, most methods only focus on a limited number of pre-selected genes and are unable to perform genome-wide analysis due to computational inefficiency. Results We introduce a statistical framework, MEScan, for accurate and efficient mutual exclusivity analysis at the genomic scale. Our framework contains a fast and powerful statistical test for mutual exclusivity with adjustment of the background mutation rate and impact of highly mutated genes, and a multi-step procedure for genome-wide screening with the control of false discovery rate. We demonstrate that MEScan more accurately identifies mutually exclusive gene sets than existing methods and is at least two orders of magnitude faster than most methods. By applying MEScan to data from four different cancer types and pan-cancer, we have identified several biologically meaningful mutually exclusive gene sets. Availability and implementation MEScan is available as an R package at https://github.com/MarkeyBBSRF/MEScan. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Plasmid Profiler: Comparative Analysis of Plasmid Content in WGS Data

10.1101/121350 ◽

2017 ◽

Cited By ~ 2

Author(s):

Adrian Zetner ◽

Jennifer Cabral ◽

Laura Mataseje ◽

Natalie C Knox ◽

Philip Mabon ◽

...

Keyword(s):

Comparative Analysis ◽

De Novo ◽

Sequence Data ◽

Health Agency ◽

R Package ◽

Whole Genome Sequence ◽

Reference Sequence ◽

Supplementary Information ◽

Plasmid Content ◽

Link Type

AbstractSummaryComparative analysis of bacterial plasmids from whole genome sequence (WGS) data generated from short read sequencing is challenging. This is due to the difficulty in identifying contigs harbouring plasmid sequence data, and further difficulty in assembling such contigs into a full plasmid. As such, few software programs and bioinformatics pipelines exist to perform comprehensive comparative analyses of plasmids within and amongst sequenced isolates. To address this gap, we have developed Plasmid Profiler, a pipeline to perform comparative plasmid content analysis without the need forde novoassembly. The pipeline is designed to rapidly identify plasmid sequences by mapping reads to a plasmid reference sequence database. Predicted plasmid sequences are then annotated with their incompatibility group, if known. The pipeline allows users to query plasmids for genes or regions of interest and visualize results as an interactive heat map.Availability and ImplementationPlasmid Profiler is freely available software released under the Apache 2.0 open source software license. A stand-alone version of the entire Plasmid Profiler pipeline is available as a Docker container athttps://hub.docker.com/r/phacnml/plasmidprofiler_0_1_6/.The conda recipe for the Plasmid R package is available at:https://anaconda.org/bioconda/r-plasmidprofilerThe custom Plasmid Profiler R package is also available as a CRAN package athttps://cran.r-project.org/web/packages/Plasmidprofiler/index.htmlGalaxy tools associated with the pipeline are available as a Galaxy tool suite athttps://toolshed.g2.bx.psu.edu/repository?repository_id=55e082200d16a504The source code is available at:https://github.com/phac-nml/plasmidprofilerThe Galaxy implementation is available at:https://github.com/phac-nml/plasmidprofiler-galaxyContactEmail:[email protected]: National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, Manitoba, CanadaSupplementary informationDocumentation:http://plasmid-profiler.readthedocs.io/en/latest/

Download Full-text

GNET2: an R package for constructing gene regulatory networks from transcriptomic data

Bioinformatics ◽

10.1093/bioinformatics/btaa902 ◽

2020 ◽

Author(s):

Chen Chen ◽

Jie Hou ◽

Xiaowen Shi ◽

Hua Yang ◽

James A Birchler ◽

...

Keyword(s):

Gene Regulatory Networks ◽

Regulatory Networks ◽

Data Exchange ◽

Graphical Model ◽

R Package ◽

Supplementary Information ◽

Original Algorithm ◽

Transcriptomic Data ◽

Regulatory Module ◽

Gene Regulatory

Abstract Motivation The Gene Network Estimation Tool (GNET) is designed to build gene regulatory networks (GRNs) from transcriptomic gene expression data with a probabilistic graphical model. The data preprocessing, model construction and visualization modules of the original GNET software were developed on different programming platforms, which were inconvenient for users to deploy and use. Results Here, we present GNET2, an improved implementation of GNET as an integrated R package. GNET2 provides more flexibility for parameter initialization and regulatory module construction based on the core iterative modeling process of the original algorithm. The data exchange interface of GNET2 is handled within an R session automatically. Given the growing demand for regulatory network reconstruction from transcriptomic data, GNET2 offers a convenient option for GRN inference on large datasets. Availability and implementation The source code of GNET2 is available at https://github.com/jianlin-cheng/GNET2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BICORN: An R package for integrative inference of de novo cis-regulatory modules

10.1101/560557 ◽

2019 ◽

Author(s):

Xi Chen

Keyword(s):

Gene Expression ◽

Transcription Factor ◽

Gene Expression Data ◽

Regulatory Networks ◽

Target Genes ◽

De Novo ◽

R Package ◽

Expression Data ◽

Regulatory Modules ◽

Regulatory Module

AbstractBICORN is an R package developed to integrate prior transcription factor binding information and gene expression data for cis-regulatory module (CRM) inference. BICORN searches for a list of candidate CRMs from binary bindings on potential target genes. Applying Gibbs sampling, BICORN samples CRMs for each gene using the fitting performance of transcription factor activities and regulation strengths of TFs in each CRM on gene expression. Consequently, sparse regulatory networks are inferred as functional CRMs regulating target genes. The BICORN package is implemented in R and is available at https://cran.r-project.org/web/packages/BICORN/index.html.

Download Full-text

Iterative Supervised Principal Component Analysis-Driven Ligand Design for Regioselective Ti-Catalyzed Pyrrole Synthesis

10.26434/chemrxiv.12284378 ◽

2020 ◽

Author(s):

Xin Yi See ◽

Benjamin Reiner ◽

Xuelan Wen ◽

T. Alexander Wheeler ◽

Channing Klein ◽

...

Keyword(s):

Principal Component Analysis ◽

De Novo ◽

Principal Component ◽

Component Analysis ◽

Catalyst Design ◽

Data Driven ◽

Initial Reaction ◽

Training Set ◽

Reaction Conditions ◽

Component Loadings

<div> <div> <div> <p>Herein, we describe the use of iterative supervised principal component analysis (ISPCA) in de novo catalyst design. The regioselective synthesis of 2,5-dimethyl-1,3,4-triphenyl-1H- pyrrole (C) via Ti- catalyzed formal [2+2+1] cycloaddition of phenyl propyne and azobenzene was targeted as a proof of principle. The initial reaction conditions led to an unselective mixture of all possible pyrrole regioisomers. ISPCA was conducted on a training set of catalysts, and their performance was regressed against the scores from the top three principal components. Component loadings from this PCA space along with k-means clustering were used to inform the design of new test catalysts. The selectivity of a prospective test set was predicted in silico using the ISPCA model, and only optimal candidates were synthesized and tested experimentally. This data-driven predictive-modeling workflow was iterated, and after only three generations the catalytic selectivity was improved from 0.5 (statistical mixture of products) to over 11 (> 90% C) by incorporating 2,6-dimethyl- 4-(pyrrolidin-1-yl)pyridine as a ligand. The successful development of a highly selective catalyst without resorting to long, stochastic screening processes demonstrates the inherent power of ISPCA in de novo catalyst design and should motivate the general use of ISPCA in reaction development. </p> </div> </div> </div>

Download Full-text

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Bioinformatics ◽

10.1093/bioinformatics/btab179 ◽

2021 ◽

Author(s):

Irzam Sarfraz ◽

Muhammad Asif ◽

Joshua D Campbell

Keyword(s):

Single Cell ◽

R Package ◽

Poor Quality ◽

Data Matrix ◽

Supplementary Information ◽

Data Provenance ◽

Rna Seq ◽

Efficient Management ◽

The Matrix ◽

The Relationship

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BloodGen3Module: Blood transcriptional module repertoire analysis and visualization using R

Bioinformatics ◽

10.1093/bioinformatics/btab121 ◽

2021 ◽

Author(s):

Darawan Rinchai ◽

Jessica Roelands ◽

Mohammed Toufiq ◽

Wouter Hendrickx ◽

Matthew C Altman ◽

...

Keyword(s):

Transcript Abundance ◽

R Package ◽

Supplementary Information ◽

Illustrative Case ◽

Bioinformatic Tools ◽

Transcriptional Module ◽

Wide Range ◽

Downstream Analysis ◽

Computing Module ◽

Parallel Workflow

Abstract Motivation We previously described the construction and characterization of generic and reusable blood transcriptional module repertoires. More recently we released a third iteration (“BloodGen3” module repertoire) that comprises 382 functionally annotated gene sets (modules) and encompasses 14,168 transcripts. Custom bioinformatic tools are needed to support downstream analysis, visualization and interpretation relying on such fixed module repertoires. Results We have developed and describe here a R package, BloodGen3Module. The functions of our package permit group comparison analyses to be performed at the module-level, and to display the results as annotated fingerprint grid plots. A parallel workflow for computing module repertoire changes for individual samples rather than groups of samples is also available; these results are displayed as fingerprint heatmaps. An illustrative case is used to demonstrate the steps involved in generating blood transcriptome repertoire fingerprints of septic patients. Taken together, this resource could facilitate the analysis and interpretation of changes in blood transcript abundance observed across a wide range of pathological and physiological states. Availability The BloodGen3Module package and documentation are freely available from Github: https://github.com/Drinchai/BloodGen3Module Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text