baerhunter: An R package for the discovery and analysis of expressed non-coding regions in bacterial RNA-seq data

AbstractSummaryStandard bioinformatics pipelines for the analysis of bacterial transcriptomic data commonly ignore non-coding but functional elements e.g. small RNAs, long antisense RNAs or untranslated regions (UTRs) of mRNA transcripts. The root of this problem is the use of incomplete genome annotation files. Here, we present baerhunter, a method implemented in R, that automates the discovery of expressed non-coding RNAs and UTRs from RNA-seq reads mapped to a reference genome. The core algorithm is part of a pipeline that facilitates downstream analysis of both coding and non-coding features. The method is simple, easy to extend and customize and, in limited tests with simulated and real data, compares favourably against the currently most popular alternative.AvailabilityThe baerhunter R package is available from: https://github.com/irilenia/[email protected]

Download Full-text

baerhunter: an R package for the discovery and analysis of expressed non-coding regions in bacterial RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz643 ◽

2019 ◽

Cited By ~ 1

Author(s):

A Ozuna ◽

D Liberto ◽

R M Joyce ◽

K B Arnvig ◽

I Nobeli

Keyword(s):

Small Rnas ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Untranslated Regions ◽

Rna Seq ◽

Coding Regions ◽

Bacterial Rna ◽

Non Coding Rnas ◽

Downstream Analysis

Abstract Summary Standard bioinformatics pipelines for the analysis of bacterial transcriptomic data commonly ignore non-coding but functional elements e.g. small RNAs, long antisense RNAs or untranslated regions (UTRs) of mRNA transcripts. The root of this problem is the use of incomplete genome annotation files. Here, we present baerhunter, a coverage-based method implemented in R, that automates the discovery of expressed non-coding RNAs and UTRs from RNA-seq reads mapped to a reference genome. The core algorithm is part of a pipeline that facilitates downstream analysis of both coding and non-coding features. The method is simple, easy to extend and customize and, in limited tests with simulated and real data, compares favourably against the currently most popular alternative. Availability and implementation The baerhunter R package is available from: https://github.com/irilenia/baerhunter Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TrancriptomeReconstructoR, A Data-Driven Annotation of Complex Transcriptomes

10.21203/rs.3.rs-131404/v1 ◽

2020 ◽

Author(s):

Maxim Ivanov ◽

Albin Sandelin ◽

Sebastian Marquardt

Keyword(s):

De Novo ◽

Gene Annotation ◽

R Package ◽

Sequence Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Model ◽

Preparation Methods ◽

Downstream Analysis

Abstract Background: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results: We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.Conclusions: Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.

Download Full-text

Identification, annotation and visualisation of extreme changes in splicing from RNA-seq experiments with SwitchSeq

10.1101/005967 ◽

2014 ◽

Cited By ~ 6

Author(s):

Mar Gonzàlez-Porta ◽

Alvis Brazma

Keyword(s):

Enrichment Analysis ◽

R Package ◽

Third Party ◽

Pathway Enrichment Analysis ◽

Rna Seq ◽

Differential Splicing ◽

Protein Coding ◽

Downstream Analysis ◽

Intuitive Manner ◽

Abundant Transcript

In the past years, RNA sequencing has become the method of choice for the study of transcriptome composition. When working with this type of data, several tools exist to quantify differences in splicing across conditions and to address the significance of those changes. However, the number of genes predicted to undergo differential splicing is often high, and further interpretation of the results becomes a challenging task. Here we present SwitchSeq, a novel set of tools designed to help the users in the interpretation of differential splicing events that affect protein coding genes. More specifically, we provide a framework to identify switch events, i.e., cases where, for a given gene, the identity of the most abundant transcript changes across conditions. The identified events are then annotated by incorporating information from several public databases and third-party tools, and are further visualised in an intuitive manner with the independent R package tviz. All the results are displayed in a self-contained HTML document, and are also stored in txt and json format to facilitate the integration with any further downstream analysis tools. Such analysis approach can be used complementarily to Gene Ontology and pathway enrichment analysis, and can also serve as an aid in the validation of predicted changes in mRNA and protein abundance. The latest version of SwitchSeq, including installation instructions and use cases, can be found at https://github.com/mgonzalezporta/SwitchSeq. Additionally, the plot capabilities are provided as an independent R package at https://github.com/mgonzalezporta/tviz.

Download Full-text

TOP the Transcription Orientation Pipeline and its use to investigate the transcription of non-coding regions: assessment with CRISPR direct repeats and intergenic sequences

10.1101/2020.01.15.903914 ◽

2020 ◽

Author(s):

Kimberley Houenoussi ◽

Roudaina Boukheloua ◽

Jean-Philippe Vernadet ◽

Daniel Gautheret ◽

Gilles Vergnaud ◽

...

Keyword(s):

Defense Mechanism ◽

Statistical Tests ◽

Cell Metabolism ◽

Rna Seq ◽

Direct Repeats ◽

Coding Regions ◽

Repeat Sequences ◽

Crispr Array ◽

Intergenic Sequences ◽

Non Coding Rnas

AbstractA large proportion of non-coding sequences in prokaryotes are transcribed, playing an important role in the cell metabolism and defense against exogenous elements. This is the case of small RNAs and of clustered regularly interspaced short palindromic repeats “CRISPR” arrays. The CRISPR-Cas system is a defense mechanism that protects bacterial and archaeal genomes against invasions by mobile genetic elements such as viruses and plasmids. The CRISPR array, made of repeats separated by unique sequences called spacers, is transcribed but the nature of the promoter and of the transcription regulation is not well known. We describe the Transcription Orientation Pipeline (TOP) which makes use of transcriptome sequence reads to recover those corresponding to a selected sequence, and determine the direction of the transcription. CRISPR repeat sequences extracted from CRISPRCasdb were used to test the performances of the program. Statistical tests show that CRISPR elements can be reliably oriented with as little as 100 mapped reads. TOP was applied to all the available RNA-Seq Illumina sequencing archives from species possessing a CRISPR array, allowing comparisons with programs dedicated to the orientation of CRISPR repeats. In addition TOP was used to analyze small non-coding RNAs in Staphylococcus aureus, demonstrating that it is a valuable and convenient tool to investigate the transcription orientation of any sequence of interest.Availability and implementationTOPs is implemented in Python and is freely available via the I2BC github repository at https://github.com/i2bc/TOP.

Download Full-text

TrancriptomeReconstructoR: data-driven annotation of complex transcriptomes

10.1101/2020.12.10.418897 ◽

2020 ◽

Author(s):

Maxim Ivanov ◽

Albin Sandelin ◽

Sebastian Marquardt

Keyword(s):

De Novo ◽

Gene Annotation ◽

R Package ◽

Sequence Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Model ◽

Preparation Methods ◽

Downstream Analysis

AbstractBackgroundThe quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data.ResultsWe developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5’ and 3’ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.ConclusionsOur proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.

Download Full-text

Bacterial Small RNAs in the Genus Herbaspirillum spp.

International Journal of Molecular Sciences ◽

10.3390/ijms20010046 ◽

2018 ◽

Vol 20 (1) ◽

pp. 46 ◽

Cited By ~ 1

Author(s):

Amanda Carvalho Garcia ◽

Vera dos Santos ◽

Teresa Santos Cavalcanti ◽

Luiz Collaço ◽

Hans Graf

Keyword(s):

Small Rnas ◽

Physiological Responses ◽

Rna Seq ◽

Herbaspirillum Seropedicae ◽

Experimental Conditions ◽

Environmental Signals ◽

Base Sequences ◽

Non Coding Rnas ◽

Rna Interaction ◽

Spacer Dna

The genus Herbaspirillum includes several strains isolated from different grasses. The identification of non-coding RNAs (ncRNAs) in the genus Herbaspirillum is an important stage studying the interaction of these molecules and the way they modulate physiological responses of different mechanisms, through RNA–RNA interaction or RNA–protein interaction. This interaction with their target occurs through the perfect pairing of short sequences (cis-encoded ncRNAs) or by the partial pairing of short sequences (trans-encoded ncRNAs). However, the companion Hfq can stabilize interactions in the trans-acting class. In addition, there are Riboswitches, located at the 5′ end of mRNA and less often at the 3′ end, which respond to environmental signals, high temperatures, or small binder molecules. Recently, CRISPR (clustered regularly interspaced palindromic repeats), in prokaryotes, have been described that consist of serial repeats of base sequences (spacer DNA) resulting from a previous exposure to exogenous plasmids or bacteriophages. We identified 285 ncRNAs in Herbaspirillum seropedicae (H. seropedicae) SmR1, expressed in different experimental conditions of RNA-seq material, classified as cis-encoded ncRNAs or trans-encoded ncRNAs and detected RNA riboswitch domains and CRISPR sequences. The results provide a better understanding of the participation of this type of RNA in the regulation of the metabolism of bacteria of the genus Herbaspirillum spp.

Download Full-text

scBatch: batch-effect correction of RNA-seq data through sample distance matrix adjustment

Bioinformatics ◽

10.1093/bioinformatics/btaa097 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3115-3123 ◽

Cited By ~ 3

Author(s):

Teng Fei ◽

Tianwei Yu

Keyword(s):

Single Cell ◽

Differential Expression Analysis ◽

Distance Matrix ◽

Real Data ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Differential Expression

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPARSim single cell: a count data simulator for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz752 ◽

2019 ◽

Cited By ~ 2

Author(s):

Giacomo Baruzzo ◽

Ilaria Patuzzi ◽

Barbara Di Camillo

Keyword(s):

Single Cell ◽

Count Data ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Distribution Of Zeros ◽

New Methods ◽

Research Fields

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

flexiMAP: a regression-based method for discovering differential alternative polyadenylation events in standard RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa854 ◽

2020 ◽

Author(s):

Krzysztof J Szkop ◽

David S Moss ◽

Irene Nobeli

Keyword(s):

Simulated Data ◽

Alternative Polyadenylation ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Beta Regression ◽

Rna Seq ◽

Good Balance ◽

Flexible Modeling ◽

Specificity And Sensitivity

Abstract Motivation We present flexible Modeling of Alternative PolyAdenylation (flexiMAP), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data. Results We show, using both simulated and real data, that flexiMAP exhibits a good balance between specificity and sensitivity and compares favourably to existing methods, especially at low fold changes. In addition, the tests on simulated data reveal some hitherto unrecognized caveats of existing methods. Importantly, flexiMAP allows modeling of multiple known covariates that often confound the results of RNA-seq data analysis. Availability and implementation The flexiMAP R package is available at: https://github.com/kszkop/flexiMAP. Scripts and data to reproduce the analysis in this paper are available at: https://doi.org/10.5281/zenodo.3689788. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Data-based RNA-seq Simulations by Binomial Thinning

10.1101/758524 ◽

2019 ◽

Cited By ~ 1

Author(s):

David Gerard

Keyword(s):

Theoretical Model ◽

Single Cell ◽

Differential Expression Analysis ◽

Simulated Data ◽

Real Data ◽

Theoretical Models ◽

Simulation Method ◽

R Package ◽

Rna Seq ◽

Ideal Model

AbstractWith the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in un-substantiated claims of a method’s performance. Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Net-work: https://cran.r-project.org/package=seqgendiff.

Download Full-text