scholarly journals baerhunter: An R package for the discovery and analysis of expressed non-coding regions in bacterial RNA-seq data

2019 ◽  
Author(s):  
A. Ozuna ◽  
D. Liberto ◽  
R. M. Joyce ◽  
K.B. Arnvig ◽  
I. Nobeli

AbstractSummaryStandard bioinformatics pipelines for the analysis of bacterial transcriptomic data commonly ignore non-coding but functional elements e.g. small RNAs, long antisense RNAs or untranslated regions (UTRs) of mRNA transcripts. The root of this problem is the use of incomplete genome annotation files. Here, we present baerhunter, a method implemented in R, that automates the discovery of expressed non-coding RNAs and UTRs from RNA-seq reads mapped to a reference genome. The core algorithm is part of a pipeline that facilitates downstream analysis of both coding and non-coding features. The method is simple, easy to extend and customize and, in limited tests with simulated and real data, compares favourably against the currently most popular alternative.AvailabilityThe baerhunter R package is available from: https://github.com/irilenia/[email protected]

Author(s):  
A Ozuna ◽  
D Liberto ◽  
R M Joyce ◽  
K B Arnvig ◽  
I Nobeli

Abstract Summary Standard bioinformatics pipelines for the analysis of bacterial transcriptomic data commonly ignore non-coding but functional elements e.g. small RNAs, long antisense RNAs or untranslated regions (UTRs) of mRNA transcripts. The root of this problem is the use of incomplete genome annotation files. Here, we present baerhunter, a coverage-based method implemented in R, that automates the discovery of expressed non-coding RNAs and UTRs from RNA-seq reads mapped to a reference genome. The core algorithm is part of a pipeline that facilitates downstream analysis of both coding and non-coding features. The method is simple, easy to extend and customize and, in limited tests with simulated and real data, compares favourably against the currently most popular alternative. Availability and implementation The baerhunter R package is available from: https://github.com/irilenia/baerhunter Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

Abstract Background: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results: We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.Conclusions: Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


2014 ◽  
Author(s):  
Mar Gonzàlez-Porta ◽  
Alvis Brazma

In the past years, RNA sequencing has become the method of choice for the study of transcriptome composition. When working with this type of data, several tools exist to quantify differences in splicing across conditions and to address the significance of those changes. However, the number of genes predicted to undergo differential splicing is often high, and further interpretation of the results becomes a challenging task. Here we present SwitchSeq, a novel set of tools designed to help the users in the interpretation of differential splicing events that affect protein coding genes. More specifically, we provide a framework to identify switch events, i.e., cases where, for a given gene, the identity of the most abundant transcript changes across conditions. The identified events are then annotated by incorporating information from several public databases and third-party tools, and are further visualised in an intuitive manner with the independent R package tviz. All the results are displayed in a self-contained HTML document, and are also stored in txt and json format to facilitate the integration with any further downstream analysis tools. Such analysis approach can be used complementarily to Gene Ontology and pathway enrichment analysis, and can also serve as an aid in the validation of predicted changes in mRNA and protein abundance. The latest version of SwitchSeq, including installation instructions and use cases, can be found at https://github.com/mgonzalezporta/SwitchSeq. Additionally, the plot capabilities are provided as an independent R package at https://github.com/mgonzalezporta/tviz.


2020 ◽  
Author(s):  
Kimberley Houenoussi ◽  
Roudaina Boukheloua ◽  
Jean-Philippe Vernadet ◽  
Daniel Gautheret ◽  
Gilles Vergnaud ◽  
...  

AbstractA large proportion of non-coding sequences in prokaryotes are transcribed, playing an important role in the cell metabolism and defense against exogenous elements. This is the case of small RNAs and of clustered regularly interspaced short palindromic repeats “CRISPR” arrays. The CRISPR-Cas system is a defense mechanism that protects bacterial and archaeal genomes against invasions by mobile genetic elements such as viruses and plasmids. The CRISPR array, made of repeats separated by unique sequences called spacers, is transcribed but the nature of the promoter and of the transcription regulation is not well known. We describe the Transcription Orientation Pipeline (TOP) which makes use of transcriptome sequence reads to recover those corresponding to a selected sequence, and determine the direction of the transcription. CRISPR repeat sequences extracted from CRISPRCasdb were used to test the performances of the program. Statistical tests show that CRISPR elements can be reliably oriented with as little as 100 mapped reads. TOP was applied to all the available RNA-Seq Illumina sequencing archives from species possessing a CRISPR array, allowing comparisons with programs dedicated to the orientation of CRISPR repeats. In addition TOP was used to analyze small non-coding RNAs in Staphylococcus aureus, demonstrating that it is a valuable and convenient tool to investigate the transcription orientation of any sequence of interest.Availability and implementationTOPs is implemented in Python and is freely available via the I2BC github repository at https://github.com/i2bc/TOP.


2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

AbstractBackgroundThe quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data.ResultsWe developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5’ and 3’ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.ConclusionsOur proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


2018 ◽  
Vol 20 (1) ◽  
pp. 46 ◽  
Author(s):  
Amanda Carvalho Garcia ◽  
Vera dos Santos ◽  
Teresa Santos Cavalcanti ◽  
Luiz Collaço ◽  
Hans Graf

The genus Herbaspirillum includes several strains isolated from different grasses. The identification of non-coding RNAs (ncRNAs) in the genus Herbaspirillum is an important stage studying the interaction of these molecules and the way they modulate physiological responses of different mechanisms, through RNA–RNA interaction or RNA–protein interaction. This interaction with their target occurs through the perfect pairing of short sequences (cis-encoded ncRNAs) or by the partial pairing of short sequences (trans-encoded ncRNAs). However, the companion Hfq can stabilize interactions in the trans-acting class. In addition, there are Riboswitches, located at the 5′ end of mRNA and less often at the 3′ end, which respond to environmental signals, high temperatures, or small binder molecules. Recently, CRISPR (clustered regularly interspaced palindromic repeats), in prokaryotes, have been described that consist of serial repeats of base sequences (spacer DNA) resulting from a previous exposure to exogenous plasmids or bacteriophages. We identified 285 ncRNAs in Herbaspirillum seropedicae (H. seropedicae) SmR1, expressed in different experimental conditions of RNA-seq material, classified as cis-encoded ncRNAs or trans-encoded ncRNAs and detected RNA riboswitch domains and CRISPR sequences. The results provide a better understanding of the participation of this type of RNA in the regulation of the metabolism of bacteria of the genus Herbaspirillum spp.


2020 ◽  
Vol 36 (10) ◽  
pp. 3115-3123 ◽  
Author(s):  
Teng Fei ◽  
Tianwei Yu

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Giacomo Baruzzo ◽  
Ilaria Patuzzi ◽  
Barbara Di Camillo

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Krzysztof J Szkop ◽  
David S Moss ◽  
Irene Nobeli

Abstract Motivation We present flexible Modeling of Alternative PolyAdenylation (flexiMAP), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data. Results We show, using both simulated and real data, that flexiMAP exhibits a good balance between specificity and sensitivity and compares favourably to existing methods, especially at low fold changes. In addition, the tests on simulated data reveal some hitherto unrecognized caveats of existing methods. Importantly, flexiMAP allows modeling of multiple known covariates that often confound the results of RNA-seq data analysis. Availability and implementation The flexiMAP R package is available at: https://github.com/kszkop/flexiMAP. Scripts and data to reproduce the analysis in this paper are available at: https://doi.org/10.5281/zenodo.3689788. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
David Gerard

AbstractWith the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in un-substantiated claims of a method’s performance. Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Net-work: https://cran.r-project.org/package=seqgendiff.


Sign in / Sign up

Export Citation Format

Share Document