MassBlast: A workflow to accelerate RNA-seq and DNA database analysis

Mapping Intimacies ◽

10.1101/131953 ◽

2017 ◽

Cited By ~ 4

Author(s):

André Veríssimo ◽

Jean-Etienne Bassard ◽

Alice Julien-Laferrière ◽

Marie-France Sagot ◽

Susana Vinga

Keyword(s):

Source Code ◽

Rna Seq ◽

Database Analysis ◽

User Input ◽

Manual Curation ◽

Dna Database ◽

New Protein Families ◽

Full Analysis ◽

New Protein ◽

Tools And Methods

AbstractSummaryCurrent workflows for sequence analysis heavily depend on user input and manual curation. New specialized tools and methods are appearing all the time, but the actions required for a full analysis are disconnected and very time-consuming. The software we propose, MassBlast, combines BLAST+ and an automated workflow analysis to filter the results and significantly improve the annotation of multiple sequencing databases for exploring new biosynthetic pathways and new protein families, among other applications. MassBlast is fully configurable and reproducible.Availability and ImplementationThe MassBlast package is written in Ruby. Source code and releases are freely available from Github (https://github.com/averissimo/mass-blast) for all major platforms (Linux, MS Windows and OS X) under the GPLv3 [email protected]

Meta-Analysis of Oxidative Transcriptomes in Insects

Antioxidants ◽

10.3390/antiox10030345 ◽

2021 ◽

Vol 10 (3) ◽

pp. 345

Author(s):

Hidemasa Bono

Keyword(s):

Oxidative Stress ◽

Stress Response ◽

Meta Analysis ◽

Adherens Junction ◽

Enrichment Analysis ◽

Oxidative Stress Response ◽

Insect Species ◽

Rna Seq ◽

Manual Curation ◽

Analysis Workflow

Data accumulation in public databases has resulted in extensive use of meta-analysis, a statistical analysis that combines the results of multiple studies. Oxidative stress occurs when there is an imbalance between free radical activity and antioxidant activity, which can be studied in insects by transcriptome analysis. This study aimed to apply a meta-analysis approach to evaluate insect oxidative transcriptomes using publicly available data. We collected oxidative stress response-related RNA sequencing (RNA-seq) data for a wide variety of insect species, mainly from public gene expression databases, by manual curation. Only RNA-seq data of Drosophila melanogaster were found and were systematically analyzed using a newly developed RNA-seq analysis workflow for species without a reference genome sequence. The results were evaluated by two metric methods to construct a reference dataset for oxidative stress response studies. Many genes were found to be downregulated under oxidative stress and related to organ system process (GO:0003008) and adherens junction organization (GO:0034332) by gene enrichment analysis. A cross-species analysis was also performed. RNA-seq data of Caenorhabditis elegans were curated, since no RNA-seq data of insect species are currently available in public databases. This method, including the workflow developed, represents a powerful tool for deciphering conserved networks in oxidative stress response.

Leveraging Curation Among Escherichia coli Pathway/Genome Databases Using Ortholog-Based Annotation Propagation

Frontiers in Microbiology ◽

10.3389/fmicb.2021.614355 ◽

2021 ◽

Vol 12 ◽

Author(s):

Suzanne Paley ◽

Ingrid M. Keseler ◽

Markus Krummenacker ◽

Peter D. Karp

Keyword(s):

Escherichia Coli ◽

Protein Complexes ◽

Limited Resources ◽

Genome Database ◽

Single Strain ◽

Manual Curation ◽

Genome Databases ◽

New Knowledge ◽

K 12 ◽

New Protein

Updating genome databases to reflect newly published molecular findings for an organism was hard enough when only a single strain of a given organism had been sequenced. With multiple sequenced strains now available for many organisms, the challenge has grown significantly because of the still-limited resources available for the manual curation that corrects errors and captures new knowledge. We have developed a method to automatically propagate multiple types of curated knowledge from genes and proteins in one genome database to their orthologs in uncurated databases for related strains, imposing several quality-control filters to reduce the chances of introducing errors. We have applied this method to propagate information from the highly curated EcoCyc database for Escherichia coli K–12 to databases for 480 other Escherichia coli strains in the BioCyc database collection. The increase in value and utility of the target databases after propagation is considerable. Target databases received updates for an average of 2,535 proteins each. In addition to widespread addition and regularization of gene and protein names, 97% of the target databases were improved by the addition of at least 200 new protein complexes, at least 800 new or updated reaction assignments, and at least 2,400 sets of GO annotations.

K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

10.1101/723833 ◽

2019 ◽

Author(s):

Christina Huan Shi ◽

Kevin Y. Yip

Keyword(s):

Single Cell ◽

State Of The Art ◽

Rna Seq ◽

Sequencing Data ◽

Memory Consumption ◽

Analysis Pipeline ◽

Cell Clusters ◽

Single Cell Sequencing ◽

Sequencing Errors ◽

Full Analysis

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

Evaluation of SQL Injection Vulnerability Detection Tools

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a2648.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 1747-1751

Keyword(s):

Static Analysis ◽

Web Applications ◽

Source Code ◽

False Negative ◽

Vulnerability Detection ◽

Sql Injection ◽

User Input ◽

Root Cause ◽

Analysis Tools ◽

Free Open Source

SQL injection vulnerabilities have been predominant on database-driven web applications since almost one decade. Exploiting such vulnerabilities enables attackers to gain unauthorized access to the back-end databases by altering the original SQL statements through manipulating user input. Testing web applications for identifying SQL injection vulnerabilities before deployment is essential to get rid of them. However, checking such vulnerabilities by hand is very tedious, difficult, and time-consuming. Web vulnerability static analysis tools are software tools for automatically identifying the root cause of SQL injection vulnerabilities in web applications source code. In this paper, we test and evaluate three free/open source static analysis tools using eight web applications with numerous known vulnerabilities, primarily for false negative rates. The evaluation results were compared and analysed, and they indicate a need to improve the tools.

LIONS: analysis suite for detecting and quantifying transposable element initiated transcription from RNA-seq

Bioinformatics ◽

10.1093/bioinformatics/btz130 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3839-3841 ◽

Cited By ~ 6

Author(s):

Artem Babaian ◽

I Richard Thompson ◽

Jake Lever ◽

Liane Gagnier ◽

Mohammad M Karimi ◽

...

Keyword(s):

Transposable Elements ◽

Transposable Element ◽

Test Data ◽

Source Code ◽

Supplementary Information ◽

Transcriptional Networks ◽

Supplementary Data ◽

Rna Seq ◽

Transcriptional Initiation ◽

Instruction Manual

Abstract Summary Transposable elements (TEs) influence the evolution of novel transcriptional networks yet the specific and meaningful interpretation of how TE-derived transcriptional initiation contributes to the transcriptome has been marred by computational and methodological deficiencies. We developed LIONS for the analysis of RNA-seq data to specifically detect and quantify TE-initiated transcripts. Availability and implementation Source code, container, test data and instruction manual are freely available at www.github.com/ababaian/LIONS. Supplementary information Supplementary data are available at Bioinformatics online.

STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa755 ◽

2020 ◽

Cited By ~ 1

Author(s):

Massimo Andreatta ◽

Santiago J Carmona

Keyword(s):

Single Cell ◽

Distance Measure ◽

Source Code ◽

Cell Types ◽

R Package ◽

Computational Method ◽

Biological Variability ◽

Rna Seq ◽

Batch Effects ◽

Guide Trees

Abstract Summary STACAS is a computational method for the identification of integration anchors in the Seurat environment, optimized for the integration of single-cell (sc) RNA-seq datasets that share only a subset of cell types. We demonstrate that by (i) correcting batch effects while preserving relevant biological variability across datasets, (ii) filtering aberrant integration anchors with a quantitative distance measure and (iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations. Availability and implementation Source code and R package available at https://github.com/carmonalab/STACAS; Docker image available at https://hub.docker.com/repository/docker/mandrea1/stacas_demo.

consensusDE: an R package for assessing consensus of multiple RNA-seq algorithms with RUV correction

10.1101/692582 ◽

2019 ◽

Author(s):

Ashley J. Waardenberg ◽

Matt A. Field

Keyword(s):

Differential Expression ◽

Simulated Data ◽

R Package ◽

Bioconductor Package ◽

Rna Seq ◽

User Input ◽

Extensive Evaluation ◽

The Impact ◽

Transcript Database ◽

Multiple Algorithms

AbstractExtensive evaluation of RNA-seq methods have demonstrated that no single algorithm consistently outperforms all others. Removal of unwanted variation (RUV) has also been proposed as a method for stabilizing differential expression (DE) results. Despite this, it remains a challenge to run multiple RNA-seq algorithms to identify significant differences common to multiple algorithms, whilst also integrating and assessing the impact of RUV into all algorithms. consensusDE was developed to automate the process of identifying significant DE by combining the results from multiple algorithms with minimal user input and with the option to automatically integrate RUV. consensusDE only requires a table describing the sample groups, a directory containing BAM files or preprocessed count tables and an optional transcript database for annotation. It supports merging of technical replicates, paired analyses and outputs a compendium of plots to guide the user in subsequent analyses. Herein, we also assess the ability of RUV to improve DE stability when combined with multiple algorithms through application to real and simulated data. We find that, although RUV demonstrated improved FDR in a setting of low replication, the effect was algorithm specific and diminished with increased replication, reinforcing increased replication for recovery of true DE genes. We finish by offering some rules and considerations for the application of RUV in a consensus-based setting.consensusDE is freely available, implemented in R and available as a Bioconductor package, under the GPL-3 license, along with a comprehensive vignette describing functionality: http://bioconductor.org/packages/consensusDE/

Viewing RNA-seq data on the entire human genome

F1000Research ◽

10.12688/f1000research.9762.1 ◽

2017 ◽

Vol 6 ◽

pp. 596 ◽

Cited By ~ 2

Author(s):

Eric M. Weitz ◽

Lorena Pantano ◽

Jingzhi Zhu ◽

Bennett Upton ◽

Ben Busby

Keyword(s):

Web Application ◽

Source Code ◽

Gene Expression Omnibus ◽

Expression Data ◽

Rna Seq ◽

Sequence Read Archive ◽

Data Pipeline ◽

Genome Wide ◽

Small Team ◽

Genome Wide Expression

RNA-Seq Viewer is a web application that enables users to visualize genome-wide expression data from NCBI’s Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO) databases. The application prototype was created by a small team during a three-day hackathon facilitated by NCBI at Brandeis University. The backend data pipeline was developed and deployed on a shared AWS EC2 instance. Source code is available at https://github.com/NCBI-Hackathons/rnaseqview.

ASGAL: Aligning RNA-Seq Data to a Splicing Graph to Detect Novel Alternative Splicing Events

10.1101/260372 ◽

2018 ◽

Author(s):

Luca Denti ◽

Raffaella Rizzi ◽

Stefano Beretta ◽

Gianluca Della Vedova ◽

Marco Previtali ◽

...

Keyword(s):

Alternative Splicing ◽

Experimental Analysis ◽

Reference Genome ◽

Gene Annotation ◽

Source Code ◽

Rna Seq ◽

Computationally Expensive ◽

Transcriptome Analyses ◽

Alternative Splicing Events ◽

Complicated Task

AbstractBackground: While the reconstruction of transcripts from a sample of RNA-Seq data is a computationally expensive and complicated task, the detection of splicing events from RNA-Seq data and a gene annotation is computationally feasible. The latter task, which is adequate for many transcriptome analyses, is usually achieved by aligning the reads to a reference genome, followed by comparing the alignments with a gene annotation, often implicitly represented by a graph: the splicing graph.Results: We present ASGAL (Alternative Splicing Graph ALigner): a tool for mapping RNA-Seq data to the splicing graph, with the main goal of detecting novel alternative splicing events. ASGAL receives in input the annotated transcripts of a gene and an RNA-Seq sample, and it computes (1) the spliced alignments of each read, and (2) a list of novel events with respect to the gene annotation.Conclusions: An experimental analysis shows that, by aligning reads directly to the splicing graph, ASGAL better predicts alternative splicing events when compared to tools requiring spliced alignments of the RNA-Seq data to a reference genome. To the best of our knowledge, ASGAL is the first tool that detects novel alternative splicing events by directly aligning reads to a splicing graph.Availability: Source code, documentation, and data are available for download at http://asgal.algolab.eu.

TransPi – a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly

10.1101/2021.02.18.431773 ◽

2021 ◽

Author(s):

R.E. Rivera-Vicéns ◽

C. Garcia Escudero ◽

N. Conci ◽

M. Eitel ◽

G. Wörheide

Keyword(s):

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

Rna Seq ◽

Analysis Pipeline ◽

User Input ◽

Genome Data ◽

Differential Gene ◽

Transcriptomic Level ◽

Genome Information

AbstractThe use of RNA-Seq data and the generation of de novo transcriptome assemblies have been pivotal for studies in ecology and evolution. This is distinctly true for non-model organisms, where no genome information is available; yet, studies of differential gene expression, DNA enrichment baits design, and phylogenetics can all be accomplished with the data gathered at the transcriptomic level. Multiple tools are available for transcriptome assembly, however, no single tool can provide the best assembly for all datasets. Therefore, a multi assembler approach, followed by a reduction step, is often sought to generate an improved representation of the assembly. To reduce errors in these complex analyses while at the same time attaining reproducibility and scalability, automated workflows have been essential in the analysis of RNA-Seq data. However, most of these tools are designed for species where genome data is used as reference for the assembly process, limiting their use in non-model organisms. We present TransPi, a comprehensive pipeline for de novo transcriptome assembly, with minimum user input but without losing the ability of a thorough analysis. A combination of different model organisms, kmer sets, read lengths, and read quantities were used for assessing the tool. Furthermore, a total of 49 non-model organisms, spanning different phyla, were also analyzed. Compared to approaches using single assemblers only, TransPi produces higher BUSCO completeness percentages, and a concurrent significant reduction in duplication rates. TransPi is easy to configure and can be deployed seamlessly using Conda, Docker and Singularity.