LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins

Anastasia A Gulyaeva; Andrey I Sigorskih; Elena S Ocheredko; Dmitry V Samborskiy; Alexander E Gorbalenya

doi:10.1093/bioinformatics/btaa065

LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins

Bioinformatics ◽

10.1093/bioinformatics/btaa065 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2731-2739 ◽

Cited By ~ 1

Author(s):

Anastasia A Gulyaeva ◽

Andrey I Sigorskih ◽

Elena S Ocheredko ◽

Dmitry V Samborskiy ◽

Alexander E Gorbalenya

Keyword(s):

Rna Virus ◽

Sequence Similarity ◽

Statistical Significance ◽

R Package ◽

Similarity Score ◽

Supplementary Information ◽

Accurate Estimation ◽

Local Alignment ◽

Multidomain Proteins ◽

Multidomain Protein

Abstract Motivation To facilitate accurate estimation of statistical significance of sequence similarity in profile–profile searches, queries should ideally correspond to protein domains. For multidomain proteins, using domains as queries depends on delineation of domain borders, which may be unknown. Thus, proteins are commonly used as queries that complicate establishing homology for similarities close to cutoff levels of statistical significance. Results In this article, we describe an iterative approach, called LAMPA, LArge Multidomain Protein Annotator, that resolves the above conundrum by gradual expansion of hit coverage of multidomain proteins through re-evaluating statistical significance of hit similarity using ever smaller queries defined at each iteration. LAMPA employs TMHMM and HHsearch for recognition of transmembrane regions and homology, respectively. We used Pfam database for annotating 2985 multidomain proteins (polyproteins) composed of >1000 amino acid residues, which dominate proteomes of RNA viruses. Under strict cutoffs, LAMPA outperformed HHsearch-mediated runs using intact polyproteins as queries by three measures: number of and coverage by identified homologous regions, and number of hit Pfam profiles. Compared to HHsearch, LAMPA identified 507 extra homologous regions in 14.4% of polyproteins. This Pfam-based annotation of RNA virus polyproteins by LAMPA was also superior to RefSeq expert annotation by two measures, region number and annotated length, for 69.3% of RNA virus polyprotein entries. We rationalized the obtained results based on dependencies of HHsearch hit statistical significance for local alignment similarity score from lengths and diversities of query-target pairs in computational experiments. Availability and implementation LAMPA 1.0.0 R package is placed at github (https://github.com/Gorbalenya-Lab/LAMPA). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DiscoRhythm: an easy-to-use web application and R package for discovering rhythmicity

Bioinformatics ◽

10.1093/bioinformatics/btz834 ◽

2019 ◽

Cited By ~ 2

Author(s):

Matthew Carlucci ◽

Algimantas Kriščiūnas ◽

Haohan Li ◽

Povilas Gibas ◽

Karolis Koncevičius ◽

...

Keyword(s):

Web Application ◽

Statistical Significance ◽

R Package ◽

Biological Data ◽

Supplementary Information ◽

Statistical Knowledge ◽

Health And Disease ◽

Phase Amplitude ◽

Almost All ◽

User Friendly

Abstract Motivation Biological rhythmicity is fundamental to almost all organisms on Earth and plays a key role in health and disease. Identification of oscillating signals could lead to novel biological insights, yet its investigation is impeded by the extensive computational and statistical knowledge required to perform such analysis. Results To address this issue, we present DiscoRhythm (Discovering Rhythmicity), a user-friendly application for characterizing rhythmicity in temporal biological data. DiscoRhythm is available as a web application or an R/Bioconductor package for estimating phase, amplitude, and statistical significance using four popular approaches to rhythm detection (Cosinor, JTK Cycle, ARSER, and Lomb-Scargle). We optimized these algorithms for speed, improving their execution times up to 30-fold to enable rapid analysis of -omic-scale datasets in real-time. Informative visualizations, interactive modules for quality control, dimensionality reduction, periodicity profiling, and incorporation of experimental replicates make DiscoRhythm a thorough toolkit for analyzing rhythmicity. Availability and Implementation The DiscoRhythm R package is available on Bioconductor (https://bioconductor.org/packages/DiscoRhythm), with source code available on GitHub (https://github.com/matthewcarlucci/DiscoRhythm) under a GPL-3 license. The web application is securely deployed over HTTPS (https://disco.camh.ca) and is freely available for use worldwide. Local instances of the DiscoRhythm web application can be created using the R package or by deploying the publicly available Docker container (https://hub.docker.com/r/mcarlucci/discorhythm). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DEUS: an R package for accurate small RNA profiling based on differential expression of unique sequences

Bioinformatics ◽

10.1093/bioinformatics/btz495 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4834-4836

Author(s):

Tim Jeske ◽

Peter Huypens ◽

Laura Stirm ◽

Selina Höckele ◽

Christine M Wurmser ◽

...

Keyword(s):

Differential Expression ◽

Small Rna ◽

Sequence Similarity ◽

Differential Expression Analysis ◽

R Package ◽

Supplementary Information ◽

Small Rna Sequencing ◽

Sequencing Data ◽

Rna Sequences ◽

Rna Profiling

Abstract Summary Despite their fundamental role in various biological processes, the analysis of small RNA sequencing data remains a challenging task. Major obstacles arise when short RNA sequences map to multiple locations in the genome, align to regions that are not annotated or underwent post-transcriptional changes which hamper accurate mapping. In order to tackle these issues, we present a novel profiling strategy that circumvents the need for read mapping to a reference genome by utilizing the actual read sequences to determine expression intensities. After differential expression analysis of individual sequence counts, significant sequences are annotated against user defined feature databases and clustered by sequence similarity. This strategy enables a more comprehensive and concise representation of small RNA populations without any data loss or data distortion. Availability and implementation Code and documentation of our R package at http://ibis.helmholtz-muenchen.de/deus/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ThETA: transcriptome-driven efficacy estimates for gene-based TArget discovery

Bioinformatics ◽

10.1093/bioinformatics/btaa518 ◽

2020 ◽

Vol 36 (14) ◽

pp. 4214-4216

Author(s):

Mario Failli ◽

Jussi Paananen ◽

Vittorio Fortino

Keyword(s):

Statistical Significance ◽

R Package ◽

Fold Change ◽

Supplementary Information ◽

Disease Genes ◽

Gene Target ◽

Tissue Specific ◽

Target Discovery ◽

Disease Associations ◽

Target Disease

Abstract Summary Estimating efficacy of gene–target-disease associations is a fundamental step in drug discovery. An important data source for this laborious task is RNA expression, which can provide gene–disease associations on the basis of expression fold change and statistical significance. However, the simply use of the log-fold change can lead to numerous false-positive associations. On the other hand, more sophisticated methods that utilize gene co-expression networks do not consider tissue specificity. Here, we introduce Transcriptome-driven Efficacy estimates for gene-based TArget discovery (ThETA), an R package that enables non-expert users to use novel efficacy scoring methods for drug–target discovery. In particular, ThETA allows users to search for gene perturbation (therapeutics) that reverse disease-gene expression and genes that are closely related to disease-genes in tissue-specific networks. ThETA also provides functions to integrate efficacy evaluations obtained with different approaches and to build an overall efficacy score, which can be used to identify and prioritize gene(target)–disease associations. Finally, ThETA implements visualizations to show tissue-specific interconnections between target and disease-genes, and to indicate biological annotations associated with the top selected genes. Availability and implementation ThETA is freely available for academic use at https://github.com/vittoriofortino84/ThETA. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detecting High Scoring Local Alignments in Pangenome Graphs

Bioinformatics ◽

10.1093/bioinformatics/btab077 ◽

2021 ◽

Author(s):

Tizian Schulz ◽

Roland Wittler ◽

Sven Rahmann ◽

Faraz Hach ◽

Jens Stoye

Keyword(s):

Sequence Similarity ◽

Query Sequence ◽

Heuristic Method ◽

Supplementary Information ◽

De Bruijn Graph ◽

Local Alignment ◽

Memory Usage ◽

Sequence Comparisons ◽

De Bruijn Graphs ◽

De Bruijn

Abstract Motivation Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. Results We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. Availability Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Connecting mathematical models to genomes: joint estimation of model parameters and genome-wide marker effects on these parameters

Bioinformatics ◽

10.1093/bioinformatics/btaa129 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3169-3176 ◽

Cited By ~ 1

Author(s):

Akio Onogi

Keyword(s):

Mathematical Models ◽

R Package ◽

Joint Estimation ◽

Supplementary Information ◽

Joint Analysis ◽

Accurate Estimation ◽

Model Parameters ◽

Statistical Framework ◽

Genome Wide ◽

Estimation Of Model Parameters

Abstract Motivation Parameters of mathematical models used in biology may be genotype-specific and regarded as new traits. Therefore, an accurate estimation of these parameters and the association mapping on the estimated parameters can lead to important findings regarding the genetic architecture of biological processes. In this study, a statistical framework for a joint analysis (JA) of model parameters and genome-wide marker effects on these parameters was proposed and evaluated. Results In the simulation analyses based on different types of mathematical models, the JA inferred the model parameters and identified the responsible genomic regions more accurately than the independent analysis (IA). The JA of real plant data provided interesting insights into photosensitivity, which were uncovered by the IA. Availability and implementation The statistical framework is provided by the R package GenomeBasedModel available at https://github.com/Onogi/GenomeBasedModel. All R and C++ scripts used in this study are also available at the site. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees

Bioinformatics ◽

10.1093/bioinformatics/btaa614 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5007-5013 ◽

Cited By ~ 1

Author(s):

Martin R Smith

Keyword(s):

Phylogenetic Trees ◽

R Package ◽

Similarity Score ◽

Supplementary Information ◽

Information Theoretic ◽

Tree Comparison ◽

New Information ◽

Alternative Measures ◽

Tree Similarity ◽

Probabilistic Measures

Abstract Motivation The Robinson–Foulds (RF) metric is widely used by biologists, linguists and chemists to quantify similarity between pairs of phylogenetic trees. The measure tallies the number of bipartition splits that occur in both trees—but this conservative approach ignores potential similarities between almost-identical splits, with undesirable consequences. ‘Generalized’ RF metrics address this shortcoming by pairing splits in one tree with similar splits in the other. Each pair is assigned a similarity score, the sum of which enumerates the similarity between two trees. The challenge lies in quantifying split similarity: existing definitions lack a principled statistical underpinning, resulting in misleading tree distances that are difficult to interpret. Here, I propose probabilistic measures of split similarity, which allow tree similarity to be measured in natural units (bits). Results My new information-theoretic metrics outperform alternative measures of tree similarity when evaluated against a broad suite of criteria, even though they do not account for the non-independence of splits within a single tree. Mutual clustering information exhibits none of the undesirable properties that characterize other tree comparison metrics, and should be preferred to the RF metric. Availability and implementation The methods discussed in this article are implemented in the R package ‘TreeDist’, archived at https://dx.doi.org/10.5281/zenodo.3528123. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Incremental BLAST: incremental addition of new sequence databases through e-value correction

10.1101/476218 ◽

2018 ◽

Cited By ~ 1

Author(s):

Sajal Dash ◽

Sarthok Rahman ◽

Heather M. Hines ◽

Wu-chun Feng

Keyword(s):

Sequence Similarity ◽

Computational Cost ◽

Supplementary Information ◽

Local Alignment ◽

Arbitrary Sequence ◽

Search Results ◽

Blast Search ◽

Link Type ◽

Ncbi Blast ◽

Incremental Addition

AbstractMotivationSearch results from local alignment search tools use statistical parameters sensitive to the size of the database. NCBI BLAST, for example, reports important matches using similarity scores and expect or e-values calculated against database size. Over the course of an investigation, the database grows and the best matches may change. To update the results of a sequence similarity search to find the most optimal hits, bioinformaticians must rerun the BLAST search against the entire database; this translates into irredeemable spent time, money, and computational resources.ResultsWe develop an efficient way to redeem spent BLAST search effort by introducing the Incremental BLAST. This tool makes use of the previous BLAST search results as it conducts new searches on only the incremental part of the database, recomputes statistical metrics such as e-values and combines these two sets of results to produce updated results. We develop statistics for correcting e-values of any BLAST result against any arbitrary sequence database. The experimental results and accuracy analysis demonstrate that Incremental BLAST can provide search results identical to NCBI BLAST at a significantly reduced computational cost. We apply three case studies to showcase different use cases where Incremental BLAST can make biological discovery more efficiently at a reduced cost. This tool can be used to update sequence blasts during the course of genomic and transcriptomic projects, such as in re-annotation projects, and to conduct incremental addition of taxon-specific sequences to a BLAST database. Incremental BLAST performs (1 + δ)/δ times faster than NCBI BLAST for δ fraction of database growth.AvailabilityIncremental BLAST is available at https://bitbucket.org/sajal000/[email protected] informationSupplementary data are available at https://bitbucket.org/sajal000/incremental-blast

Download Full-text

A Similarity Searching System for Biological Phenotype Images Using Deep Convolutional Encoder-decoder Architecture

Current Bioinformatics ◽

10.2174/1574893614666190204150109 ◽

2019 ◽

Vol 14 (7) ◽

pp. 628-639 ◽

Cited By ~ 10

Author(s):

Bizhi Wu ◽

Hangxiao Zhang ◽

Limei Lin ◽

Huiyuan Wang ◽

Yubang Gao ◽

...

Keyword(s):

Neural Network ◽

Retrieval System ◽

Sequence Similarity ◽

Local Alignment ◽

Similarity Searching ◽

Loss Of Function ◽

Biological Images ◽

The Neural Network ◽

Convolutional Autoencoder ◽

Biological Phenotype

Background: The BLAST (Basic Local Alignment Search Tool) algorithm has been widely used for sequence similarity searching. Analogously, the public phenotype images must be efficiently retrieved using biological images as queries and identify the phenotype with high similarity. Due to the accumulation of genotype-phenotype-mapping data, a system of searching for similar phenotypes is not available due to the bottleneck of image processing. Objective: In this study, we focus on the identification of similar query phenotypic images by searching the biological phenotype database, including information about loss-of-function and gain-of-function. Methods: We propose a deep convolutional autoencoder architecture to segment the biological phenotypic images and develop a phenotype retrieval system to enable a better understanding of genotype–phenotype correlation. Results: This study shows how deep convolutional autoencoder architecture can be trained on images from biological phenotypes to achieve state-of-the-art performance in a phenotypic images retrieval system. Conclusion: Taken together, the phenotype analysis system can provide further information on the correlation between genotype and phenotype. Additionally, it is obvious that the neural network model of image segmentation and the phenotype retrieval system is equally suitable for any species, which has enough phenotype images to train the neural network.

Download Full-text

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Bioinformatics ◽

10.1093/bioinformatics/btab179 ◽

2021 ◽

Author(s):

Irzam Sarfraz ◽

Muhammad Asif ◽

Joshua D Campbell

Keyword(s):

Single Cell ◽

R Package ◽

Poor Quality ◽

Data Matrix ◽

Supplementary Information ◽

Data Provenance ◽

Rna Seq ◽

Efficient Management ◽

The Matrix ◽

The Relationship

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Structure Unveils Relationships between RNA Virus Polymerases

Viruses ◽

10.3390/v13020313 ◽

2021 ◽

Vol 13 (2) ◽

pp. 313

Author(s):

Heli A. M. Mönttinen ◽

Janne J. Ravantti ◽

Minna M. Poranen

Keyword(s):

Phylogenetic Tree ◽

Rna Viruses ◽

Rna Virus ◽

Sequence Similarity ◽

Protein Structures ◽

Structural Similarity ◽

Functional Differentiation ◽

Comparison Method ◽

Homologous Structure ◽

Biological Entities

RNA viruses are the fastest evolving known biological entities. Consequently, the sequence similarity between homologous viral proteins disappears quickly, limiting the usability of traditional sequence-based phylogenetic methods in the reconstruction of relationships and evolutionary history among RNA viruses. Protein structures, however, typically evolve more slowly than sequences, and structural similarity can still be evident, when no sequence similarity can be detected. Here, we used an automated structural comparison method, homologous structure finder, for comprehensive comparisons of viral RNA-dependent RNA polymerases (RdRps). We identified a common structural core of 231 residues for all the structurally characterized viral RdRps, covering segmented and non-segmented negative-sense, positive-sense, and double-stranded RNA viruses infecting both prokaryotic and eukaryotic hosts. The grouping and branching of the viral RdRps in the structure-based phylogenetic tree follow their functional differentiation. The RdRps using protein primer, RNA primer, or self-priming mechanisms have evolved independently of each other, and the RdRps cluster into two large branches based on the used transcription mechanism. The structure-based distance tree presented here follows the recently established RdRp-based RNA virus classification at genus, subfamily, family, order, class and subphylum ranks. However, the topology of our phylogenetic tree suggests an alternative phylum level organization.

Download Full-text