CorGAT: a tool for the functional annotation of SARS-CoV-2 genomes

Bioinformatics ◽

10.1093/bioinformatics/btaa1047 ◽

2020 ◽

Author(s):

Matteo Chiara ◽

Federico Zambelli ◽

Marco Antonio Tangaro ◽

Pietro Mandreoli ◽

David S Horner ◽

...

Keyword(s):

Functional Annotation ◽

Ad Hoc ◽

State Of The Art ◽

Supplementary Information ◽

Genomic Sequences ◽

Supplementary Data ◽

Evolutionary Patterns ◽

Genomic Variants ◽

Art Methods ◽

Available Resources

Abstract Summary While over 200 000 genomic sequences are currently available through dedicated repositories, ad hoc methods for the functional annotation of SARS-CoV-2 genomes do not harness all currently available resources for the annotation of functionally relevant genomic sites. Here, we present CorGAT, a novel tool for the functional annotation of SARS-CoV-2 genomic variants. By comparisons with other state of the art methods we demonstrate that, by providing a more comprehensive and rich annotation, our method can facilitate the identification of evolutionary patterns in the genome of SARS-CoV-2. Availabilityand implementation Galaxy http://corgat.cloud.ba.infn.it/galaxy; software: https://github.com/matteo14c/CorGAT/tree/Revision_V1; docker: https://hub.docker.com/r/laniakeacloud/galaxy_corgat. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CPS analysis: self-contained validation of biomedical data clustering

Bioinformatics ◽

10.1093/bioinformatics/btaa165 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3516-3521 ◽

Cited By ~ 1

Author(s):

Lixiang Zhang ◽

Lin Lin ◽

Jia Li

Keyword(s):

Data Clustering ◽

State Of The Art ◽

R Package ◽

Research Community ◽

Supplementary Information ◽

Biomedical Data ◽

Data Generation ◽

Supplementary Data ◽

Point Set ◽

Class Labels

Abstract Motivation Cluster analysis is widely used to identify interesting subgroups in biomedical data. Since true class labels are unknown in the unsupervised setting, it is challenging to validate any cluster obtained computationally, an important problem barely addressed by the research community. Results We have developed a toolkit called covering point set (CPS) analysis to quantify uncertainty at the levels of individual clusters and overall partitions. Functions have been developed to effectively visualize the inherent variation in any cluster for data of high dimension, and provide more comprehensive view on potentially interesting subgroups in the data. Applying to three usage scenarios for biomedical data, we demonstrate that CPS analysis is more effective for evaluating uncertainty of clusters comparing to state-of-the-art measurements. We also showcase how to use CPS analysis to select data generation technologies or visualization methods. Availability and implementation The method is implemented in an R package called OTclust, available on CRAN. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations

Bioinformatics ◽

10.1093/bioinformatics/btz482 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4854-4856 ◽

Cited By ~ 8

Author(s):

James D Stephenson ◽

Roman A Laskowski ◽

Andrew Nightingale ◽

Matthew E Hurles ◽

Janet M Thornton

Keyword(s):

Protein Sequence ◽

Structural Information ◽

Protein Structures ◽

Supplementary Information ◽

Supplementary Data ◽

Web Tool ◽

Genomic Variants ◽

Structural Context ◽

Pathogenic Variants ◽

Transcript Evidence

Abstract Motivation Understanding the protein structural context and patterning on proteins of genomic variants can help to separate benign from pathogenic variants and reveal molecular consequences. However, mapping genomic coordinates to protein structures is non-trivial, complicated by alternative splicing and transcript evidence. Results Here we present VarMap, a web tool for mapping a list of chromosome coordinates to canonical UniProt sequences and associated protein 3D structures, including validation checks, and annotating them with structural information. Availability and implementation https://www.ebi.ac.uk/thornton-srv/databases/VarMap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Temporal network alignment via GoT-WAVE

Bioinformatics ◽

10.1093/bioinformatics/btz119 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3527-3529 ◽

Cited By ~ 3

Author(s):

David Aparício ◽

Pedro Ribeiro ◽

Tijana Milenković ◽

Fernando Silva

Keyword(s):

User Interface ◽

State Of The Art ◽

Source Code ◽

Network Alignment ◽

Supplementary Information ◽

Temporal Network ◽

Temporal Networks ◽

Supplementary Data ◽

Node Similarity ◽

User Friendly

Abstract Motivation Network alignment (NA) finds conserved regions between two networks. NA methods optimize node conservation (NC) and edge conservation. Dynamic graphlet degree vectors are a state-of-the-art dynamic NC measure, used within the fastest and most accurate NA method for temporal networks: DynaWAVE. Here, we use graphlet-orbit transitions (GoTs), a different graphlet-based measure of temporal node similarity, as a new dynamic NC measure within DynaWAVE, resulting in GoT-WAVE. Results On synthetic networks, GoT-WAVE improves DynaWAVE’s accuracy by 30% and speed by 64%. On real networks, when optimizing only dynamic NC, the methods are complementary. Furthermore, only GoT-WAVE supports directed edges. Hence, GoT-WAVE is a promising new temporal NA algorithm, which efficiently optimizes dynamic NC. We provide a user-friendly user interface and source code for GoT-WAVE. Availability and implementation http://www.dcc.fc.up.pt/got-wave/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PATO: Pangenome Analysis Toolkit

10.1101/2021.01.30.428878 ◽

2021 ◽

Author(s):

Miguel D. Fernández-de-Bobadilla ◽

Alba Talavera-Rodríguez ◽

Lucía Chacón ◽

Fernando Baquero ◽

Teresa M. Coque ◽

...

Keyword(s):

Population Structure ◽

Statistical Analysis ◽

Core Genome ◽

State Of The Art ◽

Source Code ◽

Supplementary Information ◽

Complete Analysis ◽

Large Set ◽

Supplementary Data ◽

Desktop Computer

AbstractMotivationComparative genomics is a growing field but one that will be eventually overtaken by sample size studies and the increase of available genomes in public databases. We present the Pangenome Analysis Toolkit (PATO) designed to simultaneously analyze thousands of genomes using a desktop computer. The tool performs common tasks of pangenome analysis such as core-genome definition and accessory genome properties and includes new features that help characterize population structure, annotate pathogenic features and create gene sharedness networks. PATO has been developed in R to integrate with the large set of tools available for genetic, phylogenetic and statistical analysis in this environment.ResultsPATO can perform the most demanding bioinformatic analyses in minutes with an accuracy comparable to state-of-the-art software but 20–30x times faster. PATO also integrates all the necessary functions for the complete analysis of the most common objectives in microbiology studies. Lastly, PATO includes the necessary tools for visualizing the results and can be integrated with other analytical packages available in R.AvailabilityThe source code for PATO is freely available at https://github.com/irycisBioinfo/PATO under the GPLv3 [email protected] informationSupplementary data are available at Bioinformatics online

Download Full-text

VINYL: Variant prIoritizatioN by survivaL analysis

Bioinformatics ◽

10.1093/bioinformatics/btaa1067 ◽

2020 ◽

Author(s):

Matteo Chiara ◽

Pietro Mandreoli ◽

Marco Antonio Tangaro ◽

Anna Maria D’Erchia ◽

Sandro Sorrentino ◽

...

Keyword(s):

Genetic Variants ◽

Functional Annotation ◽

State Of The Art ◽

Clinical Applications ◽

Automated System ◽

Supplementary Information ◽

Variant Prioritization ◽

Sequencing Technologies ◽

Pathological Conditions ◽

Equivalent State

Abstract Motivation Clinical applications of genome re-sequencing technologies typically generate large amounts of data that need to be carefully annotated and interpreted to identify genetic variants potentially associated with pathological conditions. In this context, accurate and reproducible methods for the functional annotation and prioritization of genetic variants are of fundamental importance. Results In this paper, we present VINYL, a flexible and fully automated system for the functional annotation and prioritization of genetic variants. Extensive analyses of both real and simulated datasets suggest that VINYL can identify clinically relevant genetic variants in a more accurate manner compared to equivalent state of the art methods, allowing a more rapid and effective prioritization of genetic variants in different experimental settings. As such we believe that VINYL can establish itself as a valuable tool to assist healthcare operators and researchers in clinical genomics investigations. Availability VINYL is available at http://beaconlab.it/VINYL and https://github.com/matteo14c/VINYL. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HaploGrouper: a generalized approach to haplogroup classification

Bioinformatics ◽

10.1093/bioinformatics/btaa729 ◽

2020 ◽

Author(s):

Anuradha Jagadeesan ◽

S Sunna Ebenesersdóttir ◽

Valdis B Guðmundsdóttir ◽

Elisabet Linda Thordardottir ◽

Kristjan H S Moore ◽

...

Keyword(s):

Mitochondrial Dna ◽

Phylogenetic Tree ◽

Y Chromosome ◽

State Of The Art ◽

Supplementary Information ◽

Sequence Variants ◽

Use Case ◽

Supplementary Data ◽

Human Mitochondrial Dna ◽

Comparable Accuracy

Abstract Motivation We introduce HaploGrouper, a versatile software to classify haplotypes into haplogroups on the basis of a known phylogenetic tree. A typical use case for this software is the assignment of haplogroups to human mitochondrial DNA (mtDNA) or Y-chromosome haplotypes. Existing state-of-the-art haplogroup-calling software is typically hard-wired to work only with either mtDNA or Y-chromosome haplotypes from humans. Results HaploGrouper exhibits comparable accuracy in these instances and has the advantage of being able to assign haplogroups to any kind of haplotypes from any species—given an extant annotated phylogenetic tree defined by sequence variants. Availability and implementation The software is available at the following URL https://gitlab.com/bio_anth_decode/haploGrouper. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

QPARSE: searching for long-looped or multimeric G-quadruplexes potentially distinctive and druggable

Bioinformatics ◽

10.1093/bioinformatics/btz569 ◽

2019 ◽

Cited By ~ 1

Author(s):

Michele Berselli ◽

Enrico Lavezzo ◽

Stefano Toppo

Keyword(s):

Human Gene ◽

State Of The Art ◽

Comprehensive Analysis ◽

Supplementary Information ◽

Gene Promoters ◽

Supplementary Data ◽

Stem Loop ◽

Hiv 1 ◽

Rna And Dna ◽

The Web

Abstract Motivation G-quadruplexes (G4s) are non-canonical nucleic acid conformations that are widespread in all kingdoms of life and are emerging as important regulators both in RNA and DNA. Recently, two new higher-order architectures have been reported: adjacent interacting G4s, and G4s with stable long loops forming stem-loop structures. As there are no specialized tools to identify these conformations, we developed QPARSE. Results QPARSE can exhaustively search for degenerate potential quadruplex-forming sequences (PQSs) containing bulges and/or mismatches at genomic level, as well as either multimeric or long-looped PQS (MPQS and LLPQS respectively). While its assessment vs. known reference datasets is comparable with the state-of-the-art, what is more interesting is its performance in the identification of MPQS and LLPQS that present algorithms are not designed to search for. We report a comprehensive analysis of MPQS in human gene promoters and the analysis of LLPQS on three experimentally validated case studies from HIV-1, BCL2, and hTERT. Availability QPARSE is freely accessible on the web at http://www.medcomp.medicina.unipd.it/qparse/index or downloadable from github as a python 2.7 program https://github.com/B3rse/qparse Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FQSqueezer: k-mer-based compression of sequencing data

10.1101/559807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz

Keyword(s):

Data Compression ◽

State Of The Art ◽

Genomic Data ◽

General Purpose ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Partial Matching ◽

Supplementary Material ◽

Better Than

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

MolRep: A Deep Representation Learning Library for Molecular Property Prediction

10.1101/2021.01.13.426489 ◽

2021 ◽

Author(s):

Jiahua Rao ◽

Shuangjia Zheng ◽

Ying Song ◽

Jianwen Chen ◽

Chengtao Li ◽

...

Keyword(s):

State Of The Art ◽

Source Code ◽

Representation Learning ◽

Supplementary Information ◽

Data Sets ◽

Supplementary Data ◽

Property Prediction ◽

Average Rank ◽

Benchmark Data ◽

Classification Tasks

AbstractSummaryRecently, novel representation learning algorithms have shown potential for predicting molecular properties. However, unified frameworks have not yet emerged for fairly measuring algorithmic progress, and experimental procedures of different representation models often lack rigorousness and are hardly reproducible. Herein, we have developed MolRep by unifying 16 state-of-the-art models across 4 popular molecular representations for application and comparison. Furthermore, we ran more than 12.5 million experiments to optimize hyperparameters for each method on 12 common benchmark data sets. As a result, CMPNN achieves the best results ranked the 1st in 5 out of 12 tasks with an average rank of 1.75. Relatively, ECC has good performance in classification tasks and MAT good for regression (both ranked 1st for 3 tasks) with an average rank of 2.71 and 2.6, respectively.AvailabilityThe source code is available at: https://github.com/biomed-AI/MolRepSupplementary informationSupplementary data are available online.

Download Full-text

Accurate prediction of protein beta-aggregation with generalized statistical potentials

Bioinformatics ◽

10.1093/bioinformatics/btz912 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2076-2081

Author(s):

Gabriele Orlando ◽

Alexandra Silva ◽

Sandra Macedo-Ribeiro ◽

Daniele Raimondi ◽

Wim Vranken

Keyword(s):

State Of The Art ◽

Point Mutations ◽

The State ◽

Supplementary Information ◽

In Vitro Mutagenesis ◽

Statistical Potentials ◽

Structural Environment ◽

Art Methods ◽

Definition Of

Abstract Motivation Protein beta-aggregation is an important but poorly understood phenomena involved in diseases as well as in beneficial physiological processes. However, while this task has been investigated for over 50 years, very little is known about its mechanisms of action. Moreover, the identification of regions involved in aggregation is still an open problem and the state-of-the-art methods are often inadequate in real case applications. Results In this article we present AgMata, an unsupervised tool for the identification of such regions from amino acidic sequence based on a generalized definition of statistical potentials that includes biophysical information. The tool outperforms the state-of-the-art methods on two different benchmarks. As case-study, we applied our tool to human ataxin-3, a protein involved in Machado–Joseph disease. Interestingly, AgMata identifies aggregation-prone residues that share the very same structural environment. Additionally, it successfully predicts the outcome of in vitro mutagenesis experiments, identifying point mutations that lead to an alteration of the aggregation propensity of the wild-type ataxin-3. Availability and implementation A python implementation of the tool is available at https://bitbucket.org/bio2byte/agmata. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text