VIRULIGN: fast codon-correct alignment and annotation of viral genomes

AbstractVirus sequence data are an essential resource for reconstructing spatiotemporal dynamics of viral spread as well as to inform treatment and prevention strategies. However, the potential benefit for these applications critically depends on accurate and correctly annotated alignments of genetically heterogeneous data. VIRULIGN was built for fast codoncorrect alignments of large datasets, with standardized genome annotation and various alignment export formats.VIRULIGN is freely available at https://github.com/rega-cev/virulign as an open source software project.

Download Full-text

Genomic epidemiology reveals multiple introductions of SARS-CoV-2 followed by community and nosocomial spread, Germany, February to May 2020

Eurosurveillance ◽

10.2807/1560-7917.es.2021.26.43.2002066 ◽

2021 ◽

Vol 26 (43) ◽

Author(s):

Maximilian Muenchhoff ◽

Alexander Graf ◽

Stefan Krebs ◽

Caroline Quartucci ◽

Sandra Hasmann ◽

...

Keyword(s):

Healthcare Workers ◽

Sequence Data ◽

Phylogenetic Analyses ◽

Local Level ◽

University Hospital ◽

Metropolitan Region ◽

Viral Genomes ◽

Genomic Epidemiology ◽

Viral Spread ◽

Spatio Temporal

Background In the SARS-CoV-2 pandemic, viral genomes are available at unprecedented speed, but spatio-temporal bias in genome sequence sampling precludes phylogeographical inference without additional contextual data. Aim We applied genomic epidemiology to trace SARS-CoV-2 spread on an international, national and local level, to illustrate how transmission chains can be resolved to the level of a single event and single person using integrated sequence data and spatio-temporal metadata. Methods We investigated 289 COVID-19 cases at a university hospital in Munich, Germany, between 29 February and 27 May 2020. Using the ARTIC protocol, we obtained near full-length viral genomes from 174 SARS-CoV-2-positive respiratory samples. Phylogenetic analyses using the Auspice software were employed in combination with anamnestic reporting of travel history, interpersonal interactions and perceived high-risk exposures among patients and healthcare workers to characterise cluster outbreaks and establish likely scenarios and timelines of transmission. Results We identified multiple independent introductions in the Munich Metropolitan Region during the first weeks of the first pandemic wave, mainly by travellers returning from popular skiing areas in the Alps. In these early weeks, the rate of presumable hospital-acquired infections among patients and in particular healthcare workers was high (9.6% and 54%, respectively) and we illustrated how transmission chains can be dissected at high resolution combining virus sequences and spatio-temporal networks of human interactions. Conclusions Early spread of SARS-CoV-2 in Europe was catalysed by superspreading events and regional hotspots during the winter holiday season. Genomic epidemiology can be employed to trace viral spread and inform effective containment strategies.

Download Full-text

ViralMSA: Massively scalable reference-guided multiple sequence alignment of viral genomes

10.1101/2020.04.20.052068 ◽

2020 ◽

Cited By ~ 1

Author(s):

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Genomic Sequence ◽

Sequence Data ◽

Software Project ◽

Multiple Sequence ◽

Viral Genomes ◽

Alignment Tool ◽

Multiple Sequence Alignment Tool ◽

Algorithmic Techniques

AbstractMotivationIn molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment scale poorly with respect to the number of sequences.ResultsViralMSA is a user-friendly reference-guided multiple sequence alignment tool that leverages the algorithmic techniques of read mappers to enable the multiple sequence alignment of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds.AvailabilityViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software [email protected]

Download Full-text

ViralMSA: massively scalable reference-guided multiple sequence alignment of viral genomes

Bioinformatics ◽

10.1093/bioinformatics/btaa743 ◽

2020 ◽

Author(s):

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Genomic Sequence ◽

Sequence Data ◽

Supplementary Information ◽

Software Project ◽

Multiple Sequence ◽

Viral Genomes ◽

Algorithmic Techniques ◽

User Friendly

Abstract Motivation In molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment (MSA) scale poorly with respect to the number of sequences. Results ViralMSA is a user-friendly reference-guided MSA tool that leverages the algorithmic techniques of read mappers to enable the MSA of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds. However, alignments produced by ViralMSA omit insertions with respect to the reference genome. Availability and implementation ViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software project. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BN++ – A Biological Information System

Journal of Integrative Bioinformatics ◽

10.1515/jib-2006-34 ◽

2006 ◽

Vol 3 (2) ◽

pp. 148-161 ◽

Cited By ~ 11

Author(s):

Jan Küntzer ◽

Torsten Blum ◽

Andreas Gerasch ◽

Christina Backes ◽

Andreas Hildebrandt ◽

...

Keyword(s):

Data Warehouse ◽

Open Source Software ◽

Regulatory Networks ◽

Sequence Data ◽

Biological Information ◽

Biochemical Network ◽

Biochemical Data ◽

Protein Interaction Data ◽

Class Library ◽

Essential Resource

Summary Recent years have seen an explosive growth in the amount of biochemical data available. Numerous databases have been established and are being used as an essential resource by biologists around the world. The sheer amount and heterogeneity of these data poses a major challenge: data integration and, based thereupon, the integrative analysis of these data. We present BN++, the biochemical network library, a powerful software package for integrating, analyzing, and visualizing biochemical data in the context of networks. BN++ is based on a comprehensive and extensible object model (BioCore), which has been implemented as a C++ framework, a Java class library, and a relational database. The C++ framework is used to efficiently import, integrate, and analyze the data, which is stored in a data warehouse. The Java-based viewer (BiNA) provides a powerful platform-independent visualization of the data using sophisticated graph layout algorithms. Currently, the data warehouse imports and integrates data from about a dozen important databases including, among others, sequence data, metabolic and regulatory networks, and protein interaction data. We illustrate the usefulness of BN++ with a few select example applications.Availability: BN++ is open source software available from our website at www.bnplusplus.org.

Download Full-text

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Nature ◽

10.1038/s41586-021-03205-y ◽

2021 ◽

Vol 590 (7845) ◽

pp. 290-299 ◽

Cited By ~ 22

Author(s):

Daniel Taliun ◽

◽

Daniel N. Harris ◽

Michael D. Kessler ◽

Jedidiah Carlson ◽

...

Keyword(s):

Rare Variants ◽

Sequence Data ◽

Association Studies ◽

Genotype Imputation ◽

Genome Wide Association Studies ◽

Phenotypic Data ◽

Treatment And Prevention ◽

Genome Wide ◽

Diverse Backgrounds ◽

Unmapped Reads

AbstractThe Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.

Download Full-text

Nextstrain: real-time tracking of pathogen evolution

10.1101/224048 ◽

2017 ◽

Cited By ~ 21

Author(s):

James Hadfield ◽

Colin Megill ◽

Sidney M. Bell ◽

John Huddleston ◽

Barney Potter ◽

...

Keyword(s):

Public Health ◽

Real Time ◽

Web Application ◽

Sequence Data ◽

Data Types ◽

Bioinformatics Pipeline ◽

Public Health Importance ◽

Viral Genomes ◽

Effective Public Health ◽

Interactive Visualisation

AbstractSummaryUnderstanding the spread and evolution of pathogens is important for effective public health measures and surveillance. Nextstrain consists of a database of viral genomes, a bioinformatics pipeline for phylodynamics analysis, and an interactive visualisation platform. Together these present a real-time view into the evolution and spread of a range of viral pathogens of high public health importance. The visualization integrates sequence data with other data types such as geographic information, serology, or host species. Nextstrain compiles our current understanding into a single accessible location, publicly available for use by health professionals, epidemiologists, virologists and the public alike.Availability and implementationAll code (predominantly JavaScript and Python) is freely available from github.com/nextstrain and the web-application is available at nextstrain.org.

Download Full-text

VGEA: an RNA viral assembly toolkit

PeerJ ◽

10.7717/peerj.12129 ◽

2021 ◽

Vol 9 ◽

pp. e12129

Author(s):

Paul E. Oluniyi ◽

Fehintola Ajogbasile ◽

Judith Oguzie ◽

Jessica Uwanibe ◽

Adeyemi Kayode ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Workflow Management ◽

Viral Population ◽

Lassa Virus ◽

Viral Genomes ◽

Bioinformatics Tools ◽

Reference Sequences ◽

Genome Assemblies

Next generation sequencing (NGS)-based studies have vastly increased our understanding of viral diversity. Viral sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform drug and vaccine design. Viral genomes, however, represent a great challenge to bioinformatics due to their high mutation rate and forming quasispecies in the same infected host, bringing about the need to implement advanced bioinformatics tools to assemble consensus genomes well-representative of the viral population circulating in individual patients. Many tools have been developed to preprocess sequencing reads, carry-out de novo or reference-assisted assembly of viral genomes and assess the quality of the genomes obtained. Most of these tools however exist as standalone workflows and usually require huge computational resources. Here we present (Viral Genomes Easily Analyzed), a Snakemake workflow for analyzing RNA viral genomes. VGEA enables users to map sequencing reads to the human genome to remove human contaminants, split bam files into forward and reverse reads, carry out de novo assembly of forward and reverse reads to generate contigs, pre-process reads for quality and contamination, map reads to a reference tailored to the sample using corrected contigs supplemented by the user’s choice of reference sequences and evaluate/compare genome assemblies. We designed a project with the aim of creating a flexible, easy-to-use and all-in-one pipeline from existing/stand-alone bioinformatics tools for viral genome analysis that can be deployed on a personal computer. VGEA was built on the Snakemake workflow management system and utilizes existing tools for each step: fastp (Chen et al., 2018) for read trimming and read-level quality control, BWA (Li & Durbin, 2009) for mapping sequencing reads to the human reference genome, SAMtools (Li et al., 2009) for extracting unmapped reads and also for splitting bam files into fastq files, IVA (Hunt et al., 2015) for de novo assembly to generate contigs, shiver (Wymant et al., 2018) to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit (Shen et al., 2016) for cleaning shiver assembly for QUAST, QUAST (Gurevich et al., 2013) to evaluate/assess the quality of genome assemblies and MultiQC (Ewels et al., 2016) for aggregation of the results from fastp, BWA and QUAST. Our pipeline was successfully tested and validated with SARS-CoV-2 (n = 20), HIV-1 (n = 20) and Lassa Virus (n = 20) datasets all of which have been made publicly available. VGEA is freely available on GitHub at: https://github.com/pauloluniyi/VGEA under the GNU General Public License.

Download Full-text

Making Genomic Surveillance Deliver: A Lineage Classification and Nomenclature System to Inform Rabies Virus Elimination

10.1101/2021.10.13.464180 ◽

2021 ◽

Author(s):

Kathryn Campbell ◽

Robert J Gifford ◽

Joshua Singer ◽

Verity Hill ◽

Aine O'Toole ◽

...

Keyword(s):

Rabies Virus ◽

Sequence Data ◽

Classification Systems ◽

Virus Elimination ◽

Data Resource ◽

Genomic Tools ◽

Control Programmes ◽

History Of ◽

Virus Sequence ◽

Globally Distributed

The availability of pathogen sequence data and use of genomic surveillance is rapidly increasing. Genomic tools and classification systems need updating to reflect this. Here, rabies virus is used as an example to showcase the potential value of updated genomic tools to enhance surveillance to better understand epidemiological dynamics and improve disease control. Previous studies have described the evolutionary history of rabies virus, however the resulting taxonomy lacks the definition necessary to identify incursions, lineage turnover and transmission routes at high resolution. Here we propose a lineage classification system based on the dynamic nomenclature used for SARS-CoV-2, defining a lineage by phylogenetic methods for tracking virus spread and comparing sequences across geographic areas. We demonstrate this system through application to the globally distributed Cosmopolitan clade of rabies virus, defining 73 total lineages within the clade, beyond the 22 previously reported. We further show how integration of this tool with a new rabies virus sequence data resource (RABV-GLUE) enables rapid application, for example, highlighting lineage dynamics relevant to control and elimination programmes, such as identifying importations and their sources, and areas of persistence and transmission, including transboundary incursions. This system and the tools developed should be useful for coordinating and targeting control programmes and monitoring progress as we work towards eliminating dog-mediated rabies, as well as having potential for broad application to the surveillance of other viruses.

Download Full-text

Porcine Teschoviruses Comprise at Least Eleven Distinct Serotypes: Molecular and Evolutionary Aspects

Journal of Virology ◽

10.1128/jvi.75.4.1620-1631.2001 ◽

2001 ◽

Vol 75 (4) ◽

pp. 1620-1631 ◽

Cited By ~ 85

Author(s):

Roland Zell ◽

Malte Dauber ◽

Andi Krumbholz ◽

Andreas Henke ◽

Eckhard Birch-Hirschfeld ◽

...

Keyword(s):

Neutralizing Antibodies ◽

Sequence Data ◽

Nucleotide Sequencing ◽

Genome Region ◽

Viral Genomes ◽

Field Isolates ◽

Porcine Enterovirus ◽

Group I ◽

Close Relationship ◽

Serological Data

ABSTRACT Nucleotide sequencing and phylogenetic analysis of 10 recognized prototype strains of the porcine enterovirus (PEV) cytopathic effect (CPE) group I reveals a close relationship of the viral genomes to the previously sequenced strain F65, supporting the concept of a reclassification of this virus group into a new picornavirus genus. Also, nucleotide sequences of the polyprotein-encoding genome region or the P1 region of 28 historic strains and recent field isolates were determined. The data suggest that several closely related but antigenically and molecular distinct serotypes constitute one species within the proposed genus Teschovirus. Based on sequence data and serological data, we propose a new serotype with strain Dresden as prototype. This hitherto unrecognized serotype is closely related to porcine teschovirus 1 (PTV-1, former PEV-1), but induces type-specific neutralizing antibodies. Sequencing of field isolates collected from animals presenting with neurological disorders prove that other serotypes than PTV-1 may also cause polioencephalomyelitis of swine.

Download Full-text