A Review on Viral Data Sources and Integration Methods for COVID-19 Mitigation

With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the affects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation, while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects.

Download Full-text

A review on viral data sources and search systems for perspective mitigation of COVID-19

Briefings in Bioinformatics ◽

10.1093/bib/bbaa359 ◽

2020 ◽

Author(s):

Anna Bernasconi ◽

Arif Canakoglu ◽

Marco Masseroli ◽

Pietro Pinoli ◽

Stefano Ceri

Keyword(s):

Special Interest ◽

Critical Period ◽

Sequence Data ◽

Research Community ◽

Data Sources ◽

Common Variants ◽

Viral Sequence ◽

Genome Sequences ◽

Host Genotype ◽

Viral Sequences

Abstract With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature.

Download Full-text

Reads Binning Improves the Assembly of Viral Genome Sequences From Metagenomic Samples

Frontiers in Microbiology ◽

10.3389/fmicb.2021.664560 ◽

2021 ◽

Vol 12 ◽

Author(s):

Kai Song

Keyword(s):

Dna Sequences ◽

Viral Genome ◽

Metagenomic Data ◽

Viral Sequence ◽

Genome Sequences ◽

Sequence Identification ◽

Viral Genes ◽

Eukaryotic Dna ◽

Viral Sequences

Metagenomes can be considered as mixtures of viral, bacterial, and other eukaryotic DNA sequences. Mining viral sequences from metagenomes could shed insight into virus–host relationships and expand viral databases. Current alignment-based methods are unsuitable for identifying viral sequences from metagenome sequences because most assembled metagenomic contigs are short and possess few or no predicted genes, and most metagenomic viral genes are dissimilar to known viral genes. In this study, I developed a Markov model-based method, VirMC, to identify viral sequences from metagenomic data. VirMC uses Markov chains to model sequence signatures and construct a scoring model using a likelihood test to distinguish viral and bacterial sequences. Compared with the other two state-of-the-art viral sequence-prediction methods, VirFinder and PPR-Meta, my proposed method outperformed VirFinder and had similar performance with PPR-Meta for short contigs with length less than 400 bp. VirMC outperformed VirFinder and PPR-Meta for identifying viral sequences in contaminated metagenomic samples with eukaryotic sequences. VirMC showed better performance in assembling viral-genome sequences from metagenomic data (based on filtering potential bacterial reads). Applying VirMC to human gut metagenomes from healthy subjects and patients with type-2 diabetes (T2D) revealed that viral contigs could help classify healthy and diseased statuses. This alignment-free method complements gene-based alignment approaches and will significantly improve the precision of viral sequence identification.

Download Full-text

Methodology of Big Data Integration from A Priori Unknown Heterogeneous Data Sources

Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence - CSAI '18 ◽

10.1145/3297156.3297249 ◽

2018 ◽

Author(s):

Alexey Samoylov ◽

Nikolay Sergeev ◽

Margarita Kucherova ◽

Boris Denisov

Keyword(s):

Big Data ◽

Data Integration ◽

A Priori ◽

Heterogeneous Data ◽

Data Sources ◽

Heterogeneous Data Sources

Download Full-text

Challenges in evaluating the use of viral sequence data to identify HIV transmission networks for public health

Statistical Communications in Infectious Diseases ◽

10.1515/scid-2019-0019 ◽

2020 ◽

Vol 12 (s1) ◽

Author(s):

Rami Kantor ◽

John P. Fulton ◽

Jon Steingrimsson ◽

Vladimir Novitsky ◽

Mark Howison ◽

...

Keyword(s):

Public Health ◽

United States ◽

Hiv Transmission ◽

Sequence Data ◽

The United States ◽

Viral Sequence ◽

Transmission Networks ◽

New Methods ◽

Hiv Epidemic ◽

The World

AbstractGreat efforts are devoted to end the HIV epidemic as it continues to have profound public health consequences in the United States and throughout the world, and new interventions and strategies are continuously needed. The use of HIV sequence data to infer transmission networks holds much promise to direct public heath interventions where they are most needed. As these new methods are being implemented, evaluating their benefits is essential. In this paper, we recognize challenges associated with such evaluation, and make the case that overcoming these challenges is key to the use of HIV sequence data in routine public health actions to disrupt HIV transmission networks.

Download Full-text

Limited Genetic Diversity Detected in Middle East Respiratory Syndrome-Related Coronavirus Variants Circulating in Dromedary Camels in Jordan

Viruses ◽

10.3390/v13040592 ◽

2021 ◽

Vol 13 (4) ◽

pp. 592

Author(s):

Stephanie N. Seifert ◽

Jonathan E. Schulz ◽

Stacy Ricklefs ◽

Michael Letko ◽

Elangeni Yabba ◽

...

Keyword(s):

Genetic Diversity ◽

Middle East ◽

United Arab Emirates ◽

Sequence Data ◽

Case Fatality ◽

High Sensitivity ◽

Middle East Respiratory Syndrome ◽

Full Genome Sequence ◽

Genome Sequences ◽

Dromedary Camels

Middle East respiratory syndrome-related coronavirus (MERS-CoV) is a persistent zoonotic pathogen with frequent spillover from dromedary camels to humans in the Arabian Peninsula, resulting in limited outbreaks of MERS with a high case-fatality rate. Full genome sequence data from camel-derived MERS-CoV variants show diverse lineages circulating in domestic camels with frequent recombination. More than 90% of the available full MERS-CoV genome sequences derived from camels are from just two countries, the Kingdom of Saudi Arabia (KSA) and United Arab Emirates (UAE). In this study, we employ a novel method to amplify and sequence the partial MERS-CoV genome with high sensitivity from nasal swabs of infected camels. We recovered more than 99% of the MERS-CoV genome from field-collected samples with greater than 500 TCID50 equivalent per nasal swab from camel herds sampled in Jordan in May 2016. Our subsequent analyses of 14 camel-derived MERS-CoV genomes show a striking lack of genetic diversity circulating in Jordan camels relative to MERS-CoV genome sequences derived from large camel markets in KSA and UAE. The low genetic diversity detected in Jordan camels during our study is consistent with a lack of endemic circulation in these camel herds and reflective of data from MERS outbreaks in humans dominated by nosocomial transmission following a single introduction as reported during the 2015 MERS outbreak in South Korea. Our data suggest transmission of MERS-CoV among two camel herds in Jordan in 2016 following a single introduction event.

Download Full-text

Occurrence and Expression of Gene Transfer Agent Genes in Marine Bacterioplankton

Applied and Environmental Microbiology ◽

10.1128/aem.02129-07 ◽

2008 ◽

Vol 74 (10) ◽

pp. 2933-2939 ◽

Cited By ~ 69

Author(s):

Erin J. Biers ◽

Kui Wang ◽

Catherine Pennington ◽

Robert Belas ◽

Feng Chen ◽

...

Keyword(s):

Gene Transfer ◽

Sequence Data ◽

Genome Sequences ◽

Metagenomic Sequence ◽

Marine Bacterioplankton ◽

Extracellular Release ◽

Gene Transfer Agent ◽

Metagenomic Sequence Data ◽

Search For Homologs

ABSTRACT Genes with homology to the transduction-like gene transfer agent (GTA) were observed in genome sequences of three cultured members of the marine Roseobacter clade. A broader search for homologs for this host-controlled virus-like gene transfer system identified likely GTA systems in cultured Alphaproteobacteria, and particularly in marine bacterioplankton representatives. Expression of GTA genes and extracellular release of GTA particles (∼50 to 70 nm) was demonstrated experimentally for the Roseobacter clade member Silicibacter pomeroyi DSS-3, and intraspecific gene transfer was documented. GTA homologs are surprisingly infrequent in marine metagenomic sequence data, however, and the role of this lateral gene transfer mechanism in ocean bacterioplankton communities remains unclear.

Download Full-text

Integrating genomics into the taxonomy and systematics of the Bacteria and Archaea

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.054171-0 ◽

2014 ◽

Vol 64 (Pt_2) ◽

pp. 316-324 ◽

Cited By ~ 258

Author(s):

Jongsik Chun ◽

Fred A. Rainey

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Original Research ◽

Rrna Gene ◽

New Taxon ◽

Genome Sequences ◽

Microbial World ◽

Content Type ◽

Link Type ◽

Type Strains

The polyphasic approach used today in the taxonomy and systematics of the Bacteria and Archaea includes the use of phenotypic, chemotaxonomic and genotypic data. The use of 16S rRNA gene sequence data has revolutionized our understanding of the microbial world and led to a rapid increase in the number of descriptions of novel taxa, especially at the species level. It has allowed in many cases for the demarcation of taxa into distinct species, but its limitations in a number of groups have resulted in the continued use of DNA–DNA hybridization. As technology has improved, next-generation sequencing (NGS) has provided a rapid and cost-effective approach to obtaining whole-genome sequences of microbial strains. Although some 12 000 bacterial or archaeal genome sequences are available for comparison, only 1725 of these are of actual type strains, limiting the use of genomic data in comparative taxonomic studies when there are nearly 11 000 type strains. Efforts to obtain complete genome sequences of all type strains are critical to the future of microbial systematics. The incorporation of genomics into the taxonomy and systematics of the Bacteria and Archaea coupled with computational advances will boost the credibility of taxonomy in the genomic era. This special issue of International Journal of Systematic and Evolutionary Microbiology contains both original research and review articles covering the use of genomic sequence data in microbial taxonomy and systematics. It includes contributions on specific taxa as well as outlines of approaches for incorporating genomics into new strain isolation to new taxon description workflows.

Download Full-text

Whole genome characterization of strains belonging to the Ralstonia solanacearum species complex and in silico analysis of TaqMan assays for detection in this heterogenous species complex

European Journal of Plant Pathology ◽

10.1007/s10658-020-02190-8 ◽

2021 ◽

Author(s):

Viola Kurm ◽

Ilse Houwers ◽

Claudia E. Coipan ◽

Peter Bonants ◽

Cees Waalwijk ◽

...

Keyword(s):

Ralstonia Solanacearum ◽

In Silico ◽

Species Complex ◽

Sequence Data ◽

In Silico Analysis ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequences ◽

Pcr Assays

AbstractIdentification and classification of members of the Ralstonia solanacearum species complex (RSSC) is challenging due to the heterogeneity of this complex. Whole genome sequence data of 225 strains were used to classify strains based on average nucleotide identity (ANI) and multilocus sequence analysis (MLSA). Based on the ANI score (>95%), 191 out of 192(99.5%) RSSC strains could be grouped into the three species R. solanacearum, R. pseudosolanacearum, and R. syzygii, and into the four phylotypes within the RSSC (I,II, III, and IV). R. solanacearum phylotype II could be split in two groups (IIA and IIB), from which IIB clustered in three subgroups (IIBa, IIBb and IIBc). This division by ANI was in accordance with MLSA. The IIB subgroups found by ANI and MLSA also differed in the number of SNPs in the primer and probe sites of various assays. An in-silico analysis of eight TaqMan and 11 conventional PCR assays was performed using the whole genome sequences. Based on this analysis several cases of potential false positives or false negatives can be expected upon the use of these assays for their intended target organisms. Two TaqMan assays and two PCR assays targeting the 16S rDNA sequence should be able to detect all phylotypes of the RSSC. We conclude that the increasing availability of whole genome sequences is not only useful for classification of strains, but also shows potential for selection and evaluation of clade specific nucleic acid-based amplification methods within the RSSC.

Download Full-text

VirusDIP: Virus Data Integration Platform

10.1101/2020.06.08.139451 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lina Wang ◽

Fengzhen Chen ◽

Xueqin Guo ◽

Lijin You ◽

Xiaoxia Yang ◽

...

Keyword(s):

Sequence Alignment ◽

Sequence Data ◽

Data Retrieval ◽

Viral Sequence ◽

Origin And Evolution ◽

Alignment Tool ◽

Public Data ◽

Virus Research ◽

Global Initiative ◽

Tree Building

AbstractMotivationThe Coronavirus Disease 2019 (COVID-19) pandemic poses a huge threat to human public health. Viral sequence data plays an important role in the scientific prevention and control of epidemics. A comprehensive virus database will be vital useful for virus data retrieval and deep analysis. To promote sharing of virus data, several virus databases and related analyzing tools have been created.ResultsTo facilitate virus research and promote the global sharing of virus data, we present here VirusDIP, a one-stop service platform for archive, integration, access, analysis of virus data. It accepts the submission of viral sequence data from all over the world and currently integrates data resources from the National GeneBank Database (CNGBdb), Global initiative on sharing all influenza data (GISAID), and National Center for Biotechnology Information (NCBI). Moreover, based on the comprehensive data resources, BLAST sequence alignment tool and multi-party security computing tools are deployed for multi-sequence alignment, phylogenetic tree building and global trusted sharing. VirusDIP is gradually establishing cooperation with more databases, and paving the way for the analysis of virus origin and evolution. All public data in VirusDIP are freely available for all researchers worldwide.Availabilityhttps://db.cngb.org/virus/[email protected]

Download Full-text

An Ontology-based Visual Analytics for Apple Variety Testing

10.5194/egusphere-egu21-15804 ◽

2021 ◽

Author(s):

Ekaterina Chuprikova ◽

Abraham Mejia Aguilar ◽

Roberto Monsorno

Keyword(s):

Data Mining ◽

Data Analysis ◽

Data Integration ◽

Visual Analytics ◽

Agricultural Sector ◽

Environmental Data ◽

Data Sources ◽

Apple Variety ◽

Testing Program ◽

Variety Testing

Increasing agricultural production challenges, such as climate change, environmental concerns, energy demands, and growing expectations from consumers triggered the necessity for innovation using data-driven approaches such as visual analytics. Although the visual analytics concept was introduced more than a decade ago, the latest developments in the data mining capacities made it possible to fully exploit the potential of this approach and gain insights into high complexity datasets (multi-source, multi-scale, and different stages).&#160;The current study focuses on developing prototypical visual analytics for an apple variety testing program in South Tyrol, Italy. Thus, the work aims (1) to establish a visual analytics interface enabled to integrate and harmonize information about apple variety testing and its interaction with climate by designing a semantic model; and (2) to create a single visual analytics user interface that can turn the data into knowledge for domain experts.&#160;This study extends the visual analytics approach with a structural way of data organization&#160;(ontologies), data mining, and visualization techniques to retrieve knowledge from an extensive collection of apple variety testing program and environmental data. The prototype stands on three main components: ontology, data analysis, and data visualization. Ontologies provide a representation of expert knowledge and create standard concepts for data integration, opening the possibility to share the knowledge using a unified terminology and allowing for inference. Building upon relevant semantic models (e.g., agri-food experiment ontology, plant trait ontology, GeoSPARQL), we propose to extend them based on the apple variety testing and climate data. Data integration and harmonization through developing an ontology-based model provides a framework for integrating relevant concepts and relationships between them, data sources from different repositories, and defining a precise specification for the knowledge retrieval. Besides, as the variety testing is performed on different locations, the geospatial component can enrich the analysis with spatial properties. Furthermore, the visual narratives designed within this study will give a better-integrated view of data entities' relations and the meaningful patterns and clustering based on semantic concepts.Therefore, the proposed approach is designed to improve decision-making about variety management through an interactive visual analytics system that can answer "what" and "why" about fruit-growing activities. Thus, the prototype has the potential to go beyond the traditional ways of organizing data by creating an advanced information system enabled to manage heterogeneous data sources and to provide a framework for more collaborative scientific data analysis. This study unites various interdisciplinary aspects and, in particular: Big Data analytics in the agricultural sector and visual methods; thus, the findings will contribute to the EU priority program in digital transformation in the European agricultural sector.This project has received funding from the European Union's Horizon 2020 research and innovation program under the Marie Sk&#322;odowska-Curie grant agreement No 894215.

Download Full-text