Reads Binning Improves the Assembly of Viral Genome Sequences From Metagenomic Samples

Frontiers in Microbiology ◽

10.3389/fmicb.2021.664560 ◽

2021 ◽

Vol 12 ◽

Author(s):

Kai Song

Keyword(s):

Dna Sequences ◽

Viral Genome ◽

Metagenomic Data ◽

Viral Sequence ◽

Genome Sequences ◽

Sequence Identification ◽

Viral Genes ◽

Eukaryotic Dna ◽

Viral Sequences

Metagenomes can be considered as mixtures of viral, bacterial, and other eukaryotic DNA sequences. Mining viral sequences from metagenomes could shed insight into virus–host relationships and expand viral databases. Current alignment-based methods are unsuitable for identifying viral sequences from metagenome sequences because most assembled metagenomic contigs are short and possess few or no predicted genes, and most metagenomic viral genes are dissimilar to known viral genes. In this study, I developed a Markov model-based method, VirMC, to identify viral sequences from metagenomic data. VirMC uses Markov chains to model sequence signatures and construct a scoring model using a likelihood test to distinguish viral and bacterial sequences. Compared with the other two state-of-the-art viral sequence-prediction methods, VirFinder and PPR-Meta, my proposed method outperformed VirFinder and had similar performance with PPR-Meta for short contigs with length less than 400 bp. VirMC outperformed VirFinder and PPR-Meta for identifying viral sequences in contaminated metagenomic samples with eukaryotic sequences. VirMC showed better performance in assembling viral-genome sequences from metagenomic data (based on filtering potential bacterial reads). Applying VirMC to human gut metagenomes from healthy subjects and patients with type-2 diabetes (T2D) revealed that viral contigs could help classify healthy and diseased statuses. This alignment-free method complements gene-based alignment approaches and will significantly improve the precision of viral sequence identification.

Download Full-text

Integration of viral DNA sequences in cells transformed by adenovirus 2 or SV40

Proceedings of the Royal Society of London Series B Biological Sciences ◽

10.1098/rspb.1980.0144 ◽

1980 ◽

Vol 210 (1180) ◽

pp. 423-435 ◽

Cited By ~ 2

Keyword(s):

Cell Lines ◽

Dna Sequences ◽

Viral Genome ◽

Viral Dna ◽

Viral Genomes ◽

Viral Dnas ◽

Cellular Dna ◽

Viral Insertion ◽

Viral Sequences ◽

Heteroduplex Formation

We have cloned and propagated in prokaryotic vectors the viral DNA sequences that are integrated in a variety of cells transformed by adenovirus 2 or SV40. Analysis of the clones reveals that the viral DNA sequences sometimes are arranged in a simple fashion, collinear with the viral genome; in other cell lines there are complex arrangements of viral sequences in which tracts of the viral genome are inverted with respect to each other. In several cases the nucleotide sequences at the joints between cell and viral sequences have been determined: usually there is a sharp transition between cellular and viral DNAs. The viral sequences are integrated at different locations within the genomes of different cell lines; likewise there is no specific site on the viral genomes at which integration occurs. Sometimes the viral sequences are integrated within repetitive cellular DNA, and sometimes within unique sequences. In some cases there is evidence that the viral sequences along with the flanking cell DNA have been amplified after integration. The sequences that flank the viral insertion in the line of SV40-transformed rat cells known as 14B have been used as probes to isolate, from untransformed rat cells, clones that carry the region of the chromosome in which integration occurred. Analysis of the structure of these clones by restriction endonuclease digestion and heteroduplex formation shows that a rearrangement of cellular sequences has occurred, presumably as a consequence of integration.

Download Full-text

Draft Genome Sequences of Six Novel Picorna-Like Viruses from Washington State Spiders

Genome Announcements ◽

10.1128/genomea.01705-16 ◽

2017 ◽

Vol 5 (9) ◽

Cited By ~ 4

Author(s):

Ryan C. Shean ◽

Negar Makhsous ◽

Rodney L. Crawford ◽

Keith R. Jerome ◽

Alexander L. Greninger

Keyword(s):

Amino Acid ◽

Viral Genome ◽

Draft Genome ◽

Amino Acid Identity ◽

Washington State ◽

Genome Sequences ◽

Acid Identity ◽

Spider Species ◽

Viral Sequences

ABSTRACT We report draft genome sequences of six novel Picornavirales members from six different spider species found in Washington state. These six viral sequences distinctly clustered together phylogenetically with less than 35% amino acid identity to the closest reference viral genome.

Download Full-text

Simulation Study and Comparative Evaluation of Viral Contiguous Sequence Identification Tools

10.21203/rs.3.rs-287089/v1 ◽

2021 ◽

Author(s):

Cody Glickman ◽

Jo Hendrix ◽

Michael Strong

Keyword(s):

Machine Learning ◽

Microbial Communities ◽

State Of The Art ◽

Metagenomic Data ◽

Bioinformatic Tools ◽

Sequence Identification ◽

Tool Performance ◽

Bacterial Genes ◽

Viral Sequences ◽

Read Distribution

Abstract Background:Viruses, including bacteriophage, are important components of environmental and human associated microbial communities. Viruses can act as extracellular reservoirs of bacterial genes, can mediate microbiome dynamics, and can influence the virulence of clinical pathogens. It is essential, therefore, to have robust sequence analysis methods in place to detect and annotate viral elements within microbial communities. Various targeted metagenomic analysis techniques detect viral sequences, but these methods often exclude large and genome integrated viruses. In this study, we evaluate and compare the ability of nine state-of-the-art bioinformatic tools, including Vibrant, VirSorter, VirSorter2, VirFinder, DeepVirFinder, MetaPhinder, JGI Earth Virome Pipeline, Kraken 2, and VirBrant, to identify viral contiguous sequences (contigs) across simulated metagenomes with different read distributions, taxonomic compositions, and complexities.Results:Of the tools tested in this study, VirSorter achieved the best F1 score while Vibrant had the highest average F1 score at predicting integrated prophages. Though less balanced in its precision and recall, Kraken2 had the highest average precision by a substantial margin. We introduced the machine learning tool, VirBrant, which demonstrated an improvement in average F1 score over tools such as MetaPhinder. The tool utilizes machine learning with both protein compositional and nucleotide features. The addition of nucleotide features improves the precision and recall compared to the protein compositional features alone. Viral identification by all tools was not impacted by underlying read distribution but did improve with contig length. Tool performance was inversely related to taxonomic complexity and varied by the phage host. Rhizobium and Enterococcus phage were identified consistently by the tools; whereas, Neisseria phage were commonly missed in this study.Conclusion:This study benchmarked the performance of nine state-of-the-art bioinformatic tools to identify viral contigs across different simulation conditions. This study explored the ability of the tools to identify integrated prophage elements traditionally excluded from targeted sequencing approaches. Our comprehensive analysis of viral identification tools to assess their performance in a variety of situations provides valuable insights to viral researchers looking to mine viral elements from publicly available metagenomic data.

Download Full-text

B-lymphocytic hairy cells contain no HTLV-II DNA sequences

Blood ◽

10.1182/blood.v72.4.1428.bloodjournal7241428 ◽

1988 ◽

Vol 72 (4) ◽

pp. 1428-1430

Author(s):

T Lion ◽

N Razvi ◽

HM Golomb ◽

RH Brownstein

Keyword(s):

T Cell ◽

B Cell ◽

Dna Sequences ◽

Viral Genome ◽

Mononuclear Cells ◽

Hairy Cell ◽

Cell Leukemia ◽

Cell Form ◽

Hairy Cells ◽

Viral Sequences

HTLV-II has been found in some cases of the rare T-cell form of hairy- cell leukemia (HCL) and in a leukopenic chronic T-cell leukemia mimicking HCL. We asked whether the virus is implicated in the more frequent B-cell form of HCL. DNA extracted from the mononuclear cells derived from spleen (eight cases) or peripheral blood (eight cases) of 16 patients with the B-cell form of HCL was probed. No viral sequences were detected at levels of sensitivity as low as one viral genome in five cells. Therefore HTLV-II may not be involved in the B-cell form of HCL.

Download Full-text

Simulation study and comparative evaluation of viral contiguous sequence identification tools

BMC Bioinformatics ◽

10.1186/s12859-021-04242-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Cody Glickman ◽

Jo Hendrix ◽

Michael Strong

Keyword(s):

Machine Learning ◽

State Of The Art ◽

Gene Content ◽

Metagenomic Data ◽

Bioinformatic Tools ◽

Sequence Identification ◽

Tool Performance ◽

Bacterial Genes ◽

Viral Sequences ◽

Read Distribution

Abstract Background Viruses, including bacteriophages, are important components of environmental and human associated microbial communities. Viruses can act as extracellular reservoirs of bacterial genes, can mediate microbiome dynamics, and can influence the virulence of clinical pathogens. Various targeted metagenomic analysis techniques detect viral sequences, but these methods often exclude large and genome integrated viruses. In this study, we evaluate and compare the ability of nine state-of-the-art bioinformatic tools, including Vibrant, VirSorter, VirSorter2, VirFinder, DeepVirFinder, MetaPhinder, Kraken 2, Phybrid, and a BLAST search using identified proteins from the Earth Virome Pipeline to identify viral contiguous sequences (contigs) across simulated metagenomes with different read distributions, taxonomic compositions, and complexities. Results Of the tools tested in this study, VirSorter achieved the best F1 score while Vibrant had the highest average F1 score at predicting integrated prophages. Though less balanced in its precision and recall, Kraken2 had the highest average precision by a substantial margin. We introduced the machine learning tool, Phybrid, which demonstrated an improvement in average F1 score over tools such as MetaPhinder. The tool utilizes machine learning with both gene content and nucleotide features. The addition of nucleotide features improves the precision and recall compared to the gene content features alone.Viral identification by all tools was not impacted by underlying read distribution but did improve with contig length. Tool performance was inversely related to taxonomic complexity and varied by the phage host. For instance, Rhizobium and Enterococcus phages were identified consistently by the tools; whereas, Neisseria prophage sequences were commonly missed in this study. Conclusion This study benchmarked the performance of nine state-of-the-art bioinformatic tools to identify viral contigs across different simulation conditions. This study explored the ability of the tools to identify integrated prophage elements traditionally excluded from targeted sequencing approaches. Our comprehensive analysis of viral identification tools to assess their performance in a variety of situations provides valuable insights to viral researchers looking to mine viral elements from publicly available metagenomic data.

Download Full-text

Rapid screening and identification of viral pathogens in metagenomic data

BMC Medical Genomics ◽

10.1186/s12920-021-01138-z ◽

2021 ◽

Vol 14 (S6) ◽

Author(s):

Shiyang Song ◽

Liangxiao Ma ◽

Xintian Xu ◽

Han Shi ◽

Xuan Li ◽

...

Keyword(s):

Viral Genome ◽

Rapid Screening ◽

Metagenomic Data ◽

Viral Sequence ◽

Viral Pathogens ◽

Viral Pathogen ◽

Genome Reconstruction ◽

Screening And Identification ◽

Metagenomics Data ◽

Ngs Data

Abstract Background Virus screening and viral genome reconstruction are urgent and crucial for the rapid identification of viral pathogens, i.e., tracing the source and understanding the pathogenesis when a viral outbreak occurs. Next-generation sequencing (NGS) provides an efficient and unbiased way to identify viral pathogens in host-associated and environmental samples without prior knowledge. Despite the availability of software, data analysis still requires human operations. A mature pipeline is urgently needed when thousands of viral pathogen and viral genome reconstruction samples need to be rapidly identified. Results In this paper, we present a rapid and accurate workflow to screen metagenomics sequencing data for viral pathogens and other compositions, as well as enable a reference-based assembler to reconstruct viral genomes. Moreover, we tested our workflow on several metagenomics datasets, including a SARS-CoV-2 patient sample with NGS data, pangolins tissues with NGS data, Middle East Respiratory Syndrome (MERS)-infected cells with NGS data, etc. Our workflow demonstrated high accuracy and efficiency when identifying target viruses from large scale NGS metagenomics data. Our workflow was flexible when working with a broad range of NGS datasets from small (kb) to large (100 Gb). This took from a few minutes to a few hours to complete each task. At the same time, our workflow automatically generates reports that incorporate visualized feedback (e.g., metagenomics data quality statistics, host and viral sequence compositions, details about each of the identified viral pathogens and their coverages, and reassembled viral pathogen sequences based on their closest references). Conclusions Overall, our system enabled the rapid screening and identification of viral pathogens from metagenomics data, providing an important piece to support viral pathogen research during a pandemic. The visualized report contains information from raw sequence quality to a reconstructed viral sequence, which allows non-professional people to screen their samples for viruses by themselves (Additional file 1).

Download Full-text

A Review on Viral Data Sources and Integration Methods for COVID-19 Mitigation

10.20944/preprints202008.0133.v1 ◽

2020 ◽

Author(s):

Anna Bernasconi ◽

Arif Canakoglu ◽

Marco Masseroli ◽

Pietro Pinoli ◽

Stefano Ceri

Keyword(s):

Data Integration ◽

Special Interest ◽

Critical Period ◽

Sequence Data ◽

Research Community ◽

Data Sources ◽

Viral Sequence ◽

Genome Sequences ◽

Host Genotype ◽

Viral Sequences

With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the affects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation, while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects.

Download Full-text

B-lymphocytic hairy cells contain no HTLV-II DNA sequences

Blood ◽

10.1182/blood.v72.4.1428.1428 ◽

1988 ◽

Vol 72 (4) ◽

pp. 1428-1430 ◽

Cited By ~ 5

Author(s):

T Lion ◽

N Razvi ◽

HM Golomb ◽

RH Brownstein

Keyword(s):

T Cell ◽

B Cell ◽

Dna Sequences ◽

Viral Genome ◽

Mononuclear Cells ◽

Hairy Cell ◽

Cell Leukemia ◽

Cell Form ◽

Hairy Cells ◽

Viral Sequences

Abstract HTLV-II has been found in some cases of the rare T-cell form of hairy- cell leukemia (HCL) and in a leukopenic chronic T-cell leukemia mimicking HCL. We asked whether the virus is implicated in the more frequent B-cell form of HCL. DNA extracted from the mononuclear cells derived from spleen (eight cases) or peripheral blood (eight cases) of 16 patients with the B-cell form of HCL was probed. No viral sequences were detected at levels of sensitivity as low as one viral genome in five cells. Therefore HTLV-II may not be involved in the B-cell form of HCL.

Download Full-text

A review on viral data sources and search systems for perspective mitigation of COVID-19

Briefings in Bioinformatics ◽

10.1093/bib/bbaa359 ◽

2020 ◽

Author(s):

Anna Bernasconi ◽

Arif Canakoglu ◽

Marco Masseroli ◽

Pietro Pinoli ◽

Stefano Ceri

Keyword(s):

Special Interest ◽

Critical Period ◽

Sequence Data ◽

Research Community ◽

Data Sources ◽

Common Variants ◽

Viral Sequence ◽

Genome Sequences ◽

Host Genotype ◽

Viral Sequences

Abstract With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature.

Download Full-text

Molecular Evidence for Nosocomial Transmission of Human Immunodeficiency Virus from a Surgeon to One of His Patients

Journal of Virology ◽

10.1128/jvi.72.5.4537-4540.1998 ◽

1998 ◽

Vol 72 (5) ◽

pp. 4537-4540 ◽

Cited By ~ 48

Author(s):

Alain Blanchard ◽

Stéphane Ferris ◽

Sophie Chamaret ◽

Denise Guétard ◽

Luc Montagnier

Keyword(s):

Human Immunodeficiency Virus ◽

Viral Genome ◽

Pcr Amplification ◽

Nosocomial Transmission ◽

Molecular Evidence ◽

Immunodeficiency Virus ◽

Reference Sequences ◽

Hiv Type 1 ◽

Viral Sequences

ABSTRACT We have investigated the molecular evidence in favor of the transmission of human immunodeficiency virus (HIV) from an HIV-infected surgeon to one of his patients. After PCR amplification, theenv and gag sequences from the viral genome were cloned and sequenced. Phylogenetic analysis revealed that the viral sequences derived from the surgeon and his patient are closely related, which strongly suggests that nosocomial transmission occurred. In addition, these viral sequences belong to group M of HIV type 1 but are divergent from the reference sequences of the known subtypes.

Download Full-text