scholarly journals Accurate viral genome reconstruction and host assignment with proximity-ligation sequencing

2021 ◽  
Author(s):  
Gherman Uritskiy ◽  
Maximillian Press ◽  
Christine Sun ◽  
Guillermo Dominguez Huerta ◽  
Ahmed A. Zayed ◽  
...  

Viruses play crucial roles in the ecology of microbial communities, yet they remain relatively understudied in their native environments. Despite many advancements in high-throughput whole-genome sequencing (WGS), sequence assembly, and annotation of viruses, the reconstruction of full-length viral genomes directly from metagenomic sequencing is possible only for the most abundant phages and requires long-read sequencing technologies. Additionally, the prediction of their cellular hosts remains difficult from conventional metagenomic sequencing alone. To address these gaps in the field and to accelerate the study of viruses directly in their native microbiomes, we developed an end-to-end bioinformatics platform for viral genome reconstruction and host attribution from metagenomic data using proximity-ligation sequencing (i.e., Hi-C). We demonstrate the capabilities of the platform by recovering and characterizing the metavirome of a variety of metagenomes, including a fecal microbiome that has also been sequenced with accurate long reads, allowing for the assessment and benchmarking of the new methods. The platform can accurately extract numerous near-complete viral genomes even from highly fragmented short-read assemblies and can reliably predict their cellular hosts with minimal false positives. To our knowledge, this is the first software for performing these tasks. Being significantly cheaper than long-read sequencing of comparable depth, the incorporation of proximity-ligation sequencing in microbiome research shows promise to greatly accelerate future advancements in the field.

Viruses ◽  
2021 ◽  
Vol 13 (10) ◽  
pp. 2006
Author(s):  
Anna Y Budkina ◽  
Elena V Korneenko ◽  
Ivan A Kotov ◽  
Daniil A Kiselev ◽  
Ilya V Artyushin ◽  
...  

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.


mSystems ◽  
2019 ◽  
Vol 4 (1) ◽  
Author(s):  
Robert H. Mills ◽  
Yoshiki Vázquez-Baeza ◽  
Qiyun Zhu ◽  
Lingjing Jiang ◽  
James Gaffney ◽  
...  

ABSTRACT Although genetic approaches are the standard in microbiome analysis, proteome-level information is largely absent. This discrepancy warrants a better understanding of the relationship between gene copy number and protein abundance, as this is crucial information for inferring protein-level changes from metagenomic data. As it remains unknown how metaproteomic systems evolve during dynamic disease states, we leveraged a 4.5-year fecal time series using samples from a single patient with colonic Crohn’s disease. Utilizing multiplexed quantitative proteomics and shotgun metagenomic sequencing of eight time points in technical triplicate, we quantified over 29,000 protein groups and 110,000 genes and compared them to five protein biomarkers of disease activity. Broad-scale observations were consistent between data types, including overall clustering by principal-coordinate analysis and fluctuations in Gene Ontology terms related to Crohn’s disease. Through linear regression, we determined genes and proteins fluctuating in conjunction with inflammatory metrics. We discovered conserved taxonomic differences relevant to Crohn’s disease, including a negative association of Faecalibacterium and a positive association of Escherichia with calprotectin. Despite concordant associations of genera, the specific genes correlated with these metrics were drastically different between metagenomic and metaproteomic data sets. This resulted in the generation of unique functional interpretations dependent on the data type, with metaproteome evidence for previously investigated mechanisms of dysbiosis. An example of one such mechanism was a connection between urease enzymes, amino acid metabolism, and the local inflammation state within the patient. This proof-of-concept approach prompts further investigation of the metaproteome and its relationship with the metagenome in biologically complex systems such as the microbiome. IMPORTANCE A majority of current microbiome research relies heavily on DNA analysis. However, as the field moves toward understanding the microbial functions related to healthy and disease states, it is critical to evaluate how changes in DNA relate to changes in proteins, which are functional units of the genome. This study tracked the abundance of genes and proteins as they fluctuated during various inflammatory states in a 4.5-year study of a patient with colonic Crohn’s disease. Our results indicate that despite a low level of correlation, taxonomic associations were consistent in the two data types. While there was overlap of the data types, several associations were uniquely discovered by analyzing the metaproteome component. This case study provides unique and important insights into the fundamental relationship between the genes and proteins of a single individual’s fecal microbiome associated with clinical consequences.


2021 ◽  
Author(s):  
Guangyang Wang ◽  
Shenghui Li ◽  
Qiulong Yan ◽  
Ruochun Guo ◽  
Yue Zhang ◽  
...  

Abstract Background: Viruses in the human gut have been linked to health and disease. Deciphering of the gut virome is dependent on metagenomic sequencing of the virus-like particles purified from the fecal specimens. A major limitation of conventional viral metagenomic sequencing is the low recoverability of viral genomes from the metagenomic dataset. Results: Herein, we developed an optimal method for viral amplification and metagenomic sequencing to maximize the recovery of viral genomes. Using 5 fecal specimens with multiple repetitions, we revealed the optimal number of PCR cycles of high-fidelity enzyme-based amplification and the reliability of multiple displacement amplification in virome DNA preparation, verified the reproducibility of the optimally whole viral metagenomic experimental process, and tested the capability of long-read sequencing for improving viral metagenomic assembly. Based on our optimized results, we generated 151 high-quality viruses using the data combined from short-read (15 cycles for PCR amplification) and long-read sequencing. Genomic analysis of these viruses found that most (60.3%) of them were previously unknown and showed a remarkable diversity of viral functions, especially the existence of 206 viral auxiliary metabolic genes. Finally, we compared the viral metagenomic and bulk metagenomic sequencing approaches and revealed significant differences in the efficiency and coverage of viral identification between them. Conclusions: Our study demonstrates the potential of optimized experiment and sequencing strategies in uncovering viral genomes from fecal specimens, which will facilitate future research about genome-level characterization of complex viral communities.


2021 ◽  
Vol 14 (S6) ◽  
Author(s):  
Shiyang Song ◽  
Liangxiao Ma ◽  
Xintian Xu ◽  
Han Shi ◽  
Xuan Li ◽  
...  

Abstract Background Virus screening and viral genome reconstruction are urgent and crucial for the rapid identification of viral pathogens, i.e., tracing the source and understanding the pathogenesis when a viral outbreak occurs. Next-generation sequencing (NGS) provides an efficient and unbiased way to identify viral pathogens in host-associated and environmental samples without prior knowledge. Despite the availability of software, data analysis still requires human operations. A mature pipeline is urgently needed when thousands of viral pathogen and viral genome reconstruction samples need to be rapidly identified. Results In this paper, we present a rapid and accurate workflow to screen metagenomics sequencing data for viral pathogens and other compositions, as well as enable a reference-based assembler to reconstruct viral genomes. Moreover, we tested our workflow on several metagenomics datasets, including a SARS-CoV-2 patient sample with NGS data, pangolins tissues with NGS data, Middle East Respiratory Syndrome (MERS)-infected cells with NGS data, etc. Our workflow demonstrated high accuracy and efficiency when identifying target viruses from large scale NGS metagenomics data. Our workflow was flexible when working with a broad range of NGS datasets from small (kb) to large (100 Gb). This took from a few minutes to a few hours to complete each task. At the same time, our workflow automatically generates reports that incorporate visualized feedback (e.g., metagenomics data quality statistics, host and viral sequence compositions, details about each of the identified viral pathogens and their coverages, and reassembled viral pathogen sequences based on their closest references). Conclusions Overall, our system enabled the rapid screening and identification of viral pathogens from metagenomics data, providing an important piece to support viral pathogen research during a pandemic. The visualized report contains information from raw sequence quality to a reconstructed viral sequence, which allows non-professional people to screen their samples for viruses by themselves (Additional file 1).


Author(s):  
Eva F. Caceres ◽  
William H. Lewis ◽  
Felix Homa ◽  
Tom Martin ◽  
Andreas Schramm ◽  
...  

AbstractAsgard archaea is a recently proposed superphylum currently comprised of five recognised phyla: Lokiarchaeota, Thorarchaeota, Odinarchaeota, Heimdallarchaeota and Helarchaeota. Members of this group have been identified based on culture-independent approaches with several metagenome-assembled genomes (MAGs) reconstructed to date. However, most of these genomes consist of several relatively small contigs, and, until recently, no complete Asgard archaea genome is yet available. Large scale phylogenetic analyses suggest that Asgard archaea represent the closest archaeal relatives of eukaryotes. In addition, members of this superphylum encode proteins that were originally thought to be specific to eukaryotes, including components of the trafficking machinery, cytoskeleton and endosomal sorting complexes required for transport (ESCRT). Yet, these findings have been questioned on the basis that the genome sequences that underpin them were assembled from metagenomic data, and could have been subjected to contamination and other assembly artefacts. Even though several lines of evidence indicate that the previously reported findings were not affected by these issues, having access to high-quality and preferentially fully closed Asgard archaea genomes is needed to definitively close this debate. Current long-read sequencing technologies such as Oxford Nanopore allow the generation of long reads in a high-throughput manner making them suitable for their use in metagenomics. Although the use of long reads is still limited in this field, recent analyses have shown that it is feasible to obtain complete or near-complete genomes of abundant members of mock communities and metagenomes of various level of complexity. Here, we show that long read metagenomics can be successfully applied to obtain near-complete genomes of low-abundant members of complex communities from sediment samples. We were able to reconstruct six MAGs from different Lokiarchaeota lineages that show high completeness and low fragmentation, with one of them being a near-complete genome only consisting of three contigs. Our analyses confirm that the eukaryote-like features previously associated with Lokiarchaeota are not the result of contamination or assembly artefacts, and can indeed be found in the newly reconstructed genomes.


Genes ◽  
2019 ◽  
Vol 10 (3) ◽  
pp. 220 ◽  
Author(s):  
Gherman Uritskiy ◽  
Jocelyne DiRuggiero

In the past decades, the study of microbial life through shotgun metagenomic sequencing has rapidly expanded our understanding of environmental, synthetic, and clinical microbial communities. Here, we review how shotgun metagenomics has affected the field of halophilic microbial ecology, including functional potential reconstruction, virus–host interactions, pathway selection, strain dispersal, and novel genome discoveries. However, there still remain pitfalls and limitations from conventional metagenomic analysis being applied to halophilic microbial communities. Deconvolution of halophilic metagenomes has been difficult due to the high G + C content of these microbiomes and their high intraspecific diversity, which has made both metagenomic assembly and binning a challenge. Halophiles are also underrepresented in public genome databases, which in turn slows progress. With this in mind, this review proposes experimental and analytical strategies to overcome the challenges specific to the halophilic microbiome, from experimental designs to data acquisition and the computational analysis of metagenomic sequences. Finally, we speculate about the potential applications of other next-generation sequencing technologies in halophilic communities. RNA sequencing, long-read technologies, and chromosome conformation assays, not initially intended for microbiomes, are becoming available in the study of microbial communities. Together with recent analytical advancements, these new methods and technologies have the potential to rapidly advance the field of halophile research.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Zhixing Feng ◽  
Jose C. Clemente ◽  
Brandon Wong ◽  
Eric E. Schadt

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants play an instrumental role in deciphering cellular genetic heterogeneity, but they are still difficult tasks because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, provide an opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrate that iGDA can accurately reconstruct haplotypes in closely related strains of the same species (divergence ≥0.011%) from long-read metagenomic data.


2020 ◽  
Author(s):  
Zhixing Feng ◽  
Jose Clemente ◽  
Brandon Wong ◽  
Eric E. Schadt

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, co-infection of multiple pathogens. Detecting and phasing minor variants, which is to determine whether multiple variants are from the same haplotype, play an instrumental role in deciphering cellular genetic heterogeneity, but are still difficult because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, have provided an unprecedented opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrated that iGDA can accurately reconstruct haplotypes in closely-related strains of the same species (divergence ≥ 0.011%) from long-read metagenomic data. Our approach, therefore, presents a significant advance towards the complete deciphering of cellular genetic heterogeneity.


2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i3-i11
Author(s):  
Anuradha Wickramarachchi ◽  
Vijini Mallawaarachchi ◽  
Vaibhav Rajan ◽  
Yu Lin

Abstract Motivation Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyse metagenomic data, binning is considered a crucial step to characterize the different species of micro-organisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this article, we present MetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition. Results We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving ∼13% improvement in F1-score and ∼30% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long-read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read-based metagenomics analyses to support a wide range of applications. Availability and implementation The source code is freely available at: https://github.com/anuradhawick/MetaBCC-LR. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Krithika Arumugam ◽  
Caner Bağci ◽  
Irina Bessarab ◽  
Sina Beier ◽  
Benjamin Buchfink ◽  
...  

AbstractBackgroundShort-read sequencing technologies have long been the work-horse of microbiome analysis. Continuing technological advances are making the application of long-read sequencing to metagenomic samples increasingly feasible.ResultsWe demonstrate that whole bacterial chromosomes can be obtained from a complex community, by application of MinION sequencing to a sample from an EBPR bio-reactor, producing 6Gb of sequence that assembles in to multiple closed bacterial chromosomes. We provide a simple pipeline for processing such data, which includes a new approach to correcting erroneous frame-shifts.ConclusionsAdvances in long read sequencing technology and corresponding algorithms will allow the routine extraction of whole chromosomes from environmental samples, providing a more detailed picture of individual members of a microbiome.


Sign in / Sign up

Export Citation Format

Share Document