scholarly journals Assembly methods for nanopore-based metagenomic sequencing: a comparative study

2019 ◽  
Author(s):  
Adriel Latorre-Pérez ◽  
Pascual Villalba-Bermell ◽  
Javier Pascual ◽  
Manuel Porcar ◽  
Cristina Vilanova

ABSTRACTBackgroundMetagenomic sequencing has lead to the recovery of previously unexplored microbial genomes. In this sense, short-reads sequencing platforms often result in highly fragmented metagenomes, thus complicating downstream analyses. Third generation sequencing technologies, such as MinION, could lead to more contiguous assemblies due to their ability to generate long reads. Nevertheless, there is a lack of studies evaluating the suitability of the available assembly tools for this new type of data.FindingsWe benchmarked the ability of different short-reads and long-reads tools to assembly two different commercially available mock communities, and observed remarkable differences in the resulting assemblies depending on the software of choice. Short-reads metagenomic assemblers proved unsuitable for MinION data. Among the long-reads assemblers tested, Flye and Canu were the only ones performing well in all the datasets. These tools were able to retrieve complete individual genomes directly from the metagenome, and assembled a bacterial genome in only two contigs in the best scenario. Despite the intrinsic high error of long-reads technologies, Canu and Flye lead to high accurate assemblies (~99.4-99.8 % of accuracy). However, errors still had an impact on the prediction of biosynthetic gene clusters.ConclusionsMinION metagenomic sequencing data proved sufficient for assembling low-complex microbial communities, leading to the recovery of highly complete and contiguous individual genomes. This work is the first systematic evaluation of the performance of different assembly tools on MinION data, and may help other researchers willing to use this technology to choose the most appropriate software depending on their goals. Future work is still needed in order to assess the performance of Oxford Nanopore MinION data on more complex microbiomes.

2019 ◽  
Author(s):  
Nicola De Maio ◽  
Liam P. Shaw ◽  
Alasdair Hubbard ◽  
Sophie George ◽  
Nick Sanderson ◽  
...  

ABSTRACTIllumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly). However, it is not clear how different long-read sequencing methods impact on assembly accuracy. Relative automation of the assembly process is also crucial to facilitating high-throughput complete bacterial genome reconstruction, avoiding multiple bespoke filtering and data manipulation steps. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio) sequencing platforms. We chose isolates from the Enterobacteriaceae family, as these frequently have highly plastic, repetitive genetic structures and complete genome reconstruction for these species is relevant for a precise understanding of the epidemiology of antimicrobial resistance. We de novo assembled genomes using the hybrid assembler Unicycler and compared different read processing strategies. Both strategies facilitate high-quality genome reconstruction. Combining ONT and Illumina reads fully resolved most genomes without additional manual steps, and at a lower consumables cost per isolate in our setting. Automated hybrid assembly is a powerful tool for complete and accurate bacterial genome assembly.IMPACT STATEMENTIllumina short-read sequencing is frequently used for tasks in bacterial genomics, such as assessing which species are present within samples, checking if specific genes of interest are present within individual isolates, and reconstructing the evolutionary relationships between strains. However, while short-read sequencing can reveal significant detail about the genomic content of bacterial isolates, it is often insufficient for assessing genomic structure: how different genes are arranged within genomes, and particularly which genes are on plasmids – potentially highly mobile components of the genome frequently carrying antimicrobial resistance elements. This is because Illumina short reads are typically too short to span repetitive structures in the genome, making it impossible to accurately reconstruct these repetitive regions. One solution is to complement Illumina short reads with long reads generated with SMRT Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) sequencing platforms. Using this approach, called ‘hybrid assembly’, we show that we can automatically fully reconstruct complex bacterial genomes of Enterobacteriaceae isolates in the majority of cases (best-performing method: 17/20 isolates). In particular, by comparing different methods we find that using the assembler Unicycler with Illumina and ONT reads represents a low-cost, high-quality approach for reconstructing bacterial genomes using publicly available software.DATA SUMMARYRaw sequencing data and assemblies have been deposited in NCBI under BioProject Accession PRJNA422511 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA422511). We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.


2021 ◽  
Vol 12 ◽  
Author(s):  
Davide Bolognini ◽  
Alberto Magi

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.


2020 ◽  
Author(s):  
Nicolas Dierckxsens ◽  
Tong Li ◽  
Joris R. Vermeesch ◽  
Zhi Xie

ABSTRACTDespite the rapid evolution of new sequencing technologies, structural variation detection remains poorly ascertained. The high discrepancy between the results of structural variant analysis programs makes it difficult to assess their performance on real datasets. Accurate simulations of structural variation distributions and sequencing data of the human genome are crucial for the development and benchmarking of new tools. In order to gain a better insight into the detection of structural variation with long sequencing reads, we created a realistic simulated model to thoroughly compare SV detection methods and the impact of the chosen sequencing technology and sequencing depth. To achieve this, we developed Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it revealed the strengths and weaknesses for current available structural variation callers and long read sequencing platforms. Our findings were also supported by the latest structural variation benchmark set developed by the GIAB Consortium. With these findings, we developed a new method (combiSV) that can combine the results from five different SV callers into a superior call set with increased recall and precision. Both Sim-it and combiSV are open source and can be downloaded at https://github.com/ndierckx/.


2019 ◽  
Author(s):  
Lolita Lecompte ◽  
Pierre Peterlongo ◽  
Dominique Lavenier ◽  
Claire Lemaitre

AbstractMotivationStudies on structural variants (SV) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies.ResultsWe present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of reference sequences that represent the two alleles of each structural variant. Long reads are aligned to these reference sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype insertions and deletions with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches.Availabilityhttps://github.com/llecompte/[email protected]


2018 ◽  
Author(s):  
Jana Ebler ◽  
Marina Haukness ◽  
Trevor Pesout ◽  
Tobias Marschall ◽  
Benedict Paten

MotivationCurrent genotyping approaches for single nucleotide variations (SNVs) rely on short, relatively accurate reads from second generation sequencing devices. Presently, third generation sequencing platforms able to generate much longer reads are becoming more widespread. These platforms come with the significant drawback of higher sequencing error rates, which makes them ill-suited to current genotyping algorithms. However, the longer reads make more of the genome unambiguously mappable and typically provide linkage information between neighboring variants.ResultsIn this paper we introduce a novel approach for haplotype-aware genotyping from noisy long reads. We do this by considering bipartitions of the sequencing reads, corresponding to the two haplotypes. We formalize the computational problem in terms of a Hidden Markov Model and compute posterior genotype probabilities using the forward-backward algorithm. Genotype predictions can then be made by picking the most likely genotype at each site. Our experiments indicate that longer reads allow significantly more of the genome to potentially be accurately genotyped. Further, we are able to use both Oxford Nanopore and Pacific Biosciences sequencing data to independently validate millions of variants previously identified by short-read technologies in the reference NA12878 sample, including hundreds of thousands of variants that were not previously included in the high-confidence reference set.


2020 ◽  
Vol 36 (17) ◽  
pp. 4568-4575
Author(s):  
Lolita Lecompte ◽  
Pierre Peterlongo ◽  
Dominique Lavenier ◽  
Claire Lemaitre

Abstract Motivation Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. Results We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. Availability and implementation https://github.com/llecompte/SVJedi.git Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Stephen M. J. Pollo ◽  
Sarah J. Reiling ◽  
Janneke Wit ◽  
Matthew L. Workentine ◽  
Rebecca A. Guy ◽  
...  

AbstractBackgroundGenomes of the parasite Giardia duodenalis are relatively small for eukaryotic genomes, yet there are only six publicly available. Difficulties in assembling the tetraploid G. duodenalis genome from short read sequencing data likely contribute to this lack of genomic information. We sequenced three isolates of G. duodenalis (AWB, BGS, and beaver) on the Oxford Nanopore Technologies MinION whose long reads have the potential to address genomic areas that are problematic for short reads.ResultsUsing a hybrid approach that combines MinION long reads and Illumina short reads to take advantage of the continuity of the long reads and the accuracy of the short reads we generated reference quality genomes for each isolate. The genomes for two of the isolates were evaluated against the available reference genomes for comparison. The third genome for which there is no previous data was then assembled. The long reads were used to find structural variants in each isolate to examine heterozygosity. Consistent with previous findings based on SNPs, Giardia BGS was found to be considerably more heterozygous than the other isolates that are from Assemblage A. We also find an enrichment of variant-specific surface proteins in some of the structural variant regions.ConclusionsOur results show that the MinION can be used to generate reference quality genomes in Giardia and further be used to identify structural variant regions that are an important source of genetic variation not previously examined in these parasites.


2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Shannon N. Fenlon ◽  
Yuemin Celina Chee ◽  
Jacqueline Lai Yuen Chee ◽  
Yeen Hui Choy ◽  
Alexis Jiaying Khng ◽  
...  

Abstract Objectives The availability of matched sequencing data for the same sample across different sequencing platforms is a necessity for validation and effective comparison of sequencing platforms. A commonly sequenced sample is the lab-adapted MG1655 strain of Escherichia coli; however, this strain is not fully representative of more complex and dynamic genomes of pathogenic E. coli strains. Data description We present six new sequencing data sets for another E. coli strain, UTI89, which is an extraintestinal pathogenic strain isolated from a patient suffering from a urinary tract infection. We now provide matched whole genome sequencing data generated using the PacBio RSII, Oxford Nanopore MinION R9.4, Ion Torrent, ABI SOLiD, and Illumina NextSeq sequencers. Together with other publically available datasets, UTI89 has a nearly complete suite of data generated on most second- and third-generation sequencers. These data can be used as an additional validation set for new sequencing technologies and analytical methods. More than being another E. coli strain, however, UTI89 is pathogenic, with a 10% larger genome, additional pathogenicity islands, and a large plasmid, features that are common among other naturally occurring and disease-causing E. coli isolates. These data therefore provide a more medically relevant test set for development of algorithms.


Microbiome ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
David Pellow ◽  
Alvah Zorea ◽  
Maraike Probst ◽  
Ori Furman ◽  
Arik Segal ◽  
...  

Abstract Background Metagenomic sequencing has led to the identification and assembly of many new bacterial genome sequences. These bacteria often contain plasmids: usually small, circular double-stranded DNA molecules that may transfer across bacterial species and confer antibiotic resistance. These plasmids are generally less studied and understood than their bacterial hosts. Part of the reason for this is insufficient computational tools enabling the analysis of plasmids in metagenomic samples. Results We developed SCAPP (Sequence Contents-Aware Plasmid Peeler)—an algorithm and tool to assemble plasmid sequences from metagenomic sequencing. SCAPP builds on some key ideas from the Recycler algorithm while improving plasmid assemblies by integrating biological knowledge about plasmids. We compared the performance of SCAPP to Recycler and metaplasmidSPAdes on simulated metagenomes, real human gut microbiome samples, and a human gut plasmidome dataset that we generated. We also created plasmidome and metagenome data from the same cow rumen sample and used the parallel sequencing data to create a novel assessment procedure. Overall, SCAPP outperformed Recycler and metaplasmidSPAdes across this wide range of datasets. Conclusions SCAPP is an easy to use Python package that enables the assembly of full plasmid sequences from metagenomic samples. It outperformed existing metagenomic plasmid assemblers in most cases and assembled novel and clinically relevant plasmids in samples we generated such as a human gut plasmidome. SCAPP is open-source software available from: https://github.com/Shamir-Lab/SCAPP.


2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Jean-Marc Aury ◽  
Benjamin Istace

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.


Sign in / Sign up

Export Citation Format

Share Document