scholarly journals Trycycler: consensus long-read assemblies for bacterial genomes

2021 ◽  
Author(s):  
Ryan R Wick ◽  
Louise M Judd ◽  
Louise T Cerdeira ◽  
Jane Hawkey ◽  
Guillaume Meric ◽  
...  

Assembly of bacterial genomes from long-read data (generated by Oxford Nanopore or Pacific Biosciences platforms) can often be complete: a single contig for each chromosome or plasmid in the genome. However, even complete bacterial genome assemblies constructed solely from long reads still contain a variety of errors, and different assemblies of the same genome often contain different errors. Here, we present Trycycler, a tool which produces a consensus assembly from multiple input assemblies of the same genome. Benchmarking using both simulated and real sequencing reads showed that Trycycler consensus assemblies contained fewer errors than any of those constructed with a single long-read assembler. Post-assembly polishing with Medaka and Pilon further reduced errors and yielded the most accurate genome assemblies in our study. As Trycycler can require human judgement and manual intervention, its output is not deterministic, and different users can produce different Trycycler assemblies from the same input data. However, we demonstrated that multiple users with minimal training converge on similar assemblies that are consistently more accurate than those produced by automated assembly tools. We therefore recommend Trycycler+Medaka+Pilon as an ideal approach for generating high-quality bacterial reference genomes.

2017 ◽  
Author(s):  
Ryan R. Wick ◽  
Louise M. Judd ◽  
Claire L. Gorrie ◽  
Kathryn E. Holt

AbstractIllumina sequencing platforms have enabled widespread bacterial whole genome sequencing. While Illumina data is appropriate for many analyses, its short read length limits its ability to resolve genomic structure. This has major implications for tracking the spread of mobile genetic elements, including those which carry antimicrobial resistance determinants. Fully resolving a bacterial genome requires long-read sequencing such as those generated by Oxford Nanopore Technologies (ONT) platforms. Here we describe our use of the ONT MinION to sequence 12 isolates of Klebsiella pneumoniae on a single flow cell. We assembled each genome using a combination of ONT reads and previously available Illumina reads, and little to no manual intervention was needed to achieve fully resolved assemblies using the Unicycler hybrid assembler. Assembling only ONT reads with Canu was less effective, resulting in fewer resolved genomes and higher error rates even following error correction with Nanopolish. We demonstrate that multiplexed ONT sequencing is a valuable tool for high-throughput bacterial genome finishing. Specifically, we advocate the use of Illumina sequencing as a first analysis step, followed by ONT reads as needed to resolve genomic structure.Data summarySequence read files for all 12 isolates have been deposited in SRA, accessible through these NCBI BioSample accession numbers: SAMEA3357010, SAMEA3357043, SAMN07211279, SAMN07211280, SAMEA3357223, SAMEA3357193, SAMEA3357346, SAMEA3357374, SAMEA3357320, SAMN07211281, SAMN07211282, SAMEA3357405.A full list of SRA run accession numbers (both Illumina reads and ONT reads) for these samples are available in Table S1.Assemblies and sequencing reads corresponding to each stage of processing and analysis are provided in the following figshare project: https://figshare.com/projects/Completing_bacterial_genome_assemblies_with_multiplex_MinION_sequencing/23068Source code is provided in the following public GitHub repositories: https://github.com/rrwick/Bacterial-genome-assemblies-with-multiplex-MinION-sequencinghttps://github.com/rrwick/Porechophttps://github.com/rrwick/Fast5-to-FastqImpact StatementLike many research and public health laboratories, we frequently perform large-scale bacterial comparative genomics studies using Illumina sequencing, which assays gene content and provides the high-confidence variant calls needed for phylogenomics and transmission studies. However, problems often arise with resolving genome assemblies, particularly around regions that matter most to our research, such as mobile genetic elements encoding antibiotic resistance or virulence genes. These complexities can often be resolved by long sequence reads generated with PacBio or Oxford Nanopore Technologies (ONT) platforms. While effective, this has proven difficult to scale, due to the relatively high costs of generating long reads and the manual intervention required for assembly. Here we demonstrate the use of barcoded ONT libraries sequenced in multiplex on a single ONT MinION flow cell, coupled with hybrid assembly using Unicycler, to resolve 12 large bacterial genomes. Minor manual intervention was required to fully resolve small plasmids in five isolates, which we found to be underrepresented in ONT data. Cost per sample for the ONT sequencing was equivalent to Illumina sequencing, and there is potential for significant savings by multiplexing more samples on the ONT run. This approach paves the way for high-throughput and cost-effective generation of completely resolved bacterial genomes to become widely accessible.


2021 ◽  
Author(s):  
Valentine Murigneux ◽  
Leah W. Roberts ◽  
Brian M. Forde ◽  
Minh-Duy Phan ◽  
Nguyen Thi Khanh Nhu ◽  
...  

AbstractOxford Nanopore Technology (ONT) long-read sequencing has become a popular platform for microbial researchers; however, easy and automated construction of high-quality bacterial genomes remains challenging. Here we present MicroPIPE: a reproducible end-to-end bacterial genome assembly pipeline for ONT and Illumina sequencing. To construct MicroPIPE, we evaluated the performance of several tools for genome reconstruction and assessed overall genome accuracy using ONT both natively and with Illumina. Further validation of MicroPIPE was carried out using 11 sequence type (ST)131 Escherichia coli and eight publicly available Gram-negative and Gram-positive bacterial isolates. MicroPIPE uses Singularity containers and the workflow manager Nextflow and is available at https://github.com/BeatsonLab-MicrobialGenomics/micropipe.


2019 ◽  
Author(s):  
Nicola De Maio ◽  
Liam P. Shaw ◽  
Alasdair Hubbard ◽  
Sophie George ◽  
Nick Sanderson ◽  
...  

ABSTRACTIllumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly). However, it is not clear how different long-read sequencing methods impact on assembly accuracy. Relative automation of the assembly process is also crucial to facilitating high-throughput complete bacterial genome reconstruction, avoiding multiple bespoke filtering and data manipulation steps. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or from SMRT Pacific Biosciences (PacBio) sequencing platforms. We chose isolates from the Enterobacteriaceae family, as these frequently have highly plastic, repetitive genetic structures and complete genome reconstruction for these species is relevant for a precise understanding of the epidemiology of antimicrobial resistance. We de novo assembled genomes using the hybrid assembler Unicycler and compared different read processing strategies. Both strategies facilitate high-quality genome reconstruction. Combining ONT and Illumina reads fully resolved most genomes without additional manual steps, and at a lower consumables cost per isolate in our setting. Automated hybrid assembly is a powerful tool for complete and accurate bacterial genome assembly.IMPACT STATEMENTIllumina short-read sequencing is frequently used for tasks in bacterial genomics, such as assessing which species are present within samples, checking if specific genes of interest are present within individual isolates, and reconstructing the evolutionary relationships between strains. However, while short-read sequencing can reveal significant detail about the genomic content of bacterial isolates, it is often insufficient for assessing genomic structure: how different genes are arranged within genomes, and particularly which genes are on plasmids – potentially highly mobile components of the genome frequently carrying antimicrobial resistance elements. This is because Illumina short reads are typically too short to span repetitive structures in the genome, making it impossible to accurately reconstruct these repetitive regions. One solution is to complement Illumina short reads with long reads generated with SMRT Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) sequencing platforms. Using this approach, called ‘hybrid assembly’, we show that we can automatically fully reconstruct complex bacterial genomes of Enterobacteriaceae isolates in the majority of cases (best-performing method: 17/20 isolates). In particular, by comparing different methods we find that using the assembler Unicycler with Illumina and ONT reads represents a low-cost, high-quality approach for reconstructing bacterial genomes using publicly available software.DATA SUMMARYRaw sequencing data and assemblies have been deposited in NCBI under BioProject Accession PRJNA422511 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA422511). We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Valentine Murigneux ◽  
Leah W. Roberts ◽  
Brian M. Forde ◽  
Minh-Duy Phan ◽  
Nguyen Thi Khanh Nhu ◽  
...  

Abstract Background Oxford Nanopore Technology (ONT) long-read sequencing has become a popular platform for microbial researchers due to the accessibility and affordability of its devices. However, easy and automated construction of high-quality bacterial genomes using nanopore reads remains challenging. Here we aimed to create a reproducible end-to-end bacterial genome assembly pipeline using ONT in combination with Illumina sequencing. Results We evaluated the performance of several popular tools used during genome reconstruction, including base-calling, filtering, assembly, and polishing. We also assessed overall genome accuracy using ONT both natively and with Illumina. All steps were validated using the high-quality complete reference genome for the Escherichia coli sequence type (ST)131 strain EC958. Software chosen at each stage were incorporated into our final pipeline, MicroPIPE. Further validation of MicroPIPE was carried out using 11 additional ST131 E. coli isolates, which demonstrated that complete circularised chromosomes and plasmids could be achieved without manual intervention. Twelve publicly available Gram-negative and Gram-positive bacterial genomes (with available raw ONT data and matched complete genomes) were also assembled using MicroPIPE. We found that revised basecalling and updated assembly of the majority of these genomes resulted in improved accuracy compared to the current publicly available complete genomes. Conclusions MicroPIPE is built in modules using Singularity container images and the bioinformatics workflow manager Nextflow, allowing changes and adjustments to be made in response to future tool development. Overall, MicroPIPE provides an easy-access, end-to-end solution for attaining high-quality bacterial genomes. MicroPIPE is available at https://github.com/BeatsonLab-MicrobialGenomics/micropipe.


2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Jean-Marc Aury ◽  
Benjamin Istace

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ryan R. Wick ◽  
Louise M. Judd ◽  
Louise T. Cerdeira ◽  
Jane Hawkey ◽  
Guillaume Méric ◽  
...  

AbstractWhile long-read sequencing allows for the complete assembly of bacterial genomes, long-read assemblies contain a variety of errors. Here, we present Trycycler, a tool which produces a consensus assembly from multiple input assemblies of the same genome. Benchmarking showed that Trycycler assemblies contained fewer errors than assemblies constructed with a single tool. Post-assembly polishing further reduced errors and Trycycler+polishing assemblies were the most accurate genomes in our study. As Trycycler requires manual intervention, its output is not deterministic. However, we demonstrated that multiple users converge on similar assemblies that are consistently more accurate than those produced by automated assembly tools.


2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Caroline Belser ◽  
Franc-Christophe Baurens ◽  
Benjamin Noel ◽  
Guillaume Martin ◽  
Corinne Cruaud ◽  
...  

AbstractLong-read technologies hold the promise to obtain more complete genome assemblies and to make them easier. Coupled with long-range technologies, they can reveal the architecture of complex regions, like centromeres or rDNA clusters. These technologies also make it possible to know the complete organization of chromosomes, which remained complicated before even when using genetic maps. However, generating a gapless and telomere-to-telomere assembly is still not trivial, and requires a combination of several technologies and the choice of suitable software. Here, we report a chromosome-scale assembly of a banana genome (Musa acuminata) generated using Oxford Nanopore long-reads. We generated a genome coverage of 177X from a single PromethION flowcell with near 17X with reads longer than 75 kbp. From the 11 chromosomes, 5 were entirely reconstructed in a single contig from telomere to telomere, revealing for the first time the content of complex regions like centromeres or clusters of paralogous genes.


2015 ◽  
Author(s):  
Rene L Warren ◽  
Benjamin P Vandervalk ◽  
Steven JM Jones ◽  
Inanc Birol

Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. Established and emerging long read technologies show great promise in this regard, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they could be of value. We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a solution that makes use of the information in error-rich long reads, without the need for read alignment or base correction. We show how the contiguity of an ABySS E. coli K-12 genome assembly could be increased over five-fold by the use of beta-released Oxford Nanopore Ltd. (ONT) long reads and how LINKS leverages long-range information in S. cerevisiae W303 ONT reads to yield an assembly with less than half the errors of competing applications. Re-scaffolding the colossal white spruce assembly draft (PG29, 20 Gbp) and how LINKS scales to larger genomes is also presented. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts. Availability: http://www.bcgsc.ca/bioinfo/software/links


2019 ◽  
Vol 35 (21) ◽  
pp. 4239-4246 ◽  
Author(s):  
Pierre Marijon ◽  
Rayan Chikhi ◽  
Jean-Stéphane Varré

Abstract Motivation Long-read genome assembly tools are expected to reconstruct bacterial genomes nearly perfectly; however, they still produce fragmented assemblies in some cases. It would be beneficial to understand whether these cases are intrinsically impossible to resolve, or if assemblers are at fault, implying that genomes could be refined or even finished with little to no additional experimental cost. Results We propose a set of computational techniques to assist inspection of fragmented bacterial genome assemblies, through careful analysis of assembly graphs. By finding paths of overlapping raw reads between pairs of contigs, we recover potential short-range connections between contigs that were lost during the assembly process. We show that our procedure recovers 45% of missing contig adjacencies in fragmented Canu assemblies, on samples from the NCTC bacterial sequencing project. We also observe that a simple procedure based on enumerating weighted Hamiltonian cycles can suggest likely contig orderings. In our tests, the correct contig order is ranked first in half of the cases and within the top-three predictions in nearly all evaluated cases, providing a direction for finishing fragmented long-read assemblies. Availability and implementation https://gitlab.inria.fr/pmarijon/knot . Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Michael Schmid ◽  
Daniel Frei ◽  
Andrea Patrignani ◽  
Ralph Schlapbach ◽  
Jürg E. Frey ◽  
...  

AbstractGenerating a complete, de novo genome assembly for prokaryotes is often considered a solved problem. However, we here show that Pseudomonas koreensis P19E3 harbors multiple, near identical repeat pairs up to 70 kilobase pairs in length. Beyond long repeats, the P19E3 assembly was further complicated by a shufflon region. Its complex genome could not be de novo assembled with long reads produced by Pacific Biosciences’ technology, but required very long reads from the Oxford Nanopore Technology. Another important factor for a full genomic resolution was the choice of assembly algorithm.Importantly, a repeat analysis indicated that very complex bacterial genomes represent a general phenomenon beyond Pseudomonas. Roughly 10% of 9331 complete bacterial and a handful of 293 complete archaeal genomes represented this dark matter for de novo genome assembly of prokaryotes. Several of these dark matter genome assemblies contained repeats far beyond the resolution of the sequencing technology employed and likely contain errors, other genomes were closed employing labor-intense steps like cosmid libraries, primer walking or optical mapping. Using very long sequencing reads in combination with assemblers capable of resolving long, near identical repeats will bring most prokaryotic genomes within reach of fast and complete de novo genome assembly.


Sign in / Sign up

Export Citation Format

Share Document