New approaches for metagenome assembly with short reads

Martin Ayling; Matthew D Clark; Richard M Leggett

doi:10.1093/bib/bbz020

New approaches for assembly of short-read metagenomic data

10.7287/peerj.preprints.27332 ◽

2018 ◽

Author(s):

Martin Ayling ◽

Matthew D Clark ◽

Richard M Leggett

Keyword(s):

Genome Assembly ◽

Metagenomic Data ◽

Short Read ◽

New Approaches ◽

Single Genome ◽

New Type ◽

Multiple Genomes ◽

Assembly Algorithms ◽

Genome Assemblies

In recent years, the use of longer-range read data combined with advances in assembly algorithms has stimulated big improvements in the contiguity and quality of genome assemblies. However, these advances have not directly transferred to metagenomic datasets, as assumptions made by the single genome assembly algorithms do not apply when assembling multiple genomes at varying levels of abundance. The development of dedicated assemblers for metagenomic data was a relatively late innovation and for many years, researchers had to make do using tools designed for single genomes. This has changed in the last few years and we have seen the emergence of a new type of tool built using different principles. In this review, we describe the challenges inherent in metagenomic assemblies and compare the different approaches taken by these novel assembly tools.

Download Full-text

New approaches for assembly of short-read metagenomic data

10.7287/peerj.preprints.27332v1 ◽

2018 ◽

Author(s):

Martin Ayling ◽

Matthew D Clark ◽

Richard M Leggett

Keyword(s):

Genome Assembly ◽

Metagenomic Data ◽

Short Read ◽

New Approaches ◽

Single Genome ◽

New Type ◽

Multiple Genomes ◽

Assembly Algorithms ◽

Genome Assemblies

In recent years, the use of longer-range read data combined with advances in assembly algorithms has stimulated big improvements in the contiguity and quality of genome assemblies. However, these advances have not directly transferred to metagenomic datasets, as assumptions made by the single genome assembly algorithms do not apply when assembling multiple genomes at varying levels of abundance. The development of dedicated assemblers for metagenomic data was a relatively late innovation and for many years, researchers had to make do using tools designed for single genomes. This has changed in the last few years and we have seen the emergence of a new type of tool built using different principles. In this review, we describe the challenges inherent in metagenomic assemblies and compare the different approaches taken by these novel assembly tools.

Download Full-text

Significantly improving the quality of genome assemblies through curation

10.1101/2020.08.12.247734 ◽

2020 ◽

Cited By ~ 2

Author(s):

Kerstin Howe ◽

William Chow ◽

Joanna Collins ◽

Sarah Pelan ◽

Damon-Lee Pointon ◽

...

Keyword(s):

Data Sets ◽

Data Generation ◽

Research Projects ◽

Automated Assembly ◽

Assembly Quality ◽

Assembly Strategy ◽

Assembly Evaluation ◽

Assembly Algorithms ◽

Genome Assemblies

AbstractBackgroundGenome sequence assemblies provide the basis for our understanding of biology. Generating error-free assemblies is therefore the ultimate, but sadly still unachieved goal of a multitude of research projects. Despite the ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near error-free genome assemblies for eukaryotes.ResultsWhilst working towards improved data sets and fully automated pipelines, assembly evaluation and curation is actively employed to bridge this shortcoming and significantly reduce the number of assembly errors. In addition to this increase in product value, the insights gained from assembly curation are fed back into the automated assembly strategy and contribute to notable improvements in genome assembly quality.ConclusionsWe describe our tried and tested approach for assembly curation using gEVAL, the genome evaluation browser. We outline the procedures applied to genome curation using gEVAL and also our recommendations for assembly curation in an gEVAL-independent context to facilitate the uptake of genome curation in the wider community.

Download Full-text

Significantly improving the quality of genome assemblies through curation

GigaScience ◽

10.1093/gigascience/giaa153 ◽

2021 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Kerstin Howe ◽

William Chow ◽

Joanna Collins ◽

Sarah Pelan ◽

Damon-Lee Pointon ◽

...

Keyword(s):

Genome Assembly ◽

Data Generation ◽

Research Projects ◽

Automated Assembly ◽

Assembly Quality ◽

Assembly Strategy ◽

Assembly Evaluation ◽

Assembly Algorithms ◽

Genome Assemblies

Abstract Genome sequence assemblies provide the basis for our understanding of biology. Generating error-free assemblies is therefore the ultimate, but sadly still unachieved goal of a multitude of research projects. Despite the ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near error-free genome assemblies for eukaryotes. Whilst working towards improved datasets and fully automated pipelines, assembly evaluation and curation is actively used to bridge this shortcoming and significantly reduce the number of assembly errors. In addition to this increase in product value, the insights gained from assembly curation are fed back into the automated assembly strategy and contribute to notable improvements in genome assembly quality. We describe our tried and tested approach for assembly curation using gEVAL, the genome evaluation browser. We outline the procedures applied to genome curation using gEVAL and also our recommendations for assembly curation in a gEVAL-independent context to facilitate the uptake of genome curation in the wider community.

Download Full-text

dnAQET: a framework to compute a consolidated metric for benchmarking quality of de novo assemblies

BMC Genomics ◽

10.1186/s12864-019-6070-x ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Gokhan Yavas ◽

Huixiao Hong ◽

Wenming Xiao

Keyword(s):

Quality Assessment ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Quality Score ◽

De Novo Genome Assembly ◽

Genome Assemblies ◽

Reference Genomes ◽

Better Than

Abstract Background Accurate de novo genome assembly has become reality with the advancements in sequencing technology. With the ever-increasing number of de novo genome assembly tools, assessing the quality of assemblies has become of great importance in genome research. Although many quality metrics have been proposed and software tools for calculating those metrics have been developed, the existing tools do not produce a unified measure to reflect the overall quality of an assembly. Results To address this issue, we developed the de novo Assembly Quality Evaluation Tool (dnAQET) that generates a unified metric for benchmarking the quality assessment of assemblies. Our framework first calculates individual quality scores for the scaffolds/contigs of an assembly by aligning them to a reference genome. Next, it computes a quality score for the assembly using its overall reference genome coverage, the quality score distribution of its scaffolds and the redundancy identified in it. Using synthetic assemblies randomly generated from the latest human genome build, various builds of the reference genomes for five organisms and six de novo assemblies for sample NA24385, we tested dnAQET to assess its capability for benchmarking quality evaluation of genome assemblies. For synthetic data, our quality score increased with decreasing number of misassemblies and redundancy and increasing average contig length and coverage, as expected. For genome builds, dnAQET quality score calculated for a more recent reference genome was better than the score for an older version. To compare with some of the most frequently used measures, 13 other quality measures were calculated. The quality score from dnAQET was found to be better than all other measures in terms of consistency with the known quality of the reference genomes, indicating that dnAQET is reliable for benchmarking quality assessment of de novo genome assemblies. Conclusions The dnAQET is a scalable framework designed to evaluate a de novo genome assembly based on the aggregated quality of its scaffolds (or contigs). Our results demonstrated that dnAQET quality score is reliable for benchmarking quality assessment of genome assemblies. The dnQAET can help researchers to identify the most suitable assembly tools and to select high quality assemblies generated.

Download Full-text

Assessment the Quality of Genome Assemblies by using QUAST Tool for Metagenomics

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e6435.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 4253-4259

Keyword(s):

Genome Sequencing ◽

Reference Genome ◽

Assessment Tool ◽

Quality Assessment Tool ◽

Assembly Evaluation ◽

Assembly Algorithms ◽

Genome Assemblies ◽

Modern Tool ◽

Assembly Software

Number of assembly algorithms have emerged out but due to constraints of genome sequencing techniques no one is perfect. Various methods for assembler’s comparison have been developed, but none is yet a recognized standard. The problem of evaluating assemblies of formerly unsequenced species has not been considered, because mostly existing methods for comparing assemblies are only applicable to new assemblies of finished genomes. For comparing and evaluating genome assemblies we have used QUAST (Quality Assessment Tool). This tool is used to assess the quality of leading assembly software by evaluating quality metrics. Assemblies with a reference genome, as well as without a reference can be evaluated by QUAST tool. For genome assembly evaluation based on alignment of contigs to a reference, it is a modern tool. In this study we demonstrate QUAST performance by comparing several leading genome assemblers on three metagenomic datasets.

Download Full-text

Metassembler: Merging and optimizing de novo genome assemblies

10.1101/016352 ◽

2015 ◽

Author(s):

Alejandro Hernandez Wences ◽

Michael Schatz

Keyword(s):

Open Source ◽

Genome Assembly ◽

De Novo ◽

A Genome ◽

Genome Assemblies ◽

Multiple Algorithms

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for metassembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.

Download Full-text

Exploring neighborhoods in large metagenome assembly graphs reveals hidden sequence diversity

10.1101/462788 ◽

2018 ◽

Cited By ~ 5

Author(s):

C. Titus Brown ◽

Dominik Moritz ◽

Michael P. O’Brien ◽

Felix Reidl ◽

Taylor Reiter ◽

...

Keyword(s):

Sequence Variation ◽

Retrieval System ◽

Genomic Sequence ◽

Information Retrieval System ◽

Software Implementation ◽

Metagenomic Data ◽

Data Sets ◽

Dna Assembly ◽

Strain Variation ◽

Metagenome Assembly

Genomes computationally inferred from large metagenomic data sets are often incomplete and may be missing functionally important content and strain variation. We introduce an information retrieval system for large metagenomic data sets that exploits the sparsity of DNA assembly graphs to efficiently extract subgraphs surrounding an inferred genome. We apply this system to recover missing content from genome bins and show that substantial genomic sequence variation is present in a real metagenome. Our software implementation is available at https://github.com/spacegraphcats/ spacegraphcats under the 3-Clause BSD License.

Download Full-text

A comprehensive investigation of metagenome assembly by linked-read sequencing

Microbiome ◽

10.1186/s40168-020-00929-3 ◽

2020 ◽

Vol 8 (1) ◽

Author(s):

Lu Zhang ◽

Xiaodong Fang ◽

Herui Liao ◽

Zhenmiao Zhang ◽

Xin Zhou ◽

...

Keyword(s):

Genome Assembly ◽

Simulated Data ◽

Read Depth ◽

Marginal Effect ◽

Assembly Quality ◽

Microbial Genomes ◽

Long Reads ◽

Metagenome Assembly ◽

Dna Fragment

Abstract Background The human microbiota are complex systems with important roles in our physiological activities and diseases. Sequencing the microbial genomes in the microbiota can help in our interpretation of their activities. The vast majority of the microbes in the microbiota cannot be isolated for individual sequencing. Current metagenomics practices use short-read sequencing to simultaneously sequence a mixture of microbial genomes. However, these results are in ambiguity during genome assembly, leading to unsatisfactory microbial genome completeness and contig continuity. Linked-read sequencing is able to remove some of these ambiguities by attaching the same barcode to the reads from a long DNA fragment (10–100 kb), thus improving metagenome assembly. However, it is not clear how the choices for several parameters in the use of linked-read sequencing affect the assembly quality. Results We first examined the effects of read depth (C) on metagenome assembly from linked-reads in simulated data and a mock community. The results showed that C positively correlated with the length of assembled sequences but had little effect on their qualities. The latter observation was corroborated by tests using real data from the human gut microbiome, where C demonstrated minor impact on the sequence quality as well as on the proportion of bins annotated as draft genomes. On the other hand, metagenome assembly quality was susceptible to read depth per fragment (CR) and DNA fragment physical depth (CF). For the same C, deeper CR resulted in more draft genomes while deeper CF improved the quality of the draft genomes. We also found that average fragment length (μFL) had marginal effect on assemblies, while fragments per partition (NF/P) impacted the off-target reads involved in local assembly, namely, lower NF/P values would lead to better assemblies by reducing the ambiguities of the off-target reads. In general, the use of linked-reads improved the assembly for contig N50 when compared to Illumina short-reads, but not when compared to PacBio CCS (circular consensus sequencing) long-reads. Conclusions We investigated the influence of linked-read sequencing parameters on metagenome assembly comprehensively. While the quality of genome assembly from linked-reads cannot rival that from PacBio CCS long-reads, the case for using linked-read sequencing remains persuasive due to its low cost and high base-quality. Our study revealed that the probable best practice in using linked-reads for metagenome assembly was to merge the linked-reads from multiple libraries, where each had sufficient CR but a smaller amount of input DNA.

Download Full-text

Parameter exploration improves the accuracy of long-read genome assembly

10.1101/2021.05.28.446135 ◽

2021 ◽

Author(s):

Anurag Priyam ◽

Alicja Witwicka ◽

Anindita Brahma ◽

Eckart Stolle ◽

Yannick Wurm

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

Error Rates ◽

Fine Tuning ◽

Sequencing Error ◽

High Quality ◽

Long Read ◽

Genome Assemblies ◽

Error Profiles

Long-molecule sequencing is now routinely applied to generate high-quality reference genome assemblies. However, datasets differ in repeat composition, heterozygosity, read lengths and error profiles. The assembly parameters that provide the best results could thus differ across datasets. By integrating four complementary and biologically meaningful metrics, we show that simple fine-tuning of assembly parameters can substantially improve the quality of long-read genome assemblies. In particular, modifying estimates of sequencing error rates improves some metrics more than two-fold. We provide a flexible software, CompareGenomeQualities, that automates comparisons of assembly qualities for researchers wanting a straightforward mechanism for choosing among multiple assemblies.

Download Full-text