assembly algorithms Latest Research Papers

Technologies for next-generation sequencing (NGS) have stimulated an exponential rise in high-throughput sequencing projects and resulted in the development of new read-assembly algorithms. A drastic reduction in the costs of generating short reads on the genomes of new organisms is attributable to recent advances in NGS technologies such as Ion Torrent, Illumina, and PacBio. Genome research has led to the creation of high-quality reference genomes for several organisms, and de novo assembly is a key initiative that has facilitated gene discovery and other studies. More powerful analytical algorithms are needed to work on the increasing amount of sequence data. We make a thorough comparison of the de novo assembly algorithms to allow new users to clearly understand the assembly algorithms: overlap-layout-consensus and de-Bruijn-graph, string-graph based assembly, and hybrid approach. We also address the computational efficacy of each algorithm’s performance, challenges faced by the assem- bly tools used, and the impact of repeats. Our results compare the relative performance of the different assemblers and other related assembly differences with and without the reference genome. We hope that this analysis will contribute to further the application of de novo sequences and help the future growth of assembly algorithms.

Download Full-text

Widespread false gene gains caused by duplication errors in genome assemblies

10.1101/2021.04.09.438957 ◽

2021 ◽

Author(s):

Byung June Ko ◽

Chul Lee ◽

Juwan Kim ◽

Arang Rhie ◽

DongAhn Yoo ◽

...

Keyword(s):

Gene Family ◽

Zebra Finch ◽

Whole Genome ◽

Sequencing Errors ◽

A Minor ◽

Genomic Regions ◽

Assembly Algorithms ◽

Genome Assemblies ◽

Sequence Errors

AbstractFalse duplications in genome assemblies lead to false biological conclusions. We quantified false duplications in previous genome assemblies and their new counterparts of the same species (platypus, zebra finch, Anna’s hummingbird) generated by the Vertebrate Genomes Project (VGP). Whole genome alignments revealed that 4 to 16% of the sequences were falsely duplicated in the previous assemblies, impacting hundreds to thousands of genes. These led to overestimated gene family expansions. The main source of the false duplications was heterotype duplications, where the haplotype sequences were more divergent than other parts of the genome leading the assembly algorithms to classify them as separate genes or genomic regions. A minor source was sequencing errors. Although present in a smaller proportion, we observed false duplications remaining in the VGP assemblies that can be identified and purged. This study highlights the need for more advanced assembly methods that better separates haplotypes and sequence errors, and the need for cautious analyses on gene gains.

Download Full-text

False gene and chromosome losses affected by assembly and sequence errors

10.1101/2021.04.09.438906 ◽

2021 ◽

Author(s):

Juwan Kim ◽

Chul Lee ◽

Byung June Ko ◽

DongAhn Yoo ◽

Sohyoung Won ◽

...

Keyword(s):

Genomic Sequence ◽

Protein Coding ◽

Manual Curation ◽

Genome Wide ◽

Long Reads ◽

High Gene ◽

Assembly Algorithms ◽

Genome Assemblies ◽

Regulatory Landscapes ◽

High Gene Density

Many genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project (VGP) has been producing assemblies with an emphasis on being as complete and error-free as possible, utilizing long reads, long-range scaffolding data, new assembly algorithms, and manual curation. Here we evaluate these new vertebrate genome assemblies relative to the previous references for the same species, including a mammal (platypus), two birds (zebra finch, Anna's hummingbird), and a fish (climbing perch). We found that 3 to 11% of genomic sequence was entirely missing in the previous reference assemblies, which included nearly entire GC-rich and repeat-rich microchromosomes with high gene density. Genome-wide, between 25 to 60% of the genes were either completely or partially missing in the previous assemblies, and this was in part due to a bias in GC-rich 5'-proximal promoters and 5' exon regions. Our findings reveal novel regulatory landscapes and protein coding sequences that have been greatly underestimated in previous assemblies and are now present in the VGP assemblies.

Download Full-text

Benchmarking of long-read assemblers for prokaryote whole genome sequencing

F1000Research ◽

10.12688/f1000research.21782.4 ◽

2021 ◽

Vol 8 ◽

pp. 2138

Author(s):

Ryan R. Wick ◽

Kathryn E. Holt

Keyword(s):

Data Sets ◽

Computationally Efficient ◽

Oxford Nanopore ◽

Long Read ◽

Sequencing Platforms ◽

Computational Resources ◽

Assembly Algorithms ◽

Oxford Nanopore Technologies ◽

Sequence Errors ◽

Multiple Assembly

Background: Data sets from long-read sequencing platforms (Oxford Nanopore Technologies and Pacific Biosciences) allow for most prokaryote genomes to be completely assembled – one contig per chromosome or plasmid. However, the high per-read error rate of long-read sequencing necessitates different approaches to assembly than those used for short-read sequencing. Multiple assembly tools (assemblers) exist, which use a variety of algorithms for long-read assembly. Methods: We used 500 simulated read sets and 120 real read sets to assess the performance of eight long-read assemblers (Canu, Flye, Miniasm/Minipolish, NECAT, NextDenovo/NextPolish, Raven, Redbean and Shasta) across a wide variety of genomes and read parameters. Assemblies were assessed on their structural accuracy/completeness, sequence identity, contig circularisation and computational resources used. Results: Canu v2.1 produced reliable assemblies and was good with plasmids, but it performed poorly with circularisation and had the longest runtimes of all assemblers tested. Flye v2.8 was also reliable and made the smallest sequence errors, though it used the most RAM. Miniasm/Minipolish v0.3/v0.1.3 was the most likely to produce clean contig circularisation. NECAT v20200803 was reliable and good at circularisation but tended to make larger sequence errors. NextDenovo/NextPolish v2.3.1/v1.3.1 was reliable with chromosome assembly but bad with plasmid assembly. Raven v1.3.0 was reliable for chromosome assembly, though it did not perform well on small plasmids and had circularisation issues. Redbean v2.5 and Shasta v0.7.0 were computationally efficient but more likely to produce incomplete assemblies. Conclusions: Of the assemblers tested, Flye, Miniasm/Minipolish, NextDenovo/NextPolish and Raven performed best overall. However, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.

Download Full-text

Gene Sequence Assembly Algorithm Model Based on the DBG Strategy and Its Application

Journal of Healthcare Engineering ◽

10.1155/2021/6676194 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Haihe Shi ◽

Gang Wu

Keyword(s):

Gene Sequence ◽

Sequence Assembly ◽

Feature Model ◽

De Bruijn Graph ◽

Specific Sequence ◽

Component Library ◽

Assembly Algorithm ◽

Assembly Algorithms ◽

Abstract Algorithm ◽

Domain Level

With the continuous development of sequencing technology, the amount of bioinformatics data has increased geometrically, and the massive amount of bioinformatics data puts forward more stringent requirements for sequence assembly problems. The sequence assembly algorithm based on DBG (De Bruijn graph) strategy is a key algorithm in bioinformatics, which is widely used in the domain of gene sequence assembly. Current research on the domain of sequence assembly always focuses on optimization of specific steps to a specific algorithm and lack of research on domain-level high-abstract algorithm frameworks. To some extent, it leads to the redundancy of the sequence assembly algorithm, and some problems may be caused by the artificial selection algorithm. This paper analyzes the domain of DBGSA and establishes a feature model of this domain. Based on the production programming method, the DBGSA algorithm component is interactively designed. With the support of the PAR platform, the DBGSA algorithm component library is formally implemented, and furthermore, the DBGSA component library is used to assemble the specific algorithm. This research adds domain-level research to the domain of sequence assembly and implements the DBGSA component library, which can assemble specific sequence assembly algorithms, ensuring the efficiency of algorithm development and the reliability of assembly generation algorithms. At the same time, it also provides a valuable reference for solving problems in the domain of sequence assembly.

Download Full-text

Building pan-genome infrastructures for crop plants and their use in association genetics

DNA Research ◽

10.1093/dnares/dsaa030 ◽

2021 ◽

Vol 28 (1) ◽

Author(s):

Murukarthick Jayakodi ◽

Mona Schreiber ◽

Nils Stein ◽

Martin Mascher

Keyword(s):

High Throughput Sequencing ◽

Sequence Diversity ◽

Cultivated Plants ◽

Future Research ◽

Pan Genome ◽

Genomic Studies ◽

Assembly Algorithms ◽

User Friendly ◽

Multiple Reference ◽

Reference Genomes

Abstract Pan-genomic studies aim at representing the entire sequence diversity within a species to provide useful resources for evolutionary studies, functional genomics and breeding of cultivated plants. Cost reductions in high-throughput sequencing and advances in sequence assembly algorithms have made it possible to create multiple reference genomes along with a catalogue of all forms of genetic variations in plant species with large and complex or polyploid genomes. In this review, we summarize the current approaches to building pan-genomes as an in silico representation of plant sequence diversity and outline relevant methods for their effective utilization in linking structural with phenotypic variation. We propose as future research avenues (i) transcriptomic and epigenomic studies across multiple reference genomes and (ii) the development of user-friendly and feature-rich pan-genome browsers.

Download Full-text

Significantly improving the quality of genome assemblies through curation

GigaScience ◽

10.1093/gigascience/giaa153 ◽

2021 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Kerstin Howe ◽

William Chow ◽

Joanna Collins ◽

Sarah Pelan ◽

Damon-Lee Pointon ◽

...

Keyword(s):

Genome Assembly ◽

Data Generation ◽

Research Projects ◽

Automated Assembly ◽

Assembly Quality ◽

Assembly Strategy ◽

Assembly Evaluation ◽

Assembly Algorithms ◽

Genome Assemblies

Abstract Genome sequence assemblies provide the basis for our understanding of biology. Generating error-free assemblies is therefore the ultimate, but sadly still unachieved goal of a multitude of research projects. Despite the ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near error-free genome assemblies for eukaryotes. Whilst working towards improved datasets and fully automated pipelines, assembly evaluation and curation is actively used to bridge this shortcoming and significantly reduce the number of assembly errors. In addition to this increase in product value, the insights gained from assembly curation are fed back into the automated assembly strategy and contribute to notable improvements in genome assembly quality. We describe our tried and tested approach for assembly curation using gEVAL, the genome evaluation browser. We outline the procedures applied to genome curation using gEVAL and also our recommendations for assembly curation in a gEVAL-independent context to facilitate the uptake of genome curation in the wider community.

Download Full-text

Long-read assemblies reveal structural diversity in genomes of organelles - an example with Acacia pycnantha

10.1101/2020.12.22.423164 ◽

2020 ◽

Author(s):

Anna E. Syme ◽

Todd G.B. McLay ◽

Frank Udovicic ◽

David J. Cantrill ◽

Daniel J. Murphy

Keyword(s):

Mitochondrial Genome ◽

Chloroplast Genome ◽

De Novo ◽

Genomic Structure ◽

Structural Diversity ◽

Mitochondrial Genomes ◽

Long Reads ◽

Organelle Genomes ◽

Long Read ◽

Assembly Algorithms

AbstractAlthough organelle genomes are typically represented as single, static, circular molecules, there is evidence that the chloroplast genome exists in two structural haplotypes and that the mitochondrial genome can display multiple circular, linear or branching forms. We sequenced and assembled chloroplast and mitochondrial genomes of the Golden Wattle, Acacia pycnantha, using long reads, iterative baiting to extract organelle-only reads, and several assembly algorithms to explore genomic structure. Using a de novo assembly approach agnostic to previous hypotheses about structure, we found different assemblies revealed contrasting arrangements of genomic segments; a hypothesis supported by mapped reads spanning alternate paths.

Download Full-text

Benchmarking Long-Read Assemblers for Genomic Analyses of Bacterial Pathogens Using Oxford Nanopore Sequencing

International Journal of Molecular Sciences ◽

10.3390/ijms21239161 ◽

2020 ◽

Vol 21 (23) ◽

pp. 9161

Author(s):

Zhao Chen ◽

David L. Erickson ◽

Jianghong Meng

Keyword(s):

Virulence Genes ◽

Bacterial Pathogens ◽

Error Rates ◽

Nanopore Sequencing ◽

Long Reads ◽

Oxford Nanopore ◽

Genomic Analyses ◽

Long Read ◽

Genome Analyses ◽

Assembly Algorithms

Oxford Nanopore sequencing can be used to achieve complete bacterial genomes. However, the error rates of Oxford Nanopore long reads are greater compared to Illumina short reads. Long-read assemblers using a variety of assembly algorithms have been developed to overcome this deficiency, which have not been benchmarked for genomic analyses of bacterial pathogens using Oxford Nanopore long reads. In this study, long-read assemblers, namely Canu, Flye, Miniasm/Racon, Raven, Redbean, and Shasta, were thus benchmarked using Oxford Nanopore long reads of bacterial pathogens. Ten species were tested for mediocre- and low-quality simulated reads, and 10 species were tested for real reads. Raven was the most robust assembler, obtaining complete and accurate genomes. All Miniasm/Racon and Raven assemblies of mediocre-quality reads provided accurate antimicrobial resistance (AMR) profiles, while the Raven assembly of Klebsiella variicola with low-quality reads was the only assembly with an accurate AMR profile among all assemblers and species. All assemblers functioned well for predicting virulence genes using mediocre-quality and real reads, whereas only the Raven assemblies of low-quality reads had accurate numbers of virulence genes. Regarding multilocus sequence typing (MLST), Miniasm/Racon was the most effective assembler for mediocre-quality reads, while only the Raven assemblies of Escherichia coli O157:H7 and K. variicola with low-quality reads showed positive MLST results. Miniasm/Racon and Raven were the best performers for MLST using real reads. The Miniasm/Racon and Raven assemblies showed accurate phylogenetic inference. For the pan-genome analyses, Raven was the strongest assembler for simulated reads, whereas Miniasm/Racon and Raven performed the best for real reads. Overall, the most robust and accurate assembler was Raven, closely followed by Miniasm/Racon.

Download Full-text

Benchmarking of long-read assemblers for prokaryote whole genome sequencing

F1000Research ◽

10.12688/f1000research.21782.3 ◽

2020 ◽

Vol 8 ◽

pp. 2138 ◽

Cited By ~ 2

Author(s):

Ryan R. Wick ◽

Kathryn E. Holt

Keyword(s):

Data Sets ◽

Computationally Efficient ◽

Oxford Nanopore ◽

Long Read ◽

Sequencing Platforms ◽

Computational Resources ◽

Assembly Algorithms ◽

Oxford Nanopore Technologies ◽

Sequence Errors ◽

Multiple Assembly

Background: Data sets from long-read sequencing platforms (Oxford Nanopore Technologies and Pacific Biosciences) allow for most prokaryote genomes to be completely assembled – one contig per chromosome or plasmid. However, the high per-read error rate of long-read sequencing necessitates different approaches to assembly than those used for short-read sequencing. Multiple assembly tools (assemblers) exist, which use a variety of algorithms for long-read assembly. Methods: We used 500 simulated read sets and 120 real read sets to assess the performance of eight long-read assemblers (Canu, Flye, Miniasm/Minipolish, NECAT, NextDenovo/NextPolish, Raven, Redbean and Shasta) across a wide variety of genomes and read parameters. Assemblies were assessed on their structural accuracy/completeness, sequence identity, contig circularisation and computational resources used. Results: Canu v2.0 produced reliable assemblies and was good with plasmids, but it performed poorly with circularisation and had the longest runtimes of all assemblers tested. Flye v2.8 was also reliable and made the smallest sequence errors, though it used the most RAM. Miniasm/Minipolish v0.3/v0.1.3 was the most likely to produce clean contig circularisation. NECAT v20200119 was reliable and good at circularisation but tended to make larger sequence errors. NextDenovo/NextPolish v2.3.0/v1.2.4 was reliable with chromosome assembly but bad with plasmid assembly. Raven v1.1.10 was the most reliable for chromosome assembly, though it did not perform well on small plasmids and had circularisation issues. Redbean v2.5 and Shasta v0.5.1 were computationally efficient but more likely to produce incomplete assemblies. Conclusions: Of the assemblers tested, Flye, Miniasm/Minipolish and Raven performed best overall. However, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.

Download Full-text

assembly algorithms
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Empirical evaluation of methods for de novo genome assembly

Widespread false gene gains caused by duplication errors in genome assemblies

False gene and chromosome losses affected by assembly and sequence errors

Benchmarking of long-read assemblers for prokaryote whole genome sequencing

Gene Sequence Assembly Algorithm Model Based on the DBG Strategy and Its Application

Building pan-genome infrastructures for crop plants and their use in association genetics

Significantly improving the quality of genome assemblies through curation

Long-read assemblies reveal structural diversity in genomes of organelles - an example with Acacia pycnantha

Benchmarking Long-Read Assemblers for Genomic Analyses of Bacterial Pathogens Using Oxford Nanopore Sequencing

Benchmarking of long-read assemblers for prokaryote whole genome sequencing

Export Citation Format

assembly algorithmsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Empirical evaluation of methods for de novo genome assembly

Widespread false gene gains caused by duplication errors in genome assemblies

False gene and chromosome losses affected by assembly and sequence errors

Benchmarking of long-read assemblers for prokaryote whole genome sequencing

Gene Sequence Assembly Algorithm Model Based on the DBG Strategy and Its Application

Building pan-genome infrastructures for crop plants and their use in association genetics

Significantly improving the quality of genome assemblies through curation

Long-read assemblies reveal structural diversity in genomes of organelles - an example with Acacia pycnantha

Benchmarking Long-Read Assemblers for Genomic Analyses of Bacterial Pathogens Using Oxford Nanopore Sequencing

Benchmarking of long-read assemblers for prokaryote whole genome sequencing

assembly algorithms
Recently Published Documents