The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies

AbstractThe introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies. While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8–15%, has made it challenging to generate a highly accurate final sequence. Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges. A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 0.5%. This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to “polish” the consensus built from long reads. In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.

Download Full-text

Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie

10.1101/2021.12.08.471868 ◽

2021 ◽

Author(s):

Alaina Shumate ◽

Brandon Wong ◽

Geo Pertea ◽

Mihaela Pertea

Keyword(s):

Rna Sequencing ◽

Open Source Software ◽

Transcriptome Assembly ◽

Simulated Data ◽

Real Data ◽

Short Read ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

Improved Accuracy

Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are unable to span multiple exons. Long-read technology can capture full-length transcripts, but its high error rate often leads to mis-identified splice sites, and its low throughput makes quantification difficult. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus,and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie.

Download Full-text

NextPolish: a fast and efficient genome polishing tool for long-read assembly

Bioinformatics ◽

10.1093/bioinformatics/btz891 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2253-2255 ◽

Cited By ~ 11

Author(s):

Jiang Hu ◽

Junpeng Fan ◽

Zongyi Sun ◽

Shanlin Liu

Keyword(s):

Error Rates ◽

Supplementary Information ◽

Sequencing Technologies ◽

Large Numbers ◽

Long Reads ◽

Long Read ◽

Genome Assemblies ◽

Polishing Tool ◽

Sequence Errors ◽

Plant Arabidopsis Thaliana

Abstract Motivation Although long-read sequencing technologies can produce genomes with long contiguity, they suffer from high error rates. Thus, we developed NextPolish, a tool that efficiently corrects sequence errors in genomes assembled with long reads. This new tool consists of two interlinked modules that are designed to score and count K-mers from high quality short reads, and to polish genome assemblies containing large numbers of base errors. Results When evaluated for the speed and efficiency using human and a plant (Arabidopsis thaliana) genomes, NextPolish outperformed Pilon by correcting sequence errors faster, and with a higher correction accuracy. Availability and implementation NextPolish is implemented in C and Python. The source code is available from https://github.com/Nextomics/NextPolish. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Education in the genomics era: Generating high-quality genome assemblies in university courses

GigaScience ◽

10.1093/gigascience/giaa058 ◽

2020 ◽

Vol 9 (6) ◽

Cited By ~ 3

Author(s):

Stefan Prost ◽

Sven Winter ◽

Jordi De Raad ◽

Raphael T F Coimbra ◽

Magnus Wolf ◽

...

Keyword(s):

Low Cost ◽

Genomic Data ◽

Master's Level ◽

Genome Data ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

University Courses ◽

Hands On ◽

Genome Assemblies ◽

High Quality Genome

Abstract Recent advances in genome sequencing technologies have simplified the generation of genome data and reduced the costs for genome assemblies, even for complex genomes like those of vertebrates. More practically oriented genomic courses can prepare university students for the increasing importance of genomic data used in biological and medical research. Low-cost third-generation sequencing technology, along with publicly available data, can be used to teach students how to process genomic data, assemble full chromosome-level genomes, and publish the results in peer-reviewed journals, or preprint servers. Here we outline experiences gained from 2 master's-level courses and discuss practical considerations for teaching hands-on genome assembly courses.

Download Full-text

ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs

10.1101/2020.01.13.905240 ◽

2020 ◽

Author(s):

Lauren Coombe ◽

Vladimir Nikolić ◽

Justin Chu ◽

Inanc Birol ◽

René L. Warren

Keyword(s):

Reference Sequence ◽

Biological Research ◽

Closely Related Species ◽

Draft Assembly ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome ◽

Reference Human Genome

AbstractSummaryThe ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short read assembly with a draft long read assembly, and a draft assembly with an assembly from a closely-related species. When scaffolding a human short read assembly using the reference human genome or a long read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using less than 11 GB of RAM. Compared to existing reference-guided assemblers, ntJoin generates highly contiguous assemblies faster and using less memory.Availability and implementationntJoin is written in C++ and Python, and is freely available at https://github.com/bcgsc/[email protected]

Download Full-text

HAHap: a read-based haplotyping method using hierarchical assembly

PeerJ ◽

10.7717/peerj.5852 ◽

2018 ◽

Vol 6 ◽

pp. e5852

Author(s):

Yu-Yu Lin ◽

Ping Chun Wu ◽

Pei-Lung Chen ◽

Yen-Jen Oyang ◽

Chien-Yu Chen

Keyword(s):

Simulated Data ◽

Real Data ◽

Error Rates ◽

Lower Number ◽

Nucleotide Polymorphism ◽

Single Nucleotide ◽

Hierarchical Assembly ◽

Sequencing Technologies ◽

Error Corrections ◽

Selection Of

Background The need for read-based phasing arises with advances in sequencing technologies. The minimum error correction (MEC) approach is the primary trend to resolve haplotypes by reducing conflicts in a single nucleotide polymorphism-fragment matrix. However, it is frequently observed that the solution with the optimal MEC might not be the real haplotypes, due to the fact that MEC methods consider all positions together and sometimes the conflicts in noisy regions might mislead the selection of corrections. To tackle this problem, we present a hierarchical assembly-based method designed to progressively resolve local conflicts. Results This study presents HAHap, a new phasing algorithm based on hierarchical assembly. HAHap leverages high-confident variant pairs to build haplotypes progressively. The phasing results by HAHap on both real and simulated data, compared to other MEC-based methods, revealed better phasing error rates for constructing haplotypes using short reads from whole-genome sequencing. We compared the number of error corrections (ECs) on real data with other methods, and it reveals the ability of HAHap to predict haplotypes with a lower number of ECs. We also used simulated data to investigate the behavior of HAHap under different sequencing conditions, highlighting the applicability of HAHap in certain situations.

Download Full-text

Fast-SG: An alignment-free algorithm for hybrid assembly

10.1101/209122 ◽

2017 ◽

Author(s):

Alex Di Genova ◽

Gonzalo A. Ruz ◽

Marie-France Sagot ◽

Alejandro Maass

Keyword(s):

De Novo ◽

Reference Level ◽

Hybrid Assembly ◽

Short Read ◽

Sequencing Technologies ◽

Alignment Free ◽

Long Reads ◽

Long Read ◽

Definition Of ◽

Large Genomes

ABSTRACTLong read sequencing technologies are the ultimate solution for genome repeats, allowing near reference level reconstructions of large genomes. However, long read de novo assembly pipelines are computationally intense and require a considerable amount of coverage, thereby hindering their broad application to the assembly of large genomes. Alternatively, hybrid assembly methods which combine short and long read sequencing technologies can reduce the time and cost required to produce de novo assemblies of large genomes. In this paper, we propose a new method, called FAST-SG, which uses a new ultra-fast alignment-free algorithm specifically designed for constructing a scaffolding graph using light-weight data structures. FAST-SG can construct the graph from either short or long reads. This allows the reuse of efficient algorithms designed for short read data and permits the definition of novel modular hybrid assembly pipelines. Using comprehensive standard datasets and benchmarks, we show how FAST-SG outperforms the state-of-the-art short read aligners when building the scaffolding graph, and can be used to extract linking information from either raw or error-corrected long reads. We also show how a hybrid assembly approach using FAST-SG with shallow long read coverage (5X) and moderate computational resources can produce long-range and accurate reconstructions of the genomes of Arabidopsis thaliana (Ler-0) and human (NA12878).

Download Full-text

Finding Long Tandem Repeats In Long Noisy Reads

Bioinformatics ◽

10.1093/bioinformatics/btaa865 ◽

2020 ◽

Author(s):

Shinichi Morishita ◽

Kazuki Ichikawa ◽

Gene Myers

Keyword(s):

Tandem Repeat ◽

Error Rate ◽

Tandem Repeats ◽

Repeat Unit ◽

Error Rates ◽

De Bruijn Graph ◽

Frequency Distributions ◽

Sequencing Technologies ◽

Long Reads ◽

Repeat Expansions

Abstract Motivation Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10,000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10%-20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats (< 1000 nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. Results Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder (TRF), a widely used program for finding tandem repeats, in terms of sensitivity. Software availability https://github.com/morisUtokyo/mTR

Download Full-text

Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa075 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Cheng He ◽

Guifang Lin ◽

Hairong Wei ◽

Haibao Tang ◽

Frank F White ◽

...

Keyword(s):

Copy Number ◽

Error Rates ◽

Genome Sequences ◽

Short Reads ◽

Sequencing Technologies ◽

Insertion And Deletion ◽

Novel Approach ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

Abstract Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.

Download Full-text

Lep-Anchor: automated construction of linkage map anchored haploid genomes

Bioinformatics ◽

10.1093/bioinformatics/btz978 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2359-2364 ◽

Cited By ~ 4

Author(s):

Pasi Rastas

Keyword(s):

Linkage Map ◽

De Novo ◽

Simulated Data ◽

Real Data ◽

Significant Loss ◽

Linkage Maps ◽

Additional Information ◽

Correct Orientation ◽

Genome Assemblies ◽

Speed Accuracy

Abstract Motivation Linkage mapping provides a practical way to anchor de novo genome assemblies into chromosomes and to detect chimeric or otherwise erroneous contigs. Such anchoring improves with higher number of markers and individuals, as long as the mapping software can handle all the information. Recent software Lep-MAP3 can robustly construct linkage maps for millions of genotyped markers and on thousands of individuals, providing optimal maps for genome anchoring. For such large datasets, automated and robust genome anchoring tool is especially valuable and can significantly reduce intensive computational and manual work involved. Results Here, we present a software Lep-Anchor (LA) to anchor genome assemblies automatically using dense linkage maps. As the main novelty, it takes into account the uncertainty of the linkage map positions caused by low recombination regions, cross type or poor mapping data quality. Furthermore, it can automatically detect and cut chimeric contigs, and use contig–contig, single read or alternative genome assembly alignments as additional information on contig order and orientations and to collapse haplotype contigs. We demonstrate the performance of LA using real data and show that it outperforms ALLMAPS on anchoring completeness and speed. Accuracy-wise LA and ALLMAPS are about equal, but at the expense of lower completeness of ALLMAPS. The software Chromonomer was faster than the other two methods but has major limitations and is lower in accuracy. We also show that with additional information, such as contig–contig and read alignments, the anchoring completeness can be improved by up to 70% without significant loss in accuracy. Based on simulated data, we conclude that the anchoring accuracy can be improved by utilizing information about map position uncertainty. Accuracy is the rate of contigs in correct orientation and completeness is the number contigs with inferred orientation. Availability and implementation Lep-Anchor is available with the source code under GNU general public license from http://sourceforge.net/projects/lep-anchor. All the scripts and code used to produce the reported results are included with Lep-Anchor.

Download Full-text