BiSCoT: improving large eukaryotic genome assemblies with optical maps

PeerJ ◽

10.7717/peerj.10150 ◽

2020 ◽

Vol 8 ◽

pp. e10150

Author(s):

Benjamin Istace ◽

Caroline Belser ◽

Jean-Marc Aury

Keyword(s):

Human Genome ◽

Fasta File ◽

Plant Genomes ◽

Link Type ◽

Long Reads ◽

Entire Chromosome ◽

Long Read ◽

Optical Maps ◽

Genome Assemblies

Motivation Long read sequencing and Bionano Genomics optical maps are two techniques that, when used together, make it possible to reconstruct entire chromosome or chromosome arms structure. However, the existing tools are often too conservative and organization of contigs into scaffolds is not always optimal. Results We developed BiSCoT (Bionano SCaffolding COrrection Tool), a tool that post-processes files generated during a Bionano scaffolding in order to produce an assembly of greater contiguity and quality. BiSCoT was tested on a human genome and four publicly available plant genomes sequenced with Nanopore long reads and improved significantly the contiguity and quality of the assemblies. BiSCoT generates a fasta file of the assembly as well as an AGP file which describes the new organization of the input assembly. Availability BiSCoT and improved assemblies are freely available on GitHub at http://www.genoscope.cns.fr/biscot and Pypi at https://pypi.org/project/biscot/.

Download Full-text

BiSCoT: Improving large eukaryotic genome assemblies with optical maps

10.1101/674721 ◽

2019 ◽

Cited By ~ 1

Author(s):

Benjamin Istace ◽

Caroline Belser ◽

Jean-Marc Aury

Keyword(s):

Human Genome ◽

Fasta File ◽

Plant Genomes ◽

Link Type ◽

Long Reads ◽

Entire Chromosome ◽

Long Read ◽

Optical Maps ◽

Genome Assemblies

ABSTRACTMotivationLong read sequencing and Bionano Genomics optical maps are two techniques that, when used together, make it possible to reconstruct entire chromosome or chromosome arms structure. However, the existing tools are often too conservative and organization of contigs into scaffolds is not always optimal.ResultsWe developed BiSCoT (Bionano SCaffolding COrrection Tool), a tool that post-processes files generated during a Bionano scaffolding in order to produce an assembly of greater contiguity and quality. BiSCoT was tested on a human genome and four publicly available plant genomes sequenced with Nanopore long reads and improved significantly the contiguity and quality of the assemblies. BiSCoT generates a fasta file of the assembly as well as an AGP file which describes the new organization of the input assembly.AvailabilityBiSCoT and improved assemblies are freely available on Github at http://www.genoscope.cns.fr/biscot and Pypi at https://pypi.org/project/biscot/.

Download Full-text

Gene Ontology Meta Annotator for Plants

10.1101/809988 ◽

2019 ◽

Cited By ~ 1

Author(s):

Kokulapalan Wimalanathan ◽

Carolyn J. Lawrence-Dill

Keyword(s):

Gene Ontology ◽

Go Annotation ◽

Plant Genomes ◽

Link Type ◽

Gene Structures ◽

And Performance ◽

Genome Assemblies ◽

Genome Scale ◽

Per Gene

AbstractAnnotating gene structures and functions to genome assemblies is a must to make assembly resources useful for biological inference. Gene Ontology (GO) term assignment is the most pervasively used functional annotation system, and new methods for GO assignment have improved the quality of GO-based function predictions. GOMAP, the Gene Ontology Meta Annotator for Plants (GOMAP) is an optimized, high-throughput, and reproducible pipeline for genome-scale GO annotation for plant genomes. GOMAP’s methods have been shown to expand and improve the number of genes annotated and annotations assigned per gene as well as the quality (based on F-score) of GO assignments in maize. Here we report on the pipeline’s availability and performance for annotating large, repetitive plant genomes and describe how to deploy GOMAP to annotate additional plant genomes. We containerized GOMAP to increase portability and reproducibility, and optimized its performance for HPC environments. GOMAP has been used to annotate multiple maize lines, and is currently being deployed to annotate other species including wheat, rice, barley, cotton, soy, and others. Instructions along with access to the GOMAP Singularity container are freely available online at https://gomap-singularity.readthedocs.io/en/latest/. A list of annotated genomes and links to data is maintained at https://dill-picl.org/projects/gomap/gomap-datasets/.

Download Full-text

Merfin: improved variant filtering and polishing via k-mer validation

10.1101/2021.07.16.452324 ◽

2021 ◽

Author(s):

Giulio Formenti ◽

Arang Rhie ◽

Brian P Walenz ◽

Francoise Thibaud-Nissen ◽

Kishwar Shafin ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Read Mapping ◽

Mapping Algorithm ◽

Copy Numbers ◽

Long Reads ◽

Variant Filtering ◽

Long Read ◽

Finishing Tool

Read mapping and variant calling approaches have been widely used for accurate genotyping and improving consensus quality assembled from noisy long reads. Variant calling accuracy relies heavily on the read quality, the precision of the read mapping algorithm and variant caller, and the criteria adopted to filter the calls. However, it is impossible to define a single set of optimal parameters, as they vary depending on the quality of the read set, the variant caller of choice, and the quality of the unpolished assembly. To overcome this issue, we have devised a new tool called Merfin (k-mer based finishing tool), a k-mer based variant filtering algorithm for improved genotyping and polishing. Merfin evaluates the accuracy of a call based on expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller internal score. Moreover, we introduce novel assembly quality and completeness metrics that account for the expected genomic copy numbers. Merfin significantly increased the precision of a variant call and reduced frameshift errors when applied to PacBio HiFi, PacBio CLR, or Nanopore long read based assemblies. We demonstrate the utility while polishing the first complete human genome, a fully phased human genome, and non-human high-quality genomes.

Download Full-text

HyPo: Super Fast & Accurate Polisher for Long Read Genome Assemblies

10.1101/2019.12.19.882506 ◽

2019 ◽

Cited By ~ 4

Author(s):

Ritu Kundu ◽

Joshua Casey ◽

Wing-Kin Sung

Keyword(s):

Partial Order ◽

Human Genome ◽

Human Genomes ◽

Long Reads ◽

Long Read ◽

Population Scale ◽

Genome Assemblies ◽

Large Genomes

ABSTRACTEfforts towards making population-scale long read genome assemblies (especially human genomes) viable have intensified recently with the emergence of many fast assemblers. The reliance of these fast assemblers on polishing for the accuracy of assemblies makes it crucial. We present HyPo–a Hybrid Polisher–that utilises short as well as long reads within a single run to polish a long read assembly of small and large genomes. It exploits unique genomic kmers to selectively polish segments of contigs using partial order alignment of selective read-segments. As demonstrated on human genome assemblies, Hypo generates significantly more accurate polished assemblies in about one-third time with about half the memory requirements in comparison to Racon (the widely used polisher currently).

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Download Full-text

RefKA: A fast and efficient long-read genome assembly approach for large and complex genomes

10.1101/2020.04.17.035287 ◽

2020 ◽

Author(s):

Yuxuan Yuan ◽

Philipp E. Bayer ◽

Robyn Anderson ◽

HueyTyng Lee ◽

Chon-Kit Kenneth Chan ◽

...

Keyword(s):

Genome Assembly ◽

Chinese Spring ◽

Complete Genome ◽

Reference Genome ◽

Computing Time ◽

Link Type ◽

Recent Advances ◽

Long Read ◽

Genome Assemblies

AbstractRecent advances in long-read sequencing have the potential to produce more complete genome assemblies using sequence reads which can span repetitive regions. However, overlap based assembly methods routinely used for this data require significant computing time and resources. Here, we have developed RefKA, a reference-based approach for long read genome assembly. This approach relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step. During benchmarking, we assembled the wheat Chinese Spring (CS) genome using publicly available PacBio reads in parallel in 168 wall hours on a 250 CPU system. The maximum RAM used was 300 Gb and the computing time was 42,000 CPU hours. The approach opens applications for the assembly of other large and complex genomes with much-reduced computing requirements. The RefKA pipeline is available at https://github.com/AppliedBioinformatics/RefKA

Download Full-text

Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies

10.1101/2020.03.15.992941 ◽

2020 ◽

Cited By ~ 15

Author(s):

Arang Rhie ◽

Brian P. Walenz ◽

Sergey Koren ◽

Adam M. Phillippy

Keyword(s):

De Novo ◽

High Accuracy ◽

Link Type ◽

Base Level ◽

Project Home Page ◽

Set Operations ◽

Assembly Evaluation ◽

Long Read ◽

Genome Assemblies ◽

Reference Genomes

AbstractRecent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.Availability of data and materialProject name: MerquryProject home page: https://github.com/marbl/merqury, https://github.com/marbl/merylArchived version: https://github.com/marbl/merqury/releases/tag/v1.0Operating system(s): Platform independentProgramming language: C++, Java, PerlOther requirements: gcc 4.8 or higher, java 1.6 or higherLicense: Public domain (see https://github.com/marbl/merqury/blob/master/README.license) Any restrictions to use by non-academics: No restrictions applied

Download Full-text

Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit

10.1101/715722 ◽

2019 ◽

Cited By ~ 21

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Ryan Lorig-Roach ◽

Marina Haukness ◽

Hugh E. Olsen ◽

...

Keyword(s):

Human Genome ◽

De Novo ◽

Proximity Ligation ◽

Current State ◽

Human Genomes ◽

Sequencing Method ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Assembly Performance

AbstractPresent workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.

Download Full-text

SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification

10.1101/118083 ◽

2017 ◽

Cited By ~ 9

Author(s):

Manuel Tardaguila ◽

Lorena de la Fuente ◽

Cristina Marti ◽

Cécile Pereira ◽

Francisco Jose Pardo-Palacios ◽

...

Keyword(s):

High Throughput Sequencing ◽

Full Length ◽

The Novel ◽

Extensive Evaluation ◽

Long Reads ◽

Long Read ◽

Novel Transcripts ◽

Mouse Transcriptome

ABSTRACTHigh-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in very well annotated organisms as mice and humans. Nonetheless, there is a need for studies and tools that characterize these novel isoforms. Here we present SQANTI, an automated pipeline for the classification of long-read transcripts that computes 47 descriptors that can be used to assess the quality of the data and of the preprocessing pipelines. We applied SQANTI to a neuronal mouse transcriptome using PacBio long reads and illustrate how the tool is effective in readily describing the composition of and characterizing the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach, and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, result more frequently in novel ORFs than novel UTRs and are enriched in both general metabolic and neural specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases we find that alternative isoforms are elusive to proteogenomics detection and are variable in protein changes with respect to the principal isoform of their genes. SQANTI allows the user to maximize the analytical outcome of long read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes. SQANTI is available at https://bitbucket.org/ConesaLab/sqanti.

Download Full-text