scholarly journals BiSCoT: improving large eukaryotic genome assemblies with optical maps

PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e10150
Author(s):  
Benjamin Istace ◽  
Caroline Belser ◽  
Jean-Marc Aury

Motivation Long read sequencing and Bionano Genomics optical maps are two techniques that, when used together, make it possible to reconstruct entire chromosome or chromosome arms structure. However, the existing tools are often too conservative and organization of contigs into scaffolds is not always optimal. Results We developed BiSCoT (Bionano SCaffolding COrrection Tool), a tool that post-processes files generated during a Bionano scaffolding in order to produce an assembly of greater contiguity and quality. BiSCoT was tested on a human genome and four publicly available plant genomes sequenced with Nanopore long reads and improved significantly the contiguity and quality of the assemblies. BiSCoT generates a fasta file of the assembly as well as an AGP file which describes the new organization of the input assembly. Availability BiSCoT and improved assemblies are freely available on GitHub at http://www.genoscope.cns.fr/biscot and Pypi at https://pypi.org/project/biscot/.

2019 ◽  
Author(s):  
Benjamin Istace ◽  
Caroline Belser ◽  
Jean-Marc Aury

ABSTRACTMotivationLong read sequencing and Bionano Genomics optical maps are two techniques that, when used together, make it possible to reconstruct entire chromosome or chromosome arms structure. However, the existing tools are often too conservative and organization of contigs into scaffolds is not always optimal.ResultsWe developed BiSCoT (Bionano SCaffolding COrrection Tool), a tool that post-processes files generated during a Bionano scaffolding in order to produce an assembly of greater contiguity and quality. BiSCoT was tested on a human genome and four publicly available plant genomes sequenced with Nanopore long reads and improved significantly the contiguity and quality of the assemblies. BiSCoT generates a fasta file of the assembly as well as an AGP file which describes the new organization of the input assembly.AvailabilityBiSCoT and improved assemblies are freely available on Github at http://www.genoscope.cns.fr/biscot and Pypi at https://pypi.org/project/biscot/.


2019 ◽  
Author(s):  
Kokulapalan Wimalanathan ◽  
Carolyn J. Lawrence-Dill

AbstractAnnotating gene structures and functions to genome assemblies is a must to make assembly resources useful for biological inference. Gene Ontology (GO) term assignment is the most pervasively used functional annotation system, and new methods for GO assignment have improved the quality of GO-based function predictions. GOMAP, the Gene Ontology Meta Annotator for Plants (GOMAP) is an optimized, high-throughput, and reproducible pipeline for genome-scale GO annotation for plant genomes. GOMAP’s methods have been shown to expand and improve the number of genes annotated and annotations assigned per gene as well as the quality (based on F-score) of GO assignments in maize. Here we report on the pipeline’s availability and performance for annotating large, repetitive plant genomes and describe how to deploy GOMAP to annotate additional plant genomes. We containerized GOMAP to increase portability and reproducibility, and optimized its performance for HPC environments. GOMAP has been used to annotate multiple maize lines, and is currently being deployed to annotate other species including wheat, rice, barley, cotton, soy, and others. Instructions along with access to the GOMAP Singularity container are freely available online at https://gomap-singularity.readthedocs.io/en/latest/. A list of annotated genomes and links to data is maintained at https://dill-picl.org/projects/gomap/gomap-datasets/.


2021 ◽  
Author(s):  
Giulio Formenti ◽  
Arang Rhie ◽  
Brian P Walenz ◽  
Francoise Thibaud-Nissen ◽  
Kishwar Shafin ◽  
...  

Read mapping and variant calling approaches have been widely used for accurate genotyping and improving consensus quality assembled from noisy long reads. Variant calling accuracy relies heavily on the read quality, the precision of the read mapping algorithm and variant caller, and the criteria adopted to filter the calls. However, it is impossible to define a single set of optimal parameters, as they vary depending on the quality of the read set, the variant caller of choice, and the quality of the unpolished assembly. To overcome this issue, we have devised a new tool called Merfin (k-mer based finishing tool), a k-mer based variant filtering algorithm for improved genotyping and polishing. Merfin evaluates the accuracy of a call based on expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller internal score. Moreover, we introduce novel assembly quality and completeness metrics that account for the expected genomic copy numbers. Merfin significantly increased the precision of a variant call and reduced frameshift errors when applied to PacBio HiFi, PacBio CLR, or Nanopore long read based assemblies. We demonstrate the utility while polishing the first complete human genome, a fully phased human genome, and non-human high-quality genomes.


Author(s):  
Ritu Kundu ◽  
Joshua Casey ◽  
Wing-Kin Sung

ABSTRACTEfforts towards making population-scale long read genome assemblies (especially human genomes) viable have intensified recently with the emergence of many fast assemblers. The reliance of these fast assemblers on polishing for the accuracy of assemblies makes it crucial. We present HyPo–a Hybrid Polisher–that utilises short as well as long reads within a single run to polish a long read assembly of small and large genomes. It exploits unique genomic kmers to selectively polish segments of contigs using partial order alignment of selective read-segments. As demonstrated on human genome assemblies, Hypo generates significantly more accurate polished assemblies in about one-third time with about half the memory requirements in comparison to Racon (the widely used polisher currently).


2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Jean-Marc Aury ◽  
Benjamin Istace

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.


2015 ◽  
Author(s):  
Ivan Sovic ◽  
Mile Sikic ◽  
Andreas Wilm ◽  
Shannon Nicole Fenlon ◽  
Swaine Chen ◽  
...  

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.


2020 ◽  
Author(s):  
Yuxuan Yuan ◽  
Philipp E. Bayer ◽  
Robyn Anderson ◽  
HueyTyng Lee ◽  
Chon-Kit Kenneth Chan ◽  
...  

AbstractRecent advances in long-read sequencing have the potential to produce more complete genome assemblies using sequence reads which can span repetitive regions. However, overlap based assembly methods routinely used for this data require significant computing time and resources. Here, we have developed RefKA, a reference-based approach for long read genome assembly. This approach relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step. During benchmarking, we assembled the wheat Chinese Spring (CS) genome using publicly available PacBio reads in parallel in 168 wall hours on a 250 CPU system. The maximum RAM used was 300 Gb and the computing time was 42,000 CPU hours. The approach opens applications for the assembly of other large and complex genomes with much-reduced computing requirements. The RefKA pipeline is available at https://github.com/AppliedBioinformatics/RefKA


Author(s):  
Arang Rhie ◽  
Brian P. Walenz ◽  
Sergey Koren ◽  
Adam M. Phillippy

AbstractRecent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.Availability of data and materialProject name: MerquryProject home page: https://github.com/marbl/merqury, https://github.com/marbl/merylArchived version: https://github.com/marbl/merqury/releases/tag/v1.0Operating system(s): Platform independentProgramming language: C++, Java, PerlOther requirements: gcc 4.8 or higher, java 1.6 or higherLicense: Public domain (see https://github.com/marbl/merqury/blob/master/README.license) Any restrictions to use by non-academics: No restrictions applied


2019 ◽  
Author(s):  
Kishwar Shafin ◽  
Trevor Pesout ◽  
Ryan Lorig-Roach ◽  
Marina Haukness ◽  
Hugh E. Olsen ◽  
...  

AbstractPresent workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.


2017 ◽  
Author(s):  
Manuel Tardaguila ◽  
Lorena de la Fuente ◽  
Cristina Marti ◽  
Cécile Pereira ◽  
Francisco Jose Pardo-Palacios ◽  
...  

ABSTRACTHigh-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in very well annotated organisms as mice and humans. Nonetheless, there is a need for studies and tools that characterize these novel isoforms. Here we present SQANTI, an automated pipeline for the classification of long-read transcripts that computes 47 descriptors that can be used to assess the quality of the data and of the preprocessing pipelines. We applied SQANTI to a neuronal mouse transcriptome using PacBio long reads and illustrate how the tool is effective in readily describing the composition of and characterizing the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach, and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, result more frequently in novel ORFs than novel UTRs and are enriched in both general metabolic and neural specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases we find that alternative isoforms are elusive to proteogenomics detection and are variable in protein changes with respect to the principal isoform of their genes. SQANTI allows the user to maximize the analytical outcome of long read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes. SQANTI is available at https://bitbucket.org/ConesaLab/sqanti.


Sign in / Sign up

Export Citation Format

Share Document