yacrd and fpa: upstream tools for long-read genome assembly

Pierre Marijon; Rayan Chikhi; Jean-Stéphane Varré

doi:10.1093/bioinformatics/btaa262

yacrd and fpa: upstream tools for long-read genome assembly

Bioinformatics ◽

10.1093/bioinformatics/btaa262 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3894-3896 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Disk Space ◽

Long Read

Abstract Motivation Genome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space. Results We introduce two tools: yacrd for chimera removal and read scrubbing, and fpa for filtering out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative. Availability and implementation https://github.com/natir/yacrd and https://github.com/natir/fpa. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

yacrd and fpa: upstream tools for long-read genome assembly

10.1101/674036 ◽

2019 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Disk Space ◽

Link Type ◽

Long Read

AbstractMotivationGenome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space.ResultsWe introduce two tools, yacrd and fpa, preform respectively chimera removal, read scrubbing, and filter out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative.Availabilityhttps://github.com/natir/yacrd and https://github.com/natir/[email protected] informationSupplementary data are available online.

Download Full-text

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Bioinformatics ◽

10.1093/bioinformatics/btaa440 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i75-i83 ◽

Cited By ~ 5

Author(s):

Alla Mikheenko ◽

Andrey V Bzikadze ◽

Alexey Gurevich ◽

Karen H Miga ◽

Pavel A Pevzner

Keyword(s):

Quality Assessment ◽

Chromosome Segregation ◽

Tandem Repeats ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Cellular Processes ◽

Long Reads ◽

Long Read ◽

Eukaryotic Genomes

Abstract Motivation Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETRs remains an open problem, it is not clear how to polish draft ETR assemblies. Results To address these problems, we developed the TandemTools software that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improves the recently generated assemblies of human centromeres. Availability and implementation https://github.com/ablab/TandemTools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PDR: a new genome assembly evaluation metric based on genetics concerns

Bioinformatics ◽

10.1093/bioinformatics/btaa704 ◽

2020 ◽

Author(s):

Luyu Xie ◽

Limsoon Wong

Keyword(s):

Genome Assembly ◽

Pairwise Distance ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Genetic Studies ◽

A Genome ◽

Assembly Evaluation ◽

Evaluation Metric

Abstract Motivation Existing genome assembly evaluation metrics provide only limited insight on specific aspects of genome assembly quality, and sometimes even disagree with each other. For better integrative comparison between assemblies, we propose, here, a new genome assembly evaluation metric, Pairwise Distance Reconstruction (PDR). It derives from a common concern in genetic studies, and takes completeness, contiguity, and correctness into consideration. We also propose an approximation implementation to accelerate PDR computation. Results Our results on publicly available datasets affirm PDR’s ability to integratively assess the quality of a genome assembly. In fact, this is guaranteed by its definition. The results also indicated the error introduced by approximation is extremely small and thus negligible. Availabilityand implementation https://github.com/XLuyu/PDR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

Nanopype: a modular and scalable nanopore data processing pipeline

Bioinformatics ◽

10.1093/bioinformatics/btz461 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4770-4772

Author(s):

Pay Giesselmann ◽

Sara Hetzel ◽

Franz-Josef Müller ◽

Alexander Meissner ◽

Helene Kretzmer

Keyword(s):

Data Processing ◽

Supplementary Information ◽

Nanopore Sequencing ◽

Third Generation ◽

Supplementary Data ◽

Seamless Integration ◽

Short Read ◽

Processing Pipeline ◽

Bioinformatics Software ◽

Long Read

Abstract Summary Long-read third-generation nanopore sequencing enables researchers to now address a range of questions that are difficult to tackle with short read approaches. The rapidly expanding user base and continuously increasing throughput have sparked the development of a growing number of specialized analysis tools. However, streamlined processing of nanopore datasets using reproducible and transparent workflows is still lacking. Here we present Nanopype, a nanopore data processing pipeline that integrates a diverse set of established bioinformatics software while maintaining consistent and standardized output formats. Seamless integration into compute cluster environments makes the framework suitable for high-throughput applications. As a result, Nanopype facilitates comparability of nanopore data analysis workflows and thereby should enhance the reproducibility of biological insights. Availability and implementation https://github.com/giesselmann/nanopype, https://nanopype.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ARBitR: an overlap-aware genome assembly scaffolder for linked reads

Bioinformatics ◽

10.1093/bioinformatics/btaa975 ◽

2020 ◽

Author(s):

Markus Hiltunen ◽

Martin Ryberg ◽

Hanna Johannesson

Keyword(s):

Genome Assembly ◽

General Public ◽

Source Code ◽

Draft Genome ◽

Supplementary Information ◽

Genomic Sequencing ◽

Supplementary Data ◽

Genome Assemblies ◽

General Public License

Abstract Summary Linked genomic sequencing reads contain information that can be used to join sequences together into scaffolds in draft genome assemblies. Existing software for this purpose performs the scaffolding by joining sequences with a gap between them, not considering potential overlaps of contigs. We developed ARBitR to create scaffolds where overlaps are taken into account and show that it can accurately recreate regions where draft assemblies are broken. Availability and implementation ARBitR is written and implemented in Python3 for Unix-based operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License v3. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Assembly Graph Browser: interactive visualization of assembly graphs

Bioinformatics ◽

10.1093/bioinformatics/btz072 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3476-3478 ◽

Cited By ~ 2

Author(s):

Alla Mikheenko ◽

Mikhail Kolmogorov

Keyword(s):

Genome Assembly ◽

Open Problem ◽

Interactive Visualization ◽

Supplementary Information ◽

Supplementary Data ◽

New Approach ◽

Repeat Analysis

Abstract Summary Currently, most genome assembly projects focus on contigs and scaffolds rather than assembly graphs that provide a more comprehensive representation of an assembly. Since interactive visualization of large assembly graphs remains an open problem, we developed an Assembly Graph Browser (AGB) tool that visualizes large assembly graphs, extending the functionality of previously developed visualization approaches. Assembly Graph Browser includes a number of novel functions including repeat analysis, construction of the contracted assembly graphs (i.e. the graphs obtained by collapsing a selected set of edges) and a new approach to visualizing large assembly graphs. Availability and implementation http://www.github.com/almiheenko/AGB. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MsPAC: a tool for haplotype-phased structural variant detection

Bioinformatics ◽

10.1093/bioinformatics/btz618 ◽

2019 ◽

Vol 36 (3) ◽

pp. 922-924 ◽

Cited By ~ 3

Author(s):

Oscar L Rodriguez ◽

Anna Ritz ◽

Andrew J Sharp ◽

Ali Bashir

Keyword(s):

Genomic Data ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Structural Variant ◽

Long Read ◽

One Step ◽

Variant Detection ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Summary While next-generation sequencing (NGS) has dramatically increased the availability of genomic data, phased genome assembly and structural variant (SV) analyses are limited by NGS read lengths. Long-read sequencing from Pacific Biosciences and NGS barcoding from 10x Genomics hold the potential for far more comprehensive views of individual genomes. Here, we present MsPAC, a tool that combines both technologies to partition reads, assemble haplotypes (via existing software) and convert assemblies into high-quality, phased SV predictions. MsPAC represents a framework for haplotype-resolved SV calls that moves one step closer to fully resolved, diploid genomes. Availability and implementation https://github.com/oscarlr/MsPAC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SiLiCO: A Simulator of Long Read Sequencing in PacBio and Oxford Nanopore

10.1101/076901 ◽

2016 ◽

Cited By ~ 2

Author(s):

Ethan Alexander García Baker ◽

Sara Goodwin ◽

W. Richard McCombie ◽

Olivia Mendivil Ramos

Keyword(s):

Reference Data ◽

Supplementary Information ◽

Data Sets ◽

Simulation Tool ◽

Supplementary Data ◽

Structural Variants ◽

Oxford Nanopore ◽

Long Read ◽

Sequencing Platforms ◽

Core Facilities

AbstractSummaryLong read sequencing platforms, which include the widely used Pacific Biosciences (PacBio) platform and the emerging Oxford Nanopore platform, aim to produce sequence fragments in excess of 15-20 kilobases, and have proved advantageous in the identification of structural variants and easing genome assembly. However, long read sequencing remains relatively expensive and error prone, and failed sequencing runs represent a significant problem for genomics core facilities. To quantitatively assess the underlying mechanics of sequencing failure, it is essential to have highly reproducible and controllable reference data sets to which sequencing results can be compared. Here, we present SiLiCO, the first in silico simulation tool to generate standardized sequencing results from both of the leading long read sequencing platforms.AvailabilitySiLiCO is an open source package written in Python. It is freely available at https://www.github.com/ethanagbaker/SiLiCO under the GNU GPL 3.0 license.Contact<emails>Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

AGOUTI: improving genome assembly and annotation using transcriptome data

10.1101/033019 ◽

2015 ◽

Cited By ~ 1

Author(s):

Simo V. Zhang ◽

Luting Zhuo ◽

Matthew W. Hahn

Keyword(s):

Genome Assembly ◽

Gene Annotation ◽

Supplementary Information ◽

Gene Identification ◽

Supplementary Data ◽

Transcriptome Data ◽

Rna Seq ◽

Separate Gene ◽

Gene Models ◽

Genome Assemblies

AbstractSummaryCurrent genome assemblies consist of thousands of contigs. These incomplete and fragmented assemblies lead to errors in gene identification, such that single genes spread across multiple contigs are annotated as separate gene models. We present AGOUTI (Annotated Genome Optimization Using Transcriptome Information), a tool that uses RNA-seq data to simultaneously combine contigs into scaffolds and fragmented gene models into single models. We show that AGOUTI improves both the contiguity of genome assemblies and the accuracy of gene annotation, providing updated versions of each as output.AvailabilityThe software is implemented in python and is available from github.com/svm-zhang/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text