TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Alla Mikheenko; Andrey V Bzikadze; Alexey Gurevich; Karen H Miga; Pavel A Pevzner

doi:10.1093/bioinformatics/btaa440

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Bioinformatics ◽

10.1093/bioinformatics/btaa440 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i75-i83 ◽

Cited By ~ 5

Author(s):

Alla Mikheenko ◽

Andrey V Bzikadze ◽

Alexey Gurevich ◽

Karen H Miga ◽

Pavel A Pevzner

Keyword(s):

Quality Assessment ◽

Chromosome Segregation ◽

Tandem Repeats ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Cellular Processes ◽

Long Reads ◽

Long Read ◽

Eukaryotic Genomes

Abstract Motivation Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETRs remains an open problem, it is not clear how to polish draft ETR assemblies. Results To address these problems, we developed the TandemTools software that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improves the recently generated assemblies of human centromeres. Availability and implementation https://github.com/ablab/TandemTools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TandemMapper and TandemQUAST: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

10.1101/2019.12.23.887158 ◽

2019 ◽

Cited By ~ 3

Author(s):

Alla Mikheenko ◽

Andrey V. Bzikadze ◽

Alexey Gurevich ◽

Karen H. Miga ◽

Pavel A. Pevzner

Keyword(s):

Quality Assessment ◽

Chromosome Segregation ◽

Open Problem ◽

Tandem Repeats ◽

Assembly Quality ◽

Cellular Processes ◽

Long Reads ◽

Standard Tool ◽

Long Read ◽

Eukaryotic Genomes

AbstractExtra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there is no standard tool for their quality assessment. Moreover, since the mapping of long error-prone reads to ETR remains an open problem, it is not clear how to polish draft ETR assemblies. To address these problems, we developed the tandemMapper tool for mapping reads to ETRs and the tandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that tandemQUAST not only reveals errors in and evaluates ETR assemblies, but also improves them. To illustrate how tandemMapper and tandemQUAST work, we apply them to recently generated assemblies of human centromeres.

Download Full-text

yacrd and fpa: upstream tools for long-read genome assembly

10.1101/674036 ◽

2019 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Disk Space ◽

Link Type ◽

Long Read

AbstractMotivationGenome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space.ResultsWe introduce two tools, yacrd and fpa, preform respectively chimera removal, read scrubbing, and filter out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative.Availabilityhttps://github.com/natir/yacrd and https://github.com/natir/[email protected] informationSupplementary data are available online.

Download Full-text

yacrd and fpa: upstream tools for long-read genome assembly

Bioinformatics ◽

10.1093/bioinformatics/btaa262 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3894-3896 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Disk Space ◽

Long Read

Abstract Motivation Genome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space. Results We introduce two tools: yacrd for chimera removal and read scrubbing, and fpa for filtering out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative. Availability and implementation https://github.com/natir/yacrd and https://github.com/natir/fpa. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BleTIES: Annotation of natural genome editing in ciliates using long read sequencing

10.1101/2021.05.18.444610 ◽

2021 ◽

Author(s):

Brandon K. B. Seah ◽

Estienne C. Swart

Keyword(s):

Dna Sequences ◽

Sequence Data ◽

Low Complexity ◽

Supplementary Information ◽

Neighboring Element ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Element Elimination

Ciliates are single-celled eukaryotes that eliminate specific, interspersed DNA sequences (internally eliminated sequences, IESs) from their genomes during development. These are challenging to annotate and assemble because IES-containing sequences are much less abundant in the cell than those without, and IES sequences themselves often contain repetitive and low-complexity sequences. Long read sequencing technologies from Pacific Biosciences and Oxford Nanopore have the potential to reconstruct longer IESs than has been possible with short reads, and also the ability to detect correlations of neighboring element elimination. Here we present BleTIES, a software toolkit for detecting, assembling, and analyzing IESs using mapped long reads. Availability and implementation: BleTIES is implemented in Python 3. Source code is available at https://github.com/Swart-lab/bleties (MIT license), and also distributed via Bioconda. Contact: [email protected] Supplementary information: Benchmarking of BleTIES with published sequence data.

Download Full-text

Nanopype: a modular and scalable nanopore data processing pipeline

Bioinformatics ◽

10.1093/bioinformatics/btz461 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4770-4772

Author(s):

Pay Giesselmann ◽

Sara Hetzel ◽

Franz-Josef Müller ◽

Alexander Meissner ◽

Helene Kretzmer

Keyword(s):

Data Processing ◽

Supplementary Information ◽

Nanopore Sequencing ◽

Third Generation ◽

Supplementary Data ◽

Seamless Integration ◽

Short Read ◽

Processing Pipeline ◽

Bioinformatics Software ◽

Long Read

Abstract Summary Long-read third-generation nanopore sequencing enables researchers to now address a range of questions that are difficult to tackle with short read approaches. The rapidly expanding user base and continuously increasing throughput have sparked the development of a growing number of specialized analysis tools. However, streamlined processing of nanopore datasets using reproducible and transparent workflows is still lacking. Here we present Nanopype, a nanopore data processing pipeline that integrates a diverse set of established bioinformatics software while maintaining consistent and standardized output formats. Seamless integration into compute cluster environments makes the framework suitable for high-throughput applications. As a result, Nanopype facilitates comparability of nanopore data analysis workflows and thereby should enhance the reproducibility of biological insights. Availability and implementation https://github.com/giesselmann/nanopype, https://nanopype.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

cloudSPAdes: assembly of synthetic long reads using de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz349 ◽

2019 ◽

Vol 35 (14) ◽

pp. i61-i70 ◽

Cited By ~ 4

Author(s):

Ivan Tolstoganov ◽

Anton Bankevich ◽

Zhoutao Chen ◽

Pavel A Pevzner

Keyword(s):

Narrow Range ◽

State Of The Art ◽

Supplementary Information ◽

De Bruijn Graph ◽

Hybrid Assembly ◽

De Bruijn Graphs ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

New Applications

Abstract Motivation The recently developed barcoding-based synthetic long read (SLR) technologies have already found many applications in genome assembly and analysis. However, although some new barcoding protocols are emerging and the range of SLR applications is being expanded, the existing SLR assemblers are optimized for a narrow range of parameters and are not easily extendable to new barcoding technologies and new applications such as metagenomics or hybrid assembly. Results We describe the algorithmic challenge of the SLR assembly and present a cloudSPAdes algorithm for SLR assembly that is based on analyzing the de Bruijn graph of SLRs. We benchmarked cloudSPAdes across various barcoding technologies/applications and demonstrated that it improves on the state-of-the-art SLR assemblers in accuracy and speed. Availability and implementation Source code and installation manual for cloudSPAdes are available at https://github.com/ablab/spades/releases/tag/cloudspades-paper. Supplementary Information Supplementary data are available at Bioinformatics online.

Download Full-text

Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btz484 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4809-4811 ◽

Cited By ~ 8

Author(s):

Robert S Harris ◽

Monika Cechova ◽

Kateryna D Makova

Keyword(s):

Tandem Repeats ◽

Error Rates ◽

Superior Performance ◽

Supplementary Information ◽

Whole Genome Sequencing Data ◽

Dna Repeats ◽

Sequencing Data ◽

Heat Shock Stress ◽

Noise Cancelling ◽

Long Read

Abstract Summary Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response. Availability and implementation NCRF is implemented in C, supported by several python scripts, and is available in bioconda and at https://github.com/makovalab-psu/NoiseCancellingRepeatFinder. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FLAS: fast and high-throughput algorithm for PacBio long-read self-correction

Bioinformatics ◽

10.1093/bioinformatics/btz206 ◽

2019 ◽

Vol 35 (20) ◽

pp. 3953-3960 ◽

Cited By ~ 3

Author(s):

Ergude Bao ◽

Fei Xie ◽

Changjin Song ◽

Dandan Song

Keyword(s):

High Throughput ◽

The Self ◽

Supplementary Information ◽

Third Generation ◽

Performance Tests ◽

Sequencing Errors ◽

The Third ◽

Fast Speed ◽

Long Reads ◽

Long Read

Abstract Motivation The third generation PacBio long reads have greatly facilitated sequencing projects with very large read lengths, but they contain about 15% sequencing errors and need error correction. For the projects with long reads only, it is challenging to make correction with fast speed, and also challenging to correct a sufficient amount of read bases, i.e. to achieve high-throughput self-correction. MECAT is currently among the fastest self-correction algorithms, but its throughput is relatively small (Xiao et al., 2017). Results Here, we introduce FLAS, a wrapper algorithm of MECAT, to achieve high-throughput long-read self-correction while keeping MECAT’s fast speed. FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy. In addition, FLAS also uses the corrected long-read regions to correct the uncorrected ones to further improve the throughput. In our performance tests on Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana and human long reads, FLAS can achieve 22.0–50.6% larger throughput than MECAT. FLAS is 2–13× faster compared to the self-correction algorithms other than MECAT, and its throughput is also 9.8–281.8% larger. The FLAS corrected long reads can be assembled into contigs of 13.1–29.8% larger N50 sizes than MECAT. Availability and implementation The FLAS software can be downloaded for free from this site: https://github.com/baoe/flas. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Structural variant analysis for linked-read sequencing data with gemtools

Bioinformatics ◽

10.1093/bioinformatics/btz239 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4397-4399 ◽

Cited By ~ 2

Author(s):

S U Greer ◽

H P Ji

Keyword(s):

Supplementary Information ◽

Supplementary Data ◽

Structural Variants ◽

Sequencing Data ◽

Structural Variant ◽

Single Dna Molecules ◽

Long Reads ◽

Depth Analysis ◽

Basic Functions ◽

Variant Analysis

Abstract Summary Linked-read sequencing generates synthetic long reads which are useful for the detection and analysis of structural variants (SVs). The software associated with 10× Genomics linked-read sequencing, Long Ranger, generates the essential output files (BAM, VCF, SV BEDPE) necessary for downstream analyses. However, to perform downstream analyses requires the user to customize their own tools to handle the unique features of linked-read sequencing data. Here, we describe gemtools, a collection of tools for the downstream and in-depth analysis of SVs from linked-read data. Gemtools uses the barcoded aligned reads and the Megabase-scale phase blocks to determine haplotypes of SV breakpoints and delineate complex breakpoint configurations at the resolution of single DNA molecules. The gemtools package is a suite of tools that provides the user with the flexibility to perform basic functions on their linked-read sequencing output in order to address even more questions. Availability and implementation The gemtools package is freely available for download at: https://github.com/sgreer77/gemtools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Tandem repeat interval pattern identifies animal taxa

Bioinformatics ◽

10.1093/bioinformatics/btab124 ◽

2021 ◽

Author(s):

Balaram Bhattacharyya ◽

Uddalak Mitra ◽

Ramkishore Bhattacharyya

Keyword(s):

Information Content ◽

Tandem Repeat ◽

Tandem Repeats ◽

Ordered Set ◽

Whole Genome Sequence ◽

Supplementary Information ◽

Whole Genome ◽

Supplementary Data ◽

Genome Sequences ◽

Significant Achievement

Abstract Motivation We discover that maximality of information content among intervals of Tandem Repeats (TRs) in animal genome segregates over taxa such that taxa identification becomes swift and accurate. Successive TRs of a motif occur at intervals over the sequence, forming a trail of TRs of the motif across the genome. We present a method, Tandem Repeat Information Mining (TRIM), that mines 4k number of TR trails of all k length motifs from a whole genome sequence and extracts the information content within intervals of the trails. TRIM vector formed from the ordered set of interval entropies becomes instrumental for genome segregation. Results Reconstruction of correct phylogeny for animals from whole genome sequences proves precision of TRIM. Identification of animal taxa by TRIM vector upon feature selection is the most significant achievement. These suggest Tandem Repeat Interval Pattern (TRIP) is a taxa-specific constitutional characteristic in animal genome. Availabilityand implementation Source and executable code of TRIM along with usage manual are made available at https://github.com/BB-BiG/TRIM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text