yacrd and fpa: upstream tools for long-read genome assembly

Mapping Intimacies ◽

10.1101/674036 ◽

2019 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Disk Space ◽

Link Type ◽

Long Read

AbstractMotivationGenome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space.ResultsWe introduce two tools, yacrd and fpa, preform respectively chimera removal, read scrubbing, and filter out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative.Availabilityhttps://github.com/natir/yacrd and https://github.com/natir/[email protected] informationSupplementary data are available online.

Download Full-text

yacrd and fpa: upstream tools for long-read genome assembly

Bioinformatics ◽

10.1093/bioinformatics/btaa262 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3894-3896 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Disk Space ◽

Long Read

Abstract Motivation Genome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space. Results We introduce two tools: yacrd for chimera removal and read scrubbing, and fpa for filtering out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative. Availability and implementation https://github.com/natir/yacrd and https://github.com/natir/fpa. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Bioinformatics ◽

10.1093/bioinformatics/btaa440 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i75-i83 ◽

Cited By ~ 5

Author(s):

Alla Mikheenko ◽

Andrey V Bzikadze ◽

Alexey Gurevich ◽

Karen H Miga ◽

Pavel A Pevzner

Keyword(s):

Quality Assessment ◽

Chromosome Segregation ◽

Tandem Repeats ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Cellular Processes ◽

Long Reads ◽

Long Read ◽

Eukaryotic Genomes

Abstract Motivation Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETRs remains an open problem, it is not clear how to polish draft ETR assemblies. Results To address these problems, we developed the TandemTools software that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improves the recently generated assemblies of human centromeres. Availability and implementation https://github.com/ablab/TandemTools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PDR: a new genome assembly evaluation metric based on genetics concerns

Bioinformatics ◽

10.1093/bioinformatics/btaa704 ◽

2020 ◽

Author(s):

Luyu Xie ◽

Limsoon Wong

Keyword(s):

Genome Assembly ◽

Pairwise Distance ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Genetic Studies ◽

A Genome ◽

Assembly Evaluation ◽

Evaluation Metric

Abstract Motivation Existing genome assembly evaluation metrics provide only limited insight on specific aspects of genome assembly quality, and sometimes even disagree with each other. For better integrative comparison between assemblies, we propose, here, a new genome assembly evaluation metric, Pairwise Distance Reconstruction (PDR). It derives from a common concern in genetic studies, and takes completeness, contiguity, and correctness into consideration. We also propose an approximation implementation to accelerate PDR computation. Results Our results on publicly available datasets affirm PDR’s ability to integratively assess the quality of a genome assembly. In fact, this is guaranteed by its definition. The results also indicated the error introduced by approximation is extremely small and thus negligible. Availabilityand implementation https://github.com/XLuyu/PDR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RefKA: A fast and efficient long-read genome assembly approach for large and complex genomes

10.1101/2020.04.17.035287 ◽

2020 ◽

Author(s):

Yuxuan Yuan ◽

Philipp E. Bayer ◽

Robyn Anderson ◽

HueyTyng Lee ◽

Chon-Kit Kenneth Chan ◽

...

Keyword(s):

Genome Assembly ◽

Chinese Spring ◽

Complete Genome ◽

Reference Genome ◽

Computing Time ◽

Link Type ◽

Recent Advances ◽

Long Read ◽

Genome Assemblies

AbstractRecent advances in long-read sequencing have the potential to produce more complete genome assemblies using sequence reads which can span repetitive regions. However, overlap based assembly methods routinely used for this data require significant computing time and resources. Here, we have developed RefKA, a reference-based approach for long read genome assembly. This approach relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step. During benchmarking, we assembled the wheat Chinese Spring (CS) genome using publicly available PacBio reads in parallel in 168 wall hours on a 250 CPU system. The maximum RAM used was 300 Gb and the computing time was 42,000 CPU hours. The approach opens applications for the assembly of other large and complex genomes with much-reduced computing requirements. The RefKA pipeline is available at https://github.com/AppliedBioinformatics/RefKA

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

Crosslink: A fast, scriptable genetic mapper for outcrossing species

10.1101/135277 ◽

2017 ◽

Cited By ~ 6

Author(s):

Robert J. Vickerstaff ◽

Richard J. Harrison

Keyword(s):

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Mapping Software ◽

Outcrossing Species ◽

Supplementary Material ◽

Novel Approaches ◽

Similar Accuracy ◽

General Public License

AbstractSummaryCrosslink is genetic mapping software for outcrossing species designed to run efficiently on large datasets by combining the best from existing tools with novel approaches. Tests show it runs much faster than several comparable programs whilst retaining a similar accuracy.Availability and implementationAvailable under the GNU General Public License version 2 from https://github.com/eastmallingresearch/[email protected] informationSupplementary data are available at Bioinformatics online and from https://github.com/eastmallingresearch/crosslink/releases/tag/v0.5.

Download Full-text

pyseer: a comprehensive tool for microbial pangenome-wide association studies

10.1101/266312 ◽

2018 ◽

Cited By ~ 1

Author(s):

John A Lees ◽

Marco Galardini ◽

Stephen D Bentley ◽

Jeffrey N Weiser ◽

Jukka Corander

Keyword(s):

Input Data ◽

Association Studies ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Supplementary Data ◽

New Methods ◽

Link Type ◽

Genome Wide

AbstractSummaryGenome-wide association studies (GWAS) in microbes face different challenges to eukaryotes and have been addressed by a number of different methods. pyseer brings these techniques together in one package tailored to microbial GWAS, allows greater flexibility of the input data used, and adds new methods to interpret the association results.Availability and Implementationpyseer is written in python and is freely available at https://github.com/mgalardini/pyseer, or can be installed through pip. Documentation and a tutorial are available at http://[email protected] and [email protected] informationSupplementary data are available online.

Download Full-text

PhyloFold: Precise and Swift Prediction of RNA Secondary Structures to Incorporate Phylogeny among Homologs

10.1101/2020.03.05.975797 ◽

2020 ◽

Author(s):

Masaki Tagashira

Keyword(s):

Secondary Structure ◽

Rna Secondary Structure ◽

Prediction Accuracy ◽

Structural Alignment ◽

Source Code ◽

Secondary Structures ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Structural Alignments

AbstractMotivationThe simultaneous consideration of sequence alignment and RNA secondary structure, or structural alignment, is known to help predict more accurate secondary structures of homologs. However, the consideration is heavy and can be done only roughly to decompose structural alignments.ResultsThe PhyloFold method, which predicts secondary structures of homologs considering likely pairwise structural alignments, was developed in this study. The method shows the best prediction accuracy while demanding comparable running time compared to conventional methods.AvailabilityThe source code of the programs implemented in this study is available on “https://github.com/heartsh/phylofold” and “https://github.com/heartsh/phyloalifold“.Contact“[email protected]”.Supplementary informationSupplementary data are available.

Download Full-text

Nanopype: a modular and scalable nanopore data processing pipeline

Bioinformatics ◽

10.1093/bioinformatics/btz461 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4770-4772

Author(s):

Pay Giesselmann ◽

Sara Hetzel ◽

Franz-Josef Müller ◽

Alexander Meissner ◽

Helene Kretzmer

Keyword(s):

Data Processing ◽

Supplementary Information ◽

Nanopore Sequencing ◽

Third Generation ◽

Supplementary Data ◽

Seamless Integration ◽

Short Read ◽

Processing Pipeline ◽

Bioinformatics Software ◽

Long Read

Abstract Summary Long-read third-generation nanopore sequencing enables researchers to now address a range of questions that are difficult to tackle with short read approaches. The rapidly expanding user base and continuously increasing throughput have sparked the development of a growing number of specialized analysis tools. However, streamlined processing of nanopore datasets using reproducible and transparent workflows is still lacking. Here we present Nanopype, a nanopore data processing pipeline that integrates a diverse set of established bioinformatics software while maintaining consistent and standardized output formats. Seamless integration into compute cluster environments makes the framework suitable for high-throughput applications. As a result, Nanopype facilitates comparability of nanopore data analysis workflows and thereby should enhance the reproducibility of biological insights. Availability and implementation https://github.com/giesselmann/nanopype, https://nanopype.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

VCFShark: how to squeeze a VCF file

10.1101/2020.12.18.423437 ◽

2020 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Danek

Keyword(s):

Web Site ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Order Of Magnitude ◽

Better Than ◽

De Facto Standards

AbstractSummaryThe VCF files with results of sequencing projects take a lot of space. We propose VCFShark squeezing them up to an order of magnitude better than the de facto standards (gzipped VCF and BCF).Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text