PDR: a new genome assembly evaluation metric based on genetics concerns

Bioinformatics ◽

10.1093/bioinformatics/btaa704 ◽

2020 ◽

Author(s):

Luyu Xie ◽

Limsoon Wong

Keyword(s):

Genome Assembly ◽

Pairwise Distance ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Genetic Studies ◽

A Genome ◽

Assembly Evaluation ◽

Evaluation Metric

Abstract Motivation Existing genome assembly evaluation metrics provide only limited insight on specific aspects of genome assembly quality, and sometimes even disagree with each other. For better integrative comparison between assemblies, we propose, here, a new genome assembly evaluation metric, Pairwise Distance Reconstruction (PDR). It derives from a common concern in genetic studies, and takes completeness, contiguity, and correctness into consideration. We also propose an approximation implementation to accelerate PDR computation. Results Our results on publicly available datasets affirm PDR’s ability to integratively assess the quality of a genome assembly. In fact, this is guaranteed by its definition. The results also indicated the error introduced by approximation is extremely small and thus negligible. Availabilityand implementation https://github.com/XLuyu/PDR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

WGA-LP: a pipeline for Whole Genome Assembly of contaminated reads

10.1101/2021.07.31.454518 ◽

2021 ◽

Author(s):

Nicolò Rossi ◽

Colautti Andrea ◽

Lucilla Iacumin ◽

Carla Piazza

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Whole Genome ◽

Art Programs ◽

Final Assembly ◽

Common Task ◽

A Genome ◽

Golden Standard ◽

Web App

Summary: Whole Genome Assembly (WGA) of bacterial genomes with short reads is a quite common task as DNA sequencing has become cheaper with the advances of its technology. The process of assembling a genome has no absolute golden standard (Del Angel et al. (2018)) and it requires to perform a sequence of steps each of which can involve combinations of many different tools. However, the quality of the final assembly is always strongly related to the quality of the input data. With this in mind we built WGA-LP, a package that connects state-of-art programs and novel scripts to check and improve the quality of both samples and resulting assemblies. WGA-LP, with its conservative decontamination approach, has shown to be capable of creating high quality assemblies even in the case of contaminated reads. Availability and Implementation: WGA-LP is available on GitHub (https://github.com/redsnic/WGA-LP) and Docker Hub (https://hub.docker.com/r/redsnic/wgalp). The web app for node visualization is hosted by shinyapps.io (https://redsnic.shinyapps.io/ContigCoverageVisualizer/). Contact: Nicolò Rossi, [email protected] Supplementary information: Supplementary data are available at bioRxiv online.

Download Full-text

yacrd and fpa: upstream tools for long-read genome assembly

10.1101/674036 ◽

2019 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Disk Space ◽

Link Type ◽

Long Read

AbstractMotivationGenome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space.ResultsWe introduce two tools, yacrd and fpa, preform respectively chimera removal, read scrubbing, and filter out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative.Availabilityhttps://github.com/natir/yacrd and https://github.com/natir/[email protected] informationSupplementary data are available online.

Download Full-text

yacrd and fpa: upstream tools for long-read genome assembly

Bioinformatics ◽

10.1093/bioinformatics/btaa262 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3894-3896 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Disk Space ◽

Long Read

Abstract Motivation Genome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space. Results We introduce two tools: yacrd for chimera removal and read scrubbing, and fpa for filtering out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative. Availability and implementation https://github.com/natir/yacrd and https://github.com/natir/fpa. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Significantly improving the quality of genome assemblies through curation

GigaScience ◽

10.1093/gigascience/giaa153 ◽

2021 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Kerstin Howe ◽

William Chow ◽

Joanna Collins ◽

Sarah Pelan ◽

Damon-Lee Pointon ◽

...

Keyword(s):

Genome Assembly ◽

Data Generation ◽

Research Projects ◽

Automated Assembly ◽

Assembly Quality ◽

Assembly Strategy ◽

Assembly Evaluation ◽

Assembly Algorithms ◽

Genome Assemblies

Abstract Genome sequence assemblies provide the basis for our understanding of biology. Generating error-free assemblies is therefore the ultimate, but sadly still unachieved goal of a multitude of research projects. Despite the ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near error-free genome assemblies for eukaryotes. Whilst working towards improved datasets and fully automated pipelines, assembly evaluation and curation is actively used to bridge this shortcoming and significantly reduce the number of assembly errors. In addition to this increase in product value, the insights gained from assembly curation are fed back into the automated assembly strategy and contribute to notable improvements in genome assembly quality. We describe our tried and tested approach for assembly curation using gEVAL, the genome evaluation browser. We outline the procedures applied to genome curation using gEVAL and also our recommendations for assembly curation in a gEVAL-independent context to facilitate the uptake of genome curation in the wider community.

Download Full-text

Quality of prokaryote genome assembly: Indispensable issues of factors affecting prokaryote genome assembly quality

Gene ◽

10.1016/j.gene.2012.06.016 ◽

2012 ◽

Vol 505 (2) ◽

pp. 365-367 ◽

Cited By ~ 8

Author(s):

Adriana R. Carneiro ◽

Rommel Thiago Jucá Ramos ◽

Hivana Patricia Melo Barbosa ◽

Maria Paula C. Schneider ◽

Debmalya Barh ◽

...

Keyword(s):

Genome Assembly ◽

Assembly Quality ◽

Factors Affecting

Download Full-text

CyTOFmerge: integrating mass cytometry data across multiple panels

Bioinformatics ◽

10.1093/bioinformatics/btz180 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4063-4071 ◽

Cited By ~ 3

Author(s):

Tamim Abdelaal ◽

Thomas Höllt ◽

Vincent van Unen ◽

Boudewijn P F Lelieveldt ◽

Frits Koning ◽

...

Keyword(s):

Single Cell ◽

Biological Sample ◽

Supplementary Information ◽

High Dimensional ◽

Single Cell Level ◽

Supplementary Data ◽

Mass Cytometry ◽

Cell Level ◽

Cellular Markers

Abstract Motivation High-dimensional mass cytometry (CyTOF) allows the simultaneous measurement of multiple cellular markers at single-cell level, providing a comprehensive view of cell compositions. However, the power of CyTOF to explore the full heterogeneity of a biological sample at the single-cell level is currently limited by the number of markers measured simultaneously on a single panel. Results To extend the number of markers per cell, we propose an in silico method to integrate CyTOF datasets measured using multiple panels that share a set of markers. Additionally, we present an approach to select the most informative markers from an existing CyTOF dataset to be used as a shared marker set between panels. We demonstrate the feasibility of our methods by evaluating the quality of clustering and neighborhood preservation of the integrated dataset, on two public CyTOF datasets. We illustrate that by computationally extending the number of markers we can further untangle the heterogeneity of mass cytometry data, including rare cell-population detection. Availability and implementation Implementation is available on GitHub (https://github.com/tabdelaal/CyTOFmerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Bioinformatics ◽

10.1093/bioinformatics/btaa440 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i75-i83 ◽

Cited By ~ 5

Author(s):

Alla Mikheenko ◽

Andrey V Bzikadze ◽

Alexey Gurevich ◽

Karen H Miga ◽

Pavel A Pevzner

Keyword(s):

Quality Assessment ◽

Chromosome Segregation ◽

Tandem Repeats ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Cellular Processes ◽

Long Reads ◽

Long Read ◽

Eukaryotic Genomes

Abstract Motivation Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETRs remains an open problem, it is not clear how to polish draft ETR assemblies. Results To address these problems, we developed the TandemTools software that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improves the recently generated assemblies of human centromeres. Availability and implementation https://github.com/ablab/TandemTools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

Metassembler: Merging and optimizing de novo genome assemblies

10.1101/016352 ◽

2015 ◽

Author(s):

Alejandro Hernandez Wences ◽

Michael Schatz

Keyword(s):

Open Source ◽

Genome Assembly ◽

De Novo ◽

A Genome ◽

Genome Assemblies ◽

Multiple Algorithms

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for metassembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.

Download Full-text

CNValidator: validating somatic copy-number inference

Bioinformatics ◽

10.1093/bioinformatics/bty1022 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2660-2662

Author(s):

Lucian P Smith ◽

Jon A Yamato ◽

Mary K Kuhner

Keyword(s):

Data Quality ◽

Copy Number ◽

Supplementary Information ◽

Tuning Parameter ◽

Supplementary Data ◽

Calling Algorithm ◽

Quality Issues ◽

Multiple Samples ◽

Parameter Values

Abstract Motivation CNValidator assesses the quality of somatic copy-number calls based on coherency of haplotypes across multiple samples from the same individual. It is applicable to any copy-number calling algorithm, which makes calls independently for each sample. This test is useful in assessing the accuracy of copy-number calls, as well as choosing among alternative copy-number algorithms or tuning parameter values. Results On a dataset of somatic samples from individuals with Barrett’s Esophagus, CNValidator provided feedback on the correctness of sample ploidy calls and also detected data quality issues. Availability and implementation CNValidator is available on GitHub at https://github.com/kuhnerlab/CNValidator. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text