Assembly Graph Browser: interactive visualization of assembly graphs

Alla Mikheenko; Mikhail Kolmogorov

doi:10.1093/bioinformatics/btz072

Assembly Graph Browser: interactive visualization of assembly graphs

Bioinformatics ◽

10.1093/bioinformatics/btz072 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3476-3478 ◽

Cited By ~ 2

Author(s):

Alla Mikheenko ◽

Mikhail Kolmogorov

Keyword(s):

Genome Assembly ◽

Open Problem ◽

Interactive Visualization ◽

Supplementary Information ◽

Supplementary Data ◽

New Approach ◽

Repeat Analysis

Abstract Summary Currently, most genome assembly projects focus on contigs and scaffolds rather than assembly graphs that provide a more comprehensive representation of an assembly. Since interactive visualization of large assembly graphs remains an open problem, we developed an Assembly Graph Browser (AGB) tool that visualizes large assembly graphs, extending the functionality of previously developed visualization approaches. Assembly Graph Browser includes a number of novel functions including repeat analysis, construction of the contracted assembly graphs (i.e. the graphs obtained by collapsing a selected set of edges) and a new approach to visualizing large assembly graphs. Availability and implementation http://www.github.com/almiheenko/AGB. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Overlap graph-based generation of haplotigs for diploids and polyploids

Bioinformatics ◽

10.1093/bioinformatics/btz255 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4281-4289 ◽

Cited By ~ 1

Author(s):

Jasmijn A Baaijens ◽

Alexander Schönhuth

Keyword(s):

Genome Assembly ◽

De Novo ◽

Iterative Scheme ◽

State Of The Art ◽

Simulated Data ◽

Supplementary Information ◽

Supplementary Data ◽

Specific Sequence ◽

New Approach ◽

Polyploid Genome

Abstract Motivation Haplotype-aware genome assembly plays an important role in genetics, medicine and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. Results We present POLYploid genome fitTEr (POLYTE) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes of known ploidy. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. Availability and implementation POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

icHET: interactive visualization of cytoplasmic heteroplasmy

Bioinformatics ◽

10.1093/bioinformatics/btz300 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4411-4412 ◽

Cited By ~ 2

Author(s):

Vinhthuy Phan ◽

Diem-Trang Pham ◽

Caroline Melton ◽

Adam J Ramsey ◽

Bernie J Daigle ◽

...

Keyword(s):

Reference Genome ◽

Interactive Visualization ◽

Supplementary Information ◽

Supplementary Data ◽

Short Reads ◽

Genome Wide ◽

Computational Workflow ◽

Multiple Samples

Abstract Summary Although heteroplasmy has been studied extensively in animal systems, there is a lack of tools for analyzing, exploring and visualizing heteroplasmy at the genome-wide level in other taxonomic systems. We introduce icHET, which is a computational workflow that produces an interactive visualization that facilitates the exploration, analysis and discovery of heteroplasmy across multiple genomic samples. icHET works on short reads from multiple samples from any organism with an organellar reference genome (mitochondrial or plastid) and a nuclear reference genome. Availability and implementation The software is available at https://github.com/vtphan/HeteroplasmyWorkflow. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ARBitR: an overlap-aware genome assembly scaffolder for linked reads

Bioinformatics ◽

10.1093/bioinformatics/btaa975 ◽

2020 ◽

Author(s):

Markus Hiltunen ◽

Martin Ryberg ◽

Hanna Johannesson

Keyword(s):

Genome Assembly ◽

General Public ◽

Source Code ◽

Draft Genome ◽

Supplementary Information ◽

Genomic Sequencing ◽

Supplementary Data ◽

Genome Assemblies ◽

General Public License

Abstract Summary Linked genomic sequencing reads contain information that can be used to join sequences together into scaffolds in draft genome assemblies. Existing software for this purpose performs the scaffolding by joining sequences with a gap between them, not considering potential overlaps of contigs. We developed ARBitR to create scaffolds where overlaps are taken into account and show that it can accurately recreate regions where draft assemblies are broken. Availability and implementation ARBitR is written and implemented in Python3 for Unix-based operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License v3. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PDR: a new genome assembly evaluation metric based on genetics concerns

Bioinformatics ◽

10.1093/bioinformatics/btaa704 ◽

2020 ◽

Author(s):

Luyu Xie ◽

Limsoon Wong

Keyword(s):

Genome Assembly ◽

Pairwise Distance ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Genetic Studies ◽

A Genome ◽

Assembly Evaluation ◽

Evaluation Metric

Abstract Motivation Existing genome assembly evaluation metrics provide only limited insight on specific aspects of genome assembly quality, and sometimes even disagree with each other. For better integrative comparison between assemblies, we propose, here, a new genome assembly evaluation metric, Pairwise Distance Reconstruction (PDR). It derives from a common concern in genetic studies, and takes completeness, contiguity, and correctness into consideration. We also propose an approximation implementation to accelerate PDR computation. Results Our results on publicly available datasets affirm PDR’s ability to integratively assess the quality of a genome assembly. In fact, this is guaranteed by its definition. The results also indicated the error introduced by approximation is extremely small and thus negligible. Availabilityand implementation https://github.com/XLuyu/PDR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GfaViz: flexible and interactive visualization of GFA sequence graphs

Bioinformatics ◽

10.1093/bioinformatics/bty1046 ◽

2018 ◽

Vol 35 (16) ◽

pp. 2853-2855 ◽

Cited By ~ 2

Author(s):

Giorgio Gonnella ◽

Niklas Niehus ◽

Stefan Kurtz

Keyword(s):

Interactive Visualization ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Command Line Interface ◽

Vector Graphics ◽

Fragment Assembly ◽

Or Groups ◽

Graphical Tool ◽

Standard Configuration

Abstract Summary The graphical fragment assembly (GFA) formats are emerging standard formats for the representation of sequence graphs. Although GFA 1 was primarily targeting assembly graphs, the newer GFA 2 format introduces several features, which makes it suitable for representing other kinds of information, such as scaffolding graphs, variation graphs, alignment graphs and colored metagenomic graphs. Here, we present GfaViz, an interactive graphical tool for the visualization of sequence graphs in GFA format. The software supports all new features of GFA 2 and introduces conventions for their visualization. The user can choose between two different layouts and multiple styles for representing single elements or groups. All customizations can be stored in custom tags of the GFA format itself, without requiring external configuration files. Stylesheets are supported for storing standard configuration options for groups of files. The visualizations can be exported to raster and vector graphics formats. A command line interface allows for batch generation of images. Availability and implementation GfaViz is available at https://github.com/ggonnella/gfaviz Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

AGOUTI: improving genome assembly and annotation using transcriptome data

10.1101/033019 ◽

2015 ◽

Cited By ~ 1

Author(s):

Simo V. Zhang ◽

Luting Zhuo ◽

Matthew W. Hahn

Keyword(s):

Genome Assembly ◽

Gene Annotation ◽

Supplementary Information ◽

Gene Identification ◽

Supplementary Data ◽

Transcriptome Data ◽

Rna Seq ◽

Separate Gene ◽

Gene Models ◽

Genome Assemblies

AbstractSummaryCurrent genome assemblies consist of thousands of contigs. These incomplete and fragmented assemblies lead to errors in gene identification, such that single genes spread across multiple contigs are annotated as separate gene models. We present AGOUTI (Annotated Genome Optimization Using Transcriptome Information), a tool that uses RNA-seq data to simultaneously combine contigs into scaffolds and fragmented gene models into single models. We show that AGOUTI improves both the contiguity of genome assemblies and the accuracy of gene annotation, providing updated versions of each as output.AvailabilityThe software is implemented in python and is available from github.com/svm-zhang/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

yacrd and fpa: upstream tools for long-read genome assembly

10.1101/674036 ◽

2019 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Disk Space ◽

Link Type ◽

Long Read

AbstractMotivationGenome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space.ResultsWe introduce two tools, yacrd and fpa, preform respectively chimera removal, read scrubbing, and filter out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative.Availabilityhttps://github.com/natir/yacrd and https://github.com/natir/[email protected] informationSupplementary data are available online.

Download Full-text

SMART: SuperMaximal approximate repeats tool

Bioinformatics ◽

10.1093/bioinformatics/btz953 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2589-2591

Author(s):

Lorraine A K Ayad ◽

Panagiotis Charalampopoulos ◽

Solon P Pissis

Keyword(s):

State Of The Art ◽

Input Sequence ◽

The State ◽

Supplementary Information ◽

Greedy Heuristics ◽

Supplementary Data ◽

Repeat Analysis ◽

Analysis Tools ◽

Speed Up

Abstract Summary State-of-the-art repeat analysis tools rely on extending maximal repeated pairs to enumerate maximal k-mismatch repeats. These pairs can be quadratic in n, the length of the input sequence, and thus greedy heuristics are applied to speed up the extension. Here, we introduce supermaximal k-mismatch repeats, which are linear in n and capture all maximal k-mismatch repeats: every maximal k-mismatch repeat is a substring of some supermaximal k-mismatch repeat. We present SMART, a tool based on recent algorithmic advances implemented in C++ to compute supermaximal k-mismatch repeats directly, and show that these elements are statistically much more significant than the output of the state-of-the-art. Availability and implementation http://github.com/lorrainea/smart (GNU GPL v3.0). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

yacrd and fpa: upstream tools for long-read genome assembly

Bioinformatics ◽

10.1093/bioinformatics/btaa262 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3894-3896 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Genome Assembly ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Disk Space ◽

Long Read

Abstract Motivation Genome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space. Results We introduce two tools: yacrd for chimera removal and read scrubbing, and fpa for filtering out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative. Availability and implementation https://github.com/natir/yacrd and https://github.com/natir/fpa. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text