SGTK: a toolkit for visualization and assessment of scaffold graphs

Olga Kunyavskaya; Andrey D Prjibelski

doi:10.1093/bioinformatics/bty956

SGTK: a toolkit for visualization and assessment of scaffold graphs

Bioinformatics ◽

10.1093/bioinformatics/bty956 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2303-2305 ◽

Cited By ~ 2

Author(s):

Olga Kunyavskaya ◽

Andrey D Prjibelski

Keyword(s):

Software Package ◽

Supplementary Information ◽

Sequencing Data ◽

Software Developers ◽

Long Reads ◽

Mate Pair ◽

Linkage Information ◽

Assembly Pipeline ◽

Genome Assemblies ◽

Assembly Software

Abstract Summary Scaffolding is an important step in every genome assembly pipeline, which allows to order contigs into longer sequences using various types of linkage information, such as mate-pair libraries and long reads. In this work, we operate with a notion of a scaffold graph—a graph, vertices of which correspond to the assembled contigs and edges represent connections between them. We present a software package called Scaffold Graph ToolKit that allows to construct and visualize scaffold graphs using different kinds of sequencing data. We show that the scaffold graph appears to be useful for analyzing and assessing genome assemblies, and demonstrate several use cases that can be helpful for both assembly software developers and their users. Availability and implementation SGTK is implemented in C++, Python and JavaScript and is freely available at https://github.com/olga24912/SGTK. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Structural variant analysis for linked-read sequencing data with gemtools

Bioinformatics ◽

10.1093/bioinformatics/btz239 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4397-4399 ◽

Cited By ~ 2

Author(s):

S U Greer ◽

H P Ji

Keyword(s):

Supplementary Information ◽

Supplementary Data ◽

Structural Variants ◽

Sequencing Data ◽

Structural Variant ◽

Single Dna Molecules ◽

Long Reads ◽

Depth Analysis ◽

Basic Functions ◽

Variant Analysis

Abstract Summary Linked-read sequencing generates synthetic long reads which are useful for the detection and analysis of structural variants (SVs). The software associated with 10× Genomics linked-read sequencing, Long Ranger, generates the essential output files (BAM, VCF, SV BEDPE) necessary for downstream analyses. However, to perform downstream analyses requires the user to customize their own tools to handle the unique features of linked-read sequencing data. Here, we describe gemtools, a collection of tools for the downstream and in-depth analysis of SVs from linked-read data. Gemtools uses the barcoded aligned reads and the Megabase-scale phase blocks to determine haplotypes of SV breakpoints and delineate complex breakpoint configurations at the resolution of single DNA molecules. The gemtools package is a suite of tools that provides the user with the flexibility to perform basic functions on their linked-read sequencing output in order to address even more questions. Availability and implementation The gemtools package is freely available for download at: https://github.com/sgreer77/gemtools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accurate Haplotype-Resolved Assembly Reveals The Origin Of Structural Variants For Human Trios

Bioinformatics ◽

10.1093/bioinformatics/btab068 ◽

2021 ◽

Author(s):

Mengyang Xu ◽

Lidong Guo ◽

Xiao Du ◽

Lei Li ◽

Brock A Peters ◽

...

Keyword(s):

De Novo ◽

Substantial Improvement ◽

Supplementary Information ◽

Sequencing Data ◽

Homologous Chromosomes ◽

Assembly Method ◽

Long Reads ◽

Long Read ◽

Second Generation Sequencing ◽

Generation Sequencing

Abstract Motivation Achieving a near complete understanding of how the genome of an individual affects the phenotypes of that individual requires deciphering the order of variations along homologous chromosomes in species with diploid genomes. However, true diploid assembly of long-range haplotypes remains challenging. Results To address this, we have developed Haplotype-resolved Assembly for Synthetic long reads using a Trio-binning strategy, or HAST, which uses parental information to classify reads into maternal or paternal. Once sorted, these reads are used to independently de novo assemble the parent-specific haplotypes. We applied HAST to co-barcoded second-generation sequencing data from an Asian individual, resulting in a haplotype assembly covering 94.7% of the reference genome with a scaffold N50 longer than 11 Mb. The high haplotyping precision (∼99.7%) and recall (∼95.9%) represents a substantial improvement over the commonly used tool for assembling co-barcoded reads (Supernova), and is comparable to a trio-binning-based third generation long-read based assembly method (TrioCanu) but with a significantly higher single-base accuracy (up to 99.99997% (Q65)). This makes HAST a superior tool for accurate haplotyping and future haplotype-based studies. Availability The code of the analysis is available at https://github.com/BGI-Qingdao/HAST. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Digestiflow: from BCL to FASTQ with ease

Bioinformatics ◽

10.1093/bioinformatics/btz850 ◽

2019 ◽

Author(s):

Manuel Holtgrewe ◽

Clemens Messerschmidt ◽

Mikko Nieminen ◽

Dieter Beule

Keyword(s):

Software Package ◽

Flow Cell ◽

Supplementary Information ◽

Software Components ◽

Sequencing Data ◽

Report Generation ◽

Raw Data ◽

Client Software ◽

Cell Sample ◽

Cell Data

Abstract Summary Management raw sequencing data and its preprocessing (conversion into sequences and demultiplexing) remains a challenging topic for groups running sequencing devices. They face many challenges in such efforts and solutions ranging from manual management of spreadsheets to very complex and customized LIMS systems handling much more than just sequencing raw data. In this manuscript, we describe the software package DigestiFlow that focuses on the management of Illumina flow cell sample sheets and raw data. It allows for automated extraction of information from flow cell data and management of sample sheets. Furthermore, it allows for the automated and reproducible conversion of Illumina base calls to sequences and the demultiplexing thereof using bcl2fastq and Picard Tools, followed by quality control report generation. Availability and Implementation The software is available under the MIT license at https://github.com/bihealth/digestiflow-server. The client software components are available via Bioconda. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

New gains with a slimmer genome – An automated approach to improve reference genome assemblies

10.1101/2020.08.18.255596 ◽

2020 ◽

Author(s):

Mikko Kivikoski ◽

Pasi Rastas ◽

Ari Löytynoja ◽

Juha Merilä

Keyword(s):

Gasterosteus Aculeatus ◽

Reference Genome ◽

De Novo ◽

Biological Research ◽

Integrative Approach ◽

Sequencing Data ◽

Pungitius Pungitius ◽

Model Study ◽

Long Reads ◽

Genome Assemblies

AbstractThe utility of genome-wide sequencing data in biological research depends heavily on the quality of the reference genome. Although the reference genomes have improved, it is evident that the assemblies could still be refined, especially in non-model study organisms. Here, we describe an integrative approach to improve contiguity and haploidy of a reference genome assembly. With two novel features of Lep-Anchor software and a combination of dense linkage maps, overlap detection and bridging long reads we generated an improved assembly of the nine-spined stickleback (Pungitius pungitius) reference genome. We were able to remove a significant number of haplotypic contigs, detect more genetic variation and improve the contiguity of the genome, especially that of X chromosome. However, improved scaffolding cannot correct for mosaicism of erroneously assembled contigs, demonstrated by a de novo assembly of a 1.7 Mbp inversion. Qualitatively similar gains were obtained with the genome of three-spined stickleback (Gasterosteus aculeatus).

Download Full-text

Pangenome-based genome inference

10.1101/2020.11.11.378133 ◽

2020 ◽

Author(s):

Jana Ebler ◽

Wayne E. Clarke ◽

Tobias Rausch ◽

Peter A. Audano ◽

Torsten Houwaart ◽

...

Keyword(s):

Reference Genome ◽

Wide Spectrum ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Linkage Information ◽

Almost All ◽

Genome Assemblies ◽

The Given ◽

Genotype A

AbstractTypical analysis workflows map reads to a reference genome in order to detect genetic variants. Generating such alignments introduces references biases, in particular against insertion alleles absent in the reference and comes with substantial computational burden. In contrast, recent k-mer-based genotyping methods are fast, but struggle in repetitive or duplicated regions of the genome. We propose a novel algorithm, called PanGenie, that leverages a pangenome reference built from haplotype-resolved genome assemblies in conjunction with k-mer count information from raw, short-read sequencing data to genotype a wide spectrum of genetic variation. The given haplotypes enable our method to take advantage of linkage information to aid genotyping in regions poorly covered by unique k-mers and provides access to regions otherwise inaccessible by short reads. Compared to classic mapping-based approaches, our approach is more than 4× faster at 30× coverage and at the same time, reached significantly better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (> 50bp), where we are able to genotype > 99.9% of all tested variants with over 90% accuracy at 30× short-read coverage, where the best competing tools either typed less than 60% of variants or reached accuracies below 70%. PanGenie now enables the inclusion of this commonly neglected variant type in downstream analyses.

Download Full-text

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

10.1101/840447 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alex Di Genova ◽

Elena Buena-Atienza ◽

Stephan Ossowski ◽

Marie-France Sagot

Keyword(s):

De Novo ◽

Computational Cost ◽

Sequence Information ◽

Sequencing Data ◽

High Quality ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

The continuous improvement of long-read sequencing technologies along with the development of ad-doc algorithms has launched a new de novo assembly era that promises high-quality genomes. However, it has proven difficult to use only long reads to generate accurate genome assemblies of large, repeat-rich human genomes. To date, most of the human genomes assembled from long error-prone reads add accurate short reads to further polish the consensus quality. Here, we report the development of a novel algorithm for hybrid assembly, WENGAN, and the de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms that exploit the sequence information of short and long reads to tackle assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50:16.67-62.06 Mb), few assembly errors (contig NGA50:10.9-45.91 Mb), good consensus quality (QV:27.79-33.61), and high gene completeness (BUSCO complete: 94.6-95.1%), while consuming low computational resources (CPU hours:153-1027). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 62.06 Mb (NGA50:45.91 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). Providing highest quality at low computational cost, WENGAN is an important step towards the democratization of the de novo assembly of human genomes. The WENGAN assembler is available at https://github.com/adigenova/wengan

Download Full-text

Haplotype-aware genotyping from noisy long reads

10.1101/293944 ◽

2018 ◽

Cited By ~ 5

Author(s):

Jana Ebler ◽

Marina Haukness ◽

Trevor Pesout ◽

Tobias Marschall ◽

Benedict Paten

Keyword(s):

Error Rates ◽

Sequencing Error ◽

Sequencing Data ◽

Novel Approach ◽

Long Reads ◽

Oxford Nanopore ◽

Linkage Information ◽

Second Generation Sequencing ◽

Sequencing Platforms ◽

Generation Sequencing

MotivationCurrent genotyping approaches for single nucleotide variations (SNVs) rely on short, relatively accurate reads from second generation sequencing devices. Presently, third generation sequencing platforms able to generate much longer reads are becoming more widespread. These platforms come with the significant drawback of higher sequencing error rates, which makes them ill-suited to current genotyping algorithms. However, the longer reads make more of the genome unambiguously mappable and typically provide linkage information between neighboring variants.ResultsIn this paper we introduce a novel approach for haplotype-aware genotyping from noisy long reads. We do this by considering bipartitions of the sequencing reads, corresponding to the two haplotypes. We formalize the computational problem in terms of a Hidden Markov Model and compute posterior genotype probabilities using the forward-backward algorithm. Genotype predictions can then be made by picking the most likely genotype at each site. Our experiments indicate that longer reads allow significantly more of the genome to potentially be accurately genotyped. Further, we are able to use both Oxford Nanopore and Pacific Biosciences sequencing data to independently validate millions of variants previously identified by short-read technologies in the reference NA12878 sample, including hundreds of thousands of variants that were not previously included in the high-confidence reference set.

Download Full-text

TALC: Transcript-level Aware Long-read Correction

Bioinformatics ◽

10.1093/bioinformatics/btaa634 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5000-5006 ◽

Cited By ~ 2

Author(s):

Lucile Broseus ◽

Aubin Thomas ◽

Andrew J Oldfield ◽

Dany Severac ◽

Emeric Dubois ◽

...

Keyword(s):

Transcriptome Sequencing ◽

Transcript Level ◽

Supplementary Information ◽

De Bruijn Graph ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Rna Transcript

Abstract Motivation Long-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous ‘hybrid correction’ algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data. Results We have created a novel reference-free algorithm called Transcript-level Aware Long-Read Correction (TALC) which models changes in RNA expression and isoform representation in a weighted De Bruijn graph to correct long reads from transcriptome studies. We show that transcript-level aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology. Availability and implementation TALC is implemented in C++ and available at https://github.com/lbroseus/TALC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

NextPolish: a fast and efficient genome polishing tool for long-read assembly

Bioinformatics ◽

10.1093/bioinformatics/btz891 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2253-2255 ◽

Cited By ~ 11

Author(s):

Jiang Hu ◽

Junpeng Fan ◽

Zongyi Sun ◽

Shanlin Liu

Keyword(s):

Error Rates ◽

Supplementary Information ◽

Sequencing Technologies ◽

Large Numbers ◽

Long Reads ◽

Long Read ◽

Genome Assemblies ◽

Polishing Tool ◽

Sequence Errors ◽

Plant Arabidopsis Thaliana

Abstract Motivation Although long-read sequencing technologies can produce genomes with long contiguity, they suffer from high error rates. Thus, we developed NextPolish, a tool that efficiently corrects sequence errors in genomes assembled with long reads. This new tool consists of two interlinked modules that are designed to score and count K-mers from high quality short reads, and to polish genome assemblies containing large numbers of base errors. Results When evaluated for the speed and efficiency using human and a plant (Arabidopsis thaliana) genomes, NextPolish outperformed Pilon by correcting sequence errors faster, and with a higher correction accuracy. Availability and implementation NextPolish is implemented in C and Python. The source code is available from https://github.com/Nextomics/NextPolish. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SVJedi: genotyping structural variations with long reads

Bioinformatics ◽

10.1093/bioinformatics/btaa527 ◽

2020 ◽

Vol 36 (17) ◽

pp. 4568-4575

Author(s):

Lolita Lecompte ◽

Pierre Peterlongo ◽

Dominique Lavenier ◽

Claire Lemaitre

Keyword(s):

Supplementary Information ◽

Sequencing Data ◽

Structural Variations ◽

Short Read ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Clinical Diagnoses ◽

Long Read ◽

The One

Abstract Motivation Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. Results We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. Availability and implementation https://github.com/llecompte/SVJedi.git Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text