Structural variant analysis for linked-read sequencing data with gemtools

S U Greer; H P Ji

doi:10.1093/bioinformatics/btz239

Structural variant analysis for linked-read sequencing data with gemtools

Bioinformatics ◽

10.1093/bioinformatics/btz239 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4397-4399 ◽

Cited By ~ 2

Author(s):

S U Greer ◽

H P Ji

Keyword(s):

Supplementary Information ◽

Supplementary Data ◽

Structural Variants ◽

Sequencing Data ◽

Structural Variant ◽

Single Dna Molecules ◽

Long Reads ◽

Depth Analysis ◽

Basic Functions ◽

Variant Analysis

Abstract Summary Linked-read sequencing generates synthetic long reads which are useful for the detection and analysis of structural variants (SVs). The software associated with 10× Genomics linked-read sequencing, Long Ranger, generates the essential output files (BAM, VCF, SV BEDPE) necessary for downstream analyses. However, to perform downstream analyses requires the user to customize their own tools to handle the unique features of linked-read sequencing data. Here, we describe gemtools, a collection of tools for the downstream and in-depth analysis of SVs from linked-read data. Gemtools uses the barcoded aligned reads and the Megabase-scale phase blocks to determine haplotypes of SV breakpoints and delineate complex breakpoint configurations at the resolution of single DNA molecules. The gemtools package is a suite of tools that provides the user with the flexibility to perform basic functions on their linked-read sequencing output in order to address even more questions. Availability and implementation The gemtools package is freely available for download at: https://github.com/sgreer77/gemtools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads

Frontiers in Genetics ◽

10.3389/fgene.2021.636239 ◽

2021 ◽

Vol 12 ◽

Author(s):

Junfu Guo ◽

Chang Shi ◽

Xi Chen ◽

Ou Wang ◽

Ping Liu ◽

...

Keyword(s):

Structural Variation ◽

Signal To Noise Ratio ◽

Recall Rate ◽

Structural Variants ◽

Analysis Pipeline ◽

Haplotype Phasing ◽

Base Level ◽

Long Reads ◽

Variant Analysis ◽

Complex Structural

Co-barcoded reads originating from long DNA fragments (mean length >30 kbp) maintain both single base level accuracy and long-range genomic information. We propose a pipeline, stLFRsv, to detect structural variation using co-barcoded reads. stLFRsv identifies abnormal large gaps between co-barcoded reads to detect potential breakpoints and reconstruct complex structural variants (SVs). Haplotype phasing by co-barcoded reads increases the signal to noise ratio, and barcode sharing profiles are used to filter out false positives. We integrate the short read SV caller smoove for smaller variants with stLFRsv. The integrated pipeline was evaluated on the well-characterized genome HG002/NA24385, and 74.5% precision and a 22.4% recall rate were obtained for deletions. stLFRsv revealed some large variants not included in the benchmark set that were verified by long reads or assembly. For the HG001/NA12878 genome, stLFRsv also achieved the best performance for both resource usage and the detection of large variants. Our work indicates that co-barcoded read technology has the potential to improve genome completeness.

Download Full-text

SVIM-asm: Structural variant detection from haploid and diploid genome assemblies

10.1101/2020.10.27.356907 ◽

2020 ◽

Author(s):

David Heller ◽

Martin Vingron

Keyword(s):

Genetic Information ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Diploid Genome ◽

Insertions And Deletions ◽

Structural Variant ◽

Sequencing Technologies ◽

Variant Detection ◽

Genome Assemblies

AbstractMotivationWith the availability of new sequencing technologies, the generation of haplotype-resolved genome assemblies up to chromosome scale has become feasible. These assemblies capture the complete genetic information of both parental haplotypes, increase structural variant (SV) calling sensitivity and enable direct genotyping and phasing of SVs. Yet, existing SV callers are designed for haploid genome assemblies only, do not support genotyping or detect only a limited set of SV classes.ResultsWe introduce our method SVIM-asm for the detection and genotyping of six common classes of SVs from haploid and diploid genome assemblies. Compared against the only other existing SV caller for diploid assemblies, DipCall, SVIM-asm detects more SV classes and reached higher F1 scores for the detection of insertions and deletions on two recently published assemblies of the HG002 individual.Availability and ImplementationSVIM-asm has been implemented in Python and can be easily installed via bioconda. Its source code is available at github.com/eldariont/[email protected] informationSupplementary data are available online.

Download Full-text

Somatic variant analysis of linked-reads sequencing data with Lancet

Bioinformatics ◽

10.1093/bioinformatics/btaa888 ◽

2020 ◽

Author(s):

Rajeeva Musunuri ◽

Kanika Arora ◽

André Corvelo ◽

Minita Shah ◽

Jennifer Shelton ◽

...

Keyword(s):

Supplementary Information ◽

De Bruijn Graph ◽

Haplotype Structure ◽

Sequencing Data ◽

Somatic Variant ◽

Local Assembly ◽

De Bruijn ◽

Variant Analysis ◽

Colored De Bruijn Graph ◽

Commercial Research

Abstract Summary We present a new version of the popular somatic variant caller, Lancet, that supports the analysis of linked-reads sequencing data. By seamlessly integrating barcodes and haplotype read assignments within the colored De Bruijn graph local-assembly framework, Lancet computes a barcode-aware coverage and identifies variants that disagree with the local haplotype structure. Availability and implementation Lancet is implemented in C++ and available for academic and non-commercial research purposes as an open-source package at https://github.com/nygenome/lancet. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Bioinformatics ◽

10.1093/bioinformatics/btaa440 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i75-i83 ◽

Cited By ~ 5

Author(s):

Alla Mikheenko ◽

Andrey V Bzikadze ◽

Alexey Gurevich ◽

Karen H Miga ◽

Pavel A Pevzner

Keyword(s):

Quality Assessment ◽

Chromosome Segregation ◽

Tandem Repeats ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Cellular Processes ◽

Long Reads ◽

Long Read ◽

Eukaryotic Genomes

Abstract Motivation Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETRs remains an open problem, it is not clear how to polish draft ETR assemblies. Results To address these problems, we developed the TandemTools software that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improves the recently generated assemblies of human centromeres. Availability and implementation https://github.com/ablab/TandemTools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

schex avoids overplotting for large single-cell RNA-sequencing datasets

Bioinformatics ◽

10.1093/bioinformatics/btz907 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2291-2292 ◽

Cited By ~ 1

Author(s):

Saskia Freytag ◽

Ryan Lister

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Summary Due to the scale and sparsity of single-cell RNA-sequencing data, traditional plots can obscure vital information. Our R package schex overcomes this by implementing hexagonal binning, which has the additional advantages of improving speed and reducing storage for resulting plots. Availability and implementation schex is freely available from Bioconductor via http://bioconductor.org/packages/release/bioc/html/schex.html and its development version can be accessed on GitHub via https://github.com/SaskiaFreytag/schex. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

phylogenize: correcting for phylogeny reveals genes associated with microbial distributions

Bioinformatics ◽

10.1093/bioinformatics/btz722 ◽

2019 ◽

Vol 36 (4) ◽

pp. 1289-1290

Author(s):

Patrick H Bradley ◽

Katherine S Pollard

Keyword(s):

Community Composition ◽

Human Microbiome ◽

Human Microbiome Project ◽

Shotgun Sequencing ◽

Supplementary Information ◽

Phylogenetic Comparative Methods ◽

Supplementary Data ◽

Sequencing Data ◽

Phylogenetic Regression ◽

Project Data

Abstract Summary Phylogenetic comparative methods are powerful but presently under-utilized ways to identify microbial genes underlying differences in community composition. These methods help to identify functionally important genes because they test for associations beyond those expected when related microbes occupy similar environments. We present phylogenize, a pipeline with web, QIIME 2 and R interfaces that allows researchers to perform phylogenetic regression on 16S amplicon and shotgun sequencing data and to visualize results. phylogenize applies broadly to both host-associated and environmental microbiomes. Using Human Microbiome Project and Earth Microbiome Project data, we show that phylogenize draws similar conclusions from 16S versus shotgun sequencing and reveals both known and candidate pathways associated with host colonization. Availability and implementation phylogenize is available at https://phylogenize.org and https://bitbucket.org/pbradz/phylogenize. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Quantification of aneuploidy in targeted sequencing data using ASCETS

Bioinformatics ◽

10.1093/bioinformatics/btaa980 ◽

2020 ◽

Author(s):

Liam F Spurr ◽

Mehdi Touat ◽

Alison M Taylor ◽

Adrian M Dubuc ◽

Juliann Shih ◽

...

Keyword(s):

Copy Number ◽

Large Scale ◽

Genomic Analysis ◽

Targeted Sequencing ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Copy Number Changes ◽

Panel Sequencing ◽

Chromosome Level

Abstract Summary The expansion of targeted panel sequencing efforts has created opportunities for large-scale genomic analysis, but tools for copy-number quantification on panel data are lacking. We introduce ASCETS, a method for the efficient quantitation of arm and chromosome-level copy-number changes from targeted sequencing data. Availability and implementation ASCETS is implemented in R and is freely available to non-commercial users on GitHub: https://github.com/beroukhim-lab/ascets, along with detailed documentation. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Abstract 2176: Joint structural variant analysis of colorectal cancer whole genome sequencing data

10.1158/1538-7445.am2015-2176 ◽

2015 ◽

Author(s):

Esa Pitkänen ◽

Tatiana Cajuso ◽

Riku Katainen ◽

Sofie Lundgren ◽

Sari Tuupanen ◽

...

Keyword(s):

Colorectal Cancer ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Structural Variant ◽

Variant Analysis

Download Full-text

SGTK: a toolkit for visualization and assessment of scaffold graphs

Bioinformatics ◽

10.1093/bioinformatics/bty956 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2303-2305 ◽

Cited By ~ 2

Author(s):

Olga Kunyavskaya ◽

Andrey D Prjibelski

Keyword(s):

Software Package ◽

Supplementary Information ◽

Sequencing Data ◽

Software Developers ◽

Long Reads ◽

Mate Pair ◽

Linkage Information ◽

Assembly Pipeline ◽

Genome Assemblies ◽

Assembly Software

Abstract Summary Scaffolding is an important step in every genome assembly pipeline, which allows to order contigs into longer sequences using various types of linkage information, such as mate-pair libraries and long reads. In this work, we operate with a notion of a scaffold graph—a graph, vertices of which correspond to the assembled contigs and edges represent connections between them. We present a software package called Scaffold Graph ToolKit that allows to construct and visualize scaffold graphs using different kinds of sequencing data. We show that the scaffold graph appears to be useful for analyzing and assessing genome assemblies, and demonstrate several use cases that can be helpful for both assembly software developers and their users. Availability and implementation SGTK is implemented in C++, Python and JavaScript and is freely available at https://github.com/olga24912/SGTK. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form

Bioinformatics ◽

10.1093/bioinformatics/btaa604 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4810-4812

Author(s):

Qingxi Meng ◽

Idoia Ochoa ◽

Mikel Hernaez

Keyword(s):

Single Cell ◽

Data Streams ◽

General Feature ◽

Supplementary Information ◽

Storage Space ◽

Supplementary Data ◽

Rna Seq ◽

Sequencing Data ◽

General Feature Format ◽

Original File

Abstract Motivation Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average. Availability and implementation GPress is freely available at https://github.com/qm2/gpress. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text