Efficient exact associative structure for sequencing data

AbstractMotivationA plethora of methods and applications share the fundamental need to associate information to words for high throughput sequence analysis. Indexing billions of k-mers is promptly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of the properties of the k-mer sets to leverage this challenge. They exploit the overlaps shared among k-mers by using a de Bruijn graph as a compact k-mer set to provide lightweight structures.ResultsWe present Blight, a static and exact index structure able to associate unique identifiers to indexed k-mers and to reject alien k-mers that scales to the largest kmer sets with a low memory cost. The proposed index combines an extremely compact representation along with very high throughput. Besides, its construction from the de Bruijn graph sequences is efficient and does not need supplementary memory. The efficient index implementation achieves to index the k-mers from the human genome with 8GB within 10 minutes and can scale up to the large axolotl genome with 63 GB within 76 minutes. Furthermore, while being memory efficient, the index allows above a million queries per second on a single CPU in our experiments, and the use of multiple cores raises its throughput. Finally, we also present how the index can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range.AvailabilityThe index is implemented as a C++ library, is open source under AGPL3 license, and available at github.com/Malfoy/Blight. It is designed as a user-friendly library and comes along with samples code usage.

Download Full-text

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

BMC Bioinformatics ◽

10.1186/s12859-015-0709-7 ◽

2015 ◽

Vol 16 (1) ◽

Cited By ~ 64

Author(s):

Gaëtan Benoit ◽

Claire Lemaitre ◽

Dominique Lavenier ◽

Erwan Drezen ◽

Thibault Dayris ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Bruijn Graph ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

De Bruijn

Download Full-text

SEED 2: a user-friendly platform for amplicon high-throughput sequencing data analyses

Bioinformatics ◽

10.1093/bioinformatics/bty071 ◽

2018 ◽

Vol 34 (13) ◽

pp. 2292-2294 ◽

Cited By ~ 59

Author(s):

Tomáš Větrovský ◽

Petr Baldrian ◽

Daniel Morais

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Data Analyses ◽

High Throughput Sequencing Data ◽

User Friendly

Download Full-text

Somatic variant analysis of linked-reads sequencing data with Lancet

Bioinformatics ◽

10.1093/bioinformatics/btaa888 ◽

2020 ◽

Author(s):

Rajeeva Musunuri ◽

Kanika Arora ◽

André Corvelo ◽

Minita Shah ◽

Jennifer Shelton ◽

...

Keyword(s):

Supplementary Information ◽

De Bruijn Graph ◽

Haplotype Structure ◽

Sequencing Data ◽

Somatic Variant ◽

Local Assembly ◽

De Bruijn ◽

Variant Analysis ◽

Colored De Bruijn Graph ◽

Commercial Research

Abstract Summary We present a new version of the popular somatic variant caller, Lancet, that supports the analysis of linked-reads sequencing data. By seamlessly integrating barcodes and haplotype read assignments within the colored De Bruijn graph local-assembly framework, Lancet computes a barcode-aware coverage and identifies variants that disagree with the local haplotype structure. Availability and implementation Lancet is implemented in C++ and available for academic and non-commercial research purposes as an open-source package at https://github.com/nygenome/lancet. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

BMC Bioinformatics ◽

10.1186/s12859-020-03740-x ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Aranka Steyaert ◽

Pieter Audenaert ◽

Jan Fostier

Keyword(s):

Genomic Sequence ◽

Conditional Random Field ◽

Accurate Determination ◽

Next Generation Sequencing Data ◽

De Bruijn Graph ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

Expectation Maximisation ◽

De Bruijn

Abstract Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.

Download Full-text

COPAR: A ChIP-Seq Optimal Peak Analyzer

BioMed Research International ◽

10.1155/2017/5346793 ◽

2017 ◽

Vol 2017 ◽

pp. 1-4

Author(s):

Binhua Tang ◽

Xihan Wang ◽

Victor X. Jin

Keyword(s):

High Throughput ◽

Genomic Feature ◽

Data Sets ◽

Sequencing Data ◽

Genomic Features ◽

Peak Alignment ◽

Chip Sequencing ◽

Quality Check ◽

User Friendly ◽

High Throughput Experiments

Sequencing data quality and peak alignment efficiency of ChIP-sequencing profiles are directly related to the reliability and reproducibility of NGS experiments. Till now, there is no tool specifically designed for optimal peak alignment estimation and quality-related genomic feature extraction for ChIP-sequencing profiles. We developed open-sourced COPAR, a user-friendly package, to statistically investigate, quantify, and visualize the optimal peak alignment and inherent genomic features using ChIP-seq data from NGS experiments. It provides a versatile perspective for biologists to perform quality-check for high-throughput experiments and optimize their experiment design. The package COPAR can process mapped ChIP-seq read file in BED format and output statistically sound results for multiple high-throughput experiments. Together with three public ChIP-seq data sets verified with the developed package, we have deposited COPAR on GitHub under a GNU GPL license.

Download Full-text

CRISPRCloud2: A cloud-based platform for deconvolving CRISPR screen data

10.1101/309302 ◽

2018 ◽

Cited By ~ 3

Author(s):

Hyun-Hwan Jeong ◽

Seon Young Kim ◽

Maxime W.C. Rousseaux ◽

Huda Y. Zoghbi ◽

Zhandong Liu

Keyword(s):

Cost Effectiveness ◽

High Throughput ◽

State Of The Art ◽

Statistical Test ◽

Query Interface ◽

Sequencing Data ◽

Crispr Screen ◽

One Stop ◽

User Friendly ◽

Screening Approaches

AbstractThe simplicity and cost-effectiveness of CRISPR technology have made high-throughput pooled screening approaches available to many. However, the large amount of sequencing data derived from these studies yields often unwieldy datasets requiring considerable bioinformatic resources to deconvolute data; a feature which is simply not accessible to many wet labs. To address these needs, we have developed a cloud-based webtool CRISPRCloud2 that provides a state-of-the-art accuracy in mapping short reads to CRISPR library, a powerful statistical test that aggregates information across multiple sgRNAs targeting the same gene, a user-friendly data visualization and query interface, as well as easy linking to other CRISPR tools and bioinformatics resources for target prioritization. CRISPRCloud2 is a one-stop shop for labs analyzing CRISPR screen data.

Download Full-text

Inference of viral quasispecies with a paired de Bruijn graph

Bioinformatics ◽

10.1093/bioinformatics/btaa782 ◽

2020 ◽

Author(s):

Borja Freire ◽

Susana Ladra ◽

Jose R Paramá ◽

Leena Salmela

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

De Bruijn Graph ◽

Viral Quasispecies ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

High Throughput Sequencing Data ◽

De Bruijn

Abstract Motivation RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. Results We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. Availability and implementation viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accelerating De Bruijn Graph-Based Genome Assembly for High-Throughput Short Read Data

2013 International Conference on Parallel and Distributed Systems ◽

10.1109/icpads.2013.68 ◽

2013 ◽

Cited By ~ 1

Author(s):

Kun Zhao ◽

Weiguo Liu ◽

Gerrit Voss ◽

Wolfgang Mueller-Wittig

Keyword(s):

High Throughput ◽

Genome Assembly ◽

De Bruijn Graph ◽

Short Read ◽

De Bruijn

Download Full-text

ANGSD-wrapper: utilities for analyzing next generation sequencing data

10.7287/peerj.preprints.1472 ◽

2016 ◽

Author(s):

Arun Durvasula ◽

Paul J Hoffman ◽

Tyler V Kent ◽

Chaochih Liu ◽

Thomas J Y Kono ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Molecular Ecology ◽

Principal Component ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Genome Data ◽

High Throughput Sequencing Data ◽

Genome Wide ◽

User Friendly

High throughput sequencing has changed many aspects of population genetics, molecular ecology, and related fields, affecting both experimental design and data analysis. The software package ANGSD allows users to perform a number of population genetic analyses on high-throughput sequencing data. ANGSD uses probabilistic approaches to calculate genome-wide descriptive statistics. The package makes use of genotype likelihood estimates rather than SNP calls and is specifically designed to produce more accurate results for samples with low sequencing depth. ANGSD makes use of full genome data while handling a wide array of sampling and experimental designs. Here we present ANGSD-wrapper, a set of wrapper scripts that provide a user-friendly interface for running ANGSD and visualizing results. ANGSD-wrapper supports multiple types of analyses including esti- mates of nucleotide sequence diversity and performing neutrality tests, principal component analysis, estimation of admixture proportions for individuals samples, and calculation of statistics that quantify recent introgression. ANGSD-wrapper also provides interactive graphing of ANGSD results to enhance data exploration. We demonstrate the usefulness of ANGSD-wrapper by analyzing resequencing data from populations of wild and domesticated Zea. ANGSD-wrapper is freely available from https://github.com/mojaveazure/angsd-wrapper.

Download Full-text

TALC: Transcript-level Aware Long Read Correction

10.1101/2020.01.10.901728 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lucile Broseus ◽

Aubin Thomas ◽

Andrew J. Oldfield ◽

Dany Severac ◽

Emeric Dubois ◽

...

Keyword(s):

Transcriptome Sequencing ◽

Transcript Level ◽

De Bruijn Graph ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

Rna Transcript

ABSTRACTMotivationLong-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous “hybrid correction” algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data.ResultsWe have created a novel reference-free algorithm called TALC (Transcription Aware Long Read Correction) which models changes in RNA expression and isoform representation in a weighted De-Bruijn graph to correct long reads from transcriptome studies. We show that transcription aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology.Availability and ImplementationTALC is implemented in C++ and available at https://gitlab.igh.cnrs.fr/lbroseus/[email protected]

Download Full-text