de bruijn
Recently Published Documents


TOTAL DOCUMENTS

991
(FIVE YEARS 164)

H-INDEX

38
(FIVE YEARS 5)

2022 ◽  
Vol 345 (4) ◽  
pp. 112780
Author(s):  
Daniel Gabric ◽  
Joe Sawada

2021 ◽  
Author(s):  
Jamshed Khan ◽  
Marek Kokot ◽  
Sebastian Deorowicz ◽  
Rob Patro

The de Bruijn graph has become a key data structure in modern computational genomics, and of keen interest is its compacted variant. The compacted de Bruijn graph provides a lossless representation of the graph, and it is often considerably more efficient to store and process than its non-compacted counterpart. Construction of the compacted de Bruijn graph resides upstream of many genomic analyses. As the quantity of sequencing data and the number of reference genomes on which to perform these analyses grow rapidly, efficient construction of the compacted graph becomes a computational bottleneck for these tasks. We present Cuttlefish 2, significantly advancing the existing state-of-the-art methods for construction of this graph. On a typical shared-memory machine, it reduces the construction of the compacted de Bruijn graph for 661K bacterial genomes (2.58 Tbp of input reference genomes) from about 4.5 days to 17—23 hours. Similarly on sequencing data, it constructs the graph for a 1.52 Tbp white spruce read set in about 10 hours, while the closest competitor, which also uses considerably more memory, requires 54—58 hours. Cuttlefish 2 is implemented in C++14, and is available as open-source software under a BSD-3-Clause license at https://github.com/COMBINE-lab/cuttlefish.


2021 ◽  
Author(s):  
Hans Bruijn

We can hardly underestimate the importance of privacy in our data-driven world. Privacy breaches are not just about disclosing information. Personal data is used to profile and manipulate us – sometimes on such a large scale that it affects society as a whole. What can governments do to protect our privacy? In The Governance of Privacy Hans de Bruijn first analyses the complexity of the governance challenge, using the metaphor of a journey. At the start, users have strong incentives to share data. Harvested data continue the journey that might lead to a privacy breach, but not necessarily – it can also lead to highly valued services. That is why both preparedness at the start of the journey and resilience during the journey are crucial to privacy protection. The book then explores three strategies to deal with governments, the market, and society. Governments can use the power of the law; they can exploit the power of the market by stimulating companies to compete on privacy; and they can empower society, strengthening its resilience in a data-driven world.


Author(s):  
Dmitry N. Shubin ◽  
Nikolai A. Kandaurov ◽  
Julia M. Serebrennikova ◽  
Kirill U. Sokolov

Foundations ◽  
2021 ◽  
Vol 1 (2) ◽  
pp. 256-264
Author(s):  
Takuya Yamano

A non-uniform (skewed) mixture of probability density functions occurs in various disciplines. One needs a measure of similarity to the respective constituents and its bounds. We introduce a skewed Jensen–Fisher divergence based on relative Fisher information, and provide some bounds in terms of the skewed Jensen–Shannon divergence and of the variational distance. The defined measure coincides with the definition from the skewed Jensen–Shannon divergence via the de Bruijn identity. Our results follow from applying the logarithmic Sobolev inequality and Poincaré inequality.


2021 ◽  
Author(s):  
Mikhail Karasikov ◽  
Harun Mustafa ◽  
Gunnar Rätsch ◽  
André Kahles

High-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in solving the experiment discovery problem and building compressed representations of annotated de Bruijn graphs where k-mer sets can be efficiently indexed and interactively queried. However, approaches for representing and retrieving other quantitative attributes such as gene expression or genome positions in a general manner have yet to be developed. In this work, we propose the concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes (e.g., a k-mer count or its positions in genome). To represent them, we first observe that many schemes for the representation of compressed binary matrices already support the rank operation on the columns or rows, which can be used to define an inherent indexing of any additional quantitative attributes. Based on this property, we generalize these schemes and introduce a new approach for representing non-binary sparse matrices in compressed data structures. Finally, we notice that relation attributes are often easily predictable from a node's local neighborhood in the graph. Notable examples are genome positions shifting by 1 for neighboring nodes in the graph, or expression levels that are often shared across neighbors. We exploit this regularity of graph annotations and apply an invertible delta-like coding to achieve better compression. We show that Counting de Bruijn graphs index k-mer counts from 2,652 human RNA-Seq read sets in representations over 8-fold smaller and yet faster to query compared to state-of-the-art bioinformatics tools. Furthermore, Counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip -9 for human Illumina RNA-Seq and 57% smaller for PacBio HiFi sequencing of viral samples. A complete joint searchable index of all viral PacBio SMRT reads from NCBI's SRA (152,884 read sets, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, they generate a lossless and fully queryable index that is 4.4-fold smaller compared to the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools employing de Bruijn graphs and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed and fully searchable graph-based sequence indexes.


2021 ◽  
Author(s):  
Hector Roux de Bezieux ◽  
Leandro Lima ◽  
Fanny Perraudeau ◽  
Arnaud Mary ◽  
Sandrine Dudoit ◽  
...  

Genome wide association studies (GWAS), aiming to find genetic variants associated with a trait, have widely been used on bacteria to identify genetic determinants of drug resistance or hypervirulence. Recent bacterial GWAS methods usually rely on k-mers, whose presence in a genome can denote variants ranging from single nucleotide polymorphisms to mobile genetic elements. Since many bacterial species include genes that are not shared among all strains, this approach avoids the reliance on a common reference genome. However, the same gene can exist in slightly different versions across different strains, leading to diluted effects when trying to detect its association to a phenotype through k-mer based GWAS. Here we propose to overcome this by testing covariates built from closed connected subgraphs of the De Bruijn graph defined over genomic k-mers. These covariates are able to capture polymorphic genes as a single entity, improving k-mer based GWAS in terms of power and interpretability. As the number of subgraphs is exponential in the number of nodes in the DBG, a method naively testing all possible subgraphs would result in very low statistical power due to multiple testing corrections, and the mere exploration of these subgraphs would quickly become computationally intractable. The concept of testable hypothesis has successfully been used to address both problems in similar contexts. We leverage this concept to test all closed connected subgraphs by proposing a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, resulting in drastic improvements in computational efficiency. We illustrate this on both real and simulated datasets and also demonstrate how considering subgraphs leads to a more powerful and interpretable method. Our method integrates with existing visual tools to facilitate interpretation. We also provide an implementation of our method, as well as code to reproduce all results at https://github.com/HectorRDB/Caldera_Recomb.


2021 ◽  
Vol 2090 (1) ◽  
pp. 012047
Author(s):  
Pedro J. Roig ◽  
Salvador Alcaraz ◽  
Katja Gilly ◽  
Cristina Bernad ◽  
Carlos Juiz

Abstract Working with ever growing datasets may be a time consuming and resource exhausting task. In order to try and process the corresponding items within those datasets in an optimal way, de Bruijn sequences may be an interesting option due to their special characteristics, allowing to visit all possible combinations of data exactly once. Such sequences are unidimensional, although the same principle may be extended to involve more dimensions, such as de Bruijn tori for bidimensional patterns, or de Bruijn hypertori for tridimensional patterns, even though those might be further expanded up to infinite dimensions. In this context, the main features of all those de Bruijn shapes are going to be exposed, along with some particular instances, which may be useful in pattern location in one, two and three dimensions.


Sign in / Sign up

Export Citation Format

Share Document