Squeakr: An Exact and Approximate k-mer Counting System

AbstractMotivationk-mer-based algorithms have become increasingly popular in the processing of high-throughput sequencing (HTS) data. These algorithms span the gamut of the analysis pipeline from k-mer counting (e.g., for estimating assembly parameters), to error correction, genome and transcriptome assembly, and even transcript quantification. Yet, these tasks often use very different k-mer representations and data structures. In this paper, we set forth the fundamental operations for maintaining multisets of k-mers and classify existing systems from a data-structural perspective. We then show how to build a k-mer-counting and multiset-representation system using the counting quotient filter (CQF), a feature-rich approximate membership query (AMQ) data structure. We introduce the k-mer-counting/querying system Squeakr (Simple Quotient filter-based Exact and Approximate Kmer Representation), which is based on the CQF. This off-the-shelf data structure turns out to be an efficient (approximate or exact) representation for sets or multisets of k-mers.ResultsSqueakr takes 2×−3;4.3× less time than the state-of-the-art to count and perform a random-point-query workload. Squeakr is memory-efficient, consuming 1.5X–4.3X less memory than the state-of-the-art. It offers competitive counting performance, and answers point queries (i.e. queries for the abundance of a particular k-mer) over an order-of-magnitude faster than other systems. The Squeakr representation of the k-mer multiset turns out to be immediately useful for downstream processing (e.g., de Bruijn graph traversal) because it supports fast queries and dynamic k-mer insertion, deletion, and modification.Availabilityhttps://github.com/splatlab/[email protected]

Download Full-text

Representation of k-mer sets using spectrum-preserving string sets

10.1101/2020.01.07.896928 ◽

2020 ◽

Cited By ~ 3

Author(s):

Amatur Rahman ◽

Paul Medvedev

Keyword(s):

Lower Bound ◽

State Of The Art ◽

Lossless Compression ◽

De Bruijn Graph ◽

Compact Set ◽

Greedy Method ◽

Bioinformatics Analyses ◽

Path Cover ◽

Order Of Magnitude ◽

Index Size

AbstractGiven the popularity and elegance of k-mer based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of k-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which we show improves index size by 10-44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

Download Full-text

Variable-order reference-free variant discovery with the Burrows-Wheeler Transform

BMC Bioinformatics ◽

10.1186/s12859-020-03586-3 ◽

2020 ◽

Vol 21 (S8) ◽

Cited By ~ 1

Author(s):

Nicola Prezza ◽

Nadia Pisanti ◽

Marinella Sciortino ◽

Giovanna Rosone

Keyword(s):

State Of The Art ◽

The State ◽

Chromosome 1 ◽

Variable Order ◽

De Bruijn Graphs ◽

Simple Strategy ◽

Working Space ◽

Order Of Magnitude ◽

De Bruijn ◽

Burrows Wheeler Transform

Abstract Background In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT. Results In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel. Conclusions Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore report results on larger (real) Human whole-genome sequencing experiments. Also in these cases, our tool exhibits a much higher sensitivity than the state-of-the art tool.

Download Full-text

ROTed: Random Oblivious Transfer for embedded devices

IACR Transactions on Cryptographic Hardware and Embedded Systems ◽

10.46586/tches.v2021.i4.215-238 ◽

2021 ◽

pp. 215-238

Author(s):

P. Branco ◽

L. Fiolhais ◽

M. Goulão ◽

P. Martins ◽

P. Mateus ◽

...

Keyword(s):

State Of The Art ◽

Random Oracle Model ◽

Random Oracle ◽

Oblivious Transfer ◽

The State ◽

Contact Tracing ◽

Embedded Devices ◽

Set Intersection ◽

Wide Range ◽

Order Of Magnitude

Oblivious Transfer (OT) is a fundamental primitive in cryptography, supporting protocols such as Multi-Party Computation and Private Set Intersection (PSI), that are used in applications like contact discovery, remote diagnosis and contact tracing. Due to its fundamental nature, it is utterly important that its execution is secure even if arbitrarily composed with other instances of the same, or other protocols. This property can be guaranteed by proving its security under the Universal Composability model. Herein, a 3-round Random Oblivious Transfer (ROT) protocol is proposed, which achieves high computational efficiency, in the Random Oracle Model. The security of the protocol is based on the Ring Learning With Errors assumption (for which no quantum solver is known). ROT is the basis for OT extensions and, thus, achieves wide applicability, without the overhead of compiling ROTs from OTs. Finally, the protocol is implemented in a server-class Intel processor and four application-class ARM processors, all with different architectures. The usage of vector instructions provides on average a 40% speedup. The implementation shows that our proposal is at least one order of magnitude faster than the state-of-the-art, and is suitable for a wide range of applications in embedded systems, IoT, desktop, and servers. From a memory footprint perspective, there is a small increase (16%) when compared to the state-of-the-art. This increase is marginal and should not prevent the usage of the proposed protocol in a multitude of devices. In sum, the proposal achieves up to 37k ROTs/s in an Intel server-class processor and up to 5k ROTs/s in an ARM application-class processor. A PSI application, using the proposed ROT, is up to 6.6 times faster than related art.

Download Full-text

TasselNetv2: in-field counting of wheat spikes with context-augmented local regression networks

Plant Methods ◽

10.1186/s13007-019-0537-2 ◽

2019 ◽

Vol 15 (1) ◽

Cited By ~ 11

Author(s):

Haipeng Xiong ◽

Zhiguo Cao ◽

Hao Lu ◽

Simon Madec ◽

Liang Liu ◽

...

Keyword(s):

Computer Vision ◽

Large Scale ◽

State Of The Art ◽

The State ◽

Local Regression ◽

Real Time System ◽

Counting Problem ◽

Spike Number ◽

Order Of Magnitude ◽

High Resolution Images

Abstract Background Grain yield of wheat is greatly associated with the population of wheat spikes, i.e., $$spike~number~\text {m}^{-2}$$spikenumberm-2. To obtain this index in a reliable and efficient way, it is necessary to count wheat spikes accurately and automatically. Currently computer vision technologies have shown great potential to automate this task effectively in a low-end manner. In particular, counting wheat spikes is a typical visual counting problem, which is substantially studied under the name of object counting in Computer Vision. TasselNet, which represents one of the state-of-the-art counting approaches, is a convolutional neural network-based local regression model, and currently benchmarks the best record on counting maize tassels. However, when applying TasselNet to wheat spikes, it cannot predict accurate counts when spikes partially present. Results In this paper, we make an important observation that the counting performance of local regression networks can be significantly improved via adding visual context to the local patches. Meanwhile, such context can be treated as part of the receptive field without increasing the model capacity. We thus propose a simple yet effective contextual extension of TasselNet—TasselNetv2. If implementing TasselNetv2 in a fully convolutional form, both training and inference can be greatly sped up by reducing redundant computations. In particular, we collected and labeled a large-scale wheat spikes counting (WSC) dataset, with 1764 high-resolution images and 675,322 manually-annotated instances. Extensive experiments show that, TasselNetv2 not only achieves state-of-the-art performance on the WSC dataset ($$91.01\%$$91.01% counting accuracy) but also is more than an order of magnitude faster than TasselNet (13.82 fps on $$912\times 1216$$912×1216 images). The generality of TasselNetv2 is further demonstrated by advancing the state of the art on both the Maize Tassels Counting and ShanghaiTech Crowd Counting datasets. Conclusions This paper describes TasselNetv2 for counting wheat spikes, which simultaneously addresses two important use cases in plant counting: improving the counting accuracy without increasing model capacity, and improving efficiency without sacrificing accuracy. It is promising to be deployed in a real-time system with high-throughput demand. In particular, TasselNetv2 can achieve sufficiently accurate results when training from scratch with small networks, and adopting larger pre-trained networks can further boost accuracy. In practice, one can trade off the performance and efficiency according to certain application scenarios. Code and models are made available at: https://tinyurl.com/TasselNetv2.

Download Full-text

TahcoRoll: An Efficient Approach for Signature Profiling in Genomic Data through Variable-Length k-mers

10.1101/229708 ◽

2017 ◽

Author(s):

Chelsea J.-T. Ju ◽

Jyun-Yu Jiang ◽

Ruirui Li ◽

Zeyu Li ◽

Wei Wang

Keyword(s):

High Throughput Sequencing ◽

Transcriptome Assembly ◽

Variable Length ◽

Sequencing Data ◽

Desktop Computer ◽

De Bruijn Graphs ◽

Transcript Quantification ◽

Sequencing Technologies ◽

Genomic Markers ◽

Long Read

Abstractk-mer profiling has been one of the trending approaches to analyze read data generated by high-throughput sequencing technologies. The tasks of k-mer profiling include, but are not limited to, counting the frequencies and determining the occurrences of short sequences in a dataset. The notion of k-mer has been extensively used to build de Bruijn graphs in genome or transcriptome assembly, which requires examining all possible k-mers presented in the dataset. Recently, an alternative way of profiling has been proposed, which constructs a set of representative k-mers as genomic markers and profiles their occurrences in the sequencing data. This technique has been applied in both transcript quantification through RNA-Seq and taxonomic classification of metagenomic reads. Most of these applications use a set of fixed-size k-mers since the majority of existing k-mer counters are inadequate to process genomic sequences with variable-length k-mers. However, choosing the appropriate k is challenging, as it varies for different applications. As a pioneer work to profile a set of variable-length k-mers, we propose TahcoRoll in order to enhance the Aho-Corasick algorithm. More specifically, we use one bit to represent each nucleotide, and integrate the rolling hash technique to construct an efficient in-memory data structure for this task. Using both synthetic and real datasets, results show that TahcoRoll outperforms existing approaches in either or both time and memory efficiency without using any disk space. In addition, compared to the most efficient state-of-the-art k-mer counters, such as KMC and MSBWT, TahcoRoll is the only approach that can process long read data from both PacBio and Oxford Nanopore on a commodity desktop computer. The source code of TahcoRoll is implemented in C++14, and available at https://github.com/chelseaju/TahcoRoll.git.

Download Full-text

Compressing Exact Cover Problems with Zero-suppressed Binary Decision Diagrams

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/275 ◽

2021 ◽

Author(s):

Masaaki Nishino ◽

Norihito Yasuda ◽

Kengo Nakamura

Keyword(s):

Data Structure ◽

State Of The Art ◽

Linear Time ◽

Binary Decision Diagrams ◽

Experimental Results ◽

Decision Diagrams ◽

Binary Decision ◽

Running Time ◽

Order Of Magnitude ◽

Exact Cover

Exact cover refers to the problem of finding subfamily F of a given family of sets S whose universe is D, where F forms a partition of D. Knuth’s Algorithm DLX is a state-of-the-art method for solving exact cover problems. Since DLX’s running time depends on the cardinality of input S, it can be slow if S is large. Our proposal can improve DLX by exploiting a novel data structure, DanceDD, which extends the zero-suppressed binary decision diagram (ZDD) by adding links to enable efficient modifications of the data structure. With DanceDD, we can represent S in a compressed way and perform search in linear time with the size of the structure by using link operations. The experimental results show that our method is an order of magnitude faster when the problem is highly compressed.

Download Full-text

MAYGEN - an Open-Source Chemical Structure Generator for Constitutional Isomers Based on the Orderly Generation Principle

10.26434/chemrxiv.14497959.v1 ◽

2021 ◽

Author(s):

Mehmet Aziz Yirik ◽

Maria Sorokina ◽

Christoph Steinbeck

Keyword(s):

Open Source ◽

Structure Elucidation ◽

Chemical Structure ◽

State Of The Art ◽

The State ◽

Molecular Generator ◽

Structure Generator ◽

Order Of Magnitude ◽

Closed Source

<p>The generation of constitutional isomer chemical spaces has been a subject of cheminformatics since the early 1960s, with applications in structure elucidation and elsewhere. In order to perform such a generation efficiently, exhaustively and isomorphism-free, the structure generator needs to ensure the building of canonical graphs already during the generation step and not by subsequent filtering.</p><p>Here we present MAYGEN, an open-source, pure-Java development of a constitutional isomer molecular generator. The principles of MAYGEN’s architecture and algorithm are outlined and the software is benchmarked against the state-of-the-art, but closed-source solution MOLGEN, as well as against the best open-source solution OMG. MAYGEN outperforms OMG by an order of magnitude and gets close to and occasionally outperforms MOLGEN in performance.</p>

Download Full-text

DTA-SiST: de novo transcriptome assembly by using simplified suffix trees

BMC Bioinformatics ◽

10.1186/s12859-019-3272-9 ◽

2019 ◽

Vol 20 (S25) ◽

Author(s):

Jin Zhao ◽

Haodi Feng ◽

Daming Zhu ◽

Chi Zhang ◽

Ying Xu

Keyword(s):

Suffix Tree ◽

High Throughput Sequencing ◽

De Novo ◽

State Of The Art ◽

Linear Time ◽

Transcriptome Assembly ◽

De Novo Transcriptome Assembly ◽

Suffix Trees ◽

De Novo Transcriptome ◽

Hybrid Strategy

Abstract Background Alternative splicing allows the pre-mRNAs of a gene to be spliced into various mRNAs, which greatly increases the diversity of proteins. High-throughput sequencing of mRNAs has revolutionized our ability for transcripts reconstruction. However, the massive size of short reads makes de novo transcripts assembly an algorithmic challenge. Results We develop a novel radical framework, called DTA-SiST, for de novo transcriptome assembly based on suffix trees. DTA-SiST first extends contigs by reads that have the longest overlaps with the contigs’ terminuses. These reads can be found in linear time of the lengths of the reads through a well-designed suffix tree structure. Then, DTA-SiST constructs splicing graphs based on contigs for each gene locus. Finally, DTA-SiST proposes two strategies to extract transcript-representing paths: a depth-first enumeration strategy and a hybrid strategy based on length and coverage. We implemented the above two strategies and compared them with the state-of-the-art de novo assemblers on both simulated and real datasets. Experimental results showed that the depth-first enumeration strategy performs always better with recall and also better with precision for smaller datasets while the hybrid strategy leads with precision for big datasets. Conclusions DTA-SiST performs more competitive than the other compared de novo assemblers especially with precision measure, due to the read-based contig extension strategy and the elegant transcripts extraction rules.

Download Full-text

Oxygen evolution on well-characterized mass-selected Ru and RuO2nanoparticles

Chemical Science ◽

10.1039/c4sc02685c ◽

2015 ◽

Vol 6 (1) ◽

pp. 190-196 ◽

Cited By ~ 192

Author(s):

Elisa A. Paoli ◽

Federico Masini ◽

Rasmus Frydendal ◽

Davide Deiana ◽

Christian Schlaup ◽

...

Keyword(s):

Oxygen Evolution ◽

State Of The Art ◽

The State ◽

Order Of Magnitude ◽

Magnitude Improvement

Well-defined mass-selected Ru and RuO2nanoparticles exhibit an order of magnitude improvement in the oxygen evolution activity, relative to the state-of-the-art, with a maximum at around 3–5 nm.

Download Full-text

DiscoSnp-RAD: de novo detection of small variants for population genomics

10.1101/216747 ◽

2017 ◽

Cited By ~ 2

Author(s):

Jèrèmy Gauthier ◽

Charlotte Mouden ◽

Tomasz Suchan ◽

Nadir Alvarez ◽

Nils Arrigo ◽

...

Keyword(s):

Evolutionary Biology ◽

Restriction Site ◽

Population Genomics ◽

De Novo ◽

State Of The Art ◽

Phylogenetic Reconstruction ◽

Original Method ◽

De Bruijn Graph ◽

Order Of Magnitude ◽

Two Populations

AbstractWe present an original method to de novo call variants for Restriction site associated DNA Sequencing (RAD-Seq). RAD-Seq is a technique characterized by the sequencing of specific loci along the genome, that is widely employed in the field of evolutionary biology since it allows to exploit variants (mainly SNPs) information from entire populations at a reduced cost. Common RAD dedicated tools, as STACKS or IPyRAD, are based on all-versus-all read comparisons, which require consequent time and computing resources. Based on the variant caller DiscoSnp, initially designed for shotgun sequencing, DiscoSnp-RAD avoids this pitfall as variants are detected by exploring the De Bruijn Graph built from all the read datasets. We tested the implementation on RAD data from 259 specimens of Chiastocheta flies, morphologically assigned to 7 species. All individuals were successfully assigned to their species using both STRUCTURE and Maximum Likelihood phylogenetic reconstruction. Moreover, identified variants succeeded to reveal a within species structuration and the existence of two populations linked to their geographic distributions. Furthermore, our results show that DiscoSnp-RAD is at least one order of magnitude faster than state-of-the-art tools. The overall results show that DiscoSnp-RAD is suitable to identify variants from RAD data, and stands out from other tools due to his completely different principle, making it significantly faster, in particular on large datasets.LicenseGNU Affero general public licenseAvailabilityhttps://github.com/GATB/[email protected]

Download Full-text