GRAPES-DD: exploiting decision diagrams for index-driven search in biological graph databases

Abstract Background Graphs are mathematical structures widely used for expressing relationships among elements when representing biomedical and biological information. On top of these representations, several analyses are performed. A common task is the search of one substructure within one graph, called target. The problem is referred to as one-to-one subgraph search, and it is known to be NP-complete. Heuristics and indexing techniques can be applied to facilitate the search. Indexing techniques are also exploited in the context of searching in a collection of target graphs, referred to as one-to-many subgraph problem. Filter-and-verification methods that use indexing approaches provide a fast pruning of target graphs or parts of them that do not contain the query. The expensive verification phase is then performed only on the subset of promising targets. Indexing strategies extract graph features at a sufficient granularity level for performing a powerful filtering step. Features are memorized in data structures allowing an efficient access. Indexing size, querying time and filtering power are key points for the development of efficient subgraph searching solutions. Results An existing approach, GRAPES, has been shown to have good performance in terms of speed-up for both one-to-one and one-to-many cases. However, it suffers in the size of the built index. For this reason, we propose GRAPES-DD, a modified version of GRAPES in which the indexing structure has been replaced with a Decision Diagram. Decision Diagrams are a broad class of data structures widely used to encode and manipulate functions efficiently. Experiments on biomedical structures and synthetic graphs have confirmed our expectation showing that GRAPES-DD has substantially reduced the memory utilization compared to GRAPES without worsening the searching time. Conclusion The use of Decision Diagrams for searching in biochemical and biological graphs is completely new and potentially promising thanks to their ability to encode compactly sets by exploiting their structure and regularity, and to manipulate entire sets of elements at once, instead of exploring each single element explicitly. Search strategies based on Decision Diagram makes the indexing for biochemical graphs, and not only, more affordable allowing us to potentially deal with huge and ever growing collections of biochemical and biological structures.

Download Full-text

GRAPES-DD: Exploiting Decision Diagrams for Index-Driven Biomedical Databases Search

10.21203/rs.3.rs-48943/v1 ◽

2020 ◽

Author(s):

Nicola Licheri ◽

Vincenzo Bonnici ◽

Marco Beccuti ◽

Rosalba Giugno

Keyword(s):

Data Structures ◽

Broad Class ◽

Biological Information ◽

Single Element ◽

Decision Diagrams ◽

Indexing Structure ◽

Indexing Techniques ◽

Verification Methods ◽

Granularity Level ◽

One To One

Abstract Background: Graphs are mathematical structures widely used for expressing relationships among elements when representing biomedical and biological information. On top of these representations, several analyses are conducted. A common task is the search of one substructure, called query, within one graph, called target. The problem is referred to as one-to-one subgraph search, and it is shown to be NP-complete. However, heuristics and indexing techniques can be applied to facilitate the search. Such indexing techniques are also exploited in the context of searching in a collection of target graphs, referred to as one-to-many subgraph problem. Filter-and-veriﬁcation methods that use indexing approaches provide a fast pruning of target graphs or parts of them that do not contain the query. The expensive veriﬁcation phase is then performed only on the subset of promising targets. Indexing strategies extract graph features at a suﬃcient granularity level for performing a powerful ﬁltering step. Features are memorized in data structures allowing an eﬃcient access. Indexing size, querying time and ﬁltering power are key points for the development of eﬃcient subgraph searching solutions. Results: An existing subgraph approach, GRAPES, has been shown to have good performance in term of speed-up for both one-to-one and one-to-many cases. However, it suﬀers in the size of the built index. For this reason, we propose GRAPES-DD, a modiﬁed version of GRAPES in which the indexing structure has been replaced with a Decision Diagram. Decision Diagrams are a broad class of data structures widely used to encode and manipulate functions eﬃciently. Our experiments on real biomedical structures and synthetic graphs have conﬁrmed our expectation showing that GRAPES-DD has substantially reduced the memory utilization with respect to GRAPES without worsening the searching time. Conclusion: The use of Decision Diagrams for searching in biomedical graphs is completely new and potentially promising thanks to their ability to encode compactly sets by exploiting their structure and regularity, and (ii) to manipulate entire sets of elements at once, instead of exploring each single element explicitly. This work shows that search strategies based on Decision Diagram makes the indexing for biochemical graphs, and not only, more scalable and aﬀordable allowing us to potentially deal with huge and ever growing collections of biomedical structures.

Download Full-text

DenseZDD: A Compact and Fast Index for Families of Sets

Algorithms ◽

10.3390/a11080128 ◽

2018 ◽

Vol 11 (8) ◽

pp. 128 ◽

Cited By ~ 1

Author(s):

Shuhei Denzumi ◽

Jun Kawahara ◽

Koji Tsuda ◽

Hiroki Arimura ◽

Shin-ichi Minato ◽

...

Keyword(s):

Data Structure ◽

Data Structures ◽

Information Integration ◽

Decision Diagrams ◽

Web Information Retrieval ◽

Binary Decision ◽

Web Information ◽

Set Operations ◽

Succinct Data Structure ◽

The Family

In this article, we propose a succinct data structure of zero-suppressed binary decision diagrams (ZDDs). A ZDD represents sets of combinations efficiently and we can perform various set operations on the ZDD without explicitly extracting combinations. Thanks to these features, ZDDs have been applied to web information retrieval, information integration, and data mining. However, to support rich manipulation of sets of combinations and update ZDDs in the future, ZDDs need too much space, which means that there is still room to be compressed. The paper introduces a new succinct data structure, called DenseZDD, for further compressing a ZDD when we do not need to conduct set operations on the ZDD but want to examine whether a given set is included in the family represented by the ZDD, and count the number of elements in the family. We also propose a hybrid method, which combines DenseZDDs with ordinary ZDDs. By numerical experiments, we show that the sizes of our data structures are three times smaller than those of ordinary ZDDs, and membership operations and random sampling on DenseZDDs are about ten times and three times faster than those on ordinary ZDDs for some datasets, respectively.

Download Full-text

Bringing radiomics into a multi-omics framework for a comprehensive genotype–phenotype characterization of oncological diseases

Journal of Translational Medicine ◽

10.1186/s12967-019-2073-2 ◽

2019 ◽

Vol 17 (1) ◽

Cited By ~ 13

Author(s):

Mario Zanfardino ◽

Monica Franzese ◽

Katia Pane ◽

Carlo Cavaliere ◽

Serena Monti ◽

...

Keyword(s):

Data Integration ◽

Data Structures ◽

R Package ◽

Biological Information ◽

Phenotype Definition ◽

Current State ◽

Oncological Diseases ◽

Cancer Phenotype

Abstract Genomic and radiomic data integration, namely radiogenomics, can provide meaningful knowledge in cancer diagnosis, prognosis and treatment. Despite several data structures based on multi-layer architecture proposed to combine multi-omic biological information, none of these has been designed and assessed to include radiomic data as well. To meet this need, we propose to use the MultiAssayExperiment (MAE), an R package that provides data structures and methods for manipulating and integrating multi-assay experiments, as a suitable tool to manage radiogenomic experiment data. To this aim, we first examine the role of radiogenomics in cancer phenotype definition, then the current state of radiogenomics data integration in public repository and, finally, challenges and limitations of including radiomics in MAE, designing an extended framework and showing its application on a case study from the TCGA-TCIA archives. Radiomic and genomic data from 91 patients have been successfully integrated in a single MAE object, demonstrating the suitability of the MAE data structure as container of radiogenomic data.

Download Full-text

Storing Set Families More Compactly with Top ZDDs

Algorithms ◽

10.3390/a14060172 ◽

2021 ◽

Vol 14 (6) ◽

pp. 172

Author(s):

Kotaro Matsuda ◽

Shuhei Denzumi ◽

Kunihiko Sadakane

Keyword(s):

Data Structures ◽

Binary Decision Diagrams ◽

Real Data ◽

Directed Acyclic Graphs ◽

Main Memory ◽

Large Set ◽

Decision Diagrams ◽

Binary Decision ◽

Acyclic Graphs

Zero-suppressed Binary Decision Diagrams (ZDDs) are data structures for representing set families in a compressed form. With ZDDs, many valuable operations on set families can be done in time polynomial in ZDD size. In some cases, however, the size of ZDDs for representing large set families becomes too huge to store them in the main memory. This paper proposes top ZDD, a novel representation of ZDDs which uses less space than existing ones. The top ZDD is an extension of the top tree, which compresses trees, to compress directed acyclic graphs by sharing identical subgraphs. We prove that navigational operations on ZDDs can be done in time poly-logarithmic in ZDD size, and show that there exist set families for which the size of the top ZDD is exponentially smaller than that of the ZDD. We also show experimentally that our top ZDDs have smaller sizes than ZDDs for real data.

Download Full-text

Parallel construction of binary tree based on sorting

Vestnik of Don State Technical University ◽

10.23947/1992-5980-2018-18-4-449-454 ◽

2019 ◽

Vol 18 (4) ◽

pp. 449-454

Author(s):

Ya. E. Romm ◽

D. A. Chabanyuk

Keyword(s):

Data Structures ◽

Binary Tree ◽

Time Complexity ◽

Binary Data ◽

Relational Databases ◽

Input Sequence ◽

Mutual Transformation ◽

One To One ◽

Parallel Construction ◽

Address Sorting

Introduction. Algorithms for the parallel binary tree construction are developed. The algorithms are based on sorting and described in a constructive form. For the Nelement set, the time complexity has T(R) = O(1) and T(R) = O(log2N) estimates, where R = (N2-N)/2 is the number of processors. The tree is built with the uniqueness property. The algorithms are invariant with respect to the input sequence type. The work objective is to develop and study ways of accelerating the process of organizing and transforming the tree-like data structures on the basis of the stable maximum parallel sorting algorithms for their application to the basic operations of information retrieval on databases.Materials and Methods.A one-to-one relation between the input element set and the binary tree built for it is established using a stable address sorting. The sorting provides maximum concurrency, and, in an operator form, establishes a one-to-one mapping of input and output indices. On this basis, methods for the mutual transformation of the binary data structures are being developed.Research Results.An efficient parallel algorithm for constructing a binary tree based on the address sorting with time complexity of T(N2) = O(log2N) is obtained. From the well-known analogues, the algorithm differs in structure and logarithmic estimation of time complexity, which makes it possible to achieve the acceleration of O(Nα), α≥1 order analogues. As an advanced version, an algorithm modification, which provides the maximum parallel construction of the binary tree based on a stable address sorting and a priori calculation of the stored subtree root indices is suggested. The algorithm differs in structure and estimation of T(1) = O(1) time complexity. A similar estimate is achieved in a sequential version of the modified algorithm, which allows obtaining the acceleration of known analogs O(Nα), α>1 order.Discussion and Conclusions.The results obtained are focused on the creation of effective methods for the dynamic database processing. The proposed methods and algorithms can form an algorithmic basis for an advanced deterministic search on the relational databases and information systems.

Download Full-text

A separation logic for negative dependence

Proceedings of the ACM on Programming Languages ◽

10.1145/3498719 ◽

2022 ◽

Vol 6 (POPL) ◽

pp. 1-29

Author(s):

Jialu Bao ◽

Marco Gaboardi ◽

Justin Hsu ◽

Joseph Tassarotti

Keyword(s):

Data Structures ◽

Algorithm Design ◽

Random Variables ◽

Separation Logic ◽

Negative Dependence ◽

Probabilistic Data ◽

Complete Proof ◽

Partial Operation ◽

Verification Methods ◽

Independence Of Random Variables

Formal reasoning about hashing-based probabilistic data structures often requires reasoning about random variables where when one variable gets larger (such as the number of elements hashed into one bucket), the others tend to be smaller (like the number of elements hashed into the other buckets). This is an example of negative dependence , a generalization of probabilistic independence that has recently found interesting applications in algorithm design and machine learning. Despite the usefulness of negative dependence for the analyses of probabilistic data structures, existing verification methods cannot establish this property for randomized programs. To fill this gap, we design LINA, a probabilistic separation logic for reasoning about negative dependence. Following recent works on probabilistic separation logic using separating conjunction to reason about the probabilistic independence of random variables, we use separating conjunction to reason about negative dependence. Our assertion logic features two separating conjunctions, one for independence and one for negative dependence. We generalize the logic of bunched implications (BI) to support multiple separating conjunctions, and provide a sound and complete proof system. Notably, the semantics for separating conjunction relies on a non-deterministic , rather than partial, operation for combining resources. By drawing on closure properties for negative dependence, our program logic supports a Frame-like rule for negative dependence and monotone operations. We demonstrate how LINA can verify probabilistic properties of hash-based data structures and balls-into-bins processes.

Download Full-text

ON THE COMPLEXITY OF MAPPING LINEAR CHAIN APPLICATIONS ONTO HETEROGENEOUS PLATFORMS

Parallel Processing Letters ◽

10.1142/s0129626409000298 ◽

2009 ◽

Vol 19 (03) ◽

pp. 383-397 ◽

Cited By ~ 4

Author(s):

ANNE BENOIT ◽

YVES ROBERT ◽

ERIC THIERRY

Keyword(s):

Large Scale ◽

Broad Class ◽

Linear Chain ◽

Real Life ◽

Interval Mapping ◽

Constant Factor ◽

Data Set ◽

Heterogeneous Platforms ◽

Mapping Problem ◽

One To One

In this paper, we explore the problem of mapping linear chain applications onto large-scale heterogeneous platforms. A series of data sets enter the input stage and progress from stage to stage until the final result is computed. An important optimization criterion that should be considered in such a framework is the latency, or makespan, which measures the response time of the system in order to process one single data set entirely. For such applications, which are representative of a broad class of real-life applications, we can consider one-to-one mappings, in which each stage is mapped onto a single processor. However, in order to reduce the communication cost, it seems natural to group stages into intervals. The interval mapping problem can be solved in a straightforward way if the platform has homogeneous communications: the whole chain is grouped into a single interval, which in turn is mapped onto the fastest processor. But the problem becomes harder when considering a fully heterogeneous platform. Indeed, we prove the NP-completeness of this problem. Furthermore, we prove that neither the interval mapping problem nor the similar one-to-one mapping problem can be approximated in polynomial time by any constant factor (unless P=NP).

Download Full-text

ON THE PROPERTIES OF A GENERALIZED CLASS OF T-NORMS IN INTERVAL-VALUED FUZZY LOGICS

New Mathematics and Natural Computation ◽

10.1142/s1793005706000361 ◽

2006 ◽

Vol 02 (01) ◽

pp. 29-41 ◽

Cited By ~ 16

Author(s):

BART VAN GASSE ◽

CHRIS CORNELIS ◽

GLAD DESCHRIJVER ◽

ETIENNE E. KERRE

Keyword(s):

Residuated Lattice ◽

Broad Class ◽

Fuzzy Logics ◽

Intimate Connection ◽

Logical Connectives ◽

One To One ◽

Interval Valued

Since it does not generate any MTL-algebra (prelinear residuated lattice), the lattice [Formula: see text] of closed subintervals of [0, 1] falls outside the mainstream of research on formal fuzzy logics. However, due to the intimate connection between logical connectives on [Formula: see text] and those on [0, 1], many relevant logical properties can still be maintained, sometimes in a slightly weaker form. In this paper, we focus on a broad class of parametrized t-norms on [Formula: see text]. We derive their corresponding residual implicators, and examine commonly imposed logical properties. Importantly, we formally establish one-to-one correspondences between ∨-definability (respectively, weak divisibility) for t-norms of this class and strong ∨-definability (resp., divisibility) for their counterparts on [0, 1].

Download Full-text

GENERIC PROGRAMMING OF PARALLEL APPLICATIONS WITH JANUS

Parallel Processing Letters ◽

10.1142/s0129626402000914 ◽

2002 ◽

Vol 12 (02) ◽

pp. 175-190 ◽

Cited By ~ 7

Author(s):

JENS GERLACH

Keyword(s):

Data Structures ◽

Broad Class ◽

Finite Difference Methods ◽

Parallel Applications ◽

Generic Programming ◽

Adaptive Finite Element ◽

Adaptive Finite Element Methods ◽

Data Parallel ◽

Language Extension ◽

Data Structures And Algorithms

Janus is a conceptual framework and C++ template library that provides a flexible and extensible collection of efficient data structures and algorithms for a broad class of data-parallel applications. In particular, finite difference methods, (adaptive) finite element methods, and data-parallel graph algorithms are supported. An outstanding advantage of providing a generic C++ framework is that it provides application-oriented abstractions that achieve high performance without relying on language extension or non-standard compiler technology. The C++ template mechanism allows to plug user-defined types into the Janus data structures and algorithms. Moreover, Janus components can easily be combined with standard software packages of this field.

Download Full-text

Simulating Uniform Hashing in Constant Time and Optimal Space

BRICS Report Series ◽

10.7146/brics.v9i27.21743 ◽

2002 ◽

Vol 9 (27) ◽

Author(s):

Anna Östlin ◽

Rasmus Pagh

Keyword(s):

Data Structures ◽

High Probability ◽

Broad Class ◽

Hash Functions ◽

Constant Time ◽

Algorithms And Data Structures ◽

Random Functions ◽

Performance Guarantees ◽

Optimal Space

Many algorithms and data structures employing hashing have been analyzed under the <em>uniform hashing</em> assumption, i.e., the assumption that hash functions behave like truly random functions. In this paper it is shown how to implement hash functions that can be evaluated on a RAM in constant time, and behave like truly random functions on any set of n inputs, with high probability. The space needed to represent a function is O(n) words, which is the best possible (and a polynomial improvement compared to previous fast hash functions). As a consequence, a broad class of hashing schemes can be implemented to meet, with high probability, the performance guarantees of their uniform hashing analysis.

Download Full-text