R3D3: A Doubly Opportunistic Data Structure for Compressing and Indexing Massive Data

Infocommunications journal ◽

10.36244/icj.2019.2.7 ◽

2019 ◽

pp. 58-66

Author(s):

Máté Nagy ◽

János Tapolcai ◽

Gábor Rétvári

Keyword(s):

Data Structure ◽

Data Structures ◽

Real Data ◽

Small Error ◽

Data Sets ◽

Space Reduction ◽

Wide Range ◽

Arbitrary Position ◽

Efficient Data ◽

Space Requirements

Opportunistic data structures are used extensively in big data practice to break down the massive storage space requirements of processing large volumes of information. A data structure is called (singly) opportunistic if it takes advantage of the redundancy in the input in order to store it in iformationtheoretically minimum space. Yet, efficient data processing requires a separate index alongside the data, whose size often substantially exceeds that of the compressed information. In this paper, we introduce doubly opportunistic data structures to not only attain best possible compression on the input data but also on the index. We present R3D3 that encodes a bitvector of length n and Shannon entropy H0 to nH0 bits and the accompanying index to nH0(1/2 + O(log C/C)) bits, thus attaining provably minimum space (up to small error terms) on both the data and the index, and supports a rich set of queries to arbitrary position in the compressed bitvector in O(C) time when C = o(log n). Our R3D3 prototype attains several times space reduction beyond known compression techniques on a wide range of synthetic and real data sets, while it supports operations on the compressed data at comparable speed.

Shape Neutral Analysis of Graph-based Data-structures

Theory and Practice of Logic Programming ◽

10.1017/s147106841800025x ◽

2018 ◽

Vol 18 (3-4) ◽

pp. 470-483 ◽

Cited By ~ 1

Author(s):

GREGORY J. DUCK ◽

JOXAN JAFFAR ◽

ROLAND H. C. YAP

Keyword(s):

Data Structure ◽

Data Structures ◽

Program Analysis ◽

Constraint Handling ◽

C Programs ◽

Wide Range ◽

Target Program ◽

Target Data ◽

Structure Graph ◽

Structure Properties

AbstractMalformed data-structures can lead to runtime errors such as arbitrary memory access or corruption. Despite this, reasoning over data-structure properties for low-level heap manipulating programs remains challenging. In this paper we present a constraint-based program analysis that checks data-structure integrity, w.r.t. given target data-structure properties, as the heap is manipulated by the program. Our approach is to automatically generate a solver for properties using the type definitions from the target program. The generated solver is implemented using a Constraint Handling Rules (CHR) extension of built-in heap, integer and equality solvers. A key property of our program analysis is that the target data-structure properties are shape neutral, i.e., the analysis does not check for properties relating to a given data-structure graph shape, such as doubly-linked-lists versus trees. Nevertheless, the analysis can detect errors in a wide range of data-structure manipulating programs, including those that use lists, trees, DAGs, graphs, etc. We present an implementation that uses the Satisfiability Modulo Constraint Handling Rules (SMCHR) system. Experimental results show that our approach works well for real-world C programs.

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Characterising RDF data sets

Journal of Information Science ◽

10.1177/0165551516677945 ◽

2017 ◽

Vol 44 (2) ◽

pp. 203-229 ◽

Cited By ~ 6

Author(s):

Javier D Fernández ◽

Miguel A Martínez-Prieto ◽

Pablo de la Fuente Redondo ◽

Claudio Gutiérrez

Keyword(s):

Data Structures ◽

Large Scale ◽

Open Data ◽

Structural Features ◽

Data Sets ◽

Data Set ◽

Wide Range ◽

Rdf Data ◽

Description Framework ◽

Resource Description

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.

Efficient Data Structures for Range Selections Problem

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.1387 ◽

2013 ◽

Vol 756-759 ◽

pp. 1387-1391

Author(s):

Xiao Dong Wang ◽

Jun Tian

Keyword(s):

Data Structure ◽

Linear Space ◽

Data Structures ◽

Computation Model ◽

Word Size ◽

Selection Problems ◽

Efficient Data ◽

Space Data ◽

Efficient Data Structures ◽

Range Selection

Building an efficient data structure for range selection problems is considered. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform. The computation model used in this paper is the RAM model with word-size . Our data structure is a practical linear space data structure that supports range selection queries in time with preprocessing time.

The longest common substring problem

Mathematical Structures in Computer Science ◽

10.1017/s0960129515000110 ◽

2015 ◽

Vol 27 (2) ◽

pp. 277-295 ◽

Cited By ~ 1

Author(s):

MAXIME CROCHEMORE ◽

COSTAS S. ILIOPOULOS ◽

ALESSIO LANGIU ◽

FILIPPO MIGNOSI

Keyword(s):

Data Structures ◽

Dna Sequences ◽

Simple Algorithm ◽

Suffix Trees ◽

Simple Method ◽

Suffix Arrays ◽

Lowest Common Ancestor ◽

Wide Range ◽

Efficient Data ◽

Longest Common Substring

Given a set $\mathcal{D}$ of q documents, the Longest Common Substring (LCS) problem asks, for any integer 2 ⩽ k ⩽ q, the longest substring that appears in k documents. LCS is a well-studied problem having a wide range of applications in Bioinformatics: from microarrays to DNA sequences alignments and analysis. This problem has been solved by Hui (2000International Journal of Computer Science and Engineering15 73–76) by using a famous constant-time solution to the Lowest Common Ancestor (LCA) problem in trees coupled with the use of suffix trees.In this article, we present a simple method for solving the LCS problem by using suffix trees (STs) and classical union-find data structures. In turn, we show how this simple algorithm can be adapted in order to work with other space efficient data structures such as the enhanced suffix arrays (ESA) and the compressed suffix tree.

A parallel data clustering algorithm for Intel MIC accelerators

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v20r211 ◽

2019 ◽

pp. 104-115

Author(s):

Т.В. Речкалов ◽

М.Л. Цымблер

Keyword(s):

Dna Microarrays ◽

Clustering Algorithm ◽

Real Data ◽

Data Sets ◽

Data Layout ◽

Partitioning Around Medoids ◽

Wide Range ◽

Input Dataset ◽

Intel Mic ◽

Many Integrated Core

Алгоритм PAM (Partitioning Around Medoids) представляет собой разделительный алгоритм кластеризации, в котором в качестве центров кластеров выбираются только кластеризуемые объекты (медоиды). Кластеризация на основе техники медоидов применяется в широком спектре приложений: сегментирование медицинских и спутниковых изображений, анализ ДНК-микрочипов и текстов и др. На сегодня имеются параллельные реализации PAM для систем GPU и FPGA, но отсутствуют таковые для многоядерных ускорителей архитектуры Intel Many Integrated Core (MIC). В настоящей статье предлагается новый параллельный алгоритм кластеризации PhiPAM для ускорителей Intel MIC. Вычисления распараллеливаются с помощью технологии OpenMP. Алгоритм предполагает использование специализированной компоновки данных в памяти и техники тайлинга, позволяющих эффективно векторизовать вычисления на системах Intel MIC. Эксперименты, проведенные на реальных наборах данных, показали хорошую масштабируемость алгоритма. The PAM (Partitioning Around Medoids) is a partitioning clustering algorithm where each cluster is represented by an object from the input dataset (called a medoid). The medoid-based clustering is used in a wide range of applications: the segmentation of medical and satellite images, the analysis of DNA microarrays and texts, etc. Currently, there are parallel implementations of PAM for GPU and FPGA systems, but not for Intel Many Integrated Core (MIC) accelerators. In this paper, we propose a novel parallel PhiPAM clustering algorithm for Intel MIC systems. Computations are parallelized by the OpenMP technology. The algorithm exploits a sophisticated memory data layout and loop tiling technique, which allows one to efficiently vectorize computations with Intel MIC. Experiments performed on real data sets show a good scalability of the algorithm.

Global tests for novelty

Statistical Methods in Medical Research ◽

10.1177/0962280215591236 ◽

2015 ◽

Vol 26 (4) ◽

pp. 1867-1880

Author(s):

Ilmari Ahonen ◽

Denis Larocque ◽

Jaakko Nevalainen

Keyword(s):

Novelty Detection ◽

Null Distribution ◽

Real Data ◽

Training Data ◽

Data Sets ◽

Screening Experiments ◽

Wide Range ◽

Global Tests ◽

New Treatment ◽

The Individual

Outlier detection covers the wide range of methods aiming at identifying observations that are considered unusual. Novelty detection, on the other hand, seeks observations among newly generated test data that are exceptional compared with previously observed training data. In many applications, the general existence of novelty is of more interest than identifying the individual novel observations. For instance, in high-throughput cancer treatment screening experiments, it is meaningful to test whether any new treatment effects are seen compared with existing compounds. Here, we present hypothesis tests for such global level novelty. The problem is approached through a set of very general assumptions, making it innovative in relation to the current literature. We introduce test statistics capable of detecting novelty. They operate on local neighborhoods and their null distribution is obtained by the permutation principle. We show that they are valid and able to find different types of novelty, e.g. location and scale alternatives. The performance of the methods is assessed with simulations and with applications to real data sets.

Wind Shear Prediction from Light Detection and Ranging Data Using Machine Learning Methods

Atmosphere ◽

10.3390/atmos12050644 ◽

2021 ◽

Vol 12 (5) ◽

pp. 644

Author(s):

Jingyan Huang ◽

Michael Kwok Po Ng ◽

Pak Wai Chan

Keyword(s):

Wind Shear ◽

Warning System ◽

Real Data ◽

Warning Signal ◽

Light Detection And Ranging ◽

Data Sets ◽

Light Detection ◽

Learning Methods ◽

Statistical Indicator ◽

Wide Range

The main aim of this paper is to propose a statistical indicator for wind shear prediction from Light Detection and Ranging (LIDAR) observational data. Accurate warning signal of wind shear is particularly important for aviation safety. The main challenges are that wind shear may result from a sustained change of the headwind and the possible velocity of wind shear may have a wide range. Traditionally, aviation models based on terrain-induced setting are used to detect wind shear phenomena. Different from traditional methods, we study a statistical indicator which is used to measure the variation of headwinds from multiple headwind profiles. Because the indicator value is nonnegative, a decision rule based on one-side normal distribution is employed to distinguish wind shear cases and non-wind shear cases. Experimental results based on real data sets obtained at Hong Kong International Airport runway are presented to demonstrate that the proposed indicator is quite effective. The prediction performance of the proposed method is better than that by the supervised learning methods (LDA, KNN, SVM, and logistic regression). This model would also provide more accurate warnings of wind shear for pilots and improve the performance of Wind shear and Turbulence Warning System.

Efficient Data Structures for Range Shortest Unique Substring Queries

Algorithms ◽

10.3390/a13110276 ◽

2020 ◽

Vol 13 (11) ◽

pp. 276

Author(s):

Paniz Abedin ◽

Arnab Ganguly ◽

Solon P. Pissis ◽

Sharma V. Thankachan

Keyword(s):

Information Retrieval ◽

Data Structure ◽

Data Structures ◽

Query Time ◽

Geometric Data ◽

Small Constant ◽

Efficient Data ◽

Online Queries ◽

Efficient Data Structures ◽

Unique Substrings

Let T[1,n] be a string of length n and T[i,j] be the substring of T starting at position i and ending at position j. A substring T[i,j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as input, the Shortest Unique Substring problem is to find a shortest substring of T that does not occur elsewhere in T. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over T answering the following type of online queries efficiently. Given a range [α,β], return a shortest substring T[i,j] of T with exactly one occurrence in [α,β]. We present an O(nlogn)-word data structure with O(logwn) query time, where w=Ω(logn) is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an O(n)-word data structure with O(nlogϵn) query time, where ϵ>0 is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].

Explaining predictive models using Shapley values and non-parametric vine copulas

Dependence Modeling ◽

10.1515/demo-2021-0103 ◽

2021 ◽

Vol 9 (1) ◽

pp. 62-81

Author(s):

Kjersti Aas ◽

Thomas Nagler ◽

Martin Jullum ◽

Anders Løland

Keyword(s):

Traditional Approach ◽

Simulated Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Wide Range ◽

Shapley Values ◽

Non Gaussian ◽

Vine Copula ◽

Vine Copulas

Abstract In this paper the goal is to explain predictions from complex machine learning models. One method that has become very popular during the last few years is Shapley values. The original development of Shapley values for prediction explanation relied on the assumption that the features being described were independent. If the features in reality are dependent this may lead to incorrect explanations. Hence, there have recently been attempts of appropriately modelling/estimating the dependence between the features. Although the previously proposed methods clearly outperform the traditional approach assuming independence, they have their weaknesses. In this paper we propose two new approaches for modelling the dependence between the features. Both approaches are based on vine copulas, which are flexible tools for modelling multivariate non-Gaussian distributions able to characterise a wide range of complex dependencies. The performance of the proposed methods is evaluated on simulated data sets and a real data set. The experiments demonstrate that the vine copula approaches give more accurate approximations to the true Shapley values than their competitors.