Recursive Genetic Micro-Aggregation Technique: Information Loss, Disclosure Risk and Scoring Index

This research investigates the micro-aggregation problem in secure statistical databases by integrating the divide and conquer concept with a genetic algorithm. This is achieved by recursively dividing a micro-data set into two subsets based on the proximity distance similarity. On each subset the genetic operation “crossover” is performed until the convergence condition is satisfied. The recursion will be terminated if the size of the generated subset is satisfied. Eventually, the genetic operation “mutation” will be performed over all generated subsets that satisfied the variable group size constraint in order to maximize the objective function. Experimentally, the proposed micro-aggregation technique was applied to recommended real-life data sets. Results demonstrated a remarkable reduction in the computational time, which sometimes exceeded 70% compared to the state-of-the-art. Furthermore, a good equilibrium value of the Scoring Index (SI) was achieved by involving a linear combination of the General Information Loss (GIL) and the General Disclosure Risk (GDR).

Download Full-text

Blind Queries Applied to JSON Document Stores

Information ◽

10.3390/info10100291 ◽

2019 ◽

Vol 10 (10) ◽

pp. 291 ◽

Cited By ~ 2

Author(s):

Stefania Marrara ◽

Mauro Pelucchi ◽

Giuseppe Psaila

Keyword(s):

Real Life ◽

Open Data ◽

Large Data ◽

General Information ◽

Data Sets ◽

Heterogeneous Structure ◽

Data Set ◽

Document Database ◽

Job Vacancy ◽

Document Stores

Social Media, Web Portals and, in general, information systems offer their own Application Programming Interfaces (APIs), used to provide large data sets concerning every aspect of day-by-day life. APIs usually provide data sets as collections of JSON documents. The heterogeneous structure of JSON documents returned by different APIs constitutes a barrier to effectively query and analyze these data sets. The adoption of NoSQL document stores, such as MongoDB, is useful for gathering these data sets, but does not solve the problem of querying the final heterogeneous repository. The aim of this paper is to provide analysts with a tool, named HammerJDB, that allows for blind querying collections of JSON documents within a NoSQL document database. The idea below is that users may know the application domain but it may be that they are not aware of the real structures of the documents stored in the database—the tool for blind querying tries to bridge the gap, by adopting a query rewriting mechanism. This paper is an evolution of a technique for blind querying Open Data portals and of its implementation within the Hammer framework, presented in some previous work. In this paper, we evolve that approach in order to query a NoSQL document database by evolving the Hammer framework into the HammerJDB framework, which is able to work on MongoDB databases. The effectiveness of the new approach is evaluated on a data set (derived from a real-life one), containing job-vacancy ads collected from European job portals.

Download Full-text

Design and Comparative Analysis of New Personalized Recommender Algorithms with Specific Features for Large Scale Datasets

Mathematics ◽

10.3390/math8071106 ◽

2020 ◽

Vol 8 (7) ◽

pp. 1106

Author(s):

S. Bhaskaran ◽

Raja Marappan ◽

B. Santhi

Keyword(s):

Large Scale ◽

Real Life ◽

Optimization Methods ◽

Tuning Parameter ◽

Computational Time ◽

Data Set ◽

Significant Difference ◽

Minimum Number ◽

Tremendous Amount ◽

The Given

Nowadays, because of the tremendous amount of information that humans and machines produce every day, it has become increasingly hard to choose the more relevant content across a broad range of choices. This research focuses on the design of two different intelligent optimization methods using Artificial Intelligence and Machine Learning for real-life applications that are used to improve the process of generation of recommenders. In the first method, the modified cluster based intelligent collaborative filtering is applied with the sequential clustering that operates on the values of dataset, user′s neighborhood set, and the size of the recommendation list. This strategy splits the given data set into different subsets or clusters and the recommendation list is extracted from each group for constructing the better recommendation list. In the second method, the specific features-based customized recommender that works in the training and recommendation steps by applying the split and conquer strategy on the problem datasets, which are clustered into a minimum number of clusters and the better recommendation list, is created among all the clusters. This strategy automatically tunes the tuning parameter λ that serves the role of supervised learning in generating the better recommendation list for the large datasets. The quality of the proposed recommenders for some of the large scale datasets is improved compared to some of the well-known existing methods. The proposed methods work well when λ = 0.5 with the size of the recommendation list, |L| = 30 and the size of the neighborhood, |S| < 30. For a large value of |S|, the significant difference of the root mean square error becomes smaller in the proposed methods. For large scale datasets, simulation of the proposed methods when varying the user sizes and when the user size exceeds 500, the experimental results show that better values of the metrics are obtained and the proposed method 2 performs better than proposed method 1. The significant differences are obtained in these methods because the structure of computation of the methods depends on the number of user attributes, λ, the number of bipartite graph edges, and |L|. The better values of the (Precision, Recall) metrics obtained with size as 3000 for the large scale Book-Crossing dataset in the proposed methods are (0.0004, 0.0042) and (0.0004, 0.0046) respectively. The average computational time of the proposed methods takes <10 seconds for the large scale datasets and yields better performance compared to the well-known existing methods.

Download Full-text

A divide-and-conquer algorithm for quantum state preparation

Scientific Reports ◽

10.1038/s41598-021-85474-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Israel F. Araujo ◽

Daniel K. Park ◽

Francesco Petruccione ◽

Adenilton J. da Silva

Keyword(s):

Computational Cost ◽

Quantum Circuit ◽

Divide And Conquer ◽

Computational Time ◽

Quantum Computers ◽

Dimensional Vector ◽

Quantum Devices ◽

Divide And Conquer Algorithm ◽

Quantum State Preparation ◽

Quantum Device

AbstractAdvantages in several fields of research and industry are expected with the rise of quantum computers. However, the computational cost to load classical data in quantum computers can impose restrictions on possible quantum speedups. Known algorithms to create arbitrary quantum states require quantum circuits with depth O(N) to load an N-dimensional vector. Here, we show that it is possible to load an N-dimensional vector with exponential time advantage using a quantum circuit with polylogarithmic depth and entangled information in ancillary qubits. Results show that we can efficiently load data in quantum devices using a divide-and-conquer strategy to exchange computational time for space. We demonstrate a proof of concept on a real quantum device and present two applications for quantum machine learning. We expect that this new loading strategy allows the quantum speedup of tasks that require to load a significant volume of information to quantum devices.

Download Full-text

Computing Expectiles Using k-Nearest Neighbours Approach

Symmetry ◽

10.3390/sym13040645 ◽

2021 ◽

Vol 13 (4) ◽

pp. 645

Author(s):

Muhammad Farooq ◽

Sehrish Sarfraz ◽

Christophe Chesneau ◽

Mahmood Ul Hassan ◽

Muhammad Ali Raza ◽

...

Keyword(s):

Computational Cost ◽

Real Life ◽

Distance Measures ◽

Computational Time ◽

High Dimensional ◽

Test Error ◽

Nearest Neighbours ◽

Comparable Performance ◽

Asymmetric Least Squares ◽

Low Computational Cost

Expectiles have gained considerable attention in recent years due to wide applications in many areas. In this study, the k-nearest neighbours approach, together with the asymmetric least squares loss function, called ex-kNN, is proposed for computing expectiles. Firstly, the effect of various distance measures on ex-kNN in terms of test error and computational time is evaluated. It is found that Canberra, Lorentzian, and Soergel distance measures lead to minimum test error, whereas Euclidean, Canberra, and Average of (L1,L∞) lead to a low computational cost. Secondly, the performance of ex-kNN is compared with existing packages er-boost and ex-svm for computing expectiles that are based on nine real life examples. Depending on the nature of data, the ex-kNN showed two to 10 times better performance than er-boost and comparable performance with ex-svm regarding test error. Computationally, the ex-kNN is found two to five times faster than ex-svm and much faster than er-boost, particularly, in the case of high dimensional data.

Download Full-text

Fundamental resource trade-offs for encoded distributed optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa026 ◽

2020 ◽

Author(s):

A Salman Avestimehr ◽

Seyed Mohammadreza Mousavi Kalan ◽

Mahdi Soltanolkotabi

Keyword(s):

Computational Time ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Computational Framework ◽

Data Set ◽

Trade Offs ◽

Major Bottleneck ◽

Computing Environments ◽

Analyze Data

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.

Download Full-text

A MODIFIED RATIO TYPE ESTIMATOR OF FINITE POPULATION MEAN UNDER STRATIFIED RANDOM SAMPLING SCHEME

10.36106/4917049 ◽

2021 ◽

pp. 58-60

Author(s):

Naziru Fadisanku Haruna ◽

Ran Vijay Kumar Singh ◽

Samsudeen Dahiru

Keyword(s):

Mean Square Error ◽

Random Sampling ◽

Minimum Mean Square Error ◽

Real Life ◽

Population Data ◽

Auxiliary Variable ◽

Mean Square ◽

Data Set ◽

Population Mean ◽

The Mean

In This paper a modied ratio-type estimator for nite population mean under stratied random sampling using single auxiliary variable has been proposed. The expression for mean square error and bias of the proposed estimator are derived up to the rst order of approximation. The expression for minimum mean square error of proposed estimator is also obtained. The mean square error the proposed estimator is compared with other existing estimators theoretically and condition are obtained under which proposed estimator performed better. A real life population data set has been considered to compare the efciency of the proposed estimator numerically.

Download Full-text

Long term variations of the hydrochemical composition of deep thermal ground water in the Lower Bavarian Molasse Basin – Causes and Perspectives

10.5194/egusphere-egu21-4127 ◽

2021 ◽

Author(s):

Annette Dietmaier ◽

Thomas Baumann

Keyword(s):

Time Series ◽

Real Life ◽

Data Series ◽

Fluid Composition ◽

Molasse Basin ◽

Deep Groundwater ◽

Data Set ◽

Groundwater Exploitation ◽

Groundwater Aquifers

<p>The European Water Framework Directive (WFD) commits EU member states to achieve a good qualitative and quantitative status of all their water bodies.&#160; WFD provides a list of actions to be taken to achieve the goal of good status.&#160; However, this list disregards the specific conditions under which deep (> 400 m b.g.l.) groundwater aquifers form and exist.&#160; In particular, deep groundwater fluid composition is influenced by interaction with the rock matrix and other geofluids, and may assume a bad status without anthropogenic influences. Thus, a new concept with directions of monitoring and modelling this specific kind of aquifers is needed. Their status evaluation must be based on the effects induced by their exploitation. Here, we analyze long-term real-life production data series to detect changes in the hydrochemical deep groundwater characteristics which might be triggered by balneological and geothermal exploitation. We aim to use these insights to design a set of criteria with which the status of deep groundwater aquifers can be quantitatively and qualitatively determined. Our analysis is based on a unique long-term hydrochemical data set, taken from 8 balneological and geothermal sites in the molasse basin of Lower Bavaria, Germany, and Upper Austria. It is focused on a predefined set of annual hydrochemical concentration values. The data range dates back to 1937. Our methods include developing threshold corridors, within which a good status can be assumed, and developing cluster analyses, correlation, and piper diagram analyses. We observed strong fluctuations in the hydrochemical characteristics of the molasse basin deep groundwater during the last decades. Special interest is put on fluctuations that seem to have a clear start and end date, and to be correlated with other exploitation activities in the region. For example, during the period between 1990 and 2020, bicarbonate and sodium values displayed a clear increase, followed by a distinct dip to below-average values and a subsequent return to average values at site F. During the same time, these values showed striking irregularities at site B. Furthermore, we observed fluctuations in several locations, which come close to disqualifying quality thresholds, commonly used in German balneology. Our preliminary results prove the importance of using long-term (multiple decades) time series analysis to better inform quality and quantity assessments for deep groundwater bodies: most fluctuations would stay undetected within a < 5 year time series window, but become a distinct irregularity when viewed in the context of multiple decades. In the next steps, a quality assessment matrix and threshold corridors will be developed, which take into account methods to identify these fluctuations. This will ultimately aid in assessing the sustainability of deep groundwater exploitation and reservoir management for balneological and geothermal uses.</p>

Download Full-text

Fast incremental discovery of pointwise order dependencies

Proceedings of the VLDB Endowment ◽

10.14778/3401960.3401965 ◽

2020 ◽

Vol 13 (10) ◽

pp. 1669-1681

Author(s):

Zijing Tan ◽

Ai Ran ◽

Shuai Ma ◽

Sheng Qin

Keyword(s):

Real Life ◽

Effective Algorithm ◽

Data Set ◽

Ordering Semantics ◽

Synthetic Datasets ◽

Indexing Technique

Pointwise order dependencies (PODs) are dependencies that specify ordering semantics on attributes of tuples. POD discovery refers to the process of identifying the set Σ of valid and minimal PODs on a given data set D. In practice D is typically large and keeps changing, and it is prohibitively expensive to compute Σ from scratch every time. In this paper, we make a first effort to study the incremental POD discovery problem, aiming at computing changes ΔΣ to Σ such that Σ ⊕ ΔΣ is the set of valid and minimal PODs on D with a set Δ D of tuple insertion updates. (1) We first propose a novel indexing technique for inputs Σ and D. We give algorithms to build and choose indexes for Σ and D , and to update indexes in response to Δ D. We show that POD violations w.r.t. Σ incurred by Δ D can be efficiently identified by leveraging the proposed indexes, with a cost dependent on log (| D |). (2) We then present an effective algorithm for computing ΔΣ, based on Σ and identified violations caused by Δ D. The PODs in Σ that become invalid on D + Δ D are efficiently detected with the proposed indexes, and further new valid PODs on D + Δ D are identified by refining those invalid PODs in Σ on D + Δ D. (3) Finally, using both real-life and synthetic datasets, we experimentally show that our approach outperforms the batch approach that computes from scratch, up to orders of magnitude.

Download Full-text

Online information retrieval behaviour and economics of attention

Online Information Review ◽

10.1108/oir-05-2015-0139 ◽

2015 ◽

Vol 39 (6) ◽

pp. 779-794 ◽

Cited By ~ 2

Author(s):

Mustafa Utku Özmen

Keyword(s):

Information Retrieval ◽

Information Provision ◽

General Information ◽

Survival Duration ◽

Online Information ◽

Data Set ◽

Content Type ◽

Factors Affecting ◽

Quasi Experimental ◽

Information Providers

Purpose – The purpose of this paper is to analyse users’ attitudes towards online information retrieval and processing. The aim is to identify the characteristics of information that better capture the attention of the users and to provide evidence for the information retrieval behaviour of the users by studying online photo archives as information units. Design/methodology/approach – The paper analyses a unique quasi-experimental data of photo archive access counts collected by the author from an online newspaper. In addition to access counts of each photo in 500 randomly chosen photo galleries, characteristics of the photo galleries are also recorded. Survival (duration) analysis is used in order to analyse the factors affecting the share of the photo gallery viewed by a certain proportion of the initial number of viewers. Findings – The results of the survival analysis indicate that users are impatient in case of longer photo galleries; they lose attention faster and stop viewing earlier when gallery length is uncertain; they are attracted by keywords and initial presentation and they give more credit to specific rather than general information categories. Practical implications – Results of the study offer applicable implications for information providers, especially on the online domain. In order to attract more attention, entities can engage in targeted information provision by taking into account people’s attitude towards information retrieval and processing as presented in this paper. Originality/value – This paper uses a unique data set in a quasi-experimental setting in order to identify the characteristics of online information that users are attracted to.

Download Full-text

Variational inference using approximate likelihood under the coalescent with recombination

Genome Research ◽

10.1101/gr.273631.120 ◽

2021 ◽

pp. gr.273631.120

Author(s):

Xinhao Liu ◽

Huw A Ogilvie ◽

Luay Nakhleh

Keyword(s):

Simulated Data ◽

Variational Inference ◽

Divide And Conquer ◽

Data Sets ◽

Transition Rates ◽

Data Set ◽

Population Sizes ◽

Novel Method ◽

Approximate Likelihood ◽

Promising Avenue

Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, are coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human-chimp-gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation it is flexible enough to enable future implementations of all kinds of population models.

Download Full-text