Fast incremental discovery of pointwise order dependencies

2020 ◽  
Vol 13 (10) ◽  
pp. 1669-1681
Author(s):  
Zijing Tan ◽  
Ai Ran ◽  
Shuai Ma ◽  
Sheng Qin

Pointwise order dependencies (PODs) are dependencies that specify ordering semantics on attributes of tuples. POD discovery refers to the process of identifying the set Σ of valid and minimal PODs on a given data set D. In practice D is typically large and keeps changing, and it is prohibitively expensive to compute Σ from scratch every time. In this paper, we make a first effort to study the incremental POD discovery problem, aiming at computing changes ΔΣ to Σ such that Σ ⊕ ΔΣ is the set of valid and minimal PODs on D with a set Δ D of tuple insertion updates. (1) We first propose a novel indexing technique for inputs Σ and D. We give algorithms to build and choose indexes for Σ and D , and to update indexes in response to Δ D. We show that POD violations w.r.t. Σ incurred by Δ D can be efficiently identified by leveraging the proposed indexes, with a cost dependent on log (| D |). (2) We then present an effective algorithm for computing ΔΣ, based on Σ and identified violations caused by Δ D. The PODs in Σ that become invalid on D + Δ D are efficiently detected with the proposed indexes, and further new valid PODs on D + Δ D are identified by refining those invalid PODs in Σ on D + Δ D. (3) Finally, using both real-life and synthetic datasets, we experimentally show that our approach outperforms the batch approach that computes from scratch, up to orders of magnitude.

Interval data mining is used to extract unknown patterns, hidden rules, associations etc. associated in interval based data. The extraction of closed interval is important because by mining the set of closed intervals and their support counts, the support counts of any interval can be computed easily. In this work an incremental algorithm for computing closed intervals together with their support counts from interval dataset is proposed. Many methods for mining closed intervals are available. Most of these methods assume a static data set as input and hence the algorithms are non-incremental. Real life data sets are however dynamic by nature. An efficient incremental algorithm called CI-Tree has been already proposed for computing closed intervals present in dynamic interval data. However this method could not compute the support values of the closed intervals. The proposed algorithm called SCI-Tree extracts all closed intervals together with their support values incrementally from the given interval data. Also, all the frequent closed intervals can be computed for any user defined minimum support with a single scan of SCI-Tree without revisiting the dataset. The proposed method has been tested with real life and synthetic datasets and results have been reported.


2021 ◽  
Vol 16 (2) ◽  
pp. 1-31
Author(s):  
Chunkai Zhang ◽  
Zilin Du ◽  
Yuting Yang ◽  
Wensheng Gan ◽  
Philip S. Yu

Utility mining has emerged as an important and interesting topic owing to its wide application and considerable popularity. However, conventional utility mining methods have a bias toward items that have longer on-shelf time as they have a greater chance to generate a high utility. To eliminate the bias, the problem of on-shelf utility mining (OSUM) is introduced. In this article, we focus on the task of OSUM of sequence data, where the sequential database is divided into several partitions according to time periods and items are associated with utilities and several on-shelf time periods. To address the problem, we propose two methods, OSUM of sequence data (OSUMS) and OSUMS + , to extract on-shelf high-utility sequential patterns. For further efficiency, we also design several strategies to reduce the search space and avoid redundant calculation with two upper bounds time prefix extension utility ( TPEU ) and time reduced sequence utility ( TRSU ). In addition, two novel data structures are developed for facilitating the calculation of upper bounds and utilities. Substantial experimental results on certain real and synthetic datasets show that the two methods outperform the state-of-the-art algorithm. In conclusion, OSUMS may consume a large amount of memory and is unsuitable for cases with limited memory, while OSUMS + has wider real-life applications owing to its high efficiency.


2021 ◽  
pp. 58-60
Author(s):  
Naziru Fadisanku Haruna ◽  
Ran Vijay Kumar Singh ◽  
Samsudeen Dahiru

In This paper a modied ratio-type estimator for nite population mean under stratied random sampling using single auxiliary variable has been proposed. The expression for mean square error and bias of the proposed estimator are derived up to the rst order of approximation. The expression for minimum mean square error of proposed estimator is also obtained. The mean square error the proposed estimator is compared with other existing estimators theoretically and condition are obtained under which proposed estimator performed better. A real life population data set has been considered to compare the efciency of the proposed estimator numerically.


2021 ◽  
Author(s):  
Annette Dietmaier ◽  
Thomas Baumann

<p>The European Water Framework Directive (WFD) commits EU member states to achieve a good qualitative and quantitative status of all their water bodies.  WFD provides a list of actions to be taken to achieve the goal of good status.  However, this list disregards the specific conditions under which deep (> 400 m b.g.l.) groundwater aquifers form and exist.  In particular, deep groundwater fluid composition is influenced by interaction with the rock matrix and other geofluids, and may assume a bad status without anthropogenic influences. Thus, a new concept with directions of monitoring and modelling this specific kind of aquifers is needed. Their status evaluation must be based on the effects induced by their exploitation. Here, we analyze long-term real-life production data series to detect changes in the hydrochemical deep groundwater characteristics which might be triggered by balneological and geothermal exploitation. We aim to use these insights to design a set of criteria with which the status of deep groundwater aquifers can be quantitatively and qualitatively determined. Our analysis is based on a unique long-term hydrochemical data set, taken from 8 balneological and geothermal sites in the molasse basin of Lower Bavaria, Germany, and Upper Austria. It is focused on a predefined set of annual hydrochemical concentration values. The data range dates back to 1937. Our methods include developing threshold corridors, within which a good status can be assumed, and developing cluster analyses, correlation, and piper diagram analyses. We observed strong fluctuations in the hydrochemical characteristics of the molasse basin deep groundwater during the last decades. Special interest is put on fluctuations that seem to have a clear start and end date, and to be correlated with other exploitation activities in the region. For example, during the period between 1990 and 2020, bicarbonate and sodium values displayed a clear increase, followed by a distinct dip to below-average values and a subsequent return to average values at site F. During the same time, these values showed striking irregularities at site B. Furthermore, we observed fluctuations in several locations, which come close to disqualifying quality thresholds, commonly used in German balneology. Our preliminary results prove the importance of using long-term (multiple decades) time series analysis to better inform quality and quantity assessments for deep groundwater bodies: most fluctuations would stay undetected within a < 5 year time series window, but become a distinct irregularity when viewed in the context of multiple decades. In the next steps, a quality assessment matrix and threshold corridors will be developed, which take into account methods to identify these fluctuations. This will ultimately aid in assessing the sustainability of deep groundwater exploitation and reservoir management for balneological and geothermal uses.</p>


Author(s):  
Rupam Mukherjee

For prognostics in industrial applications, the degree of anomaly of a test point from a baseline cluster is estimated using a statistical distance metric. Among different statistical distance metrics, energy distance is an interesting concept based on Newton’s Law of Gravitation, promising simpler computation than classical distance metrics. In this paper, we review the state of the art formulations of energy distance and point out several reasons why they are not directly applicable to the anomaly-detection problem. Thereby, we propose a new energy-based metric called the P-statistic which addresses these issues, is applicable to anomaly detection and retains the computational simplicity of the energy distance. We also demonstrate its effectiveness on a real-life data-set.


2021 ◽  
Vol 19 (1) ◽  
pp. 2-20
Author(s):  
Piyush Kant Rai ◽  
Alka Singh ◽  
Muhammad Qasim

This article introduces calibration estimators under different distance measures based on two auxiliary variables in stratified sampling. The theory of the calibration estimator is presented. The calibrated weights based on different distance functions are also derived. A simulation study has been carried out to judge the performance of the proposed estimators based on the minimum relative root mean squared error criterion. A real-life data set is also used to confirm the supremacy of the proposed method.


2021 ◽  
Vol 50 (2) ◽  
pp. 16-37
Author(s):  
Valentin Todorov

In a number of recent articles Riani, Cerioli, Atkinson and others advocate the technique of monitoring robust estimates computed over a range of key parameter values. Through this approach the diagnostic tools of choice can be tuned in such a way that highly robust estimators which are as efficient as possible are obtained. This approach is applicable to various robust multivariate estimates like S- and MM-estimates, MVE and MCD as well as to the Forward Search in whichmonitoring is part of the robust method. Key tool for detection of multivariate outliers and for monitoring of robust estimates is the Mahalanobis distances and statistics related to these distances. However, the results obtained with thistool in case of compositional data might be unrealistic since compositional data contain relative rather than absolute information and need to be transformed to the usual Euclidean geometry before the standard statistical tools can be applied. Various data transformations of compositional data have been introduced in the literature and theoretical results on the equivalence of the additive, the centered, and the isometric logratio transformation in the context of outlier identification exist. To illustrate the problem of monitoring compositional data and to demonstrate the usefulness of monitoring in this case we start with a simple example and then analyze a real life data set presenting the technologicalstructure of manufactured exports. The analysis is conducted with the R package fsdaR, which makes the analytical and graphical tools provided in the MATLAB FSDA library available for R users.


Author(s):  
Hesham M. Al-Ammal

Detection of anomalies in a given data set is a vital step in several applications in cybersecurity; including intrusion detection, fraud, and social network analysis. Many of these techniques detect anomalies by examining graph-based data. Analyzing graphs makes it possible to capture relationships, communities, as well as anomalies. The advantage of using graphs is that many real-life situations can be easily modeled by a graph that captures their structure and inter-dependencies. Although anomaly detection in graphs dates back to the 1990s, recent advances in research utilized machine learning methods for anomaly detection over graphs. This chapter will concentrate on static graphs (both labeled and unlabeled), and the chapter summarizes some of these recent studies in machine learning for anomaly detection in graphs. This includes methods such as support vector machines, neural networks, generative neural networks, and deep learning methods. The chapter will reflect the success and challenges of using these methods in the context of graph-based anomaly detection.


Symmetry ◽  
2020 ◽  
Vol 12 (8) ◽  
pp. 1211
Author(s):  
Mengjiao Zhang ◽  
Tiantian Xu ◽  
Zhao Li ◽  
Xiqing Han ◽  
Xiangjun Dong

As an important technology in computer science, data mining aims to mine hidden, previously unknown, and potentially valuable patterns from databases.High utility negative sequential rule (HUNSR) mining can provide more comprehensive decision-making information than high utility sequential rule (HUSR) mining by taking non-occurring events into account. HUNSR mining is much more difficult than HUSR mining because of two key intrinsic complexities. One is how to define the HUNSR mining problem and the other is how to calculate the antecedent’s local utility value in a HUNSR, a key issue in calculating the utility-confidence of the HUNSR. To address the intrinsic complexities, we propose a comprehensive algorithm called e-HUNSR and the contributions are as follows. (1) We formalize the problem of HUNSR mining by proposing a series of concepts. (2) We propose a novel data structure to store the related information of HUNSR candidate (HUNSRC) and a method to efficiently calculate the local utility value and utility of HUNSRC’s antecedent. (3) We propose an efficient method to generate HUNSRC based on high utility negative sequential pattern (HUNSP) and a pruning strategy to prune meaningless HUNSRC. To the best of our knowledge, e-HUNSR is the first algorithm to efficiently mine HUNSR. The experimental results on two real-life and 12 synthetic datasets show that e-HUNSR is very efficient.


Sign in / Sign up

Export Citation Format

Share Document