Fast incremental discovery of pointwise order dependencies

Pointwise order dependencies (PODs) are dependencies that specify ordering semantics on attributes of tuples. POD discovery refers to the process of identifying the set Σ of valid and minimal PODs on a given data set D. In practice D is typically large and keeps changing, and it is prohibitively expensive to compute Σ from scratch every time. In this paper, we make a first effort to study the incremental POD discovery problem, aiming at computing changes ΔΣ to Σ such that Σ ⊕ ΔΣ is the set of valid and minimal PODs on D with a set Δ D of tuple insertion updates. (1) We first propose a novel indexing technique for inputs Σ and D. We give algorithms to build and choose indexes for Σ and D , and to update indexes in response to Δ D. We show that POD violations w.r.t. Σ incurred by Δ D can be efficiently identified by leveraging the proposed indexes, with a cost dependent on log (| D |). (2) We then present an effective algorithm for computing ΔΣ, based on Σ and identified violations caused by Δ D. The PODs in Σ that become invalid on D + Δ D are efficiently detected with the proposed indexes, and further new valid PODs on D + Δ D are identified by refining those invalid PODs in Σ on D + Δ D. (3) Finally, using both real-life and synthetic datasets, we experimentally show that our approach outperforms the batch approach that computes from scratch, up to orders of magnitude.

Download Full-text

SCI-Tree: An Incremental Algorithm for Computing Support Counts of all Closed Intervals from an Interval Dataset

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i8009.0881019 ◽

2019 ◽

Vol 8 (10) ◽

pp. 233-242

Keyword(s):

Real Life ◽

Interval Data ◽

Incremental Algorithm ◽

Data Sets ◽

Data Set ◽

Closed Intervals ◽

Real Life Data ◽

Static Data ◽

Synthetic Datasets ◽

Computing Support

Interval data mining is used to extract unknown patterns, hidden rules, associations etc. associated in interval based data. The extraction of closed interval is important because by mining the set of closed intervals and their support counts, the support counts of any interval can be computed easily. In this work an incremental algorithm for computing closed intervals together with their support counts from interval dataset is proposed. Many methods for mining closed intervals are available. Most of these methods assume a static data set as input and hence the algorithms are non-incremental. Real life data sets are however dynamic by nature. An efficient incremental algorithm called CI-Tree has been already proposed for computing closed intervals present in dynamic interval data. However this method could not compute the support values of the closed intervals. The proposed algorithm called SCI-Tree extracts all closed intervals together with their support values incrementally from the given interval data. Also, all the frequent closed intervals can be computed for any user defined minimum support with a single scan of SCI-Tree without revisiting the dataset. The proposed method has been tested with real life and synthetic datasets and results have been reported.

Download Full-text

On-Shelf Utility Mining of Sequence Data

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3457570 ◽

2021 ◽

Vol 16 (2) ◽

pp. 1-31

Author(s):

Chunkai Zhang ◽

Zilin Du ◽

Yuting Yang ◽

Wensheng Gan ◽

Philip S. Yu

Keyword(s):

High Efficiency ◽

Sequence Data ◽

Real Life ◽

Search Space ◽

Upper Bounds ◽

Utility Mining ◽

Limited Memory ◽

Time Periods ◽

High Utility ◽

Synthetic Datasets

Utility mining has emerged as an important and interesting topic owing to its wide application and considerable popularity. However, conventional utility mining methods have a bias toward items that have longer on-shelf time as they have a greater chance to generate a high utility. To eliminate the bias, the problem of on-shelf utility mining (OSUM) is introduced. In this article, we focus on the task of OSUM of sequence data, where the sequential database is divided into several partitions according to time periods and items are associated with utilities and several on-shelf time periods. To address the problem, we propose two methods, OSUM of sequence data (OSUMS) and OSUMS + , to extract on-shelf high-utility sequential patterns. For further efficiency, we also design several strategies to reduce the search space and avoid redundant calculation with two upper bounds time prefix extension utility ( TPEU ) and time reduced sequence utility ( TRSU ). In addition, two novel data structures are developed for facilitating the calculation of upper bounds and utilities. Substantial experimental results on certain real and synthetic datasets show that the two methods outperform the state-of-the-art algorithm. In conclusion, OSUMS may consume a large amount of memory and is unsuitable for cases with limited memory, while OSUMS + has wider real-life applications owing to its high efficiency.

Download Full-text

A MODIFIED RATIO TYPE ESTIMATOR OF FINITE POPULATION MEAN UNDER STRATIFIED RANDOM SAMPLING SCHEME

10.36106/4917049 ◽

2021 ◽

pp. 58-60

Author(s):

Naziru Fadisanku Haruna ◽

Ran Vijay Kumar Singh ◽

Samsudeen Dahiru

Keyword(s):

Mean Square Error ◽

Random Sampling ◽

Minimum Mean Square Error ◽

Real Life ◽

Population Data ◽

Auxiliary Variable ◽

Mean Square ◽

Data Set ◽

Population Mean ◽

The Mean

In This paper a modied ratio-type estimator for nite population mean under stratied random sampling using single auxiliary variable has been proposed. The expression for mean square error and bias of the proposed estimator are derived up to the rst order of approximation. The expression for minimum mean square error of proposed estimator is also obtained. The mean square error the proposed estimator is compared with other existing estimators theoretically and condition are obtained under which proposed estimator performed better. A real life population data set has been considered to compare the efciency of the proposed estimator numerically.

Download Full-text

Long term variations of the hydrochemical composition of deep thermal ground water in the Lower Bavarian Molasse Basin – Causes and Perspectives

10.5194/egusphere-egu21-4127 ◽

2021 ◽

Author(s):

Annette Dietmaier ◽

Thomas Baumann

Keyword(s):

Time Series ◽

Real Life ◽

Data Series ◽

Fluid Composition ◽

Molasse Basin ◽

Deep Groundwater ◽

Data Set ◽

Groundwater Exploitation ◽

Groundwater Aquifers

<p>The European Water Framework Directive (WFD) commits EU member states to achieve a good qualitative and quantitative status of all their water bodies.&#160; WFD provides a list of actions to be taken to achieve the goal of good status.&#160; However, this list disregards the specific conditions under which deep (> 400 m b.g.l.) groundwater aquifers form and exist.&#160; In particular, deep groundwater fluid composition is influenced by interaction with the rock matrix and other geofluids, and may assume a bad status without anthropogenic influences. Thus, a new concept with directions of monitoring and modelling this specific kind of aquifers is needed. Their status evaluation must be based on the effects induced by their exploitation. Here, we analyze long-term real-life production data series to detect changes in the hydrochemical deep groundwater characteristics which might be triggered by balneological and geothermal exploitation. We aim to use these insights to design a set of criteria with which the status of deep groundwater aquifers can be quantitatively and qualitatively determined. Our analysis is based on a unique long-term hydrochemical data set, taken from 8 balneological and geothermal sites in the molasse basin of Lower Bavaria, Germany, and Upper Austria. It is focused on a predefined set of annual hydrochemical concentration values. The data range dates back to 1937. Our methods include developing threshold corridors, within which a good status can be assumed, and developing cluster analyses, correlation, and piper diagram analyses. We observed strong fluctuations in the hydrochemical characteristics of the molasse basin deep groundwater during the last decades. Special interest is put on fluctuations that seem to have a clear start and end date, and to be correlated with other exploitation activities in the region. For example, during the period between 1990 and 2020, bicarbonate and sodium values displayed a clear increase, followed by a distinct dip to below-average values and a subsequent return to average values at site F. During the same time, these values showed striking irregularities at site B. Furthermore, we observed fluctuations in several locations, which come close to disqualifying quality thresholds, commonly used in German balneology. Our preliminary results prove the importance of using long-term (multiple decades) time series analysis to better inform quality and quantity assessments for deep groundwater bodies: most fluctuations would stay undetected within a < 5 year time series window, but become a distinct irregularity when viewed in the context of multiple decades. In the next steps, a quality assessment matrix and threshold corridors will be developed, which take into account methods to identify these fluctuations. This will ultimately aid in assessing the sustainability of deep groundwater exploitation and reservoir management for balneological and geothermal uses.</p>

Download Full-text

Modified Energy Statistic for Unsupervised Anomaly Detection

International Journal of Prognostics and Health Management ◽

10.36001/ijphm.2021.v12i1.1323 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Rupam Mukherjee

Keyword(s):

Anomaly Detection ◽

Real Life ◽

Industrial Applications ◽

Test Point ◽

Distance Metrics ◽

Data Set ◽

Energy Distance ◽

New Energy ◽

Statistical Distance ◽

Computational Simplicity

For prognostics in industrial applications, the degree of anomaly of a test point from a baseline cluster is estimated using a statistical distance metric. Among different statistical distance metrics, energy distance is an interesting concept based on Newton’s Law of Gravitation, promising simpler computation than classical distance metrics. In this paper, we review the state of the art formulations of energy distance and point out several reasons why they are not directly applicable to the anomaly-detection problem. Thereby, we propose a new energy-based metric called the P-statistic which addresses these issues, is applicable to anomaly detection and retains the computational simplicity of the energy distance. We also demonstrate its effectiveness on a real-life data-set.

Download Full-text

Calibration-Based Estimators using Different Distance Measures under Two Auxiliary Variables: A Comparative Study

Journal of Modern Applied Statistical Methods ◽

10.22237/jmasm/1619481600 ◽

2021 ◽

Vol 19 (1) ◽

pp. 2-20

Author(s):

Piyush Kant Rai ◽

Alka Singh ◽

Muhammad Qasim

Keyword(s):

Mean Squared Error ◽

Real Life ◽

Distance Functions ◽

Distance Measures ◽

Auxiliary Variables ◽

Data Set ◽

Life Data ◽

Squared Error ◽

Real Life Data ◽

Relative Root

This article introduces calibration estimators under different distance measures based on two auxiliary variables in stratified sampling. The theory of the calibration estimator is presented. The calibrated weights based on different distance functions are also derived. A simulation study has been carried out to judge the performance of the proposed estimators based on the minimum relative root mean squared error criterion. A real-life data set is also used to confirm the supremacy of the proposed method.

Download Full-text

Monitoring Robust Estimates for Compositional Data

Austrian Journal of Statistics ◽

10.17713/ajs.v50i2.1067 ◽

2021 ◽

Vol 50 (2) ◽

pp. 16-37

Author(s):

Valentin Todorov

Keyword(s):

Compositional Data ◽

Real Life ◽

R Package ◽

Diagnostic Tools ◽

Data Set ◽

Mahalanobis Distances ◽

Robust Estimates ◽

Real Life Data ◽

Manufactured Exports ◽

Parameter Values

In a number of recent articles Riani, Cerioli, Atkinson and others advocate the technique of monitoring robust estimates computed over a range of key parameter values. Through this approach the diagnostic tools of choice can be tuned in such a way that highly robust estimators which are as efficient as possible are obtained. This approach is applicable to various robust multivariate estimates like S- and MM-estimates, MVE and MCD as well as to the Forward Search in whichmonitoring is part of the robust method. Key tool for detection of multivariate outliers and for monitoring of robust estimates is the Mahalanobis distances and statistics related to these distances. However, the results obtained with thistool in case of compositional data might be unrealistic since compositional data contain relative rather than absolute information and need to be transformed to the usual Euclidean geometry before the standard statistical tools can be applied. Various data transformations of compositional data have been introduced in the literature and theoretical results on the equivalence of the additive, the centered, and the isometric logratio transformation in the context of outlier identification exist. To illustrate the problem of monitoring compositional data and to demonstrate the usefulness of monitoring in this case we start with a simple example and then analyze a real life data set presenting the technologicalstructure of manufactured exports. The analysis is conducted with the R package fsdaR, which makes the analytical and graphical tools provided in the MATLAB FSDA library available for R users.

Download Full-text

A Review of Machine Learning Techniques for Anomaly Detection in Static Graphs

Implementing Computational Intelligence Techniques for Security Systems Design - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-7998-2418-3.ch007 ◽

2020 ◽

pp. 146-162

Author(s):

Hesham M. Al-Ammal

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Anomaly Detection ◽

Real Life ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Methods ◽

Data Set ◽

Learning Techniques ◽

Vector Machines

Detection of anomalies in a given data set is a vital step in several applications in cybersecurity; including intrusion detection, fraud, and social network analysis. Many of these techniques detect anomalies by examining graph-based data. Analyzing graphs makes it possible to capture relationships, communities, as well as anomalies. The advantage of using graphs is that many real-life situations can be easily modeled by a graph that captures their structure and inter-dependencies. Although anomaly detection in graphs dates back to the 1990s, recent advances in research utilized machine learning methods for anomaly detection over graphs. This chapter will concentrate on static graphs (both labeled and unlabeled), and the chapter summarizes some of these recent studies in machine learning for anomaly detection in graphs. This includes methods such as support vector machines, neural networks, generative neural networks, and deep learning methods. The chapter will reflect the success and challenges of using these methods in the context of graph-based anomaly detection.

Download Full-text

Gary White Illustrates How to Use plt.scatter With a Real-Life Data Set in Matplotlib

plt.scatter ◽

10.4135/9781529772371.n2 ◽

2020 ◽

Keyword(s):

Real Life ◽

Data Set ◽

Life Data ◽

Real Life Data

Download Full-text

e-HUNSR: An Efficient Algorithm for Mining High Utility Negative Sequential Rules

Symmetry ◽

10.3390/sym12081211 ◽

2020 ◽

Vol 12 (8) ◽

pp. 1211

Author(s):

Mengjiao Zhang ◽

Tiantian Xu ◽

Zhao Li ◽

Xiqing Han ◽

Xiangjun Dong

Keyword(s):

Decision Making ◽

Real Life ◽

The Other ◽

Utility Value ◽

Science Data ◽

Related Information ◽

Sequential Rule ◽

Pruning Strategy ◽

High Utility ◽

Synthetic Datasets

As an important technology in computer science, data mining aims to mine hidden, previously unknown, and potentially valuable patterns from databases.High utility negative sequential rule (HUNSR) mining can provide more comprehensive decision-making information than high utility sequential rule (HUSR) mining by taking non-occurring events into account. HUNSR mining is much more difficult than HUSR mining because of two key intrinsic complexities. One is how to define the HUNSR mining problem and the other is how to calculate the antecedent’s local utility value in a HUNSR, a key issue in calculating the utility-confidence of the HUNSR. To address the intrinsic complexities, we propose a comprehensive algorithm called e-HUNSR and the contributions are as follows. (1) We formalize the problem of HUNSR mining by proposing a series of concepts. (2) We propose a novel data structure to store the related information of HUNSR candidate (HUNSRC) and a method to efficiently calculate the local utility value and utility of HUNSRC’s antecedent. (3) We propose an efficient method to generate HUNSRC based on high utility negative sequential pattern (HUNSP) and a pruning strategy to prune meaningless HUNSRC. To the best of our knowledge, e-HUNSR is the first algorithm to efficiently mine HUNSR. The experimental results on two real-life and 12 synthetic datasets show that e-HUNSR is very efficient.

Download Full-text