Targeted Adaptable Sample for Accurate and Efficient Quantile Estimation in Non-Stationary Data Streams

The need to detect outliers or otherwise unusual data, which can be formalized as the estimation a particular quantile of a distribution, is an important problem that frequently arises in a variety of applications of pattern recognition, computer vision and signal processing. For example, our work was most proximally motivated by the practical limitations and requirements of many semi-automatic surveillance analytics systems that detect abnormalities in closed-circuit television (CCTV) footage using statistical models of low-level motion features. In this paper, we specifically address the problem of estimating the running quantile of a data stream with non-stationary stochasticity when the absolute (rather than asymptotic) memory for storing observations is severely limited. We make several major contributions: (i) we derive an important theoretical result that shows that the change in the quantile of a stream is constrained regardless of the stochastic properties of data; (ii) we describe a set of high-level design goals for an effective estimation algorithm that emerge as a consequence of our theoretical findings; (iii) we introduce a novel algorithm that implements the aforementioned design goals by retaining a sample of data values in a manner adaptive to changes in the distribution of data and progressively narrowing down its focus in the periods of quasi-stationary stochasticity; and (iv) we present a comprehensive evaluation of the proposed algorithm and compare it with the existing methods in the literature on both synthetic datasets and three large “real-world” streams acquired in the course of operation of an existing commercial surveillance system. Our results and their detailed analysis convincingly and comprehensively demonstrate that the proposed method is highly successful and vastly outperforms the existing alternatives, especially when the target quantile is high-valued and the available buffer capacity severely limited.

Download Full-text

Microcluster-Based Incremental Ensemble Learning for Noisy, Nonstationary Data Streams

Complexity ◽

10.1155/2020/6147378 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Sanmin Liu ◽

Shan Xue ◽

Fanzhen Liu ◽

Jieren Cheng ◽

Xiulai Li ◽

...

Keyword(s):

Ensemble Learning ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Majority Vote ◽

Stream Classification ◽

Model Stability ◽

Data Stream Classification ◽

Nonstationary Data ◽

Synthetic Datasets

Data stream classification becomes a promising prediction work with relevance to many practical environments. However, under the environment of concept drift and noise, the research of data stream classification faces lots of challenges. Hence, a new incremental ensemble model is presented for classifying nonstationary data streams with noise. Our approach integrates three strategies: incremental learning to monitor and adapt to concept drift; ensemble learning to improve model stability; and a microclustering procedure that distinguishes drift from noise and predicts the labels of incoming instances via majority vote. Experiments with two synthetic datasets designed to test for both gradual and abrupt drift show that our method provides more accurate classification in nonstationary data streams with noise than the two popular baselines.

Download Full-text

Detecting Metachanges in Data Streams from the Viewpoint of the MDL Principle

Entropy ◽

10.3390/e21121134 ◽

2019 ◽

Vol 21 (12) ◽

pp. 1134 ◽

Cited By ~ 1

Author(s):

Shintaro Fukushima ◽

Kenji Yamanishi

Keyword(s):

Data Streams ◽

Data Stream ◽

Minimum Description Length ◽

Detection Algorithm ◽

Change Points ◽

Detection Methods ◽

Code Length ◽

Warning Signals ◽

Mdl Principle ◽

Synthetic Datasets

This paper addresses the issue of how we can detect changes of changes, which we call metachanges, in data streams. A metachange refers to a change in patterns of when and how changes occur, referred to as “metachanges along time” and “metachanges along state”, respectively. Metachanges along time mean that the intervals between change points significantly vary, whereas metachanges along state mean that the magnitude of changes varies. It is practically important to detect metachanges because they may be early warning signals of important events. This paper introduces a novel notion of metachange statistics as a measure of the degree of a metachange. The key idea is to integrate metachanges along both time and state in terms of “code length” according to the minimum description length (MDL) principle. We develop an online metachange detection algorithm (MCD) based on the statistics to apply it to a data stream. With synthetic datasets, we demonstrated that MCD detects metachanges earlier and more accurately than existing methods. With real datasets, we demonstrated that MCD can lead to the discovery of important events that might be overlooked by conventional change detection methods.

Download Full-text

FlexSketch: Estimation of Probability Density for Stationary and Non-Stationary Data Streams

Sensors ◽

10.3390/s21041080 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1080

Author(s):

Namuk Park ◽

Songkuk Kim

Keyword(s):

Probability Distribution ◽

Probability Density ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Estimation Algorithm ◽

Accurate Estimation ◽

Probability Density Estimation ◽

Data History ◽

Changes Over Time

Efficient and accurate estimation of the probability distribution of a data stream is an important problem in many sensor systems. It is especially challenging when the data stream is non-stationary, i.e., its probability distribution changes over time. Statistical models for non-stationary data streams demand agile adaptation for concept drift while tolerating temporal fluctuations. To this end, a statistical model needs to forget old data samples and to detect concept drift swiftly. In this paper, we propose FlexSketch, an online probability density estimation algorithm for data streams. Our algorithm uses an ensemble of histograms, each of which represents a different length of data history. FlexSketch updates each histogram for a new data sample and generates probability distribution by combining the ensemble of histograms while monitoring discrepancy between recent data and existing models periodically. When it detects concept drift, a new histogram is added to the ensemble and the oldest histogram is removed. This allows us to estimate the probability density function with high update speed and high accuracy using only limited memory. Experimental results demonstrate that our algorithm shows improved speed and accuracy compared to existing methods for both stationary and non-stationary data streams.

Download Full-text

An Intuitionistic Calculus to Complex Abnormal Event Recognition on Data Streams

Security and Communication Networks ◽

10.1155/2021/3573753 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Zhao Lijun ◽

Hu Guiqiu ◽

Li Qingsheng ◽

Ding Guanhua

Keyword(s):

Data Streams ◽

Data Stream ◽

Parametric Representation ◽

Event Recognition ◽

Time Interval ◽

Time Data ◽

Multiple Systems ◽

Analysis System ◽

Data Analysis System ◽

High Level

Data mining in real-time data streams is associated with multiple types of uncertainty, which often leads the respective categorizers to make erroneous predictions related to the presence or absence of complex events. But recognizing complex abnormal events, even those that occur in extremely rare cases, offers significant support to decision-making systems. Therefore, there is a need for robust recognition mechanisms that will be able to predict or recognize when an abnormal event occurs or will occur on a data stream. Considering this need, this paper presents an Intuitionistic Tumbling Windows event calculus (ITWec) methodology. It is an innovative data analysis system that combines for the first time in the literature a set of multiple systems for Complex Abnormal Event Recognition (CAER). In the proposed system, the probabilities of the existence of a high-level complex abnormal event for each period are initially calculated nonparametrically, based on the probabilities of the low-level events associated with it. Because cumulative results are sought in consecutive, nonoverlapping sections of the data stream, the method uses the clearly defined rules of initialization and termination of the tumbling windows method, where there is an explicit determination of the time interval within which several blocks of a particular stream are investigated window. Finally, the number of maximum probable intervals in which an event is likely to occur based on a certain probability threshold is calculated, based on a parametric representation of intuitively fuzzy sets.

Download Full-text

An Efficient Data Stream Clustering Algorithm Based on Dynamic Grids

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.186.665 ◽

2011 ◽

Vol 186 ◽

pp. 665-670

Author(s):

Yun Wu ◽

Feng Gao

Keyword(s):

Data Streams ◽

Data Stream ◽

Arbitrary Shape ◽

Clustering Algorithm ◽

Stream Clustering ◽

Research Fields ◽

Efficient Data ◽

Data Stream Clustering ◽

Synthetic Datasets ◽

Clustering Data

Data mining based on data stream has become one of hot research fields. In this paper we present a novel algorithm for clustering data streams based on dynamic grids named DG-CluStream. DG-CluStream partitions and prunes grids dynamically, improves the accuracy of grids gradually through saving feature tuples of grids. The algorithm can discover clusters with arbitrary shape and is more efficient than those static methods due to a notable decrease on the number of the grids. Through fading coefficient, DG-CluStream can also deal with the problem of concept drifting efficiently. The experimental results on real datasets and synthetic datasets demonstrate promising availabilities of the approach.

Download Full-text

AFQN: approximate Qn estimation in data streams

Applied Intelligence ◽

10.1007/s10489-021-02614-w ◽

2021 ◽

Author(s):

Italo Epicoco ◽

Catiuscia Melle ◽

Massimo Cafaro ◽

Marco Pulimeno

Keyword(s):

Data Streams ◽

Data Stream ◽

Input Data ◽

Sliding Window ◽

Quantile Estimation ◽

Fast Detection ◽

Online Computation ◽

First Approximation ◽

Detection Of Outliers ◽

Novel Algorithm

AbstractWe present afqn (Approximate Fast Qn), a novel algorithm for approximate computation of the Qn scale estimator in a streaming setting, in the sliding window model. It is well-known that computing the Qn estimator exactly may be too costly for some applications, and the problem is a fortiori exacerbated in the streaming setting, in which the time available to process incoming data stream items is short. In this paper we show how to efficiently and accurately approximate the Qn estimator. As an application, we show the use of afqn for fast detection of outliers in data streams. In particular, the outliers are detected in the sliding window model, with a simple check based on the Qn scale estimator. Extensive experimental results on synthetic and real datasets confirm the validity of our approach by showing up to three times faster updates per second. Our contributions are the following ones: (i) to the best of our knowledge, we present the first approximation algorithm for online computation of the Qn scale estimator in a streaming setting and in the sliding window model; (ii) we show how to take advantage of our UDDSketch algorithm for quantile estimation in order to quickly compute the Qn scale estimator; (iii) as an example of a possible application of the Qn scale estimator, we discuss how to detect outliers in an input data stream.

Download Full-text

A proposal to enhance national capability to manage epidemics: The critical importance of expert statistical input including official statistics

Statistical Journal of the IAOS ◽

10.3233/sji-210808 ◽

2021 ◽

pp. 1-17

Author(s):

N.I. Fisher ◽

D.J. Trewin

Keyword(s):

Data Streams ◽

Age Groups ◽

Quantitative Information ◽

Economic Welfare ◽

Way Of Life ◽

Policy Makers ◽

The Social ◽

Novel Virus ◽

High Level ◽

Integration Analysis

Given the high level of global mobility, pandemics are likely to be more frequent, and with potentially devastating consequences for our way of life. With COVID-19, Australia is in relatively better shape than most other countries and is generally regarded as having managed the pandemic well. That said, we believe there is a critical need to start the process of learning from this pandemic to improve the quantitative information and related advice provided to policy makers. A dispassionate assessment of Australia’s health and economic response to the COVID-19 pandemic reveals some important inadequacies in the data, statistical analysis and interpretation used to guide Australia’s preparations and actions. For example, one key shortcoming has been the lack of data to obtain an early understanding of the extent of asymptomatic and mildly symptomatic cases or the differences across age groups, occupations or ethnic groups. Minimising the combined health, social and economic impacts of a novel virus depends critically on ongoing acquisition, integration, analysis, interpretation and presentation of a variety of data streams to inform the development, execution and monitoring of appropriate strategies. The article captures the essential quantitative components of such an approach for each of the four basic phases, from initial detection to post-pandemic. It also outlines the critical steps in each stage to enable policy makers to deal more efficiently and effectively with future such events, thus enhancing both the social and the economic welfare of its people. Although written in an Australian context, we believe most elements would apply to other countries as well.

Download Full-text

Recurring concept memory management in data streams: exploiting data stream concept evolution to improve performance and transparency

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00736-w ◽

2021 ◽

Author(s):

Ben Halstead ◽

Yun Sing Koh ◽

Patricia Riddle ◽

Russel Pears ◽

Mykola Pechenizkiy ◽

...

Keyword(s):

Data Streams ◽

Data Stream ◽

Memory Management ◽

Improve Performance ◽

Concept Evolution

Download Full-text

Analysis of Data Stream Processing At Edge Layer for Internet of Things

Journal of ISMAC - June 2019 ◽

10.36548/jismac.2020.1.003 ◽

2020 ◽

Vol 2 (1) ◽

pp. 26-37

Author(s):

Dr. Pasumponpandian

Keyword(s):

Internet Of Things ◽

Data Streams ◽

Data Stream ◽

Smart Cities ◽

Stream Processing ◽

Middle Layer ◽

Cloud Services ◽

Decentralized Systems ◽

Data Stream Processing ◽

Edge Layer

The progress of internet of things at a rapid pace and simultaneous development of the technologies and the processing capabilities has paved way for the development of decentralized systems that are relying on cloud services. Though the decentralized systems are founded on cloud complexities still prevail in transferring all the information’s that are been sensed through the IOT devices to the cloud. This because of the huge streams of information’s gathered by certain applications and the expectation to have a timely response, incurring minimized delay, computing energy and enhanced reliability. So this kind of decentralization has led to the development of middle layer between the cloud and the IOT, and was termed as the Edge layer, meaning bringing down the service of the cloud to the user edge. The paper puts forth the analysis of the data stream processing in the edge layer taking in the complexities involved in the computing the data streams of IOT in an edge layer and puts forth the real time analytics in the edge layer to examine the data streams of the internet of things offering a data- driven insight for parking system in the smart cities.

Download Full-text

COMPREHENSIVE EVALUATION OF THE WORK AT HEIGHT

Acta Metallurgica Slovaca ◽

10.12776/ams.v24i1.1006 ◽

2018 ◽

Vol 24 (1) ◽

pp. 100

Author(s):

Lenka Kissiková ◽

Ivan Dlugoš

Keyword(s):

Personal Protective Equipment ◽

Comprehensive Evaluation ◽

The State ◽

Test Facility ◽

Protective Equipment ◽

Fatal Accidents ◽

Mineral Oils ◽

High Level ◽

And Training

<p>The article evaluates the issue of work at heights in industry and reports statistics on fatal accidents at work, the source of which is a fall from above. It also deals with the assessment of the state of personal protective equipment already in use - for example, safety and working ropes and other accessories contaminated with facade paints, lyes, acids or mineral oils and their misuse and dangerous use. The state of the assessed personal protective equipment used was assessed in a test facility on test machines, where the safety of these devices was verified under certain conditions. The article also mentions the issue of inadequate training and training of high-level workers and the lack of training centers that carry out such training.</p>

Download Full-text