A State-of-the-Art Review of Data Stream Anonymization Schemes

Author(s):  
Aderonke B. Sakpere ◽  
Anne V. D. M. Kayem

Streaming data emerges from different electronic sources and needs to be processed in real time with minimal delay. Data streams can generate hidden and useful knowledge patterns when mined and analyzed. In spite of these benefits, the issue of privacy needs to be addressed before streaming data is released for mining and analysis purposes. In order to address data privacy concerns, several techniques have emerged. K-anonymity has received considerable attention over other privacy preserving techniques because of its simplicity and efficiency in protecting data. Yet, k-anonymity cannot be directly applied on continuous data (data streams) because of its transient nature. In this chapter, the authors discuss the challenges faced by k-anonymity algorithms in enforcing privacy on data streams and review existing privacy techniques for handling data streams.

Information ◽  
2021 ◽  
Vol 12 (1) ◽  
pp. 24
Author(s):  
Frederic Stahl ◽  
Thien Le ◽  
Atta Badii ◽  
Mohamed Medhat Gaber

This paper introduces a new and expressive algorithm for inducing descriptive rule-sets from streaming data in real-time in order to describe frequent patterns explicitly encoded in the stream. Data Stream Mining (DSM) is concerned with the automatic analysis of data streams in real-time. Rapid flows of data challenge the state-of-the art processing and communication infrastructure, hence the motivation for research and innovation into real-time algorithms that analyse data streams on-the-fly and can automatically adapt to concept drifts. To date, DSM techniques have largely focused on predictive data mining applications that aim to forecast the value of a particular target feature of unseen data instances, answering questions such as whether a credit card transaction is fraudulent or not. A real-time, expressive and descriptive Data Mining technique for streaming data has not been previously established as part of the DSM toolkit. This has motivated the work reported in this paper, which has resulted in developing and validating a Generalised Rule Induction (GRI) tool, thus producing expressive rules as explanations that can be easily understood by human analysts. The expressiveness of decision models in data streams serves the objectives of transparency, underpinning the vision of `explainable AI’ and yet is an area of research that has attracted less attention despite being of high practical importance. The algorithm introduced and described in this paper is termed Fast Generalised Rule Induction (FGRI). FGRI is able to induce descriptive rules incrementally for raw data from both categorical and numerical features. FGRI is able to adapt rule-sets to changes of the pattern encoded in the data stream (concept drift) on the fly as new data arrives and can thus be applied continuously in real-time. The paper also provides a theoretical, qualitative and empirical evaluation of FGRI.


Author(s):  
Enrico Di Minin ◽  
Christoph Fink ◽  
Anna Hausmann ◽  
Jens Kremer ◽  
Ritwik Kulkarni

Author(s):  
Prasanna Lakshmi Kompalli

Data coming from different sources is referred to as data streams. Data stream mining is an online learning technique where each data point must be processed as the data arrives and discarded as the processing is completed. Progress of technologies has resulted in the monitoring these data streams in real time. Data streams has created many new challenges to the researchers in real time. The main features of this type of data are they are fast flowing, large amounts of data which are continuous and growing in nature, and characteristics of data might change in course of time which is termed as concept drift. This chapter addresses the problems in mining data streams with concept drift. Due to which, isolating the correct literature would be a grueling task for researchers and practitioners. This chapter tries to provide a solution as it would be an amalgamation of all techniques used for data stream mining with concept drift.


2020 ◽  
Vol 8 (4) ◽  
pp. 63-73
Author(s):  
Sikha Bagui ◽  
Katie Jin

This survey performs a thorough enumeration and analysis of existing methods for data stream processing. It is a survey of the challenges facing streaming data. The challenges addressed are preprocessing of streaming data, detection and dealing with concept drifts in streaming data, data reduction in the face of data streams, approximate queries and blocking operations in streaming data.


2021 ◽  
Author(s):  
Christian Nordahl ◽  
Veselka Boeva ◽  
Håkan Grahn ◽  
Marie Persson Netz

AbstractData has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.


Author(s):  
J. C. Whittier ◽  
S. Nittel ◽  
I. Subasinghe

With live streaming sensors and sensor networks, increasingly large numbers of individual sensors are deployed in physical space. Sensor data streams are a fundamentally novel mechanism to deliver observations to information systems. They enable us to represent spatio-temporal continuous phenomena such as radiation accidents, toxic plumes, or earthquakes almost as instantaneously as they happen in the real world. Sensor data streams discretely sample an earthquake, while the earthquake is continuous over space and time. Programmers attempting to integrate many streams to analyze earthquake activity and scope need to write code to integrate potentially very large sets of asynchronously sampled, concurrent streams in tedious application code. In previous work, we proposed the field stream data model (Liang et al., 2016) for data stream engines. Abstracting the stream of an individual sensor as a temporal field, the field represents the Earth’s movement at the sensor position as continuous. This simplifies analysis across many sensors significantly. In this paper, we undertake a feasibility study of using the field stream model and the open source Data Stream Engine (DSE) Apache Spark(Apache Spark, 2017) to implement a real-time earthquake event detection with a subset of the 250 GPS sensor data streams of the Southern California Integrated GPS Network (SCIGN). The field-based real-time stream queries compute maximum displacement values over the latest query window of each stream, and related spatially neighboring streams to identify earthquake events and their extent. Further, we correlated the detected events with an USGS earthquake event feed. The query results are visualized in real-time.


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5829 ◽  
Author(s):  
Jen-Wei Huang ◽  
Meng-Xun Zhong ◽  
Bijay Prasad Jaysawal

Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.


2018 ◽  
Vol 7 (2) ◽  
pp. 270 ◽  
Author(s):  
Shyam Sunder Reddy K ◽  
Shoba Bindu C

Real-time data stream clustering has been widely used in many fields, and it can extract useful information from massive sets of data. Most of the existing density-based algorithms cluster the data streams based on the density within the micro-clusters. These algorithms completely omit the data density in the area between the micro-clusters and recluster the micro-clusters based on erroneous assumptions about the distribution of the data within and between the micro-clusters that lead to poor clustering results. This paper describes a novel density-based clustering algorithm for evolving data streams called MCDAStream, which clusters the data stream based on micro-cluster density and attraction between the micro-clusters. The attraction of micro-clusters characterizes the positional information of the data points in each micro-cluster. We generate better clustering results by considering both micro-cluster density and attraction of micro-clusters. The quality of the proposed algorithm is evaluated on various synthetic and real-time datasets with distinct characteristics and quality metrics.


Author(s):  
Yulin Ding ◽  
Hui Lin ◽  
Rongrong Li

Recent breakthroughs in sensor networks have made it possible to collect and assemble increasing amounts of real-time observational data by observing dynamic phenomena at previously impossible time and space scales. Real-time observational data streams present potentially profound opportunities for real-time applications in disaster mitigation and emergency response, by providing accurate and timeliness estimates of environment’s status. However, the data are always subject to inevitable anomalies (including errors and anomalous changes/events) caused by various effects produced by the environment they are monitoring. The “big but dirty” real-time observational data streams can rarely achieve their full potential in the following real-time models or applications due to the low data quality. Therefore, timely and meaningful online data cleaning is a necessary pre-requisite step to ensure the quality, reliability, and timeliness of the real-time observational data. <br><br> In general, a straightforward streaming data cleaning approach, is to define various types of models/classifiers representing normal behavior of sensor data streams and then declare any deviation from this model as normal or erroneous data. The effectiveness of these models is affected by dynamic changes of deployed environments. Due to the changing nature of the complicated process being observed, real-time observational data is characterized by diversity and dynamic, showing a typical Big (Geo) Data characters. Dynamics and diversity is not only reflected in the data values, but also reflected in the complicated changing patterns of the data distributions. This means the pattern of the real-time observational data distribution is not <i>stationary or static</i> but <i>changing and dynamic</i>. After the data pattern changed, it is necessary to adapt the model over time to cope with the changing patterns of real-time data streams. Otherwise, the model will not fit the following observational data streams, which may led to large estimation error. In order to achieve the best generalization error, it is an important challenge for the data cleaning methodology to be able to characterize the behavior of data stream distributions and adaptively update a model to include new information and remove old information. However, the complicated data changing property invalidates traditional data cleaning methods, which rely on the assumption of a stationary data distribution, and drives the need for more dynamic and adaptive online data cleaning methods. <br><br> To overcome these shortcomings, this paper presents a change semantics constrained online filtering method for real-time observational data. Based on the principle that the filter parameter should vary in accordance to the data change patterns, this paper embeds semantic description, which quantitatively depicts the change patterns in the data distribution to self-adapt the filter parameter automatically. Real-time observational water level data streams of different precipitation scenarios are selected for testing. Experimental results prove that by means of this method, more accurate and reliable water level information can be available, which is prior to scientific and prompt flood assessment and decision-making.


Sign in / Sign up

Export Citation Format

Share Document