OptStream: Releasing Time Series Privately

Many applications of machine learning and optimization operate on data streams. While these datasets are fundamental to fuel decision-making algorithms, often they contain sensitive information about individuals, and their usage poses significant privacy risks. Motivated by an application in energy systems, this paper presents OptStream, a novel algorithm for releasing differentially private data streams under the w-event model of privacy. OptStream is a 4-step procedure consisting of sampling, perturbation, reconstruction, and post-processing modules. First, the sampling module selects a small set of points to access in each period of interest. Then, the perturbation module adds noise to the sampled data points to guarantee privacy. Next, the reconstruction module re-assembles non-sampled data points from the perturbed sample points. Finally, the post-processing module uses convex optimization over the privacy-preserving output of the previous modules, as well as the privacy-preserving answers of additional queries on the data stream, to improve accuracy by redistributing the added noise. OptStream is evaluated on a test case involving the release of a real data stream from the largest European transmission operator. Experimental results show that OptStream may not only improve the accuracy of state-of-the-art methods by at least one order of magnitude but also supports accurate load forecasting on the privacy-preserving data.

Download Full-text

OptStream: Releasing Time Series Privately (Extended Abstract)

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/722 ◽

2020 ◽

Author(s):

Ferdinando Fioretto ◽

Pascal Van Hentenryck

Keyword(s):

Data Streams ◽

Data Stream ◽

Real Data ◽

Test Case ◽

Sensitive Data ◽

Private Data ◽

Order Of Magnitude ◽

Bounded Error ◽

Applications Of Machine Learning ◽

Privacy Risks

Many applications of machine learning and optimization operate on sensitive data streams, posing significant privacy risks for individuals whose data appear in the stream. Motivated by an application in energy systems, this paper presents OptStream, a novel algorithm for releasing differentially private data streams under the w-event model of privacy. The procedure ensures privacy while guaranteeing bounded error on the released data stream. OptStream is evaluated on a test case involving the release of a real data stream from the largest European transmission operator. Experimental results show that OptStream may not only improve the accuracy of state-of-the-art methods by at least one order of magnitude but also support accurate load forecasting on the privacy-preserving data.

Download Full-text

TADILOF: Time Aware Density-Based Incremental Local Outlier Detection in Data Streams

Sensors ◽

10.3390/s20205829 ◽

2020 ◽

Vol 20 (20) ◽

pp. 5829 ◽

Cited By ~ 1

Author(s):

Jen-Wei Huang ◽

Meng-Xun Zhong ◽

Bijay Prasad Jaysawal

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Data Stream ◽

State Of The Art ◽

Streaming Data ◽

Current State ◽

Data Points ◽

Local Outlier ◽

Time Aware ◽

Over Time

Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.

Download Full-text

Density-Based Clustering Method for Trends Analysis Using Evolving Data Stream

International Journal of Synthetic Emotions ◽

10.4018/ijse.2020070102 ◽

2020 ◽

Vol 11 (2) ◽

pp. 19-36

Author(s):

Umesh Kokate ◽

Arviand V. Deshpande ◽

Parikshit N. Mahalle

Keyword(s):

Data Streams ◽

Data Stream ◽

Cluster Formation ◽

Clustering Method ◽

Density Based Clustering ◽

Trends Analysis ◽

Data Points ◽

Data Stream Clustering ◽

Evolving Data ◽

Over Time

Evolution of data in the data stream environment generates patterns at different time instances. The cluster formation changes with respect to time because of the behaviour and members of clusters. Data stream clustering (DSC) allows us to investigate the changes of the group behaviour. These changes in the behaviour of the group members over time lead to formation of new clusters and may make old clusters extinct. Also, these extinct old clusters may recur over time. The problem is to identify and record these change patterns of evolving data streams. The knowledge obtained from these change patterns is then used for trends analysis over evolving data streams. In order to address this flexible clustering requirement, density-based clustering method is proposed to dynamically cluster evolving data streams. The decay factor identifies formation of new clusters and diminishing of older clusters on arrival of data points. This indicates trends in evolving data streams.

Download Full-text

A Novel Drift Detection Algorithm Based on Features’ Importance Analysis in a Data Streams Environment

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2020-0019 ◽

2020 ◽

Vol 10 (4) ◽

pp. 287-298

Author(s):

Piotr Duda ◽

Krzysztof Przybyszewski ◽

Lipo Wang

Keyword(s):

Random Forest ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Ensemble Methods ◽

Real Data ◽

Relevant Information ◽

Detection Algorithm ◽

Important Indicator ◽

Features Importance

AbstractThe training set consists of many features that influence the classifier in different degrees. Choosing the most important features and rejecting those that do not carry relevant information is of great importance to the operating of the learned model. In the case of data streams, the importance of the features may additionally change over time. Such changes affect the performance of the classifier but can also be an important indicator of occurring concept-drift. In this work, we propose a new algorithm for data streams classification, called Random Forest with Features Importance (RFFI), which uses the measure of features importance as a drift detector. The RFFT algorithm implements solutions inspired by the Random Forest algorithm to the data stream scenarios. The proposed algorithm combines the ability of ensemble methods for handling slow changes in a data stream with a new method for detecting concept drift occurrence. The work contains an experimental analysis of the proposed algorithm, carried out on synthetic and real data.

Download Full-text

Improved Macro-clusters generation using Top-k shared Micro-clusters in Data Streams

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i10.400 ◽

2017 ◽

Vol 7 (10) ◽

pp. 52

Author(s):

LAKSHMI PRANEETHA

Keyword(s):

Real Time ◽

Data Streams ◽

Bloom Filter ◽

Scientific Applications ◽

Pruning Algorithm ◽

Density Data ◽

Data Points ◽

Short Time ◽

Information Streams

Now-a-days data streams or information streams are gigantic and quick changing. The usage of information streams can fluctuate from basic logical, scientific applications to vital business and money related ones. The useful information is abstracted from the stream and represented in the form of micro-clusters in the online phase. In offline phase micro-clusters are merged to form the macro clusters. DBSTREAM technique captures the density between micro-clusters by means of a shared density graph in the online phase. The density data in this graph is then used in reclustering for improving the formation of clusters but DBSTREAM takes more time in handling the corrupted data points In this paper an early pruning algorithm is used before pre-processing of information and a bloom filter is used for recognizing the corrupted information. Our experiments on real time datasets shows that using this approach improves the efficiency of macro-clusters by 90% and increases the generation of more number of micro-clusters within in a short time.

Download Full-text

Is there enough star formation in simulated protoclusters?

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3693 ◽

2020 ◽

Vol 501 (2) ◽

pp. 1803-1822

Author(s):

Seunghwan Lim ◽

Douglas Scott ◽

Arif Babul ◽

David J Barnes ◽

Scott T Kay ◽

...

Keyword(s):

Star Formation ◽

Galaxy Formation ◽

Star Formation History ◽

Test Case ◽

Numerical Resolution ◽

Formation History ◽

Order Of Magnitude ◽

History Of ◽

Hydrodynamical Simulations ◽

Formation Efficiency

ABSTRACT As progenitors of the most massive objects, protoclusters are key to tracing the evolution and star formation history of the Universe, and are responsible for ${\gtrsim }\, 20$ per cent of the cosmic star formation at $z\, {\gt }\, 2$. Using a combination of state-of-the-art hydrodynamical simulations and empirical models, we show that current galaxy formation models do not produce enough star formation in protoclusters to match observations. We find that the star formation rates (SFRs) predicted from the models are an order of magnitude lower than what is seen in observations, despite the relatively good agreement found for their mass-accretion histories, specifically that they lie on an evolutionary path to become Coma-like clusters at $z\, {\simeq }\, 0$. Using a well-studied protocluster core at $z\, {=}\, 4.3$ as a test case, we find that star formation efficiency of protocluster galaxies is higher than predicted by the models. We show that a large part of the discrepancy can be attributed to a dependence of SFR on the numerical resolution of the simulations, with a roughly factor of 3 drop in SFR when the spatial resolution decreases by a factor of 4. We also present predictions up to $z\, {\simeq }\, 7$. Compared to lower redshifts, we find that centrals (the most massive member galaxies) are more distinct from the other galaxies, while protocluster galaxies are less distinct from field galaxies. All these results suggest that, as a rare and extreme population at high z, protoclusters can help constrain galaxy formation models tuned to match the average population at $z\, {\simeq }\, 0$.

Download Full-text

Recurring concept memory management in data streams: exploiting data stream concept evolution to improve performance and transparency

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00736-w ◽

2021 ◽

Author(s):

Ben Halstead ◽

Yun Sing Koh ◽

Patricia Riddle ◽

Russel Pears ◽

Mykola Pechenizkiy ◽

...

Keyword(s):

Data Streams ◽

Data Stream ◽

Memory Management ◽

Improve Performance ◽

Concept Evolution

Download Full-text

Robust estimation for multivariate wrapped models

METRON ◽

10.1007/s40300-021-00214-9 ◽

2021 ◽

Author(s):

Giovanni Saraceno ◽

Claudio Agostinelli ◽

Luca Greco

Keyword(s):

Robust Estimation ◽

Numerical Study ◽

Real Data ◽

Likelihood Method ◽

Weighted Likelihood ◽

Finite Sample ◽

Pearson Residuals ◽

Data Points ◽

Wrapped Distributions ◽

Standard Techniques

AbstractA weighted likelihood technique for robust estimation of multivariate Wrapped distributions of data points scattered on a $$p-$$ p - dimensional torus is proposed. The occurrence of outliers in the sample at hand can badly compromise inference for standard techniques such as maximum likelihood method. Therefore, there is the need to handle such model inadequacies in the fitting process by a robust technique and an effective downweighting of observations not following the assumed model. Furthermore, the employ of a robust method could help in situations of hidden and unexpected substructures in the data. Here, it is suggested to build a set of data-dependent weights based on the Pearson residuals and solve the corresponding weighted likelihood estimating equations. In particular, robust estimation is carried out by using a Classification EM algorithm whose M-step is enhanced by the computation of weights based on current parameters’ values. The finite sample behavior of the proposed method has been investigated by a Monte Carlo numerical study and real data examples.

Download Full-text

BIAS-VARIANCE CONTROL VIA HARD POINTS SHAVING

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001404003460 ◽

2004 ◽

Vol 18 (05) ◽

pp. 891-903 ◽

Cited By ~ 12

Author(s):

STEFANO MERLER ◽

BRUNO CAPRILE ◽

CESARE FURLANELLO

Keyword(s):

Control Strategy ◽

Noisy Data ◽

Real Data ◽

Training Data ◽

Regularization Technique ◽

Data Points ◽

Classification Tasks ◽

Bias Variance

In this paper, we propose a regularization technique for AdaBoost. The method implements a bias-variance control strategy in order to avoid overfitting in classification tasks on noisy data. The method is based on a notion of easy and hard training patterns as emerging from analysis of the dynamical evolutions of AdaBoost weights. The procedure consists in sorting the training data points by a hardness measure, and in progressively eliminating the hardest, stopping at an automatically selected threshold. Effectiveness of the method is tested and discussed on synthetic as well as real data.

Download Full-text

Anonymization Based on Improved Bucketization (AIB): A Privacy-Preserving Data Publishing Technique for Improving Data Utility in Healthcare Data

Journal of Medical Imaging and Health Informatics ◽

10.1166/jmihi.2021.3901 ◽

2021 ◽

Vol 11 (12) ◽

pp. 3164-3173

Author(s):

R. Indhumathi ◽

S. Sathiya Devi

Keyword(s):

Medical Information ◽

Threshold Value ◽

Privacy Preserving ◽

Data Publishing ◽

Published Data ◽

Sensitive Information ◽

Data Utility ◽

Healthcare Data ◽

Privacy Preserving Data Publishing ◽

Horizontal Partitioning

Data sharing is essential in present biomedical research. A large quantity of medical information is gathered and for different objectives of analysis and study. Because of its large collection, anonymity is essential. Thus, it is quite important to preserve privacy and prevent leakage of sensitive information of patients. Most of the Anonymization methods such as generalisation, suppression and perturbation are proposed to overcome the information leak which degrades the utility of the collected data. During data sanitization, the utility is automatically diminished. Privacy Preserving Data Publishing faces the main drawback of maintaining tradeoff between privacy and data utility. To address this issue, an efficient algorithm called Anonymization based on Improved Bucketization (AIB) is proposed, which increases the utility of published data while maintaining privacy. The Bucketization technique is used in this paper with the intervention of the clustering method. The proposed work is divided into three stages: (i) Vertical and Horizontal partitioning (ii) Assigning Sensitive index to attributes in the cluster (iii) Verifying each cluster against privacy threshold (iv) Examining for privacy breach in Quasi Identifier (QI). To increase the utility of published data, the threshold value is determined based on the distribution of elements in each attribute, and the anonymization method is applied only to the specific QI element. As a result, the data utility has been improved. Finally, the evaluation results validated the design of paper and demonstrated that our design is effective in improving data utility.

Download Full-text