scholarly journals OptStream: Releasing Time Series Privately

2019 ◽  
Vol 65 ◽  
pp. 423-456
Author(s):  
Ferdinando Fioretto ◽  
Pascal Van Hentenryck

Many applications of machine learning and optimization operate on data streams. While these datasets are fundamental to fuel decision-making algorithms, often they contain sensitive information about individuals, and their usage poses significant privacy risks. Motivated by an application in energy systems, this paper presents OptStream, a novel algorithm for releasing differentially private data streams under the w-event model of privacy. OptStream is a 4-step procedure consisting of sampling, perturbation, reconstruction, and post-processing modules. First, the sampling module selects a small set of points to access in each period of interest. Then, the perturbation module adds noise to the sampled data points to guarantee privacy. Next, the reconstruction module re-assembles non-sampled data points from the perturbed sample points. Finally, the post-processing module uses convex optimization over the privacy-preserving output of the previous modules, as well as the privacy-preserving answers of additional queries on the data stream, to improve accuracy by redistributing the added noise. OptStream is evaluated on a test case involving the release of a real data stream from the largest European transmission operator. Experimental results show that OptStream may not only improve the accuracy of state-of-the-art methods by at least one order of magnitude but also supports accurate load forecasting on the privacy-preserving data.

Author(s):  
Ferdinando Fioretto ◽  
Pascal Van Hentenryck

Many applications of machine learning and optimization operate on sensitive data streams, posing significant privacy risks for individuals whose data appear in the stream. Motivated by an application in energy systems, this paper presents OptStream, a novel algorithm for releasing differentially private data streams under the w-event model of privacy. The procedure ensures privacy while guaranteeing bounded error on the released data stream. OptStream is evaluated on a test case involving the release of a real data stream from the largest European transmission operator. Experimental results show that OptStream may not only improve the accuracy of state-of-the-art methods by at least one order of magnitude but also support accurate load forecasting on the privacy-preserving data.


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5829 ◽  
Author(s):  
Jen-Wei Huang ◽  
Meng-Xun Zhong ◽  
Bijay Prasad Jaysawal

Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.


2020 ◽  
Vol 11 (2) ◽  
pp. 19-36
Author(s):  
Umesh Kokate ◽  
Arviand V. Deshpande ◽  
Parikshit N. Mahalle

Evolution of data in the data stream environment generates patterns at different time instances. The cluster formation changes with respect to time because of the behaviour and members of clusters. Data stream clustering (DSC) allows us to investigate the changes of the group behaviour. These changes in the behaviour of the group members over time lead to formation of new clusters and may make old clusters extinct. Also, these extinct old clusters may recur over time. The problem is to identify and record these change patterns of evolving data streams. The knowledge obtained from these change patterns is then used for trends analysis over evolving data streams. In order to address this flexible clustering requirement, density-based clustering method is proposed to dynamically cluster evolving data streams. The decay factor identifies formation of new clusters and diminishing of older clusters on arrival of data points. This indicates trends in evolving data streams.


2020 ◽  
Vol 10 (4) ◽  
pp. 287-298
Author(s):  
Piotr Duda ◽  
Krzysztof Przybyszewski ◽  
Lipo Wang

AbstractThe training set consists of many features that influence the classifier in different degrees. Choosing the most important features and rejecting those that do not carry relevant information is of great importance to the operating of the learned model. In the case of data streams, the importance of the features may additionally change over time. Such changes affect the performance of the classifier but can also be an important indicator of occurring concept-drift. In this work, we propose a new algorithm for data streams classification, called Random Forest with Features Importance (RFFI), which uses the measure of features importance as a drift detector. The RFFT algorithm implements solutions inspired by the Random Forest algorithm to the data stream scenarios. The proposed algorithm combines the ability of ensemble methods for handling slow changes in a data stream with a new method for detecting concept drift occurrence. The work contains an experimental analysis of the proposed algorithm, carried out on synthetic and real data.


Author(s):  
LAKSHMI PRANEETHA

Now-a-days data streams or information streams are gigantic and quick changing. The usage of information streams can fluctuate from basic logical, scientific applications to vital business and money related ones. The useful information is abstracted from the stream and represented in the form of micro-clusters in the online phase. In offline phase micro-clusters are merged to form the macro clusters. DBSTREAM technique captures the density between micro-clusters by means of a shared density graph in the online phase. The density data in this graph is then used in reclustering for improving the formation of clusters but DBSTREAM takes more time in handling the corrupted data points In this paper an early pruning algorithm is used before pre-processing of information and a bloom filter is used for recognizing the corrupted information. Our experiments on real time datasets shows that using this approach improves the efficiency of macro-clusters by 90% and increases the generation of more number of micro-clusters within in a short time.


2020 ◽  
Vol 501 (2) ◽  
pp. 1803-1822
Author(s):  
Seunghwan Lim ◽  
Douglas Scott ◽  
Arif Babul ◽  
David J Barnes ◽  
Scott T Kay ◽  
...  

ABSTRACT As progenitors of the most massive objects, protoclusters are key to tracing the evolution and star formation history of the Universe, and are responsible for ${\gtrsim }\, 20$ per cent of the cosmic star formation at $z\, {\gt }\, 2$. Using a combination of state-of-the-art hydrodynamical simulations and empirical models, we show that current galaxy formation models do not produce enough star formation in protoclusters to match observations. We find that the star formation rates (SFRs) predicted from the models are an order of magnitude lower than what is seen in observations, despite the relatively good agreement found for their mass-accretion histories, specifically that they lie on an evolutionary path to become Coma-like clusters at $z\, {\simeq }\, 0$. Using a well-studied protocluster core at $z\, {=}\, 4.3$ as a test case, we find that star formation efficiency of protocluster galaxies is higher than predicted by the models. We show that a large part of the discrepancy can be attributed to a dependence of SFR on the numerical resolution of the simulations, with a roughly factor of 3 drop in SFR when the spatial resolution decreases by a factor of 4. We also present predictions up to $z\, {\simeq }\, 7$. Compared to lower redshifts, we find that centrals (the most massive member galaxies) are more distinct from the other galaxies, while protocluster galaxies are less distinct from field galaxies. All these results suggest that, as a rare and extreme population at high z, protoclusters can help constrain galaxy formation models tuned to match the average population at $z\, {\simeq }\, 0$.


METRON ◽  
2021 ◽  
Author(s):  
Giovanni Saraceno ◽  
Claudio Agostinelli ◽  
Luca Greco

AbstractA weighted likelihood technique for robust estimation of multivariate Wrapped distributions of data points scattered on a $$p-$$ p - dimensional torus is proposed. The occurrence of outliers in the sample at hand can badly compromise inference for standard techniques such as maximum likelihood method. Therefore, there is the need to handle such model inadequacies in the fitting process by a robust technique and an effective downweighting of observations not following the assumed model. Furthermore, the employ of a robust method could help in situations of hidden and unexpected substructures in the data. Here, it is suggested to build a set of data-dependent weights based on the Pearson residuals and solve the corresponding weighted likelihood estimating equations. In particular, robust estimation is carried out by using a Classification EM algorithm whose M-step is enhanced by the computation of weights based on current parameters’ values. The finite sample behavior of the proposed method has been investigated by a Monte Carlo numerical study and real data examples.


Author(s):  
STEFANO MERLER ◽  
BRUNO CAPRILE ◽  
CESARE FURLANELLO

In this paper, we propose a regularization technique for AdaBoost. The method implements a bias-variance control strategy in order to avoid overfitting in classification tasks on noisy data. The method is based on a notion of easy and hard training patterns as emerging from analysis of the dynamical evolutions of AdaBoost weights. The procedure consists in sorting the training data points by a hardness measure, and in progressively eliminating the hardest, stopping at an automatically selected threshold. Effectiveness of the method is tested and discussed on synthetic as well as real data.


2021 ◽  
Vol 11 (12) ◽  
pp. 3164-3173
Author(s):  
R. Indhumathi ◽  
S. Sathiya Devi

Data sharing is essential in present biomedical research. A large quantity of medical information is gathered and for different objectives of analysis and study. Because of its large collection, anonymity is essential. Thus, it is quite important to preserve privacy and prevent leakage of sensitive information of patients. Most of the Anonymization methods such as generalisation, suppression and perturbation are proposed to overcome the information leak which degrades the utility of the collected data. During data sanitization, the utility is automatically diminished. Privacy Preserving Data Publishing faces the main drawback of maintaining tradeoff between privacy and data utility. To address this issue, an efficient algorithm called Anonymization based on Improved Bucketization (AIB) is proposed, which increases the utility of published data while maintaining privacy. The Bucketization technique is used in this paper with the intervention of the clustering method. The proposed work is divided into three stages: (i) Vertical and Horizontal partitioning (ii) Assigning Sensitive index to attributes in the cluster (iii) Verifying each cluster against privacy threshold (iv) Examining for privacy breach in Quasi Identifier (QI). To increase the utility of published data, the threshold value is determined based on the distribution of elements in each attribute, and the anonymization method is applied only to the specific QI element. As a result, the data utility has been improved. Finally, the evaluation results validated the design of paper and demonstrated that our design is effective in improving data utility.


Sign in / Sign up

Export Citation Format

Share Document