scholarly journals Privacy-Preserving for Distributed Data Streams: Towards l-Diversity

Author(s):  
Mona Mohamed ◽  
Sahar Ghanem ◽  
Magdy Nagi

Privacy-preserving data publishing have been studied widely on static data. However, many recent applications generate data streams that are real-time, unbounded, rapidly changing, and distributed in nature. Recently, few work addressed k-anonymity and l-diversity for data streams. Their model implied that if the stream is distributed, it is collected at a central site for anonymization. In this paper, we propose a novel distributed model where distributed streams are first anonymized by distributed (collecting) sites before merging and releasing. Our approach extends Continuously Anonymizing STreaming data via adaptive cLustEring (CASTLE) [4], a cluster-based approach that provides both k-anonymity and l-diversity for centralized data streams. The main idea is for each site to construct its local clustering model and exchange this local view with other sites to globally construct approximately the same clustering view. The approach is heuristic in a sense that not every update to the local view is sent, instead triggering events are selected for exchanging cluster information. Extensive experiments on a real data set are performed to study the introduced Information Loss (IL) on different settings. First, the impact of the different parameters on IL are quantified. Then k-anonymity and l-diversity are compared in terms of messaging cost and IL. Finally, the effectiveness of the proposed distributed model is studied by comparing the introduced IL to the IL of the centralized model (as a lower bound) and to a distributed model with no communication (as an upper bound). The experimental results show that the main contributing factor to IL is the number of attributes in the quasi-identifier (50%-75%) and the number of sites contributed about 1% and this proves the scalability of the proposed approach. In addition, providing l-diversity is shown to introduce about 25% increase in IL when compared to k-anonymity. Moreover, 35% reduction in IL is achieved by messaging cost (in bytes) of about 0.3% of the data set size

2021 ◽  
Vol 10 (2) ◽  
pp. 78
Author(s):  
Songyuan Li ◽  
Hui Tian ◽  
Hong Shen ◽  
Yingpeng Sang

Publication of trajectory data that contain rich information of vehicles in the dimensions of time and space (location) enables online monitoring and supervision of vehicles in motion and offline traffic analysis for various management tasks. However, it also provides security holes for privacy breaches as exposing individual’s privacy information to public may results in attacks threatening individual’s safety. Therefore, increased attention has been made recently on the privacy protection of trajectory data publishing. However, existing methods, such as generalization via anonymization and suppression via randomization, achieve protection by modifying the original trajectory to form a publishable trajectory, which results in significant data distortion and hence a low data utility. In this work, we propose a trajectory privacy-preserving method called dynamic anonymization with bounded distortion. In our method, individual trajectories in the original trajectory set are mixed in a localized manner to form synthetic trajectory data set with a bounded distortion for publishing, which can protect the privacy of location information associated with individuals in the trajectory data set and ensure a guaranteed utility of the published data both individually and collectively. Through experiments conducted on real trajectory data of Guangzhou City Taxi statistics, we evaluate the performance of our proposed method and compare it with the existing mainstream methods in terms of privacy preservation against attacks and trajectory data utilization. The results show that our proposed method achieves better performance on data utilization than the existing methods using globally static anonymization, without trading off the data security against attacks.


We have real-time data everywhere and every day. Most of the data comes from IoT sensors, data from GPS positions, web transactions and social media updates. Real time data is typically generated in a continuous fashion. Such real-time data are called Data streams. Data streams are transient and there is very little time to process each item in the stream. It is a great challenge to do analytics on rapidly flowing high velocity data. Another issue is the percentage of incoming data that is considered for analytics. Higher the percentage greater would be the accuracy. Considering these two issues, the proposed work is intended to find a better solution by gaining insight on real-time streaming data with minimum response time and greater accuracy. This paper combines the two technology giants TensorFlow and Apache Kafka. is used to handle the real-time streaming data since TensorFlow supports analytics support with deep learning algorithms. The Training and Testing is done on Uber connected vehicle public data set RideAustin. The experimental result of RideAustin shows the predicted failure under each type of vehicle parameter. The comparative analysis showed 16% improvement over the traditional Machine Learning algorithm.


Author(s):  
Joshua Plasse ◽  
Henrique Hoeltgebaum ◽  
Niall M. Adams

AbstractSequentially detecting multiple changepoints in a data stream is a challenging task. Difficulties relate to both computational and statistical aspects, and in the latter, specifying control parameters is a particular problem. Choosing control parameters typically relies on unrealistic assumptions, such as the distributions generating the data, and their parameters, being known. This is implausible in the streaming paradigm, where several changepoints will exist. Further, current literature is mostly concerned with streams of continuous-valued observations, and focuses on detecting a single changepoint. There is a dearth of literature dedicated to detecting multiple changepoints in transition matrices, which arise from a sequence of discrete states. This paper makes the following contributions: a complete framework is developed for adaptively and sequentially estimating a Markov transition matrix in the streaming data setting. A change detection method is then developed, using a novel moment matching technique, which can effectively monitor for multiple changepoints in a transition matrix. This adaptive detection and estimation procedure for transition matrices, referred to as ADEPT-M, is compared to several change detectors on synthetic data streams, and is implemented on two real-world data streams – one consisting of over nine million HTTP web requests, and the other being a well-studied electricity market data set.


Energies ◽  
2020 ◽  
Vol 13 (4) ◽  
pp. 924 ◽  
Author(s):  
Krzysztof Gajowniczek ◽  
Marcin Bator ◽  
Tomasz Ząbkowski ◽  
Arkadiusz Orłowski ◽  
Chu Kiong Loo

Currently, thanks to the rapid development of wireless sensor networks and network traffic monitoring, the data stream is gradually becoming one of the most popular data generating processes. The data stream is different from traditional static data. Cluster analysis is an important technology for data mining, which is why many researchers pay attention to grouping streaming data. In the literature, there are many data stream clustering techniques, unfortunately, very few of them try to solve the problem of clustering data streams coming from multiple sources. In this article, we present an algorithm with a tree structure for grouping data streams (in the form of a time series) that have similar properties and behaviors. We have evaluated our algorithm over real multivariate data streams generated by smart meter sensors—the Irish Commission for Energy Regulation data set. There were several measures used to analyze the various characteristics of a tree-like clustering structure (computer science perspective) and also measures that are important from a business standpoint. The proposed method was able to cluster the flows of data and has identified the customers with similar behavior during the analyzed period.


Author(s):  
Alfredo Cuzzocrea ◽  
Filippo Furfaro ◽  
Elio Masciari ◽  
Domenico Saccà

Sensor networks represent a leading case of data stream sources coming from real-life application scenarios. Sensors are non-reactive elements which are used to monitor real-life phenomena, such as live weather conditions, network traffic etc. They are usually organized into networks where their readings are transmitted using low level protocols. A relevant problem in dealing with data streams consists in the fact that they are intrinsically multi-level and multidimensional in nature, so that they require to be analyzed by means of a multi-level and a multi-resolution (analysis) model accordingly, like OLAP, beyond traditional solutions provided by primitive SQL-based DBMS interfaces. Despite this, a significant issue in dealing with OLAP is represented by the so-called curse of dimensionality problem, which consists in the fact that, when the number of dimensions of the target data cube increases, multidimensional data cannot be accessed and queried efficiently, due to their enormous size. Starting from this practical evidence, several data cube compression techniques have been proposed during the last years, with alternate fortune. Briefly, the main idea of these techniques consists in computing compressed representations of input data cubes in order to evaluate time-consuming OLAP queries against them, thus supobtaining approximate answers. Similarly to static data, approximate query answering techniques can be applied to streaming data, in order to improve OLAP analysis of such kind of data. Unfortunately, the data cube compression computational paradigm gets worse when OLAP aggregations are computed on top of a continuously flooding multidimensional data stream. In order to efficiently deal with the curse of dimensionality problem and achieve high efficiency in processing and querying multidimensional data streams, thus efficiently supporting OLAP analysis of such kind of data, in this chapter we propose novel compression techniques over data stream readings that are materialized for OLAP purposes. This allows us to tame the unbounded nature of streaming data, thus dealing with bounded memory issues exposed by conventional DBMS tools. Overall, in this chapter we introduce an innovative, complex technique for efficiently supporting OLAP analysis of multidimensional data streams.


Author(s):  
Alexandre Evfimievski ◽  
Tyrone Grandison

Privacy-preserving data mining (PPDM) refers to the area of data mining that seeks to safeguard sensitive information from unsolicited or unsanctioned disclosure. Most traditional data mining techniques analyze and model the data set statistically, in aggregated form, while privacy preservation is primarily concerned with protecting against disclosure of individual data records. This domain separation points to the technical feasibility of PPDM. Historically, issues related to PPDM were first studied by the national statistical agencies interested in collecting private social and economical data, such as census and tax records, and making it available for analysis by public servants, companies, and researchers. Building accurate socioeconomical models is vital for business planning and public policy. Yet, there is no way of knowing in advance what models may be needed, nor is it feasible for the statistical agency to perform all data processing for everyone, playing the role of a trusted third party. Instead, the agency provides the data in a sanitized form that allows statistical processing and protects the privacy of individual records, solving a problem known as privacypreserving data publishing. For a survey of work in statistical databases, see Adam and Wortmann (1989) and Willenborg and de Waal (2001).


Crisis ◽  
2018 ◽  
Vol 39 (1) ◽  
pp. 27-36 ◽  
Author(s):  
Kuan-Ying Lee ◽  
Chung-Yi Li ◽  
Kun-Chia Chang ◽  
Tsung-Hsueh Lu ◽  
Ying-Yeh Chen

Abstract. Background: We investigated the age at exposure to parental suicide and the risk of subsequent suicide completion in young people. The impact of parental and offspring sex was also examined. Method: Using a cohort study design, we linked Taiwan's Birth Registry (1978–1997) with Taiwan's Death Registry (1985–2009) and identified 40,249 children who had experienced maternal suicide (n = 14,431), paternal suicide (n = 26,887), or the suicide of both parents (n = 281). Each exposed child was matched to 10 children of the same sex and birth year whose parents were still alive. This yielded a total of 398,081 children for our non-exposed cohort. A Cox proportional hazards model was used to compare the suicide risk of the exposed and non-exposed groups. Results: Compared with the non-exposed group, offspring who were exposed to parental suicide were 3.91 times (95% confidence interval [CI] = 3.10–4.92 more likely to die by suicide after adjusting for baseline characteristics. The risk of suicide seemed to be lower in older male offspring (HR = 3.94, 95% CI = 2.57–6.06), but higher in older female offspring (HR = 5.30, 95% CI = 3.05–9.22). Stratified analyses based on parental sex revealed similar patterns as the combined analysis. Limitations: As only register-­based data were used, we were not able to explore the impact of variables not contained in the data set, such as the role of mental illness. Conclusion: Our findings suggest a prominent elevation in the risk of suicide among offspring who lost their parents to suicide. The risk elevation differed according to the sex of the afflicted offspring as well as to their age at exposure.


2013 ◽  
Vol 99 (4) ◽  
pp. 40-45 ◽  
Author(s):  
Aaron Young ◽  
Philip Davignon ◽  
Margaret B. Hansen ◽  
Mark A. Eggen

ABSTRACT Recent media coverage has focused on the supply of physicians in the United States, especially with the impact of a growing physician shortage and the Affordable Care Act. State medical boards and other entities maintain data on physician licensure and discipline, as well as some biographical data describing their physician populations. However, there are gaps of workforce information in these sources. The Federation of State Medical Boards' (FSMB) Census of Licensed Physicians and the AMA Masterfile, for example, offer valuable information, but they provide a limited picture of the physician workforce. Furthermore, they are unable to shed light on some of the nuances in physician availability, such as how much time physicians spend providing direct patient care. In response to these gaps, policymakers and regulators have in recent years discussed the creation of a physician minimum data set (MDS), which would be gathered periodically and would provide key physician workforce information. While proponents of an MDS believe it would provide benefits to a variety of stakeholders, an effort has not been attempted to determine whether state medical boards think it is important to collect physician workforce data and if they currently collect workforce information from licensed physicians. To learn more, the FSMB sent surveys to the executive directors at state medical boards to determine their perceptions of collecting workforce data and current practices regarding their collection of such data. The purpose of this article is to convey results from this effort. Survey findings indicate that the vast majority of boards view physician workforce information as valuable in the determination of health care needs within their state, and that various boards are already collecting some data elements. Analysis of the data confirms the potential benefits of a physician minimum data set (MDS) and why state medical boards are in a unique position to collect MDS information from physicians.


2019 ◽  
Vol 11 (1) ◽  
pp. 156-173
Author(s):  
Spenser Robinson ◽  
A.J. Singh

This paper shows Leadership in Energy and Environmental Design (LEED) certified hospitality properties exhibit increased expenses and earn lower net operating income (NOI) than non-certified buildings. ENERGY STAR certified properties demonstrate lower overall expenses than non-certified buildings with statistically neutral NOI effects. Using a custom sample of all green buildings and their competitive data set as of 2013 provided by Smith Travel Research (STR), the paper documents potential reasons for this result including increased operational expenses, potential confusion with certified and registered LEED projects in the data, and qualitative input. The qualitative input comes from a small sample survey of five industry professionals. The paper provides one of the only analyses on operating efficiencies with LEED and ENERGY STAR hospitality properties.


2019 ◽  
Vol 33 (3) ◽  
pp. 187-202
Author(s):  
Ahmed Rachid El-Khattabi ◽  
T. William Lester

The use of tax increment financing (TIF) remains a popular, yet highly controversial, tool among policy makers in their efforts to promote economic development. This study conducts a comprehensive assessment of the effectiveness of Missouri’s TIF program, specifically in Kansas City and St. Louis, in creating economic opportunities. We build a time-series data set starting 1990 through 2012 of detailed employment levels, establishment counts, and sales at the census block-group level to run a set of difference-in-differences with matching estimates for the impact of TIF at the local level. Although we analyze the impact of TIF on a wide set of indicators and across various industry sectors, we find no conclusive evidence that the TIF program in either city has a causal impact on key economic development indicators.


Sign in / Sign up

Export Citation Format

Share Document