Privacy of study participants in open-access Health and Demographic Surveillance System data: A requirements analysis for data anonymisation (Preprint)

Mapping Intimacies ◽

10.2196/preprints.34472 ◽

2021 ◽

Author(s):

Matthias Templ ◽

Chifundo Kanjala ◽

Inken Siems

Keyword(s):

Open Access ◽

Surveillance System ◽

Demographic Surveillance System ◽

Event History ◽

Time Varying ◽

Data Utility ◽

History Data ◽

Disclosure Risk ◽

Event History Data ◽

Event Order

BACKGROUND Sharing and anonymising data have become hot topics for individuals, organisations, and countries around the world. Open-access sharing of anonymised data containing sensitive information about individuals makes the most sense whenever the utility of the data can be preserved and the risk of disclosure can be kept below acceptable levels. In this case, researchers can use the data without access restrictions and limitations. OBJECTIVE The goal of this paper is to highlight solutions and requirements for sharing longitudinal health and surveillance event history data in form of open-access data. The challenges lie in the anonymisation of multiple event dates and the time-varying variables. A sequential approach that adds noise to the event dates is proposed. This approach maintains the event order and preserves the average time between events. Additionally, a nosy neighbor distance-based matching approach to estimate the risk is proposed. Regarding dealing with the key variables that change over time such as educational level or occupation, we make two proposals, one based on limiting the intermediate status of a person (e.g. on education), and the other to achieve k-anonymity in subsets of the data. The proposed approaches were applied to the Karonga Health and Demographic Surveillance System (HDSS) core dataset, which contains longitudinal data from 1995 to the end of 2016 and includes 280,381 event records with time-varying, socio-economic variables and demographic information on individuals. The proposed anonymisation strategy lowers the risk of disclosure to acceptable levels thus allowing sharing of the data. METHODS statistical disclosure control, k-anonymity, adding noise, disclosure risk measurement, event history data anonymization, longitudinal data anonymization, data utility by visual comparisons. RESULTS Anonymized version of event history data including longitudinal information on individuals over time with high data utility. CONCLUSIONS The proposed anonymisation of study participants in event history data including static and time-varying status variables, specifically applied to longitudinal health and demographic surveillance system data, led to an anonymized data set with very low disclosure risk and high data utility ready to be shared to the public in form of an open-access data set. Different level of noise for event history dates were evaluated for disclosure risk and data utility. It turned out that high utility had been achieved even with the highest level of noise. Details matters to ensure consistency/credibility. Most important, the sequential noise approach presented in this paper maintains the event order. It has been shown that not even the event order is preserved but also the time between events is well maintained in comparison to the original data. We also proposed an anonymization strategy to handle the information of time-varying status of educational, occupational level of a person, year of death, year of birth, and number of events of a person. We proposed an approach that preserves the data utility well but limit the number of educational and occupational levels of a person. Using distance-based neighborhood matching we simulated an attack under a nosy neighbor situation and by using a worst-case scenario where attackers has full information on the original data. It could be shown that the disclosure risk is very low even by assuming that the attacker’s data base and information is optimal. The HDSS and medical science research communities in LMIC settings will be the primary beneficiaries of the results and methods presented in this science article, but the results will be useful for anyone working on anonymising longitudinal datasets possibly including also time-varying information and event history data for purposes of sharing. In other words, the proposed approaches can be applied to almost any event history data, and, additionally, to event history data including static and/or status variables that changes its entries in time.

A training manual for event history data management using Health and Demographic Surveillance System data

BMC Research Notes ◽

10.1186/s13104-017-2541-9 ◽

2017 ◽

Vol 10 (1) ◽

Cited By ~ 2

Author(s):

Philippe Bocquier ◽

Carren Ginsburg ◽

Kobus Herbst ◽

Osman Sankoh ◽

Mark A. Collinson

Keyword(s):

Data Management ◽

Surveillance System ◽

Demographic Surveillance System ◽

Event History ◽

Training Manual ◽

History Data ◽

Event History Data ◽

System Data

Methods for Semiparametric Regression Analysis of Multivariate Correlated Event-History Data

Operations Research ’91 ◽

10.1007/978-3-642-48417-9_83 ◽

1992 ◽

pp. 300-304 ◽

Cited By ~ 1

Author(s):

Leo Brecht

Keyword(s):

Regression Analysis ◽

Semiparametric Regression ◽

Event History ◽

History Data ◽

Event History Data

Event history data structures

Event History Analysis with Stata ◽

10.4324/9780429260407-2 ◽

2019 ◽

pp. 41-62

Author(s):

Hans-Peter Blossfeld ◽

Götz Rohwer ◽

Thorsten Schneider ◽

Brendan Halpin

Keyword(s):

Data Structures ◽

Event History ◽

History Data ◽

Event History Data

Analysis of Event History Data

A Practical Guide to Using Panel Data ◽

10.4135/9781473910485.n13 ◽

2017 ◽

pp. 243-265

Keyword(s):

Event History ◽

History Data ◽

Event History Data

Bayesian Smoothing and Regression for Longitudinal, Spatial and Event History Data ◽

10.1093/acprof:oso/9780199533022.003.0006 ◽

2011 ◽

pp. 415-494

Author(s):

Ludwig Fahrmeir ◽

Thomas Kneib

Keyword(s):

Event History ◽

History Data ◽

Event History Data

Escaping welfare? Social assistance dynamics in Sweden

Journal of European Social Policy ◽

10.1177/0958928711418855 ◽

2011 ◽

Vol 21 (5) ◽

pp. 486-500 ◽

Cited By ~ 30

Author(s):

Olof Bäckman ◽

Åke Bergmark

Keyword(s):

Mixture Model ◽

Unobserved Heterogeneity ◽

Social Assistance ◽

Temporal Patterns ◽

Previous Experience ◽

Event History ◽

Duration Dependence ◽

Swedish Population ◽

History Data ◽

Event History Data

The article analyses temporal patterns in social assistance receipt in Sweden in the 2000s by looking at which circumstances facilitate versus reduce the possibilities of a person ceasing to be a recipient of social assistance. The analysis is guided by the following questions: What conditions lead people to terminate periods of social assistance receipt? Which factors are central to exits with different subsequent income patterns? How do these explain the different situations of recipients prior to termination? We focus particularly on income maintenance prior to spells of social assistance. We use event history data on monthly social assistance take-up covering the total adult Swedish population for the years 2002–2004. We adopt a gamma mixture model to control for unobserved heterogeneity. The results suggest that previous experience of both employment and social assistance receipt are important determinants for all types of exits from social assistance recipiency. A negative duration dependence is found also when unobserved heterogeneity is controlled for.

A general approach to the machine handling of event history data

Social Science Information ◽

10.1177/053901885024001008 ◽

1985 ◽

Vol 24 (1) ◽

pp. 161-188 ◽

Cited By ~ 3

Author(s):

Máire Ní Bhrolcháin ◽

Ian M. Timaeus

Keyword(s):

Event History ◽

History Data ◽

Event History Data

A mixture of beta–Dirichlet processes prior for Bayesian analysis of event history data

Journal of the Korean Statistical Society ◽

10.1016/j.jkss.2012.11.001 ◽

2013 ◽

Vol 42 (3) ◽

pp. 313-321 ◽

Cited By ~ 1

Author(s):

Minwoo Chae ◽

Rafael Weißbach ◽

Kwang Hyun Cho ◽

Yongdai Kim

Keyword(s):

Bayesian Analysis ◽

Event History ◽

Dirichlet Processes ◽

History Data ◽

Event History Data

A segmented regression model for event history data: an application to the fertility patterns in Italy

Journal of Applied Statistics ◽

10.1080/02664760802552994 ◽

2009 ◽

Vol 36 (9) ◽

pp. 973-988 ◽

Cited By ~ 2

Author(s):

Vito M.R. Muggeo ◽

Massimo Attanasio ◽

Mariano Porcu

Keyword(s):

Regression Model ◽

Event History ◽

Segmented Regression ◽

History Data ◽

Event History Data ◽

Segmented Regression Model

Semiparametric analysis of complex longitudinal data

10.32469/10355/78162 ◽

2020 ◽

Author(s):

◽

Dayu Sun

Keyword(s):

Variable Selection ◽

Longitudinal Data ◽

Recurrent Event ◽

Event History ◽

Panel Count Data ◽

Recurrent Event Data ◽

Response Variable ◽

History Data ◽

Event History Data ◽

Observation Process

Event history data consist of the longitudinal records of event occurrence times. Recurrent event data and panel count data are two common types of event history data that occur in many areas, such as medical studies and social sciences. A great deal of literature has been established for their analyses. Nevertheless, only limited research exists on the variable selection for recurrent event data and panel count data. The existing methods can be seen as direct generalizations of the available penalized procedures for linear models, but may not perform as well as expected due to the complex structure of event history data. The first and second parts of this dissertation then discuss simultaneous parameter estimation and variable selection for event history data. We present a new variable selection method with a new penalty function, which will be referred to as the broken adaptive ridge regression approach. In addition to the establishment of the oracle property, we also show that the proposed variable selection method has the clustering or grouping effect when covariates are highly correlated. Furthermore, the numerical studies are performed and indicate that the method works well for practical situations and can outperform the existing methods. Applications to real data are provided. Most of the existing studies of longitudinal data assume that covariates can be observed at the same observation times for the response variable, and the observation process is independent of the response variable completely or given covariates. In practice, the response variables and covariates are sometimes observed intermittently at different time points, leading to sparse asynchronous longitudinal data. The observation process may also be related to the response variable even given covariates and sometimes both issues can even occur at the same time. Although each of the two issues has been developed to address in literature, it does not seem to exist an established approach that can deal with both together. To address both issues simultaneously, the third part of this dissertation proposes a flexible semiparametric transformation conditional model and a kernel-weighted estimating equation based approach. The proposed estimators of regression parameters are shown to be consistent and asymptotically follow the normal distribution. For the assessment of the finite sample performance of the proposed method, an extensive simulation study is carried out and suggests that it performs well for practical situations. The approach is applied to a prospective HIV study that motivated this investigation.