An effective privacy enhanced interface to support record linkage decisions

IntroductionPrivacy enhanced technologies (PET) are those that measure and protect privacy by preventing unnecessary use of personal data without loss of the functionality of the information system. In practice, implementing such a system requires fine-grained access control so that access can be granted in smaller chunks of data. Objectives and ApproachIn record linkage, PET to date has mostly meant separation of identifiers and sensitive information to allow access to only the necessary part. Moving beyond this current norm, we have designed a privacy enhanced interface to support linkage that discloses only the needed information at the sub variable level, when needed to make good decisions, to reduce exposure of personally identifiable information (PII). The system allows for access to PII both at (1) cell level (e.g., only names of needed people are released) or (2) sub-cell level (e.g., only part of a name, suffix or characters, is released). ResultsIn a user study (N=104) where participants tried to link complicated situations (e.g. twins, Sr/Jr, change of last name) using the interface, we found that users given fully masked data, 0% of information disclosed, were still able to get 75% accuracy using supplemental visual markup. The markups depict data discrepancies such as swapped first and last names, transposed characters, different characters and missing data. More importantly, with this effective interface, we found that there were no statistical difference in accuracy of linkage (84%) or time taken between users with access to all data and those with access to only 30% of the data. We have released a tutorial where users can experience balancing between information disclosure and accuracy of results on sample data. Conclusion/ImplicationsPrivacy is a major public concern when PII is legitimately accessed to link data. Our study demonstrates that a well-designed privacy enhanced interface can significantly reduce exposure of PII to people when resolving ambiguous linkages without compromising linkage quality. This research points to a new direction for PET in record linkage beyond encryption.

Download Full-text

Automated Extraction and Presentation of Data Practices in Privacy Policies

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2021-0019 ◽

2021 ◽

Vol 2021 (2) ◽

pp. 88-110

Author(s):

Duc Bui ◽

Kang G. Shin ◽

Jong-Min Choi ◽

Junbum Shin

Keyword(s):

User Study ◽

Personal Information ◽

Personal Data ◽

Neural Model ◽

Automated Analysis ◽

Entity Recognition ◽

Automated System ◽

Privacy Policies ◽

Fine Grained ◽

Data Practices

Abstract Privacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated analysis, it remains challenging due to the complex policy statements written in legal (vague) language. Prior work is limited by small/generated datasets and manually created rules. We formulate the extraction of fine-grained personal data phrases and the corresponding data collection or sharing practices as a sequence-labeling problem that can be solved by an entity-recognition model. We create a large dataset with 4.1k sentences (97k tokens) and 2.6k annotated fine-grained data practices from 30 real-world privacy policies to train and evaluate neural networks. We present a fully automated system, called PI-Extract, which accurately extracts privacy practices by a neural model and outperforms, by a large margin, strong rule-based baselines. We conduct a user study on the effects of data practice annotation which highlights and describes the data practices extracted by PI-Extract to help users better understand privacy-policy documents. Our experimental evaluation results show that the annotation significantly improves the users’ reading comprehension of policy texts, as indicated by a 26.6% increase in the average total reading score.

Download Full-text

Optimization of the Mainzelliste software for fast privacy-preserving record linkage

Journal of Translational Medicine ◽

10.1186/s12967-020-02678-1 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Florens Rohde ◽

Martin Franke ◽

Ziad Sehili ◽

Martin Lablans ◽

Erhard Rahm

Keyword(s):

Record Linkage ◽

Comprehensive Evaluation ◽

Personal Data ◽

Privacy Preserving ◽

Use Cases ◽

Locality Sensitive Hashing ◽

Third Party ◽

Bloom Filters ◽

Linkage Quality ◽

Order Of Magnitude

Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.

Download Full-text

Privacy Attack on Multiple Dynamic Match-key based Privacy-Preserving Record Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v5i1.1345 ◽

2020 ◽

Vol 5 (1) ◽

Author(s):

Anushka Vidanage ◽

Thilina Ranbaduge ◽

Peter Christen ◽

Sean Randall

Keyword(s):

Record Linkage ◽

Personal Information ◽

Privacy Preserving ◽

Statistical Correlation ◽

Sensitive Information ◽

Frequency Distributions ◽

Plain Text ◽

Large Databases ◽

Linkage Quality ◽

Privacy Attack

Introduction Over the last decade, the demand for linking records about people across databases has increased in various domains. Privacy challenges associated with linking sensitive information led to the development of privacy-preserving record linkage techniques. The multiple dynamic match-key encoding approach recently proposed by Randall et al. (IJPDS, 2019) is such a technique aimed at providing sufficient privacy for linkage applications while obtaining high linkage quality. However, the use of this encoding in large databases can reveal frequency information that can allow the re-identification of encoded values. Objectives We propose a frequency-based attack to evaluate the privacy guarantees of multiple dynamic match-key encoding. We then present two improvements to this match-key encoding approach to prevent such a privacy attack. Methods The proposed attack analyses the frequency distributions of individual match-keys in order to identify the attributes used for each match-key, where we assume the adversary has access to a plain-text database with similar characteristics as the encoded database. We employ a set of statistical correlation tests to compare the frequency distributions of match-key values between the encoded and plain-text databases. Once the attribute combinations used for match-keys are discovered, we then re-identify encoded sensitive values by utilising a frequency alignment method. Next, we propose two modifications to the match-key encoding; one to alter the original frequency distributions and another to make the frequency distributions uniform. Both will help to prevent frequency-based attacks. Results We evaluate our privacy attack using two large real-world databases. The results show that in certain situations the attack can successfully re-identify a set of sensitive values encoded using the multiple dynamic match-key encoding approach. On the databases used in our experiments, the attack is able to re-identify plain-text values with a precision and recall of both up to 98%. Furthermore, we show that our proposed improvements are able to make this attack harder to perform with only a small reduction in linkage quality. Conclusions Our proposed privacy attack demonstrates the weaknesses of multiple match-key encoding that should be taken into consideration when linking databases that contain sensitive personal information. Our proposed modifications ensure that the multiple dynamic match-key encoding approach can be used securely while retaining high linkage quality.

Download Full-text

An Evaluation Framework for Privacy-Preserving Record Linkage

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v6i1.636 ◽

2014 ◽

Vol 6 (1) ◽

Cited By ~ 20

Author(s):

Dinusha Vatsalan ◽

Peter Christen ◽

Christine M. O'Keefe ◽

Vassilios S. Verykios

Keyword(s):

Real World ◽

Record Linkage ◽

Comparative Evaluation ◽

Privacy Preserving ◽

Evaluation Framework ◽

Sensitive Information ◽

The Past ◽

Large Databases ◽

Linkage Quality ◽

The Face

Privacy-preserving record linkage (PPRL) addresses the problem of identifying matching records from different databases that correspond to the same real-world entities using quasi-identifying attributes (in the absence of unique entity identifiers), while preserving privacy of these entities. Privacy is being preserved by not revealing any information that could be used to infer the actual values about the records that are not reconciled to the same entity (non-matches), and any confidential or sensitive information (that is not agreed upon by the data custodians) about the records that were reconciled to the same entity (matches) during or after the linkage process. The PPRL process often involves three main challenges, which are scalability to large databases, high linkage quality in the presence of data quality errors, and sufficient privacy guarantees. While many solutions have been developed for the PPRL problem over the past two decades, an evaluation and comparison framework of PPRL solutions with standard numerical measures defined for all three properties (scalability, linkage quality, and privacy) of PPRL has so far not been presented in the literature. We propose a general framework with normalized measures to practically evaluate and compare PPRL solutions in the face of linkage attack methods that are based on an external global dataset. We conducted experiments of several existing PPRL solutions on real-world databases using our proposed evaluation framework, and the results show that our framework provides an extensive and comparative evaluation of PPRL solutions in terms of the three properties.

Download Full-text

Encoding Diagnostic Codes for Privacy-Preserving Record Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1461 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Privacy Preserving ◽

New Method ◽

Bloom Filters ◽

Mortality Data ◽

Sensitive Information ◽

Diagnostic Code ◽

Relational Similarity ◽

Linkage Quality ◽

The Impact

IntroductionDiagnostic codes, such as the ICD-10, may be considered as sensitive information. If such codes have to be encoded using current methods for data linkage, all hierarchical information given by the code positions will be lost. We present a technique (HPBFs) for preserving the hierarchical information of the codes while protecting privacy. The new method modifies a widely used Privacy-preserving Record Linkage (PPRL) technique based on Bloom filters for the use with hierarchical codes. Objectives and ApproachAssessing the similarities of hierarchical codes requires considering the code positions of two codes in a given diagnostic hierarchy. The hierarchical similarities of the original diagnostic code pairs should correspond closely to the similarity of the encoded pairs of the same code. Furthermore, to assess the hierarchy-preserving properties of an encoding, the impact on similarity measures from differing code positions at all levels of the code hierarchy can be evaluated. A full match of codes should yield a higher similarity than partial matches. Finally, the new method is tested against ad-hoc solutions as an addition to a standard PPRL setup. This is done using real-world mortality data with a known link status of two databases. ResultsIn all applications for encoded ICD codes where either categorical discrimination, relational similarity or linkage quality in a PPRL setting is required, HPBFs outperform other known methods. Lower mean differences and smaller confidence intervals between clear-text codes and encrypted code pairs were observed, indicating better preservation of hierarchical similarities. Finally, using these techniques allows for much better hierarchical discrimination for partial matches. ConclusionThe new technique yields better linkage results than all other known methods to encrypt hierarchical codes. In all tests, comparing categorical discrimination, relational similarity and PPRL linkage quality, HPBFs outperformed methods currently used.

Download Full-text

Record linkage of routine data with cohorts’ data of infants under European and Portuguese law

European Journal of Public Health ◽

10.1093/eurpub/ckaa166.178 ◽

2020 ◽

Vol 30 (Supplement_5) ◽

Author(s):

J Doetsch ◽

I Lopes ◽

R Redinha ◽

H Barros

Keyword(s):

Big Data ◽

Data Processing ◽

Data Protection ◽

Record Linkage ◽

Data Science ◽

Personal Data ◽

Routine Data ◽

Cohort Data ◽

Education Data ◽

Explicit Consent

Abstract The usage and exchange of “big data” is at the forefront of the data science agenda where Record Linkage plays a prominent role in biomedical research. In an era of ubiquitous data exchange and big data, Record Linkage is almost inevitable, but raises ethical and legal problems, namely personal data and privacy protection. Record Linkage refers to the general merging of data information to consolidate facts about an individual or an event that are not available in a separate record. This article provides an overview of ethical challenges and research opportunities in linking routine data on health and education with cohort data from very preterm (VPT) infants in Portugal. Portuguese, European and International law has been reviewed on data processing, protection and privacy. A three-stage analysis was carried out: i) interplay of threefold law-levelling for Record Linkage at different levels; ii) impact of data protection and privacy rights for data processing, iii) data linkage process' challenges and opportunities for research. A framework to discuss the process and its implications for data protection and privacy was created. The GDPR functions as utmost substantial legal basis for the protection of personal data in Record Linkage, and explicit written consent is considered the appropriate basis for the processing sensitive data. In Portugal, retrospective access to routine data is permitted if anonymised; for health data if it meets data processing requirements declared with an explicit consent; for education data if the data processing rules are complied. Routine health and education data can be linked to cohort data if rights of the data subject and requirements and duties of processors and controllers are respected. A strong ethical context through the application of the GDPR in all phases of research need to be established to achieve Record Linkage between cohort and routine collected records for health and education data of VPT infants in Portugal. Key messages GDPR is the most important legal framework for the protection of personal data, however, its uniform approach granting freedom to its Member states hampers Record Linkage processes among EU countries. The question remains whether the gap between data protection and privacy is adequately balanced at three legal levels to guarantee freedom for research and the improvement of health of data subjects.

Download Full-text

The Value of Consumer Data in Online Advertising

Review of Network Economics ◽

10.1515/rne-2017-0066 ◽

2017 ◽

Vol 16 (3) ◽

pp. 269-289

Author(s):

Marc Bourreau ◽

Bernard Caillaud ◽

Romain de Nijs

Keyword(s):

Information Disclosure ◽

Value Of Information ◽

Online Advertising ◽

Personal Data ◽

Specific Information ◽

Incremental Value ◽

Imperfectly Competitive ◽

Social Returns ◽

Consumer Data ◽

The Cost

Abstract In this paper we propose a model where consumer personal data have multidimensional characteristics, and are used by platforms to offer ad slots with better targeting possibilities to a market of differentiated advertisers through real-time auctions. A platform controls the amount of information about consumers that it discloses to advertisers, thereby affecting the dispersion of advertisers’ valuations for the slot. We first show by way of simulations that the amount of consumer-specific information that is optimally revealed to advertisers increases with the degree of competition on the advertising market and decreases with the cost of information disclosure for a monopolistic platform, competing platforms or a welfare-maximizing platform, provided the advertising market is not highly concentrated. Second, we exhibit different properties between the welfare-maximizing situation and the imperfectly competitive market situations with respect to how the incremental value of information varies: there are decreasing social returns to consumers’ data, while private returns may be increasing or decreasing locally.

Download Full-text

Understanding the disclosure of personal data online

Information and Computer Security ◽

10.1108/ics-10-2020-0168 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Heather J. Parker ◽

Stephen Flowerday

Keyword(s):

Social Media ◽

Information Disclosure ◽

Cost Benefit Analysis ◽

Personal Information ◽

Personal Data ◽

Cost Benefit ◽

Content Type ◽

Privacy Violation ◽

The Cost ◽

Insight Into

Purpose Social media has created a new level of interconnected communication. However, the use of online platforms brings about various ways in which a user’s personal data can be put at risk. This study aims to investigate what drives the disclosure of personal information online and whether an increase in awareness of the value of personal information motivates users to safeguard their information. Design/methodology/approach Fourteen university students participated in a mixed-methods experiment, where responses to Likert-type scale items were combined with responses to interview questions to provide insight into the cost–benefit analysis users conduct when disclosing information online. Findings Overall, the findings indicate that users are able to disregard their concerns due to a resigned and apathetic attitude towards privacy. Furthermore, subjective norms enhanced by fear of missing out (FOMO) further allows users to overlook potential risks to their information in order to avoid social isolation and sanction. Alternatively, an increased awareness of the personal value of information and having experienced a previous privacy violation encourage the protection of information and limited disclosure. Originality/value This study provides insight into privacy and information disclosure on social media in South Africa. To the knowledge of the researchers, this is the first study to include a combination of the theory of planned behaviour and the privacy calculus model, together with the antecedent factors of personal valuation of information, trust in the social media provider, FOMO.

Download Full-text

How do you measure up? Methods to assess linkage quality

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.152 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Anna Ferrante ◽

James Boyd ◽

Sean Randall ◽

Adrian Brown ◽

James Semmens

Keyword(s):

Record Linkage ◽

Service Use ◽

Performance Metrics ◽

Population Level ◽

Additional Information ◽

Linkage Quality ◽

Health And Disease ◽

The Impact ◽

Quality Process

ABSTRACT ObjectivesRecord linkage is a powerful technique which transforms discrete episode data into longitudinal person-based records. These records enable the construction and analysis of complex pathways of health and disease progression, and service use. Achieving high linkage quality is essential for ensuring the quality and integrity of research based on linked data. The methods used to assess linkage quality will depend on the volume and characteristics of the datasets involved, the processes used for linkage and the additional information available for quality assessment. This paper proposes and evaluates two methods to routinely assess linkage quality. ApproachLinkage units currently use a range of methods to measure, monitor and improve linkage quality; however, no common approach or standards exist. There is an urgent need to develop “best practices” in evaluating, reporting and benchmarking linkage quality. In assessing linkage quality, of primary interest is in knowing the number of true matches and non-matches identified as links and non-links. Any misclassification of matches within these groups introduces linkage errors. We present efforts to develop sharable methods to measure linkage quality in Australia. This includes a sampling-based method to estimate both precision (accuracy) and recall (sensitivity) following record linkage and a benchmarking method - a transparent and transportable methodology to benchmark the quality of linkages across different operational environments. ResultsThe sampling-based method achieved estimates of linkage quality that were very close to actual linkage quality metrics. This method presents as a feasible means of accurately estimating matching quality and refining linkages in population level linkage studies. The benchmarking method provides a systematic approach to estimating linkage quality with a set of open and shareable datasets and a set of well-defined, established performance metrics. The method provides an opportunity to benchmark the linkage quality of different record linkage operations. Both methods have the potential to assess the inter-rater reliability of clerical reviews. ConclusionsBoth methods produce reliable estimates of linkage quality enabling the exchange of information within and between linkage communities. It is important that researchers can assess risk in studies using record linkage techniques. Understanding the impact of linkage quality on research outputs highlights a need for standard methods to routinely measure linkage quality. These two methods provide a good start to the quality process, but it is important to identify standards and good practices in all parts of the linkage process (pre-processing, standardising activities, linkage, grouping and extracting).

Download Full-text

Accurate Filtering of Privacy-Sensitive Information in Raw Genomic Data

10.1101/292185 ◽

2018 ◽

Author(s):

Jérémie Decouchant ◽

Maria Fernandes ◽

Marcus Völp ◽

Francisco M Couto ◽

Paulo Esteves-Veríssimo

Keyword(s):

High Performance ◽

Genomic Data ◽

Sensitive Information ◽

Sensitive Data ◽

Variable Regions ◽

Fine Grained ◽

Sequencing Errors ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads

AbstractSequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption.

Download Full-text