Privacy preserving record linkage meets record linkage using unencrypted data

IntroductionPrivacy preserving record linkage (PPRL) resolves privacy concerns because of its capabilities to link encrypted identifiers. It encrypts identifiers using bloom filters and performs record matching based on encrypted data using dice coefficient similarity. Matching data based on hashed identifiers impacts the performance of linkage due to loss of information. Objectives and ApproachWe propose a technique to optimize the bloom filter parameters and examine if the optimal parameters increase the performance of the linkage in terms of precision, recall, and f-measure. Let us consider a set of string values and calculate the similarity between any two of them using the Jaro-Winkler method. Now let us encrypt the string values using bloom filters and calculate the similarity between any two of them using the dice coefficient technique. Optimal parameters of bloom filters are those that minimize the difference between the calculated similarities using Jaro-Winkler vs. the calculated similarities using the dice coefficient technique. ResultsUsing publically available data, several first name and last name datasets each comprising 1000 unique values were generated. The following values for bloom filter parameters were considered: q in q-grams (q=1,2,3), bit array length (l=50,100,200,500,1000), number of hash functions (k=5,10,20,50). The following five setups of bloom filters were able to minimize the difference between the calculated similarities on encrypted data using the dice coefficient technique, and the calculated similarities on unencrypted data using the Jaro-Winkler method: q=1,l=1000,k=50/q=1,l=500,k=20/ q=2,l=1000,k=50/ q=3,l=500,k=50. These setups were considered to perform data linkage over 10 synthetically-generated datasets. Results show that PPRL was able to achieve similar performance compared to data linkage over unencrypted data. Conclusion/ImplicationsThis study showed that optimal parameters of bloom filters minimized loss of information resulting from data encryption. Experimental findings indicated that PPRL using optimal parameters of bloom filters achieves almost the same performance as data linkage on unencrypted data in terms of precision, recall, and f-measure.

Download Full-text

A suggestion on how to include calendar dates in Bloom Filters

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.33 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Antony Stevens

Keyword(s):

Bloom Filter ◽

The Other ◽

Bloom Filters ◽

Dice Coefficient ◽

Personal Names ◽

Admissible Range ◽

Secure Protocols ◽

The Difference ◽

The One ◽

Calendar Dates

ABSTRACT ObjectiveBloom Filters have been used in a number of studies conducted for the Ministry of Health. They are usually recommended because of the possibility that they may participate in secure protocols for the exchange of data. In our case the speed of the program, once the filters have been prepared, is so high that that itself is sufficient motive for their adoption. Nevertheless if two calendar dates differ by one character this may merit more attention than a similar difference in personal names. This became evident in a large linkage between mortality records and hospital separations where the patient had died. Higher scores were obtained when the date fields differed by only one character, but when that character represented a year there would no reason to notice the pair. When the character difference was compatible with a difference of a few days this would be more interesting because in studies like the one just cited it would be reasonable to admit differences of a few days or even, perhaps, weeks between the events ( recording of the death of the patient ).ApproachHow then to represent the difference between dates in a Bloom Filter? A date can be represented as a Boolean vector where the day (or week) is set to '1'. It may be represented by several contiguous '1's to admit admissible uncertainty in comparisons. The similarity between two dates can then just be the Dice Coefficient of the corresponding vectors. ResultBut a vector representing a date may then be very large. It could be as much as 365 bits per year, far more than is usually used for the other fields. The number of logical word comparisons would go up and the program would become slower. Knowing that the admissible range is presented by contiguous '1's means that we can obtain the effect of constructing the Bloom Filter and calculating the Dice Coefficient more directly. Starting with the two dates we can obtain the number of bits that are shared, which will depend on the admissible range. The Dice Coefficient can then be calculated directly without the need to construct the Filter. ConclusionWe are then left with the decision on how to add the result to the value obtained from the other variables, and this will depend on what importance it is felt the date should have.

Download Full-text

Secure Privacy Preserving Record Linkage of Large Databases by Modified Bloom Filter Encodings

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.29 ◽

2017 ◽

Vol 1 (1) ◽

Cited By ~ 2

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Large Scale ◽

Bloom Filter ◽

Privacy Preserving ◽

Error Rates ◽

Bloom Filters ◽

Data Sets ◽

Research Subjects ◽

Practical Applications ◽

Large Databases

ABSTRACTObjectiveIn most European settings, record linkage across different institutions has to be based on personal identifiers such as names, birthday or place of birth. To protect the privacy of research subjects, the identifiers have to be encrypted. In practice, these identifiers show error rates up to 20% per identifier, therefore linking on encrypted identifiers usually implies the loss of large subsets of the databases. In many applications, this loss of cases is related to variables of interest for the subject matter of the study. Therefore, this kind of record-linkage will generate biased estimates. These problems gave rise to techniques of Privacy Preserving Record Linkage (PPRL). Many different PPRL techniques have been suggested within the last 10 years, very few of them are suitable for practical applications with large database containing millions of records as they are typical for administrative or medical databases. One proven technique for PPRL for large scale applications is PPRL based on Bloom filters.MethodUsing appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time.ResultHowever, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall and re-identification risk on large databases.

Download Full-text

A blinded evaluation of privacy preserving record linkage with Bloom filters

BMC Medical Research Methodology ◽

10.1186/s12874-022-01510-2 ◽

2022 ◽

Vol 22 (1) ◽

Author(s):

Sean Randall ◽

Helen Wichmann ◽

Adrian Brown ◽

James Boyd ◽

Tom Eitelhuber ◽

...

Keyword(s):

Record Linkage ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Filter Method ◽

Morbidity Data ◽

Linkage Quality ◽

Hospital Morbidity ◽

Western Australian ◽

Direct Investigation

Abstract Background Privacy preserving record linkage (PPRL) methods using Bloom filters have shown promise for use in operational linkage settings. However real-world evaluations are required to confirm their suitability in practice. Methods An extract of records from the Western Australian (WA) Hospital Morbidity Data Collection 2011–2015 and WA Death Registrations 2011–2015 were encoded to Bloom filters, and then linked using privacy-preserving methods. Results were compared to a traditional, un-encoded linkage of the same datasets using the same blocking criteria to enable direct investigation of the comparison step. The encoded linkage was carried out in a blinded setting, where there was no access to un-encoded data or a ‘truth set’. Results The PPRL method using Bloom filters provided similar linkage quality to the traditional un-encoded linkage, with 99.3% of ‘groupings’ identical between privacy preserving and clear-text linkage. Conclusion The Bloom filter method appears suitable for use in situations where clear-text identifiers cannot be provided for linkage.

Download Full-text

Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v4i1.1095 ◽

2019 ◽

Vol 4 (1) ◽

Author(s):

Adrian P Brown ◽

Sean M Randall ◽

James H Boyd ◽

Anna M Ferrante

Keyword(s):

Real World ◽

Record Linkage ◽

Hamming Distance ◽

Bloom Filters ◽

Limited Information ◽

Dice Coefficient ◽

Jaccard Similarity ◽

Comparison Methods ◽

Linkage Quality ◽

Partial Weight

Introduction The need for increased privacy protection in data linkage has driven the development of privacy-preserving record linkage (PPRL) techniques. A popular technique using Bloom filters with cryptographic analyses, modifications, and hashing variations to optimise privacy has been the focus of much research in this area. With few applications of Bloom filters within a probabilistic framework, there is limited information on whether approximate matches between Bloom filtered fields can improve linkage quality. Objectives In this study, we evaluate the effectiveness of three approximate comparison methods for Bloom filters within the context of the Fellegi-Sunter model of recording linkage: Sørensen–Dice coefficient, Jaccard similarity and Hamming distance. Methods Using synthetic datasets with introduced errors to simulate datasets with a range of data quality and a large real-world administrative health dataset, the research estimated partial weight curves for converting similarity scores (for each approximate comparison method) to partial weights at both field and dataset level. Deduplication linkages were run on each dataset using these partial weight curves. This was to compare the resulting quality of the approximate comparison techniques with linkages using simple cut-off similarity values and only exact matching. Results Linkages using approximate comparisons produced significantly better quality results than those using exact comparisons only. Field level partial weight curves for a specific dataset produced the best quality results. The Sørensen-Dice coefficient and Jaccard similarity produced the most consistent results across a spectrum of synthetic and real-world datasets. Conclusion The use of Bloom filter similarity comparisons for probabilistic record linkage can produce linkage quality results which are comparable to Jaro-Winkler string similarities with unencrypted linkages. Probabilistic linkages using Bloom filters benefit significantly from the use of similarity comparisons, with partial weight curves producing the best results, even when not optimised for that particular dataset.

Download Full-text

Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1445 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Thilina Ranbaduge ◽

Peter Christen

Keyword(s):

National Security ◽

Record Linkage ◽

Missing Values ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Practical Applications ◽

Binary Encoding ◽

Linkage Quality

IntroductionApplications in domains ranging from healthcare to national security increasingly require records about individuals in sensitive databases to be linked in privacy-preserving ways. Missing values make the linkage process challenging because they can affect the encoding of attribute values. No study has systematically investigated how missing values affect the outcomes of different encoding techniques used in privacy-preserving linkage applications. Objectives and ApproachBinary encodings, such as Bloom filters, are popular for linking sensitive databases. They are now employed in real-world linkage applications. However, existing encoding techniques assume the quasi-identifying attributes used for encoding to be complete. Missing values can lead to incomplete encodings which can result in decreased or increased similarities and therefore to false non-matches or false matches. In this study we empirically evaluate three binary encoding techniques using real voter databases, where pairs of records that correspond to the same voter (with name or address changes) resulted in files of 100,000 and 500,000 records containing from 0% to 50% missing values. ResultsWe encoded between two and four of the attributes first and last name, street, and city into three record-level binary encodings: Cryptographic long-term key (CLK) [Schnell et al. 2009], record-level Bloom filter (RBF) [Durham et al. 2014], and tabulation Min-hashing (TBH) [Smith 2017]. Experiments showed a 10% to 25% drop on average in both precision and recall for all encoding techniques when missing values are increasing. CLK resulted in the highest decrease in precision, while TBH resulted in the highest decrease in recall compared to the other encoding techniques. ConclusionBinary encodings such as Bloom filters are now used in practical applications for linking sensitive databases. Our evaluation shows that such encoding techniques can result in lower linkage quality if there are missing values in quasi-identifying attributes. This highlights the need for novel encoding techniques that can overcome the challenge of missing values.

Download Full-text

Cryptanalysis of Basic Bloom Filters Used for Privacy Preserving Record Linkage

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v6i2.640 ◽

2014 ◽

Vol 6 (2) ◽

Cited By ~ 23

Author(s):

Frank Niedermeyer ◽

Simone Steinmetzer ◽

Martin Kroll ◽

Rainer Schnell

Keyword(s):

Statistical Analysis ◽

Record Linkage ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Small Subset ◽

Data Set ◽

Successful Attack ◽

Global Data

Bloom filter encoded identifiers are increasingly used for privacy preserving record linkage applications, because they allow for errors in encrypted identifiers. However, little research on the security of Bloom filters has been published so far. In this paper, we formalize a successful attack on Bloom filters composed of bigrams. It has previously been assumed in the literature that an attacker knows the global data set from which a sample is drawn. In contrast, we suppose that an attacker does not know this global data set. Instead, we assume the adversary knows a publicly available list of the most frequent attributes. The attack is based on subtle filtering and elementary statistical analysis of encrypted bigrams. The attack described in this paper can be used for the deciphering of a whole database instead of only a small subset of the most frequent names, as in previous research. We illustrate our proposed method with an attack on a database of encrypted surnames. Finally, we describe modifications of the Bloom filters for preventing similar attacks.

Download Full-text

Securing Bloom Filters for Privacy-preserving Record Linkage

Proceedings of the 29th ACM International Conference on Information & Knowledge Management ◽

10.1145/3340531.3412105 ◽

2020 ◽

Author(s):

Thilina Ranbaduge ◽

Rainer Schnell

Keyword(s):

Record Linkage ◽

Privacy Preserving ◽

Bloom Filters

Download Full-text

The Western Australian Twin Register: A Population-Based Register of Adult and Child Multiples

Twin Research and Human Genetics ◽

10.1375/twin.9.6.712 ◽

2006 ◽

Vol 9 (6) ◽

pp. 712-717 ◽

Cited By ~ 2

Author(s):

Jessica D. Y. Lee ◽

Lyle J. Palmer

Keyword(s):

Medical Research ◽

Genetic Epidemiology ◽

Record Linkage ◽

Collaborative Research ◽

Data Linkage ◽

Population Based ◽

Multiple Births ◽

Efficient Management ◽

Twin Registry ◽

Western Australian

AbstractThe Western Australian Twin Register (WATR) was established in 1997 to study the health of all child multiples born in Western Australia (WA). The Register has until recently consisted of all multiples born in WA between 1980 and 1997. Using unique record linkage capacities available through the WA data linkage system, we have subsequently been able to identify all multiple births born in WA since 1974. New affiliations with the Australian Twin Registry and the WA Institute for Medical Research are further enabled by the use of the WA Genetic Epidemiology Resource — a high-end bioinformatics infrastructure that allows efficient management of health datasets and facilitates collaborative research capabilities. In addition to this infrastructure, funding provided by these institutions has allowed the extension of the WATR to include a greater number of WA multiples, including those born between 1974 and 1979, and from 1998 onwards. These resources are in the process of being enabled for national and international access.

Download Full-text

Encoding Hierarchical Classification Codes for Privacy-Preserving Record Linkage Using Bloom Filters

Machine Learning and Knowledge Discovery in Databases - Communications in Computer and Information Science ◽

10.1007/978-3-030-43887-6_12 ◽

2020 ◽

pp. 142-156

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Hierarchical Classification ◽

Privacy Preserving ◽

Bloom Filters

Download Full-text

A Novel Hardware Security Architecture: PD-CRP(PUF Database & Challenge-Response Pair) Bloom Filter on Memristor Based PUF

10.20944/preprints202008.0598.v1 ◽

2020 ◽

Author(s):

Jungwon Lee ◽

Seoyeon Choi ◽

Dayoung Kim ◽

Yunyoung Choi ◽

Wookyung Sun

Keyword(s):

Data Transmission ◽

Hardware Security ◽

Bloom Filter ◽

Transmission Error ◽

Bloom Filters ◽

Search Performance ◽

Security Technology ◽

Security Environment ◽

Filter Size ◽

Simulation Results

Because the development of the internet of things (IoT) requires technology that transfers information between objects without human intervention, the core of IoT security will be secure authentication between devices or between devices and servers. Software-based authentication may be a security vulnerability in IoT, but hardware-based security technology can provide a strong security environment. A physical unclonable functions (PUFs) are a hardware security element suitable for lightweight applications. PUFs can generate challenge-response pairs(CRPs) that cannot be controlled or predicted by utilizing inherent physical variations that occur in the manufacturing process. In particular, pulse width memristive PUF (PWM-PUF) improves security performance by applying different write pulse widths and bank structures. Bloom filter (BF) is probabilistic data structures that answer membership queries using small memories. Bloom filter can improve search performance and reduce memory usage and are used in areas such as networking, security, big data, and IoT. In this paper, we propose a structure that applies Bloom filters based on the PWM-PUF to reduce PUF data transmission errors. The proposed structure uses two different Bloom filter types that store different information and that are located in front of and behind the PWM-PUF, improving security by removing challenges from attacker access. Simulation results show that the proposed structure decreases the data transmission error rate and reuse rate as the Bloom filter size increases, the simulation results also show that the proposed structure improves PWM-PUF security with a very small Bloom filter memory.

Download Full-text