Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks

Abstract The Global Alliance for Genomics and Health (GA4GH) created the Beacon Project as a means of testing the willingness of data holders to share genetic data in the simplest technical context—a query for the presence of a specified nucleotide at a given position within a chromosome. Each participating site (or “beacon”) is responsible for assuring that genomic data are exposed through the Beacon service only with the permission of the individual to whom the data pertains and in accordance with the GA4GH policy and standards. While recognizing the inference risks associated with large-scale data aggregation, and the fact that some beacons contain sensitive phenotypic associations that increase privacy risk, the GA4GH adjudged the risk of re-identification based on the binary yes/no allele-presence query responses as acceptable. However, recent work demonstrated that, given a beacon with specific characteristics (including relatively small sample size and an adversary who possesses an individual’s whole genome sequence), the individual’s membership in a beacon can be inferred through repeated queries for variants present in the individual’s genome. In this paper, we propose three practical strategies for reducing re-identification risks in beacons. The first two strategies manipulate the beacon such that the presence of rare alleles is obscured; the third strategy budgets the number of accesses per user for each individual genome. Using a beacon containing data from the 1000 Genomes Project, we demonstrate that the proposed strategies can effectively reduce re-identification risk in beacon-like datasets.

Download Full-text

Thinking in multitudes: Questionnaires and composite cases in early American psychology

History of the Human Sciences ◽

10.1177/0952695120903909 ◽

2020 ◽

Vol 33 (3-4) ◽

pp. 160-174 ◽

Cited By ~ 1

Author(s):

Jacy L. Young

Keyword(s):

Scientific Reasoning ◽

Large Scale ◽

Individual Case ◽

Early American ◽

Large Scale Data ◽

American Psychology ◽

Questionnaire Research ◽

The Individual ◽

Scale Data

In the late 19th century, the questionnaire was one means of taking the case study into the multitudes. This article engages with Forrester’s idea of thinking in cases as a means of interrogating questionnaire-based research in early American psychology. Questionnaire research was explicitly framed by psychologists as a practice involving both natural historical and statistical forms of scientific reasoning. At the same time, questionnaire projects failed to successfully enact the latter aspiration in terms of synthesizing masses of collected data into a coherent whole. Difficulties in managing the scores of descriptive information questionnaires generated ensured the continuing presence of individuals in the results of this research, as the individual case was excerpted and discussed alongside a cast of others. As a consequence, questionnaire research embodied an amalgam of case, natural historical, and statistical thinking. Ultimately, large-scale data collection undertaken with questionnaires failed in its aim to construct composite exemplars or ‘types’ of particular kinds of individuals; to produce the singular from the multitudes.

Download Full-text

Building theories of consistency and variability in children’s language development: A large-scale data approach

10.31234/osf.io/z7yrx ◽

2021 ◽

Author(s):

Angeline Tsui ◽

Virginia A. Marchman ◽

Michael C. Frank

Keyword(s):

Language Learning ◽

Data Aggregation ◽

Large Scale ◽

Meta Analysis ◽

Secondary Data ◽

Small Sample ◽

Early Language ◽

Large Scale Data ◽

Small Sample Sizes ◽

Scale Data

Young children typically begin learning words during their first two years of life. On the other hand, they also vary substantially in their language learning. Similarities and differences in language learning call for a quantitative theory that can predict and explain which aspects of early language are consistent and which are variable. However, current developmental research practices limit our ability to build such quantitative theories because of small sample sizes and challenges related to reproducibility and replicability. In this chapter, we suggest that three approaches – meta-analysis, multi-site collaborations, and secondary data aggregation – can together address some of the limitations of current research in the developmental area. We review the strengths and limitations of each approach and end by discussing the potential impacts of combining these three approaches.

Download Full-text

Online celebrity discourses on Facebook

The Journal of Fandom Studies ◽

10.1386/jfs_00026_1 ◽

2020 ◽

Vol 8 (3) ◽

pp. 305-319 ◽

Cited By ~ 1

Author(s):

Dániel Hegedűs

Keyword(s):

Social Media ◽

Large Scale ◽

Modern Society ◽

Large Scale Data ◽

Cognitive Patterns ◽

Social Media Platforms ◽

Altered Form ◽

The Individual ◽

Facebook Pages ◽

Scale Data

The web 2.0 phenomenon and social media – without question – have reshaped our everyday experiences. These changes that they have generated affect how we consume, communicate and present ourselves, just to name a few aspects of life, and moreover, opened up new perspectives for sociology. Though many social practices persist in a somewhat altered form, brand new types of entities have emerged on different social media platforms: one of them is the video blogger. These actors have gained great visibility through so-called micro-celebrity practices and have become potential large-scale distributors of ideas, values and knowledge. Celebrities, in this case micro-celebrities (video bloggers), may disseminate such cognitive patterns through their constructed discourse which is objectified in the online space through a peculiar digital face (a social media profile) where fans can react, share and comment according to the affordances of the digital space. Most importantly, all of these interactions are accessible for scholars to examine the fan and celebrity practices of our era. This research attempts to reconstruct these discursive interactions on the Facebook pages of ten top Hungarian video bloggers. All findings are based on a large-scale data collection using the Netvizz application. As part of the interpretation of the results, a further consideration was that celebrity discourses may be a sort of disciplinary force in (post)modern society, which normalizes the individual to some extent by providing adequate schemas of attitude, mentality and ways of consumption.

Download Full-text

Evaluation of Large-scale Data to Detect Irregularity in Payment for Medical Services

Methods of Information in Medicine ◽

10.3414/me15-01-0076 ◽

2016 ◽

Vol 55 (03) ◽

pp. 284-291

Author(s):

Junghyun Park ◽

Seokjoon Yoon ◽

Minki Kim

Keyword(s):

Large Scale ◽

Healthcare Sector ◽

Benford’S Law ◽

Individual Level ◽

Large Scale Data ◽

Level Data ◽

Depth Analysis ◽

Benford's Law ◽

The Individual ◽

Scale Data

SummaryBackground: Sophisticated anti-fraud systems for the healthcare sector have been built based on several statistical methods. Although existing methods have been developed to detect fraud in the healthcare sector, these algorithms consume considerable time and cost, and lack a theoretical basis to handle large-scale data.Objectives: Based on mathematical theory, this study proposes a new approach to using Benford’s Law in that we closely examined the individual-level data to identify specific fees for in-depth analysis.Methods: We extended the mathematical theory to demonstrate the manner in which large-scale data conform to Benford’s Law. Then, we empirically tested its applicability using actual large-scale healthcare data from Korea’s Health Insurance Review and Assessment (HIRA) National Patient Sample (NPS). For Benford’s Law, we considered the mean absolute deviation (MAD) formula to test the large-scale data.Results: We conducted our study on 32 diseases, comprising 25 representative diseases and 7 DRG-regulated diseases. We performed an empirical test on 25 diseases, showing the applicability of Benford’s Law to large-scale data in the healthcare industry. For the seven DRG-regulated diseases, we examined the individual-level data to identify specific fees to carry out an in-depth analysis. Among the eight categories of medical costs, we considered the strength of certain irregularities based on the details of each DRG-regulated disease.Conclusions: Using the degree of abnormality, we propose priority action to be taken by government health departments and private insurance institutions to bring unnecessary medical expenses under control. However, when we detect deviations from Benford’s Law, relatively high contamination ratios are required at conventional significance levels.

Download Full-text

Microsatellite analysis of pooled Schistosoma mansoni DNA: an approach for studies of parasite populations

Parasitology ◽

10.1017/s0031182005009066 ◽

2005 ◽

Vol 132 (3) ◽

pp. 331-338 ◽

Cited By ~ 21

Author(s):

L. K. SILVA ◽

S. LIU ◽

R. E. BLANTON

Keyword(s):

Schistosoma Mansoni ◽

Large Scale ◽

Small Sample Size ◽

Association Studies ◽

Microsatellite Analysis ◽

Small Sample ◽

Genetic Composition ◽

The Individual ◽

Parasite Populations ◽

Pooled Samples

Human parasites are often distributed in metapopulations, which makes random sampling for genetic epidemiology difficult. The typical approach to sampling Schistosoma mansoni involves laboratory passage to obtain individual worms with small sample size and selection bias as a consequence. By contrast, the naturally pooled samples from egg output in stool or urine directly represent the genetic composition of current populations. To test whether pooled samples could be used to estimate population allele frequencies, DNA from individual cloned parasites was pooled and amplified by PCR for 7 microsatellites. By polyacrylamide gel analysis, the relative band intensities of the products from the major alleles in the pooled samples differed by 0–6% from the summed intensities of the individual clones (mean=2·1%±2·1% S.D.). The number of PCR cycles (25–40) did not influence the accuracy of the estimate. Varying the frequency of 1 allele in pooled samples from 32 to 69% likewise did not affect accuracy. Allele frequency estimates from aggregate samples such as eggs will be a better foundation for studies of parasite population dynamics as well as the basis for large-scale association studies of host and parasite characteristics.

Download Full-text

Scale-Selective Ridge Regression for Multimodel Forecasting

Journal of Climate ◽

10.1175/jcli-d-13-00030.1 ◽

2013 ◽

Vol 26 (20) ◽

pp. 7957-7965 ◽

Cited By ~ 6

Author(s):

Timothy DelSole ◽

Liwei Jia ◽

Michael K. Tippett

Keyword(s):

Least Squares ◽

Ridge Regression ◽

Large Scale ◽

Small Sample Size ◽

Ordinary Least Squares ◽

Small Sample ◽

Spatial Gradients ◽

Smoothness Constraint ◽

Grid Points ◽

The Individual

Abstract This paper proposes a new approach to linearly combining multimodel forecasts, called scale-selective ridge regression, which ensures that the weighting coefficients satisfy certain smoothness constraints. The smoothness constraint reflects the “prior assumption” that seasonally predictable patterns tend to be large scale. In the absence of a smoothness constraint, regression methods typically produce noisy weights and hence noisy predictions. Constraining the weights to be smooth ensures that the multimodel combination is no less smooth than the individual model forecasts. The proposed method is equivalent to minimizing a cost function comprising the familiar mean square error plus a “penalty function” that penalizes weights with large spatial gradients. The method reduces to pointwise ridge regression for a suitable choice of constraint. The method is tested using the Ensemble-Based Predictions of Climate Changes and Their Impacts (ENSEMBLES) hindcast dataset during 1960–2005. The cross-validated skill of the proposed forecast method is shown to be larger than the skill of either ordinary least squares or pointwise ridge regression, although the significance of this difference is difficult to test owing to the small sample size. The model weights derived from the method are much smoother than those obtained from ordinary least squares or pointwise ridge regression. Interestingly, regressions in which the weights are completely independent of space give comparable overall skill. The scale-selective ridge is numerically more intensive than pointwise methods since the solution requires solving equations that couple all grid points together.

Download Full-text

Large-Scale Data Learning Method for Anomaly Detection using Machine Learning for Monitoring Vibration in Vehicle Equipment

IEEJ Transactions on Industry Applications ◽

10.1541/ieejias.140.480 ◽

2020 ◽

Vol 140 (6) ◽

pp. 480-487

Author(s):

Minoru Kondo

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Large Scale ◽

Learning Method ◽

Large Scale Data ◽

Scale Data

Download Full-text

CONTRIBUTION OF DNA COPY-NUMBER VARIATION (CNV) TO CANCER SUSCEPTIBILITY AND LARGE-SCALE GENOME ALTERATIONS IN OSTEOSARCOMA (OS)

Clinical & Investigative Medicine ◽

10.25011/cim.v31i4.4821 ◽

2008 ◽

Vol 31 (4) ◽

pp. 19

Author(s):

I Pasic ◽

A Shlien ◽

A Novokmet ◽

C Zhang ◽

U Tabori ◽

...

Keyword(s):

Genomic Instability ◽

Large Scale ◽

Structural Variation ◽

Age Of Onset ◽

Cancer Susceptibility ◽

Small Sample Size ◽

Snp Array ◽

Tumour Suppressor Gene ◽

Small Sample ◽

International Hapmap Project

Introduction: OS, a common Li-Fraumeni syndrome (LFS)-associated neoplasm, is a common bone malignancy of children and adolescents. Sporadic OS is also characterized by young age of onset and high genomic instability, suggesting a genetic contribution to disease. This study examined the contribution of novel DNA structural variation elements, CNVs, to OS susceptibility. Given our finding of excessive constitutional DNA CNV in LFS patients, which often coincide with cancer-related genes, we hypothesized that constitutional CNV may also provide clues about the aetiology of LFS-related sporadic neoplasms like OS. Methods: CNV in blood DNA of 26 patients with sporadic OS was compared to that of 263 normal control samples from the International HapMap project, as well as 62 local controls. Analysis was performed on DNA hybridized to Affymetrix genome-wide human SNP array 6.0 by Partek Genomic Suite. Results: There was no detectable difference in average number of CNVs, CNV length, and total structural variation (product of average CNV number and length) between individuals with OS and controls. While this data is preliminary (small sample size), it argues against the presence of constitutional genomic instability in individuals with sporadic OS. Conclusion: We found that the majority of tumours from patients with sporadic OS show CN loss at chr3q13.31, raising the possibility that chr3q13.31 may represent a “driver” region in OS aetiology. In at least one OS tumour, which displays CN loss at chr3q13.31, we demonstrate decreased expression of a known tumour suppressor gene located at chr3q13.31. We are investigating the role ofchr3q13.31 in development of OS.

Download Full-text