scholarly journals Differential privacy in the 2020 US census: what will it do? Quantifying the accuracy/privacy tradeoff

2020 ◽  
Vol 3 ◽  
pp. 1722
Author(s):  
Samantha Petti ◽  
Abraham Flaxman

Background: The 2020 US Census will use a novel approach to disclosure avoidance to protect respondents’ data, called TopDown. This TopDown algorithm was applied to the 2018 end-to-end (E2E) test of the decennial census. The computer code used for this test as well as accompanying exposition has recently been released publicly by the Census Bureau. Methods: We used the available code and data to better understand the error introduced by the E2E disclosure avoidance system when Census Bureau applied it to 1940 census data and we developed an empirical measure of privacy loss to compare the error and privacy of the new approach to that of a (non-differentially private) simple-random-sampling approach to protecting privacy. Results: We found that the empirical privacy loss of TopDown is substantially smaller than the theoretical guarantee for all privacy loss budgets we examined. When run on the 1940 census data, TopDown with a privacy budget of 1.0 was similar in error and privacy loss to that of a simple random sample of 50% of the US population. When run with a privacy budget of 4.0, it was similar in error and privacy loss of a 90% sample. Conclusions: This work fits into the beginning of a discussion on how to best balance privacy and accuracy in decennial census data collection, and there is a need for continued discussion.

2019 ◽  
Vol 3 ◽  
pp. 1722 ◽  
Author(s):  
Samantha Petti ◽  
Abraham Flaxman

Background: The 2020 US Census will use a novel approach to disclosure avoidance to protect respondents’ data, called TopDown. This TopDown algorithm was applied to the 2018 end-to-end (E2E) test of the decennial census. The computer code used for this test as well as accompanying exposition has recently been released publicly by the Census Bureau. Methods: We used the available code and data to better understand the error introduced by the E2E disclosure avoidance system when Census Bureau applied it to 1940 census data and we developed an empirical measure of privacy loss to compare the error and privacy of the new approach to that of a simple-random-sampling approach to protecting privacy. Results: We found that the empirical privacy loss of TopDown is substantially smaller than the theoretical guarantee for all privacy loss budgets we examined. When run on the 1940 census data, TopDown with a privacy budget of 1.0 was similar in error and privacy loss to that of a simple random sample of 50% of the US population. When run with a privacy budget of 4.0, it was similar in error and privacy loss of a 90% sample. Conclusions: This work fits into the beginning of a discussion on how to best balance privacy and accuracy in decennial census data collection, and there is a need for continued discussion.


2021 ◽  
Vol 14 (10) ◽  
pp. 1805-1817
Author(s):  
David Pujol ◽  
Yikai Wu ◽  
Brandon Fain ◽  
Ashwin Machanavajjhala

Large organizations that collect data about populations (like the US Census Bureau) release summary statistics that are used by multiple stakeholders for resource allocation and policy making problems. These organizations are also legally required to protect the privacy of individuals from whom they collect data. Differential Privacy (DP) provides a solution to release useful summary data while preserving privacy. Most DP mechanisms are designed to answer a single set of queries. In reality, there are often multiple stakeholders that use a given data release and have overlapping but not-identical queries. This introduces a novel joint optimization problem in DP where the privacy budget must be shared among different analysts. We initiate study into the problem of DP query answering across multiple analysts. To capture the competing goals and priorities of multiple analysts, we formulate three desiderata that any mechanism should satisfy in this setting - The Sharing Incentive, Non-interference, and Adaptivity - while still optimizing for overall error. We demonstrate how existing DP query answering mechanisms in the multi-analyst settings fail to satisfy at least one of the desiderata. We present novel DP algorithms that provably satisfy all our desiderata and empirically show that they incur low error on realistic tasks.


1991 ◽  
Vol 11 (4) ◽  
pp. 357-398 ◽  
Author(s):  
Michael L. Cohen

ABSTRACTThe census is a social fact, the outcome of a process that involves the interaction of public laws and institutions and citizens' responses to an official inquiry. However, it is not a ‘hard’ fact. Reasons for inevitable defects in the census count are listed in the first section; the second section reports efforts by the US Census Bureau to identify sources of error in census coverage, and make estimates of the size of the errors. The use of census data for policy purposes, such as political representation and allocating funds, makes these defects controversial. Errors may be removed by making adjustments to the initial census count. However, because adjustment reallocates resources between groups, it has become the subject of political conflict. The paper describes the conflict between statistical practices, laws and public policy about census adjustment in the United States, and concludes by considering the extent to which causes in America are likely to be found in other countries.


Stroke ◽  
2017 ◽  
Vol 48 (suppl_1) ◽  
Author(s):  
Tzu-Ching Wu ◽  
Christy M Ankrom ◽  
Arvind B Bambhroliya ◽  
Shima Borzorgui ◽  
Sean I Savitz

Objective: Access to care is an important healthcare goal but access to research is also important to patients. We sought to gain an understanding of the status of stroke research among the various stroke designated hospitals in the state and to identify regions and facilities that lack access to stroke research. Methods: Texas Department of State Health Service (TDSHS) designated stroke facilities (DSF) were surveyed using a standardized questionnaire via telephone/email to confirm stroke center status, presence of a dedicated stroke coordinator, use of telestroke services, and participation in stroke research. Stroke discharge data were obtained from TDSHS and stroke volume (by ICD) were estimated for 2013 for all non-DSF. Census data were obtained from the US Census Bureau. Results: In total, 109/136 (80%) TDSHS DSF responded to the survey. Only 32/109 (29%) of the TDSHS DSF are participating in stroke research, mostly in the 4 metropolitan areas (fig 1). We identified 16 non-DSF that have 100-149 stroke discharges, and another 21 non-DSF that have ≥ 150 stroke discharges (fig 1). Over half (53%) of the DSF in the state are utilizing telestroke services. Conclusions: Most clinical stroke research conducted in Texas is in the 4 metropolitan markets. Our findings demonstrate that over 50% or ~14 million Texans reside outside of the 4 markets and therefore may lack access to stroke research. To increase access, we identified several non-DSF in the state with substantial stroke discharges (fig 1). Academic centers and non-DSF partnering through telemedicine and other relationships should be considered to expand throughout the state opportunities for participation in stroke research.


2019 ◽  
Vol 109 ◽  
pp. 403-408 ◽  
Author(s):  
Steven Ruggles ◽  
Catherine Fitch ◽  
Diana Magnuson ◽  
Jonathan Schroeder

The Census Bureau has announced new methods for disclosure control in public use data products. The new approach, known as differential privacy, represents a radical departure from current practice. In its pure form, differential privacy techniques may make the release of useful microdata impossible and limit the utility of tabular small-area data. Adoption of differential privacy will have far-reaching consequences for research. It is likely that scientists, planners, and the public will lose the free access we have enjoyed for six decades to reliable public Census Bureau data describing US social and economic change.


2020 ◽  
Author(s):  
Mathew Hauer ◽  
Alexis R Santos-Lozada

Scientists and policy makers rely on accurate population and mortality data to inform efforts regarding the coronavirus disease 2019 (COVID-19) pandemic, with age-specific mortality rates of high importance due to the concentration of COVID-19 deaths at older ages. Population counts – the principal denominators for calculating age-specific mortality rates – will be subject to noise infusion in the United States with the 2020 Census via a disclosure avoidance system based on differential privacy. Using COVID-19 mortality curves from the CDC, we show that differential privacy will introduce substantial distortion in COVID-19 mortality rates – sometimes causing mortality rates to exceed 100\% -- hindering our ability to understand the pandemic. This distortion is particularly large for population groupings with fewer than 1000 persons – 40\% of all county-level age-sex groupings and 60\% of race groupings. The US Census Bureau should consider a larger privacy budget and data users should consider pooling data to increase population sizes to minimize differential privacy’s distortion.


2021 ◽  
Vol 7 ◽  
pp. 237802312110236
Author(s):  
Alexis R. Santos-Lozada

Descriptions of the effect of the implementation of a new disclosure avoidance system (DAS), which relies on differential privacy, emphasize the impact of our understanding of contemporary social and health dynamics. However, focusing on overall population may obscure important changes in subpopulation indicators such as age-specific rates resulting from this implementation. The author provides a visualization that compares infant mortality rates calculated using 2009–2011 county-level average death counts and denominators derived from the traditional and proposed DASs. Death counts come from the National Center for Health Statistics and denominators come from the first U.S. Census Bureau demonstration products. These visualizations indicate that infant mortality rates produced using the proposed DAS are different from those produced using the traditional methods, with higher variation observed for nonmetropolitan counties and areas with smaller populations. These findings suggest that the proposed DAS will hinder our ability to understand contemporary health dynamics in the United States.


Author(s):  
Amy O'Hara ◽  
Quentin Brummet

An expanding body of data privacy research reveals that computational advances and ever-growing amounts of publicly retrievable data increase re-identification risks. Because of this, data publishers are realizing that traditional statistical disclosure limitation methods may not protect privacy. This paper discusses the use of differential privacy at the US Census Bureau to protect the published results of the 2020 census. We first discuss the legal framework under which the Census Bureau intends to use differential privacy. The Census Act in the US states that the agency must keep information confidential, avoiding “any publication whereby the data furnished by any particular establishment or individual under this title can be identified.” The fact that Census may release fewer statistics in 2020 than in 2010 is leading scholars to parse the meaning of identification and reevaluate the agency’s responsibility to balance data utility with privacy protection. We then describe technical aspects of the application of differential privacy in the U.S. Census. This data collection is enormously complex and serves a wide variety of users and uses -- 7.8 billion statistics were released using the 2010 US Census. This complexity strains the application of differential privacy to ensure appropriate geographic relationships, respect legal requirements for certain statistics to be free of noise infusion, and provide information for detailed demographic groups. We end by discussing the prospects of applying formal mathematical privacy to other information products at the Census Bureau. At present, techniques exist for applying differential privacy to descriptive statistics, histograms, and counts, but are less developed for more complex data releases including panel data, linked data, and vast person-level datasets. We expect the continued development of formally private methods to occur alongside discussions of what privacy means and the policy issues involved in trading off protection for accuracy.


2019 ◽  
Author(s):  
Mathew Hauer ◽  
James Byars

BACKGROUND: The Internal Revenue Service's (IRS) county-to-county migration data are an incredible resource for understanding migration in the United States. Produced annually since 1990 in conjunction with the US Census Bureau, the IRS migration data represent 95 to 98 percent of the tax-filing universe and their dependents, making the IRS migration data one of the largest sources of migration data. However, any analysis using the IRS migration data must process at least seven legacy formats of these public data across more than 2000 data files -- a serious burden for migration scholars. OBJECTIVE: To produce a single, flat data file containing complete county-to-county IRS migration flow data and to make the computer code to process the migration data freely available. METHODS: This paper uses R to process more than 2,000 IRS migration files into a single, flat data file for use in migration research. CONTRIBUTION: To encourage and facilitate the use of this data, we provide a single, standardized, flat data file containing county-to-county 1-year migration flows for the period 1990-2010 (containing 163,883 dyadic county pairs resulting in 3.2 million county pair-year observations totaling over 343 million migrants) and provide the full R script to download, process, and flatten the IRS migration data.


2021 ◽  
Vol 7 ◽  
pp. 237802312199401
Author(s):  
Mathew E. Hauer ◽  
Alexis R. Santos-Lozada

Scholars rely on accurate population and mortality data to inform efforts regarding the coronavirus disease 2019 (COVID-19) pandemic, with age-specific mortality rates of high importance because of the concentration of COVID-19 deaths at older ages. Population counts, the principal denominators for calculating age-specific mortality rates, will be subject to noise infusion in the United States with the 2020 census through a disclosure avoidance system based on differential privacy. Using empirical COVID-19 mortality curves, the authors show that differential privacy will introduce substantial distortion in COVID-19 mortality rates, sometimes causing mortality rates to exceed 100 percent, hindering our ability to understand the pandemic. This distortion is particularly large for population groupings with fewer than 1,000 persons: 40 percent of all county-level age-sex groupings and 60 percent of race groupings. The U.S. Census Bureau should consider a larger privacy budget, and data users should consider pooling data to minimize differential privacy’s distortion.


Sign in / Sign up

Export Citation Format

Share Document