scholarly journals Is Predicted Data a Viable Alternative to Real Data?

2019 ◽  
Vol 34 (2) ◽  
pp. 485-508
Author(s):  
Tomoki Fujii ◽  
Roy van der Weide

Abstract It is costly to collect the household- and individual-level data that underlie official estimates of poverty and health. For this reason, developing countries often do not have the budget to update estimates of poverty and health regularly, even though these estimates are most needed there. One way to reduce the financial burden is to substitute some of the real data with predicted data by means of double sampling, where the expensive outcome variable is collected for a subsample and its predictors for all. This study finds that double sampling yields only modest reductions in financial costs when imposing a statistical precision constraint in a wide range of realistic empirical settings. There are circumstances in which the gains can be more substantial, but these denote the exception rather than the rule. The recommendation is to rely on real data whenever there is a need for new data and to use prediction estimators to leverage existing data.

2021 ◽  
Author(s):  
Yiliang Zhang ◽  
Youshu Cheng ◽  
Yixuan Ye ◽  
Wei Jiang ◽  
Qiongshi Lu ◽  
...  

AbstractWith the increasing accessibility of individual-level data from genome wide association studies, it is now common for researchers to have individual-level data of some traits in one specific population. For some traits, we can only access public released summary-level data due to privacy and safety concerns. The current methods to estimate genetic correlation can only be applied when the input data type of the two traits of interest is either both individual-level or both summary-level. When researchers have access to individual-level data for one trait and summary-level data for the other, they have to transform the individual-level data to summary-level data first and then apply summary data-based methods to estimate the genetic correlation. This procedure is computationally and statistically inefficient and introduces information loss. We introduce GENJI (Genetic correlation EstimatioN Jointly using Individual-level and summary data), a method that can estimate within-population or transethnic genetic correlation based on individual-level data for one trait and summary-level data for another trait. Through extensive simulations and analyses of real data on within-population and transethnic genetic correlation estimation, we show that GENJI produces more reliable and efficient estimation than summary data-based methods. Besides, when individual-level data are available for both traits, GENJI can achieve comparable performance than individual-level data-based methods. Downstream applications of genetic correlation can benefit from more accurate estimates. In particular, we show that more accurate genetic correlation estimation facilitates the predictability of cross-population polygenic risk scores.


2015 ◽  
Vol 42 (4) ◽  
pp. 328-336 ◽  
Author(s):  
Robert Nee ◽  
Deepti S. Moon ◽  
Rahul M. Jindal ◽  
Frank P. Hurst ◽  
Christina M. Yuan ◽  
...  

Background: The impact of socioeconomic factors on arteriovenous fistula (AVF) creation in hemodialysis (HD) patients is not well understood. We assessed the association of area and individual-level indicators of poverty and health care insurance on AVF use among incident end-stage renal disease (ESRD) patients initiated on HD. Methods: In this retrospective cohort study using the United States Renal Data System database, we identified 669,206 patients initiated on maintenance HD from January 1, 2007 through December 31, 2012. We assessed the Medicare-Medicaid dual-eligibility status as an indicator of individual-level poverty and ZIP code-level median household income (MHI) data obtained from the 2010 United States Census. We conducted logistic regression of AVF use at start of dialysis as the outcome variable. Results: The proportions of dual-eligible and non-dual-eligible patients who initiated HD with an AVF were 12.53 and 16.17%, respectively (p < 0.001). Dual eligibility was associated with significantly lower likelihood of AVF use upon initiation of HD (adjusted odds ratio (aOR) 0.91; 95% CI 0.90-0.93). Patients in the lowest area-level MHI quintile had an aOR of 0.97 (95% CI 0.95-0.99) compared to those in higher quintile levels. However, dual eligibility and area-level MHI were not significant in patients with Veterans Affairs (VA) coverage. Conclusions: Individual- and area-level measures of poverty were independently associated with a lower likelihood of AVF use at the start of HD, the only exception being patients with VA health care benefits. Efforts to improve incident AVF use may require focusing on pre-ESRD care to be successful.


2016 ◽  
Author(s):  
Xiang Zhu ◽  
Matthew Stephens

Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously-proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously-unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.


2021 ◽  
Author(s):  
Arnor Ingi Sigurdsson ◽  
David Westergaard ◽  
Ole Winther ◽  
Ole Lund ◽  
Søren Brunak ◽  
...  

Polygenic risk scores (PRSs) are expected to play a critical role in achieving precision medicine. PRS predictors are generally based on linear models using summary statistics, and more recently individual- level data. However, these predictors generally only capture additive relationships and are limited when it comes to what type of data they use. Here, we develop a deep learning framework (EIR) for PRS prediction which includes a model, genome-local-net (GLN), we specifically designed for large scale genomics data. The framework supports multi-task (MT) learning, automatic integration of clinical and biochemical data and model explainability. GLN outperforms LASSO for a wide range of diseases, particularly autoimmune disease which have been researched for interaction effects. We showcase the flexibility of the framework by training one MT model to predict 338 diseases simultaneously. Furthermore, we find that incorporating measurement data for PRSs improves performance for virtually all (93%) diseases considered (ROC-AUC improvement up to 0.36) and that including genotype data provides better model calibration compared to measurements alone. We use the framework to analyse what our models learn and find that they learn both relevant disease variants and clinical measurements. EIR is open source and available at https://github.com/arnor-sigurdsson/EIR.


Biometrika ◽  
2021 ◽  
Author(s):  
Rui Duan ◽  
Yang Ning ◽  
Yong Chen

Abstract In multicentre research, individual-level data are often protected against sharing across sites. To overcome the barrier of data sharing, many distributed algorithms, which only require sharing aggregated information, have been developed. The existing distributed algorithms usually assume the data are homogeneously distributed across sites. This assumption ignores the important fact that the data collected at different sites may come from various subpopulations and environments, which can lead to heterogeneity in the distribution of the data. Ignoring the heterogeneity may lead to erroneous statistical inference. In this paper, we propose distributed algorithms which account for the heterogeneous distributions by allowing site-specific nuisance parameters. The proposed methods extend the surrogate likelihood approach (Wang et al., 2017; Jordan et al., 2018) to the heterogeneous setting by applying a novel density ratio tilting method to the efficient score function. The proposed algorithms maintain the same communication cost as existing communication-efficient algorithms. We establish a non-asymptotic risk bound for the proposed distributed estimator and its limiting distribution in the two-index asymptotic setting which allows both sample size per site and the number of sites to go to infinity. In addition, we show that the asymptotic variance of the estimator attains the Cramér-Rao lower bound when the number of sites is in rate smaller than the sample size at each site. Finally, we use simulation studies and a real data application to demonstrate the validity and feasibility of the proposed methods.


Author(s):  
Michael M. Bechtel ◽  
Lukas Schmid

Abstract Voters tend to be richer, more conservative, and more educated than non-voters. While many electoral reforms promise to increase political participation, these policy instruments may have multidimensional and differential effects that can increase or decrease the representativeness of turnout. We develop an approach that allows us to estimate these effects and assess the impact of postal voting on representational inequality in Swiss referendums using individual-level ( $N = 79\comma\; 000$ ) and aggregate-level data from 1981 to 2009. We find that postal voting mobilizes equally across a wide range of political and sociodemographic groups but more strongly activates high earners, those with medium education levels, and less politically interested individuals. Yet, those who vote are not less politically knowledgeable and the effects on the composition of turnout remain limited. Our results inform research on the consequences of electoral reforms meant to increase political participation in large electorates.


Epigenomics ◽  
2021 ◽  
Author(s):  
Samantha Lent ◽  
Andres Cardenas ◽  
Sheryl L Rifas-Shiman ◽  
Patrice Perron ◽  
Luigi Bouchard ◽  
...  

Aim: We evaluated five methods for detecting differentially methylated regions (DMRs): DMRcate, comb-p, seqlm, GlobalP and dmrff. Materials & methods: We used a simulation study and real data analysis to evaluate performance. Additionally, we evaluated the use of an ancestry-matched reference cohort to estimate correlations between CpG sites in cord blood. Results: Several methods had inflated Type I error, which increased at more stringent significant levels. In power simulations with 1–2 causal CpG sites with the same direction of effect, dmrff was consistently among the most powerful methods. Conclusion: This study illustrates the need for more thorough simulation studies when evaluating novel methods. More work must be done to develop methods with well-controlled Type I error that do not require individual-level data.


2019 ◽  
Author(s):  
Luke R. Lloyd-Jones ◽  
Jian Zeng ◽  
Julia Sidorenko ◽  
Loïc Yengo ◽  
Gerhard Moser ◽  
...  

ABSTRACTThe capacity to accurately predict an individual’s phenotype from their DNA sequence is one of the great promises of genomics and precision medicine. Recently, Bayesian methods for generating polygenic predictors have been successfully applied in human genomics but require the individual level data, which are often limited in their access due to privacy or logistical concerns, and are computationally very intensive. This has motivated methodological frameworks that utilise publicly available genome-wide association studies (GWAS) summary data, which now for some traits include results from greater than a million individuals. In this study, we extend the established summary statistics methodological framework to include a class of point-normal mixture prior Bayesian regression models, which have been shown to generate optimal genetic predictions and can perform heritability estimation, variant mapping and estimate the distribution of the genetic effects. In a wide range of simulations and cross-validation using 10 real quantitative traits and 1.1 million variants on 350,000 individuals from the UK Biobank (UKB), we establish that our summary based method, SBayesR, performs similarly to methods that use the individual level data and outperforms other state-of-the-art summary statistics methods in terms of prediction accuracy and heritability estimation at a fraction of the computational resources. We generate polygenic predictors for body mass index and height in two independent data sets and show that by exploiting summary statistics on 1.1 million variants from the largest GWAS meta-analysis (n ≈ 700, 000) that the SBayesR prediction R2 improved on average across traits by 6.8% relative to that estimated from an individual-level data BayesR analysis of data from the UKB (n ≈ 450, 000). Compared with commonly used state-of-the-art summary-based methods, SBayesR improved the prediction R2 by 4.1% relative to LDpred and by 28.7% relative to clumping and p-value thresholding. SBayesR gave comparable prediction accuracy to the recent RSS method, which has a similar model, but at a computational time that is two orders of magnitude smaller. The methodology is implemented in a very efficient and user-friendly software tool titled GCTB.


2021 ◽  
Author(s):  
Hongyu Zhao ◽  
Yiliang Zhang ◽  
Youshu Cheng ◽  
Yixuan Ye ◽  
Wei Jiang ◽  
...  

Abstract With the increasing accessibility of individual-level data from genome wide association studies, it is now common for researchers to have individual-level data of some traits in one specific population. For some traits, we can only access public released summary-level data due to privacy and safety concerns. The current methods to estimate genetic correlation can only be applied when the input data type of the two traits of interest is either both individual-level or both summary-level. When researchers have access to individual-level data for one trait and summary-level data for the other, they have to transform the individual-level data to summary-level data first and then apply summary data-based methods to estimate the genetic correlation. This procedure is computationally and statistically inefficient and introduces information loss. We introduce GENJI (Genetic correlation EstimatioN Jointly using Individual-level and summary data), a method that can estimate within-population or transethnic genetic correlation based on individual-level data for one trait and summary-level data for another trait. Through extensive simulations and analyses of real data on within-population and transethnic genetic correlation estimation, we show that GENJI produces more reliable and efficient estimation than summary data-based methods. Besides, when individual-level data are available for both traits, GENJI can achieve comparable performance than individual-level data-based methods. Downstream applications of genetic correlation can benefit from more accurate estimates. In particular, we show that more accurate genetic correlation estimation facilitates the predictability of cross-population polygenic risk scores.


Sign in / Sign up

Export Citation Format

Share Document