scholarly journals Heterogeneity-aware and communication-efficient distributed statistical inference

Biometrika ◽  
2021 ◽  
Author(s):  
Rui Duan ◽  
Yang Ning ◽  
Yong Chen

Abstract In multicentre research, individual-level data are often protected against sharing across sites. To overcome the barrier of data sharing, many distributed algorithms, which only require sharing aggregated information, have been developed. The existing distributed algorithms usually assume the data are homogeneously distributed across sites. This assumption ignores the important fact that the data collected at different sites may come from various subpopulations and environments, which can lead to heterogeneity in the distribution of the data. Ignoring the heterogeneity may lead to erroneous statistical inference. In this paper, we propose distributed algorithms which account for the heterogeneous distributions by allowing site-specific nuisance parameters. The proposed methods extend the surrogate likelihood approach (Wang et al., 2017; Jordan et al., 2018) to the heterogeneous setting by applying a novel density ratio tilting method to the efficient score function. The proposed algorithms maintain the same communication cost as existing communication-efficient algorithms. We establish a non-asymptotic risk bound for the proposed distributed estimator and its limiting distribution in the two-index asymptotic setting which allows both sample size per site and the number of sites to go to infinity. In addition, we show that the asymptotic variance of the estimator attains the Cramér-Rao lower bound when the number of sites is in rate smaller than the sample size at each site. Finally, we use simulation studies and a real data application to demonstrate the validity and feasibility of the proposed methods.

2019 ◽  
Vol 34 (2) ◽  
pp. 485-508
Author(s):  
Tomoki Fujii ◽  
Roy van der Weide

Abstract It is costly to collect the household- and individual-level data that underlie official estimates of poverty and health. For this reason, developing countries often do not have the budget to update estimates of poverty and health regularly, even though these estimates are most needed there. One way to reduce the financial burden is to substitute some of the real data with predicted data by means of double sampling, where the expensive outcome variable is collected for a subsample and its predictors for all. This study finds that double sampling yields only modest reductions in financial costs when imposing a statistical precision constraint in a wide range of realistic empirical settings. There are circumstances in which the gains can be more substantial, but these denote the exception rather than the rule. The recommendation is to rely on real data whenever there is a need for new data and to use prediction estimators to leverage existing data.


2021 ◽  
Author(s):  
Yiliang Zhang ◽  
Youshu Cheng ◽  
Yixuan Ye ◽  
Wei Jiang ◽  
Qiongshi Lu ◽  
...  

AbstractWith the increasing accessibility of individual-level data from genome wide association studies, it is now common for researchers to have individual-level data of some traits in one specific population. For some traits, we can only access public released summary-level data due to privacy and safety concerns. The current methods to estimate genetic correlation can only be applied when the input data type of the two traits of interest is either both individual-level or both summary-level. When researchers have access to individual-level data for one trait and summary-level data for the other, they have to transform the individual-level data to summary-level data first and then apply summary data-based methods to estimate the genetic correlation. This procedure is computationally and statistically inefficient and introduces information loss. We introduce GENJI (Genetic correlation EstimatioN Jointly using Individual-level and summary data), a method that can estimate within-population or transethnic genetic correlation based on individual-level data for one trait and summary-level data for another trait. Through extensive simulations and analyses of real data on within-population and transethnic genetic correlation estimation, we show that GENJI produces more reliable and efficient estimation than summary data-based methods. Besides, when individual-level data are available for both traits, GENJI can achieve comparable performance than individual-level data-based methods. Downstream applications of genetic correlation can benefit from more accurate estimates. In particular, we show that more accurate genetic correlation estimation facilitates the predictability of cross-population polygenic risk scores.


Epigenomics ◽  
2021 ◽  
Author(s):  
Samantha Lent ◽  
Andres Cardenas ◽  
Sheryl L Rifas-Shiman ◽  
Patrice Perron ◽  
Luigi Bouchard ◽  
...  

Aim: We evaluated five methods for detecting differentially methylated regions (DMRs): DMRcate, comb-p, seqlm, GlobalP and dmrff. Materials & methods: We used a simulation study and real data analysis to evaluate performance. Additionally, we evaluated the use of an ancestry-matched reference cohort to estimate correlations between CpG sites in cord blood. Results: Several methods had inflated Type I error, which increased at more stringent significant levels. In power simulations with 1–2 causal CpG sites with the same direction of effect, dmrff was consistently among the most powerful methods. Conclusion: This study illustrates the need for more thorough simulation studies when evaluating novel methods. More work must be done to develop methods with well-controlled Type I error that do not require individual-level data.


2021 ◽  
Author(s):  
Hongyu Zhao ◽  
Yiliang Zhang ◽  
Youshu Cheng ◽  
Yixuan Ye ◽  
Wei Jiang ◽  
...  

Abstract With the increasing accessibility of individual-level data from genome wide association studies, it is now common for researchers to have individual-level data of some traits in one specific population. For some traits, we can only access public released summary-level data due to privacy and safety concerns. The current methods to estimate genetic correlation can only be applied when the input data type of the two traits of interest is either both individual-level or both summary-level. When researchers have access to individual-level data for one trait and summary-level data for the other, they have to transform the individual-level data to summary-level data first and then apply summary data-based methods to estimate the genetic correlation. This procedure is computationally and statistically inefficient and introduces information loss. We introduce GENJI (Genetic correlation EstimatioN Jointly using Individual-level and summary data), a method that can estimate within-population or transethnic genetic correlation based on individual-level data for one trait and summary-level data for another trait. Through extensive simulations and analyses of real data on within-population and transethnic genetic correlation estimation, we show that GENJI produces more reliable and efficient estimation than summary data-based methods. Besides, when individual-level data are available for both traits, GENJI can achieve comparable performance than individual-level data-based methods. Downstream applications of genetic correlation can benefit from more accurate estimates. In particular, we show that more accurate genetic correlation estimation facilitates the predictability of cross-population polygenic risk scores.


Author(s):  
Jingjing Wang ◽  
Xueying Wu ◽  
Ruoyu Wang ◽  
Dongsheng He ◽  
Dongying Li ◽  
...  

The coronavirus disease 2019 pandemic has stimulated intensive research interest in its transmission pathways and infection factors, e.g., socioeconomic and demographic characteristics, climatology, baseline health conditions or pre-existing diseases, and government policies. Meanwhile, some empirical studies suggested that built environment attributes may be associated with the transmission mechanism and infection risk of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However, no review has been conducted to explore the effect of built environment characteristics on the infection risk. This research gap prevents government officials and urban planners from creating effective urban design guidelines to contain SARS-CoV-2 infections and face future pandemic challenges. This review summarizes evidence from 25 empirical studies and provides an overview of the effect of built environment on SARS-CoV-2 infection risk. Virus infection risk was positively associated with the density of commercial facilities, roads, and schools and with public transit accessibility, whereas it was negatively associated with the availability of green spaces. This review recommends several directions for future studies, namely using longitudinal research design and individual-level data, considering multilevel factors and extending to diversified geographic areas.


Foods ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 557
Author(s):  
Elena Raptou

This study investigated the relationship of behavioral factors, such as snack choices, obesity stereotypes and smoking with adolescents’ body weight. Individual-level data for 1254 Greek youths were selected via a formal questionnaire. Snack choices seem to be gender specific with girls showing a stronger preference for healthier snacks. Frequent consumption of high-calorie and more filling snacks was found to increase Body Mass Index (BMI) in both genders. Fruit/vegetable snacks were associated with lower body weight in females, whereas cereal/nut snacks had a negative influence in males’ BMI. The majority of participants expressed anti-fat attitudes and more boys than girls assigned positive attributes to lean peers. The endorsement of the thin-ideal was positively associated with the BMI of both adolescent boys and girls. This study also revealed that neglecting potential endogeneity issues can lead to biased estimates of smoking. Gender may be a crucial moderator of smoking–BMI relationships. Male smokers presented a higher obesity risk, whereas female smokers were more likely to be underweight. Nutrition professionals should pay attention to increase the acceptance of healthy snack options. Gender differences in the influence of weight stereotypes and smoking on BMI should be considered in order to enhance the efficacy of obesity prevention interventions.


2021 ◽  
pp. 001041402110243
Author(s):  
Carolina Plescia ◽  
Sylvia Kritzinger

Combining individual-level with event-level data across 25 European countries and three sets of European Election Studies, this study examines the effect of conflict between parties in coalition government on electoral accountability and responsibility attribution. We find that conflict increases punishment for poor economic performance precisely because it helps clarify to voters parties’ actions and responsibilities while in office. The results indicate that under conditions of conflict, the punishment is equal for all coalition partners when they share responsibility for poor economic performance. When there is no conflict within a government, the effect of poor economic evaluations on vote choice is rather low, with slightly more punishment targeted to the prime minister’s party. These findings have important implications for our understanding of electoral accountability and political representation in coalition governments.


Author(s):  
Alice R. Carter ◽  
Eleanor Sanderson ◽  
Gemma Hammerton ◽  
Rebecca C. Richmond ◽  
George Davey Smith ◽  
...  

AbstractMediation analysis seeks to explain the pathway(s) through which an exposure affects an outcome. Traditional, non-instrumental variable methods for mediation analysis experience a number of methodological difficulties, including bias due to confounding between an exposure, mediator and outcome and measurement error. Mendelian randomisation (MR) can be used to improve causal inference for mediation analysis. We describe two approaches that can be used for estimating mediation analysis with MR: multivariable MR (MVMR) and two-step MR. We outline the approaches and provide code to demonstrate how they can be used in mediation analysis. We review issues that can affect analyses, including confounding, measurement error, weak instrument bias, interactions between exposures and mediators and analysis of multiple mediators. Description of the methods is supplemented by simulated and real data examples. Although MR relies on large sample sizes and strong assumptions, such as having strong instruments and no horizontally pleiotropic pathways, our simulations demonstrate that these methods are unaffected by confounders of the exposure or mediator and the outcome and non-differential measurement error of the exposure or mediator. Both MVMR and two-step MR can be implemented in both individual-level MR and summary data MR. MR mediation methods require different assumptions to be made, compared with non-instrumental variable mediation methods. Where these assumptions are more plausible, MR can be used to improve causal inference in mediation analysis.


2021 ◽  
pp. 003329412110268
Author(s):  
Jaime Ballard ◽  
Adeya Richmond ◽  
Suzanne van den Hoogenhof ◽  
Lynne Borden ◽  
Daniel Francis Perkins

Background Multilevel data can be missing at the individual level or at a nested level, such as family, classroom, or program site. Increased knowledge of higher-level missing data is necessary to develop evaluation design and statistical methods to address it. Methods Participants included 9,514 individuals participating in 47 youth and family programs nationwide who completed multiple self-report measures before and after program participation. Data were marked as missing or not missing at the item, scale, and wave levels for both individuals and program sites. Results Site-level missing data represented a substantial portion of missing data, ranging from 0–46% of missing data at pre-test and 35–71% of missing data at post-test. Youth were the most likely to be missing data, although site-level data did not differ by the age of participants served. In this dataset youth had the most surveys to complete, so their missing data could be due to survey fatigue. Conclusions Much of the missing data for individuals can be explained by the site not administering those questions or scales. These results suggest a need for statistical methods that account for site-level missing data, and for research design methods to reduce the prevalence of site-level missing data or reduce its impact. Researchers can generate buy-in with sites during the community collaboration stage, assessing problematic items for revision or removal and need for ongoing site support, particularly at post-test. We recommend that researchers conducting multilevel data report the amount and mechanism of missing data at each level.


Sign in / Sign up

Export Citation Format

Share Document