Deep integrative models for large-scale human genomics

Polygenic risk scores (PRSs) are expected to play a critical role in achieving precision medicine. PRS predictors are generally based on linear models using summary statistics, and more recently individual- level data. However, these predictors generally only capture additive relationships and are limited when it comes to what type of data they use. Here, we develop a deep learning framework (EIR) for PRS prediction which includes a model, genome-local-net (GLN), we specifically designed for large scale genomics data. The framework supports multi-task (MT) learning, automatic integration of clinical and biochemical data and model explainability. GLN outperforms LASSO for a wide range of diseases, particularly autoimmune disease which have been researched for interaction effects. We showcase the flexibility of the framework by training one MT model to predict 338 diseases simultaneously. Furthermore, we find that incorporating measurement data for PRSs improves performance for virtually all (93%) diseases considered (ROC-AUC improvement up to 0.36) and that including genotype data provides better model calibration compared to measurements alone. We use the framework to analyse what our models learn and find that they learn both relevant disease variants and clinical measurements. EIR is open source and available at https://github.com/arnor-sigurdsson/EIR.

Download Full-text

Bayesian large-scale multiple regression with summary statistics from genome-wide association studies

10.1101/042457 ◽

2016 ◽

Cited By ~ 5

Author(s):

Xiang Zhu ◽

Matthew Stephens

Keyword(s):

Multiple Regression ◽

Large Scale ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Individual Level ◽

Genome Wide ◽

Level Data ◽

Wide Range

Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously-proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously-unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.

Download Full-text

Integrative Data Analysis from a Unifying Research Synthesis Perspective

10.1093/oso/9780190676001.003.0020 ◽

2018 ◽

Author(s):

Eun-Young Mun ◽

Anne E. Ray

Keyword(s):

Data Analysis ◽

Large Scale ◽

Research Synthesis ◽

Alcohol Intervention ◽

Data Set ◽

Integrative Data Analysis ◽

Level Data ◽

Model Complex ◽

Wide Range ◽

Individual Participant

Integrative data analysis (IDA) is a promising new approach in psychological research and has been well received in the field of alcohol research. This chapter provides a larger unifying research synthesis framework for IDA. Major advantages of IDA of individual participant-level data include better and more flexible ways to examine subgroups, model complex relationships, deal with methodological and clinical heterogeneity, and examine infrequently occurring behaviors. However, between-study heterogeneity in measures, designs, and samples and systematic study-level missing data are significant barriers to IDA and, more broadly, to large-scale research synthesis. Based on the authors’ experience working on the Project INTEGRATE data set, which combined individual participant-level data from 24 independent college brief alcohol intervention studies, it is also recognized that IDA investigations require a wide range of expertise and considerable resources and that some minimum standards for reporting IDA studies may be needed to improve transparency and quality of evidence.

Download Full-text

Trust in scientists in times of pandemic: Panel evidence from 12 countries

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2108576118 ◽

2021 ◽

Vol 118 (40) ◽

pp. e2108576118

Author(s):

Yann Algan ◽

Daniel Cohen ◽

Eva Davoine ◽

Martial Foucault ◽

Stefanie Stantcheva

Keyword(s):

Large Scale ◽

Social Trust ◽

Critical Role ◽

Individual Level ◽

Individual Support ◽

Compliant Behavior ◽

Paradoxical Effects ◽

The Government ◽

Nonpharmaceutical Interventions

This article analyzes the specific and critical role of trust in scientists on both the support for and compliance with nonpharmaceutical interventions (NPIs) during the COVID-19 pandemic. We exploit large-scale, longitudinal, and representative surveys for 12 countries over the period from March to December 2020, and we complement the analysis with experimental data. We find that trust in scientists is the key driving force behind individual support for and compliance with NPIs and for favorable attitudes toward vaccination. The effect of trust in government is more ambiguous and tends to diminish support for and compliance with NPIs in countries where the recommendations from scientists and the government were not aligned. Trust in others also has seemingly paradoxical effects: in countries where social trust is high, the support for NPIs is low due to higher expectations that others will voluntary social distance. Our individual-level longitudinal data also allows us to evaluate the effects of within-person changes in trust over the pandemic: we show that trust levels and, in particular, trust in scientists have changed dramatically for individuals and within countries, with important subsequent effects on compliant behavior and support for NPIs. Such findings point out the challenging but critical need to maintain trust in scientists during a lasting pandemic that strains citizens and governments.

Download Full-text

Is Predicted Data a Viable Alternative to Real Data?

The World Bank Economic Review ◽

10.1093/wber/lhz007 ◽

2019 ◽

Vol 34 (2) ◽

pp. 485-508

Author(s):

Tomoki Fujii ◽

Roy van der Weide

Keyword(s):

Financial Burden ◽

Real Data ◽

Outcome Variable ◽

Double Sampling ◽

Individual Level ◽

Financial Costs ◽

Level Data ◽

Statistical Precision ◽

Wide Range ◽

Poverty And Health

Abstract It is costly to collect the household- and individual-level data that underlie official estimates of poverty and health. For this reason, developing countries often do not have the budget to update estimates of poverty and health regularly, even though these estimates are most needed there. One way to reduce the financial burden is to substitute some of the real data with predicted data by means of double sampling, where the expensive outcome variable is collected for a subsample and its predictors for all. This study finds that double sampling yields only modest reductions in financial costs when imposing a statistical precision constraint in a wide range of realistic empirical settings. There are circumstances in which the gains can be more substantial, but these denote the exception rather than the rule. The recommendation is to rely on real data whenever there is a need for new data and to use prediction estimators to leverage existing data.

Download Full-text

Apples and Oranges? The Problem of Equivalence in Comparative Research

Political Analysis ◽

10.1093/pan/mpr028 ◽

2011 ◽

Vol 19 (4) ◽

pp. 471-487 ◽

Cited By ~ 44

Author(s):

Daniel Stegmueller

Keyword(s):

Item Response ◽

Comparative Research ◽

Large Scale ◽

Structural Parameters ◽

Simultaneous Estimation ◽

Response Behavior ◽

Item Response Model ◽

Individual Level ◽

Level Data ◽

Cross National

Researchers in comparative research are increasingly relying on individual level data to test theories involving unobservable constructs like attitudes and preferences. Estimation is carried out using large-scale cross-national survey data providing responses from individuals living in widely varying contexts. This strategy rests on the assumption of equivalence, that is, no systematic distortion in response behavior of individuals from different countries exists. However, this assumption is frequently violated with rather grave consequences for comparability and interpretation. I present a multilevel mixture ordinal item response model with item bias effects that is able to establish equivalence. It corrects for systematic measurement error induced by unobserved country heterogeneity, and it allows for the simultaneous estimation of structural parameters of interest.

Download Full-text

Spontaneous Collective Action: Peripheral Mobilization During the Arab Spring

American Political Science Review ◽

10.1017/s0003055416000769 ◽

2017 ◽

Vol 111 (2) ◽

pp. 379-403 ◽

Cited By ~ 63

Author(s):

ZACHARY C. STEINERT-THRELKELD

Keyword(s):

Social Network ◽

Collective Action ◽

Information Diffusion ◽

Large Scale ◽

Authoritarian Regimes ◽

Individual Level ◽

The Core ◽

Wide Range ◽

Models Of Disease ◽

The Arab Spring

Who is responsible for protest mobilization? Models of disease and information diffusion suggest that those central to a social network (the core) should have a greater ability to mobilize others than those who are less well-connected. To the contrary, this article argues that those not central to a network (the periphery) can generate collective action, especially in the context of large-scale protests in authoritarian regimes. To show that those in the core of a social network have no effect on levels of protest, this article develops a dataset of daily protests across 16 countries in the Middle East and North Africa over 14 months from 2010 through 2011. It combines that dataset with geocoded, individual-level communication from the same period and measures the number of connections of each person. Those on the periphery are shown to be responsible for changing levels of protest, with some evidence suggesting that the core’s mobilization efforts lead to fewer protests. These results have implications for a wide range of social choices that rely on interdependent decision making.

Download Full-text

Four dimensions characterize comprehensive trait judgments of faces

10.31234/osf.io/87nex ◽

2019 ◽

Cited By ~ 1

Author(s):

Chujun Lin ◽

Umit Keles ◽

Ralph Adolphs

Keyword(s):

Large Scale ◽

Real Life ◽

Three Dimensions ◽

Prior Work ◽

Four Dimensions ◽

Individual Level ◽

Level Data ◽

Trait Words ◽

Psychological Dimensions

People readily attribute many traits to faces: some look beautiful, some competent, some aggressive1. These snap judgments have important consequences in real life, ranging from success in political elections to decisions in courtroom sentencing2,3. Modern psychological theories argue that the hundreds of different words people use to describe others from their faces are well captured by only two or three dimensions, such as valence and dominance4, a highly influential framework that has been the basis for numerous studies in social and developmental psychology5–10, social neuroscience11,12, and in engineering applications13,14. However, all prior work has used only a small number of words (12 to 18) to derive underlying dimensions, limiting conclusions to date. Here we employed deep neural networks to select a comprehensive set of 100 words that are representative of the trait words people use to describe faces, and to select a set of 100 faces. In two large-scale, preregistered studies we asked participants to rate the 100 faces on the 100 words (obtaining 2,850,000 ratings from 1,710 participants), and discovered a novel set of four psychological dimensions that best explain trait judgments of faces: warmth, competence, femininity, and youth. We reproduced these four dimensions across different regions around the world, in both aggregated and individual-level data. These results provide a new and most comprehensive characterization of face judgments, and reconcile prior work on face perception with work in social cognition15 and personality psychology16.

Download Full-text

Estimating genetic correlation jointly using individual-level and summary-level GWAS data

10.1101/2021.08.18.456908 ◽

2021 ◽

Author(s):

Yiliang Zhang ◽

Youshu Cheng ◽

Yixuan Ye ◽

Wei Jiang ◽

Qiongshi Lu ◽

...

Keyword(s):

Genetic Correlation ◽

Association Studies ◽

Real Data ◽

Efficient Estimation ◽

Risk Scores ◽

Genome Wide Association Studies ◽

Individual Level ◽

Correlation Estimation ◽

Level Data ◽

Summary Data

AbstractWith the increasing accessibility of individual-level data from genome wide association studies, it is now common for researchers to have individual-level data of some traits in one specific population. For some traits, we can only access public released summary-level data due to privacy and safety concerns. The current methods to estimate genetic correlation can only be applied when the input data type of the two traits of interest is either both individual-level or both summary-level. When researchers have access to individual-level data for one trait and summary-level data for the other, they have to transform the individual-level data to summary-level data first and then apply summary data-based methods to estimate the genetic correlation. This procedure is computationally and statistically inefficient and introduces information loss. We introduce GENJI (Genetic correlation EstimatioN Jointly using Individual-level and summary data), a method that can estimate within-population or transethnic genetic correlation based on individual-level data for one trait and summary-level data for another trait. Through extensive simulations and analyses of real data on within-population and transethnic genetic correlation estimation, we show that GENJI produces more reliable and efficient estimation than summary data-based methods. Besides, when individual-level data are available for both traits, GENJI can achieve comparable performance than individual-level data-based methods. Downstream applications of genetic correlation can benefit from more accurate estimates. In particular, we show that more accurate genetic correlation estimation facilitates the predictability of cross-population polygenic risk scores.

Download Full-text

Impact of increasing vegetarian availability on meal selection and sales in cafeterias

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1907207116 ◽

2019 ◽

Vol 116 (42) ◽

pp. 20923-20929 ◽

Cited By ~ 19

Author(s):

Emma E. Garnett ◽

Andrew Balmford ◽

Chris Sandbrook ◽

Mark A. Pilling ◽

Theresa M. Marteau

Keyword(s):

Large Scale ◽

Population Level ◽

Field Studies ◽

Individual Level ◽

Rebound Effects ◽

Percentage Points ◽

Physical Environments ◽

Meat Meal ◽

Level Data ◽

The Impact

Shifting people in higher income countries toward more plant-based diets would protect the natural environment and improve population health. Research in other domains suggests altering the physical environments in which people make decisions (“nudging”) holds promise for achieving socially desirable behavior change. Here, we examine the impact of attempting to nudge meal selection by increasing the proportion of vegetarian meals offered in a year-long large-scale series of observational and experimental field studies. Anonymized individual-level data from 94,644 meals purchased in 2017 were collected from 3 cafeterias at an English university. Doubling the proportion of vegetarian meals available from 25 to 50% (e.g., from 1 in 4 to 2 in 4 options) increased vegetarian meal sales (and decreased meat meal sales) by 14.9 and 14.5 percentage points in the observational study (2 cafeterias) and by 7.8 percentage points in the experimental study (1 cafeteria), equivalent to proportional increases in vegetarian meal sales of 61.8%, 78.8%, and 40.8%, respectively. Linking sales data to participants’ previous meal purchases revealed that the largest effects were found in the quartile of diners with the lowest prior levels of vegetarian meal selection. Moreover, serving more vegetarian options had little impact on overall sales and did not lead to detectable rebound effects: Vegetarian sales were not lower at other mealtimes. These results provide robust evidence to support the potential for simple changes to catering practices to make an important contribution to achieving more sustainable diets at the population level.

Download Full-text

The Impact of Natural Disasters on Dietary Intake

American Journal of Health Behavior ◽

10.5993/ajhb.44.1.4 ◽

2020 ◽

Vol 44 (1) ◽

pp. 26-39

Author(s):

Mengmeng Ji ◽

Ruopeng An ◽

Yingjie Qiu ◽

Chenghua Guan

Keyword(s):

Natural Disasters ◽

Linear Models ◽

Vegetable Consumption ◽

Behavioral Risk ◽

Individual Level ◽

Behavioral Risk Factor Surveillance ◽

Level Data ◽

Consumption Frequency ◽

Advance Research ◽

The Impact

Objectives: In this study, we explored the potential impact of disasters on individuals' fruit and vegetable consumption. Methods: Individual-level data (N = 351,229) from the Behavioral Risk Factor Surveillance System (BRFSS) 2011 survey were merged with county-level disaster declaration data from the Federal Emergency Management Agency (FEMA) based on disaster duration, interview month and residential county. Multilevel mixed-effects generalized linear models were conducted to examine the impact of different types of disasters on self-reported daily fruit, 100% pure fruit juice, beans, green vegetables, orange vegetables, other vegetables and overall vegetables consumption frequencies, adjusting for individual covariates. Results: No associations between disasters and daily fruit and overall vegetable consumption frequency were identified at either national or state levels. Only floods were consistently associated with reduced consumption of orange vegetables. Conclusions: This study did not identify an association between natural disasters and daily overall fruit/vegetable consumption frequency at national or state levels, whereas disasters were found to alter the consumption of certain vegetable subgroup (orange vegetables) slightly. Longitudinal studies with validated and detailed measures on diet and disaster are warranted to advance research in this field.

Download Full-text