Efficient approaches for large scale GWAS studies with genotype uncertainty

Mapping Intimacies ◽

10.1101/786384 ◽

2019 ◽

Author(s):

Emil Jørsboe ◽

Anders Albrechtsen

Keyword(s):

Population Structure ◽

Allele Frequency ◽

Statistical Power ◽

Large Scale ◽

Association Studies ◽

Genetic Data ◽

Data Sets ◽

Sequencing Data ◽

The Individual ◽

Genotype Probabilities

1AbstractIntroductionAssociation studies using genetic data from SNP-chip based imputation or low depth sequencing data provide a cost efficient design for large scale studies. However, these approaches provide genetic data with uncertainty of the observed genotypes. Here we explore association methods that can be applied to data where the genotype is not directly observed. We investigate how using different priors when estimating genotype probabilities affects the association results in different scenarios such as studies with population structure and varying depth sequencing data. We also suggest a method (ANGSD-asso) that is computational feasible for analysing large scale low depth sequencing data sets, such as can be generated by the non-invasive prenatal testing (NIPT) with low-pass sequencing.MethodsANGSD-asso’s EM model works by modelling the unobserved genotype as a latent variable in a generalised linear model framework. The software is implemented in C/C++ and can be run multi-threaded enabling the analysis of big data sets. ANGSD-asso is based on genotype probabilities, they can be estimated in various ways, such as using the sample allele frequency as a prior, using the individual allele frequencies as a prior or using haplotype frequencies from haplotype imputation. Using simulations of sequencing data we explore how genotype probability based method compares to using genetic dosages in large association studies with genotype uncertainty.Results & DiscussionOur simulations show that in a structured population using the individual allele frequency prior has better power than the sample allele frequency. If there is a correlation between genotype uncertainty and phenotype, then the individual allele frequency prior also helps control the false positive rate. In the absence of population structure the sample allele frequency prior and the individual allele frequency prior perform similarly. In scenarios with sequencing depth and phenotype correlation ANGSD-asso’s EM model has better statistical power and less bias compared to using dosages. Lastly when adding additional covariates to the linear model ANGSD-asso’s EM model has more statistical power and provides less biased effect sizes than other methods that accommodate genotype uncertainly, while also being much faster. This makes it possible to properly account for genotype uncertainty in large scale association studies.

Efficient approaches for large-scale GWAS with genotype uncertainty

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab385 ◽

2021 ◽

Author(s):

Emil Jørsboe ◽

Anders Albrechtsen

Keyword(s):

Linear Model ◽

Allele Frequency ◽

Latent Variable ◽

Statistical Power ◽

Large Scale ◽

Association Studies ◽

Genetic Data ◽

Model Framework ◽

The Individual ◽

Genotype Probabilities

Abstract Association studies using genetic data from SNP-chip-based imputation or low-depth sequencing data provide a cost-efficient design for large-scale association studies. We explore methods for performing association studies applicable to such genetic data and investigate how using different priors when estimating genotype probabilities affects the association results. Our proposed method, ANGSD-asso’s latent model, models the unobserved genotype as a latent variable in a generalized linear model framework. The software is implemented in C/C++ and can be run multi-threaded. ANGSD-asso is based on genotype probabilities, which can be estimated using either the sample allele frequency or the individual allele frequencies as a prior. We explore through simulations how genotype probability-based methods compare with using genetic dosages. Our simulations show that in a structured population using the individual allele frequency prior has better power than the sample allele frequency. In scenarios with sequencing depth and phenotype correlation ANGSD-asso’s latent model has higher statistical power and less bias than using dosages. Adding additional covariates to the linear model of ANGSD-asso’s latent model has higher statistical power and less bias than other methods that accommodate genotype uncertainty, while also being much faster. This is shown with imputed data from UK Biobank and simulations.

Assessing Study Reproducibility through M2RI: A Novel Approach for Large-scale High-throughput Association Studies

10.1101/2020.08.18.253740 ◽

2020 ◽

Author(s):

Zeyu Jiao ◽

Yinglei Lai ◽

Jujiao Kang ◽

Weikang Gong ◽

Liang Ma ◽

...

Keyword(s):

Sample Size ◽

Rna Sequencing ◽

High Throughput ◽

Large Scale ◽

Association Studies ◽

Structural Mri ◽

Data Sets ◽

Sequencing Data ◽

Novel Approach ◽

Magnetic Resonance Imaging Mri

AbstractHigh-throughput technologies, such as magnetic resonance imaging (MRI) and DNA/RNA sequencing (DNA-seq/RNA-seq), have been increasingly used in large-scale association studies. With these technologies, important biomedical research findings have been generated. The reproducibility of these findings, especially from structural MRI (sMRI) and functional MRI (fMRI) association studies, has recently been questioned. There is an urgent demand for a reliable overall reproducibility assessment for large-scale high-throughput association studies. It is also desirable to understand the relationship between study reproducibility and sample size in an experimental design. In this study, we developed a novel approach: the mixture model reproducibility index (M2RI) for assessing study reproducibility of large-scale association studies. With M2RI, we performed study reproducibility analysis for several recent large sMRI/fMRI data sets. The advantages of our approach were clearly demonstrated, and the sample size requirements for different phenotypes were also clearly demonstrated, especially when compared to the Dice coefficient (DC). We applied M2RI to compare two MRI or RNA sequencing data sets. The reproducibility assessment results were consistent with our expectations. In summary, M2RI is a novel and useful approach for assessing study reproducibility, calculating sample sizes and evaluating the similarity between two closely related studies.

CONE: Community Oriented Network Estimation Is a Versatile Framework for Inferring Population Structure in Large-Scale Sequencing Data

G3 Genes|Genome|Genetics ◽

10.1534/g3.117.300131 ◽

2017 ◽

Vol 7 (10) ◽

pp. 3359-3377 ◽

Cited By ~ 5

Author(s):

Markku O. Kuismin ◽

Jon Ahlinder ◽

Mikko J. Sillanpӓӓ

Keyword(s):

Population Structure ◽

Large Scale ◽

Sequencing Data ◽

Network Estimation

ASCOT identifies key regulators of neuronal subtype-specific splicing

Nature Communications ◽

10.1038/s41467-019-14020-5 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 2

Author(s):

Jonathan P. Ling ◽

Christopher Wilks ◽

Rone Charles ◽

Patrick J. Leavey ◽

Devlina Ghosh ◽

...

Keyword(s):

Rna Splicing ◽

Large Scale ◽

Splice Variants ◽

Next Generation Sequencing Data ◽

Data Sets ◽

Cell Type ◽

Sequencing Data ◽

Large Scale Analysis ◽

Cell Type Specific ◽

Public Archives

AbstractPublic archives of next-generation sequencing data are growing exponentially, but the difficulty of marshaling this data has led to its underutilization by scientists. Here, we present ASCOT, a resource that uses annotation-free methods to rapidly analyze and visualize splice variants across tens of thousands of bulk and single-cell data sets in the public archive. To demonstrate the utility of ASCOT, we identify novel cell type-specific alternative exons across the nervous system and leverage ENCODE and GTEx data sets to study the unique splicing of photoreceptors. We find that PTBP1 knockdown and MSI1 and PCBP2 overexpression are sufficient to activate many photoreceptor-specific exons in HepG2 liver cancer cells. This work demonstrates how large-scale analysis of public RNA-Seq data sets can yield key insights into cell type-specific control of RNA splicing and underscores the importance of considering both annotated and unannotated splicing events.

Rational Political Man: A Synthesis of Economic and Social-Psychological Perspectives

American Political Science Review ◽

10.1017/s0003055400263223 ◽

1969 ◽

Vol 63 (4) ◽

pp. 1106-1119 ◽

Cited By ~ 35

Author(s):

Michael J. Shapiro

Keyword(s):

Voting Behavior ◽

Large Scale ◽

Data Gathering ◽

Original Data ◽

Data Sets ◽

Party Affiliation ◽

Theoretical Frameworks ◽

Psychological Variables ◽

The Individual ◽

Group Memberships

In recent years the welter of data accumulated on American voting behavior has been continually reanalyzed by social scientists interested in building theories of electoral choice. Most of the original data-gathering enterprises were guided by general theoretical frameworks which, for the most part, were not developed to a point where the ensuing analyses addressed themselves unambiguously to the overall conceptions by which they were guided. As a result much of our knowledge about voting behavior is in the form of generalizations about what social and psychological variables account for voting choices while we lack conceptual frameworks which systematically interrelate these generalizations and provide comprehensive and parsimonious explanation. If any one unifying conception has emerged from the original large scale studies it is that the average voter is irrational. This inference has been derived from a variety of empirical relationships coupled with varying conceptions of rationality.The more recent reanalyses of these data sets have been characterized by a theoretical sophistication that was lacking heretofore. One of these, a theory of the calculus of voting, has applied some formal rigor to the question of the rationality of the decision to vote, selected empirical equivalents of theoretical entities from survey data on national elections, and conducted a successful test of the theory. Unlike traditional approaches to the rationality question which infer the degree of rationality from quantities of information possessed or from correlates of decisions (background, party affiliation, group memberships, etc.), this investigation conceived of rationality in terms of the kind of calculus employed by the individual in deciding among alternatives (in this case whether or not to vote).

A comparative analysis of cell-type adjustment methods for epigenome-wide association studies based on simulated and real data sets

Briefings in Bioinformatics ◽

10.1093/bib/bby068 ◽

2018 ◽

Vol 20 (6) ◽

pp. 2055-2065 ◽

Cited By ~ 1

Author(s):

Johannes Brägelmann ◽

Justo Lorenzo Bermejo

Keyword(s):

Statistical Power ◽

Type I Error ◽

Association Studies ◽

Real Data ◽

Error Rates ◽

Data Sets ◽

Type I ◽

Cell Type ◽

Type I Error Rates

Abstract Technological advances and reduced costs of high-density methylation arrays have led to an increasing number of association studies on the possible relationship between human disease and epigenetic variability. DNA samples from peripheral blood or other tissue types are analyzed in epigenome-wide association studies (EWAS) to detect methylation differences related to a particular phenotype. Since information on the cell-type composition of the sample is generally not available and methylation profiles are cell-type specific, statistical methods have been developed for adjustment of cell-type heterogeneity in EWAS. In this study we systematically compared five popular adjustment methods: the factored spectrally transformed linear mixed model (FaST-LMM-EWASher), the sparse principal component analysis algorithm ReFACTor, surrogate variable analysis (SVA), independent SVA (ISVA) and an optimized version of SVA (SmartSVA). We used real data and applied a multilayered simulation framework to assess the type I error rate, the statistical power and the quality of estimated methylation differences according to major study characteristics. While all five adjustment methods improved false-positive rates compared with unadjusted analyses, FaST-LMM-EWASher resulted in the lowest type I error rate at the expense of low statistical power. SVA efficiently corrected for cell-type heterogeneity in EWAS up to 200 cases and 200 controls, but did not control type I error rates in larger studies. Results based on real data sets confirmed simulation findings with the strongest control of type I error rates by FaST-LMM-EWASher and SmartSVA. Overall, ReFACTor, ISVA and SmartSVA showed the best comparable statistical power, quality of estimated methylation differences and runtime.

Comment on ‘Large-Scale Cognitive GWAS Meta-Analysis Reveals Tissue-Specific Neural Expression and Potential Nootropic Drug Targets’ by Lam et al.

Twin Research and Human Genetics ◽

10.1017/thg.2018.12 ◽

2018 ◽

Vol 21 (2) ◽

pp. 84-88 ◽

Cited By ~ 6

Author(s):

W. David Hill

Keyword(s):

Genetic Information ◽

Drug Targets ◽

Large Scale ◽

Association Studies ◽

Meta Analysis ◽

Genetic Correlations ◽

Data Sets ◽

Genome Wide Association Studies ◽

Nootropic Drug ◽

Genome Wide

Intelligence and educational attainment are strongly genetically correlated. This relationship can be exploited by Multi-Trait Analysis of GWAS (MTAG) to add power to Genome-wide Association Studies (GWAS) of intelligence. MTAG allows the user to meta-analyze GWASs of different phenotypes, based on their genetic correlations, to identify association's specific to the trait of choice. An MTAG analysis using GWAS data sets on intelligence and education was conducted by Lam et al. (2017). Lam et al. (2017) reported 70 loci that they described as ‘trait specific’ to intelligence. This article examines whether the analysis conducted by Lam et al. (2017) has resulted in genetic information about a phenotype that is more similar to education than intelligence.

Optimal Genomic Control in Large-scale Genetic Associations for Binary Diseases

10.21203/rs.3.rs-318017/v2 ◽

2021 ◽

Author(s):

Runqing Yang ◽

Yuxin Song ◽

Li Jiang ◽

Zhiyu Hao ◽

Runqing Yang

Keyword(s):

Multiple Testing ◽

Statistical Power ◽

Large Scale ◽

Association Studies ◽

Joint Analysis ◽

Genome Wide Association Studies ◽

Genetic Associations ◽

Genomic Heritability ◽

Large Scale Data ◽

Genome Wide

Abstract Complex computation and approximate solution hinder the application of generalized linear mixed models (GLMM) into genome-wide association studies. We extended GRAMMAR to handle binary diseases by considering genomic breeding values (GBVs) estimated in advance as a known predictor in genomic logit regression, and then controlled polygenic effects by regulating downward genomic heritability. Using simulations and case analyses, we showed in optimizing GRAMMAR, polygenic effects and genomic controls could be evaluated using the fewer sampling markers, which extremely simplified GLMM-based association analysis in large-scale data. In addition, joint analysis for quantitative trait nucleotide (QTN) candidates chosen by multiple testing offered significant improved statistical power to detect QTNs over existing methods.

Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data

10.1101/390195 ◽

2018 ◽

Cited By ~ 4

Author(s):

Janko Tackmann ◽

João Frederico Matias Rodrigues ◽

Christian von Mering

Keyword(s):

Graphical Models ◽

Large Scale ◽

Study Data ◽

Microbial Interactions ◽

Data Sets ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Data Set ◽

Seamless Integration

AbstractThe recent explosion of metagenomic sequencing data opens the door towards the modeling of microbial ecosystems in unprecedented detail. In particular, co-occurrence based prediction of ecological interactions could strongly benefit from this development. However, current methods fall short on several fronts: univariate tools do not distinguish between direct and indirect interactions, resulting in excessive false positives, while approaches with better resolution are so far computationally highly limited. Furthermore, confounding variables typical for cross-study data sets are rarely addressed. We present FlashWeave, a new approach based on a flexible Probabilistic Graphical Models framework to infer highly resolved direct microbial interactions from massive heterogeneous microbial abundance data sets with seamless integration of metadata. On a variety of benchmarks, FlashWeave outperforms state-of-the-art methods by several orders of magnitude in terms of speed while generally providing increased accuracy. We apply FlashWeave to a cross-study data set of 69 818 publicly available human gut samples, resulting in one of the largest and most diverse models of microbial interactions in the human gut to date.

Visualizing spatial population structure with estimated effective migration surfaces

10.1101/011809 ◽

2014 ◽

Cited By ~ 5

Author(s):

Desislava Petkova ◽

John Novembre ◽

Matthew Stephens

Keyword(s):

Population Structure ◽

Gene Flow ◽

Genetic Similarity ◽

Large Scale ◽

Genetic Model ◽

Isolation By Distance ◽

Genetic Data ◽

Migration Rates ◽

Spatial Population ◽

Spatial Population Structure

Genetic data often exhibit patterns that are broadly consistent with "isolation by distance" - a phenomenon where genetic similarity tends to decay with geographic distance. In a heterogeneous habitat, decay may occur more quickly in some regions than others: for example, barriers to gene flow can accelerate the genetic differentiation between groups located close in space. We use the concept of "effective migration" to model the relationship between genetics and geography: in this paradigm, effective migration is low in regions where genetic similarity decays quickly. We present a method to quantify and visualize variation in effective migration across the habitat, which can be used to identify potential barriers to gene flow, from geographically indexed large-scale genetic data. Our approach uses a population genetic model to relate underlying migration rates to expected pairwise genetic dissimilarities, and estimates migration rates by matching these expectations to the observed dissimilarities. We illustrate the potential and limitations of our method using simulations and data from elephant, human, and Arabidopsis thaliana populations. The resulting visualizations highlight important features of the spatial population structure that are difficult to discern using existing methods for summarizing genetic variation such as principal components analysis.