Optimizing the identification of causal variants across varying genetic architectures in crops

AbstractBackgroundAssociation studies use statistical links between genetic markers and variation in a phenotype’s value across many individuals to identify genes controlling variation in the target phenotype. However, this approach, particularly conducted on a genome-wide scale (GWAS), has limited power to identify the genes responsible for variation in traits controlled by complex genetic architectures.ResultsHere we employ simulation studies utilizing real-world genotype datasets from association populations in four species with distinct minor allele frequency distributions, population structures, and patterns linkage disequilibrium to evaluate the impact of variation in both heritability and trait complexity on both conventional mixed linear model based GWAS and two new approaches specifically developed for complex traits. Mixed linear model based GWAS rapidly losses power for more complex traits. FarmCPU, a method based on multi-locus mixed linear models, provides the greatest statistical power for moderately complex traits. A Bayesian approach adopted from genomic prediction provides the greatest statistical power to identify causal genetic loci for extremely complex traits.ConclusionsUsing estimates of the complexity of the genetic architecture of target traits can guide the selection of appropriate statistical methods and improve the overall accuracy and power of GWAS.

Download Full-text

GAPIT Version 3: Boosting Power and Accuracy for Genomic Association and Prediction

10.1101/2020.11.29.403170 ◽

2020 ◽

Author(s):

Jiabo Wang ◽

Zhiwu Zhang

Keyword(s):

Linear Model ◽

Statistical Power ◽

Mixed Model ◽

Genome Wide Association Study ◽

Linear Models ◽

Genomic Data ◽

Mixed Linear Model ◽

Genomic Research ◽

Multiple Loci ◽

Genomic Association

AbstractGenome-Wide Association Study (GWAS) and Genomic Prediction/Selection (GP/GS) are the two essential enterprises in genomic research. Due to the great magnitude and complexity of genomic data, analytical methods and their associated software packages are frequently advanced. GAPIT is a widely used Genomic Association and Prediction Integrated Tool. The first version was released to the public in 2012 with the implementation of the general linear model (GLM), mixed linear model (MLM), compressed MLM, and genomic Best Linear Unbiased Prediction (gBLUP). The second version was released in 2016 with several new implementations, including Enriched Compressed MLM and Settlement of mixed linear models Under Progressively Exclusive Relationship (SUPER). All the GWAS methods are based on the single locus test. For the first time, in the current release of GAPIT, version 3 implemented three multiple loci test methods, including Multiple Loci Mixed Model (MLMM), Fixed and random model Circulating Probability Unification (FarmCPU), and Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK). Additionally, two GP/GS methods were implemented based on Compressed MLM, named compressed BLUP, and SUPER, named SUPER BLUP. These new implementations not only boost statistical power for GWAS and prediction accuracy for GP/GS, but also improve computing speed and increase the capacity to analyze big genomic data. Here, we document the current upgrade of GAPIT by describing the selection of the recently developed methods, their implementation, and potential impact. All documents, including source code, user manual, demo data, and tutorials, are freely available at the GAPIT website (http://zzlab.net/GAPIT).

Download Full-text

Mixed linear model approach adapted for genome-wide association studies

Nature Genetics ◽

10.1038/ng.546 ◽

2010 ◽

Vol 42 (4) ◽

pp. 355-360 ◽

Cited By ~ 1007

Author(s):

Zhiwu Zhang ◽

Elhan Ersoz ◽

Chao-Qiang Lai ◽

Rory J Todhunter ◽

Hemant K Tiwari ◽

...

Keyword(s):

Linear Model ◽

Association Studies ◽

Mixed Linear Model ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Model Approach

Download Full-text

A simulation study of impacts of error structure on modeling stockrecruitment data using generalized linear models

Canadian Journal of Fisheries and Aquatic Sciences ◽

10.1139/f03-149 ◽

2004 ◽

Vol 61 (1) ◽

pp. 122-133 ◽

Cited By ~ 20

Author(s):

Yan Jiao ◽

Yong Chen ◽

David Schneider ◽

Joe Wroblewski

Keyword(s):

Linear Model ◽

Generalized Linear Model ◽

Linear Models ◽

Least Squares Method ◽

Estimation Of Parameters ◽

Estimation Errors ◽

Error Structure ◽

Stock Recruitment ◽

Data Points ◽

The Impact

Stockrecruitment (SR) models are commonly fitted to SR data with a least-squares method. Errors in modeling are usually assumed to be normal or lognormal, regardless of whether such an assumption is realistic. A Monte Carlo simulation approach was used to evaluate the impact of the assumption of error structure on SR modeling. The generalized linear model, which can readily deal with different error structures, was used in estimating parameters. This study suggests that the quality of SR parameter estimation, measured by estimation errors, can be influenced by the realism of error structure assumed in an estimation, the number of SR data points, and the number of outliers in modeling. A small number of SR data points and the presence of outliers in SR data could increase the difficulty in identifying an appropriate error structure in modeling, which might lead to large biases in the SR param eter estimation. This study shows that generalized linear model methods can help identify an appropriate error distribution in SR modeling, leading to an improved estimation of parameters even when there are outliers and the number of SR data points is small. We recommend the generalized linear model be used for quantifying stockrecruitment relationships.

Download Full-text

Recent advances in genetic predisposition to clinical acute lung injury

AJP Lung Cellular and Molecular Physiology ◽

10.1152/ajplung.90269.2008 ◽

2009 ◽

Vol 296 (5) ◽

pp. L713-L725 ◽

Cited By ~ 79

Author(s):

Li Gao ◽

Kathleen C. Barnes

Keyword(s):

Acute Lung Injury ◽

Lung Injury ◽

Candidate Genes ◽

Genetic Variants ◽

Complex Traits ◽

Association Studies ◽

Genetic Association Studies ◽

Special Focus ◽

Disease Manifestation ◽

The Impact

It has been well established that acute lung injury (ALI), and the more severe presentation of acute respiratory distress syndrome (ARDS), constitute complex traits characterized by a multigenic and multifactorial etiology. Identification and validation of genetic variants contributing to disease susceptibility and severity has been hampered by the profound heterogeneity of the clinical phenotype and the role of environmental factors, which includes treatment, on outcome. The critical nature of ALI and ARDS, compounded by the impact of phenotypic heterogeneity, has rendered the amassing of sufficiently powered studies especially challenging. Nevertheless, progress has been made in the identification of genetic variants in select candidate genes, which has enhanced our understanding of the specific pathways involved in disease manifestation. Identification of novel candidate genes for which genetic association studies have confirmed a role in disease has been greatly aided by the powerful tool of high-throughput expression profiling. This article will review these studies to date, summarizing candidate genes associated with ALI and ARDS, acknowledging those that have been replicated in independent populations, with a special focus on the specific pathways for which candidate genes identified so far can be clustered.

Download Full-text

Genome-wide association and genomic selection in animal breedingThis article is one of a selection of papers from the conference “Exploiting Genome-wide Association in Oilseed Brassicas: a model for genetic improvement of major OECD crops for sustainable farming”.

Genome ◽

10.1139/g10-076 ◽

2010 ◽

Vol 53 (11) ◽

pp. 876-883 ◽

Cited By ~ 135

Author(s):

Ben Hayes ◽

Mike Goddard

Keyword(s):

Genomic Selection ◽

Complex Traits ◽

Association Studies ◽

Genome Wide Association ◽

Relationship Matrix ◽

Genome Wide Association Studies ◽

Simple Method ◽

Breeding Values ◽

Genome Wide ◽

A Genome

Results from genome-wide association studies in livestock, and humans, has lead to the conclusion that the effect of individual quantitative trait loci (QTL) on complex traits, such as yield, are likely to be small; therefore, a large number of QTL are necessary to explain genetic variation in these traits. Given this genetic architecture, gains from marker-assisted selection (MAS) programs using only a small number of DNA markers to trace a limited number of QTL is likely to be small. This has lead to the development of alternative technology for using the available dense single nucleotide polymorphism (SNP) information, called genomic selection. Genomic selection uses a genome-wide panel of dense markers so that all QTL are likely to be in linkage disequilibrium with at least one SNP. The genomic breeding values are predicted to be the sum of the effect of these SNPs across the entire genome. In dairy cattle breeding, the accuracy of genomic estimated breeding values (GEBV) that can be achieved and the fact that these are available early in life have lead to rapid adoption of the technology. Here, we discuss the design of experiments necessary to achieve accurate prediction of GEBV in future generations in terms of the number of markers necessary and the size of the reference population where marker effects are estimated. We also present a simple method for implementing genomic selection using a genomic relationship matrix. Future challenges discussed include using whole genome sequence data to improve the accuracy of genomic selection and management of inbreeding through genomic relationships.

Download Full-text

Gamete simulation improves polygenic transmission disequilibrium analysis

10.1101/2020.10.26.355602 ◽

2020 ◽

Author(s):

Jiawen Chen ◽

Jing You ◽

Zijie Zhao ◽

Zheng Ni ◽

Kunling Huang ◽

...

Keyword(s):

Complex Traits ◽

Statistical Power ◽

Association Studies ◽

Autism Spectrum ◽

Genetic Maps ◽

Risk Scores ◽

Parental Genotype ◽

Genome Wide Association Studies ◽

Transmission Disequilibrium ◽

Polygenic Risk

AbstractPolygenic risk scores (PRS) derived from summary statistics of genome-wide association studies (GWAS) have enjoyed great popularity in human genetics research. Applied to population cohorts, PRS can effectively stratify individuals by risk group and has promising applications in early diagnosis and clinical intervention. However, our understanding of within-family polygenic risk is incomplete, in part because the small samples per family significantly limits power. Here, to address this challenge, we introduce ORIGAMI, a computational framework that uses parental genotype data to simulate offspring genomes. ORIGAMI uses state-of-the-art genetic maps to simulate realistic recombination events on phased parental genomes and allows quantifying the prospective PRS variability within each family. We quantify and showcase the substantially reduced yet highly heterogeneous PRS variation within families for numerous complex traits. Further, we incorporate within-family PRS variability to improve polygenic transmission disequilibrium test (pTDT). Through simulations, we demonstrate that modeling within-family risk substantially improves the statistical power of pTDT. Applied to 7,805 trios of autism spectrum disorder (ASD) probands and healthy parents, we successfully replicated previously reported over-transmission of ASD, educational attainment, and schizophrenia risk, and identified multiple novel traits with significant transmission disequilibrium. These results provided novel etiologic insights into the shared genetic basis of various complex traits and ASD.

Download Full-text

On modeling and analyzing barley malt data in different years

Biometrical Letters ◽

10.2478/bile-2019-0004 ◽

2019 ◽

Vol 56 (1) ◽

pp. 45-57

Author(s):

Iwona Mejza ◽

Katarzyna Ambroży-Deręgowska ◽

Jan Bocianowski ◽

Józef Błażewicz ◽

Marek Liszewski ◽

...

Keyword(s):

Linear Model ◽

Random Effects ◽

Fixed Effects ◽

Linear Models ◽

Model Fitting ◽

Mixed Linear Model ◽

Barley Grain ◽

Barley Malt ◽

Starting Point ◽

Quality Coefficient

SummaryThe main purpose of this study was the model fitting of data deriving from a three-year experiment with barley malt. Two linear models were considered: a fixed linear model with fixed effects of years and other factors, and a mixed linear model with random effects of years and fixed effects of other factors. Two cultivars of brewing barley, Sebastian and Mauritia, six methods of nitrogen fertilization and four germination times were analyzed. Three quantitative traits were observed: practical extractivity of the malt, malting productivity, and a quality coefficient Q. The starting point for the statistical analyses was the available experimental material, which consisted of barley grain samples destined for malting. The analyses were performed over a series of years with respect to fixed or random effects of years. Due to the strong differentiation of the years of the study and some significant interactions of factors with years, annual analyses were also carried out.

Download Full-text

Across-cohort QC analyses of genome-wide association study summary statistics from complex traits

10.1101/033787 ◽

2015 ◽

Author(s):

Guo-Bo Chen ◽

Sang Hong Lee ◽

Matthew R Robinson ◽

Maciej Trzaskowski ◽

Zhi-Xiang Zhu ◽

...

Keyword(s):

Complex Traits ◽

Statistical Power ◽

Association Studies ◽

False Negative ◽

Genome Wide Association ◽

Effect Sizes ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Unknown Sample ◽

Genome Wide

Genome-wide association studies (GWASs) have been successful in discovering replicable SNP-trait associations for many quantitative traits and common diseases in humans. Typically the effect sizes of SNP alleles are very small and this has led to large genome-wide association meta-analyses (GWAMA) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study we propose a new set of metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We proposed a pair of methods in examining the concordance between demographic information and summary statistics. In method I, we use the population genetics Fststatistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. In method II, we conduct principal component analysis based on reported allele frequencies, and is able to recover the ancestral information for each cohort. In addition, we propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. Finally, to quantify unknown sample overlap across all pairs of cohorts we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy.

Download Full-text

Neurofuzzy Approach to Fault Detection of Nonlinear Systems

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.1999.p0524 ◽

1999 ◽

Vol 3 (6) ◽

pp. 524-531

Author(s):

Jinglu Hu ◽

◽

Kotaro Hirasawa ◽

Kousuke Kumamaru ◽

Keyword(s):

Nonlinear Systems ◽

Fault Detection ◽

Linear Systems ◽

Linear Model ◽

Basis Function ◽

Linear Models ◽

Detection Method ◽

Local Model ◽

Model Based ◽

Robust Fault Detection

This paper proposes a neurofuzzy approach to fault detection in linear systems. The system diagnosed is described by using a neurofuzzy model called LimNet that consists of a linear model and multiple local linear models with interpolation of a "fuzzy basis function". Fault detection is considered in two cases: when faults occur in the linear model part, a KDI-based robust fault detection is applied, where a multi-local-model part is treated as error due to nonlinear undermodeling; when faults occur in the multi-local-model part, a multi-model based fault detection method is developed, in which the identified LimNet is interpreted as several local ARMAX models, and KDI is used as an index to discriminate between each local model and its reference. This paper mainly concentrates discussions on multi-model based fault detection.

Download Full-text

CALDERA: Finding all significant de Bruijn subgraphs for bacterial GWAS

10.1101/2021.11.05.467462 ◽

2021 ◽

Author(s):

Hector Roux de Bezieux ◽

Leandro Lima ◽

Fanny Perraudeau ◽

Arnaud Mary ◽

Sandrine Dudoit ◽

...

Keyword(s):

Statistical Power ◽

Association Studies ◽

Bacterial Species ◽

De Bruijn Graph ◽

Testable Hypothesis ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

A Genome ◽

De Bruijn ◽

Connected Subgraphs

Genome wide association studies (GWAS), aiming to find genetic variants associated with a trait, have widely been used on bacteria to identify genetic determinants of drug resistance or hypervirulence. Recent bacterial GWAS methods usually rely on k-mers, whose presence in a genome can denote variants ranging from single nucleotide polymorphisms to mobile genetic elements. Since many bacterial species include genes that are not shared among all strains, this approach avoids the reliance on a common reference genome. However, the same gene can exist in slightly different versions across different strains, leading to diluted effects when trying to detect its association to a phenotype through k-mer based GWAS. Here we propose to overcome this by testing covariates built from closed connected subgraphs of the De Bruijn graph defined over genomic k-mers. These covariates are able to capture polymorphic genes as a single entity, improving k-mer based GWAS in terms of power and interpretability. As the number of subgraphs is exponential in the number of nodes in the DBG, a method naively testing all possible subgraphs would result in very low statistical power due to multiple testing corrections, and the mere exploration of these subgraphs would quickly become computationally intractable. The concept of testable hypothesis has successfully been used to address both problems in similar contexts. We leverage this concept to test all closed connected subgraphs by proposing a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, resulting in drastic improvements in computational efficiency. We illustrate this on both real and simulated datasets and also demonstrate how considering subgraphs leads to a more powerful and interpretable method. Our method integrates with existing visual tools to facilitate interpretation. We also provide an implementation of our method, as well as code to reproduce all results at https://github.com/HectorRDB/Caldera_Recomb.

Download Full-text

Optimizing the identification of causal variants across varying genetic architectures in crops

GAPIT Version 3: Boosting Power and Accuracy for Genomic Association and Prediction

Mixed linear model approach adapted for genome-wide association studies

A simulation study of impacts of error structure on modeling stockrecruitment data using generalized linear models

Recent advances in genetic predisposition to clinical acute lung injury

Genome-wide association and genomic selection in animal breedingThis article is one of a selection of papers from the conference “Exploiting Genome-wide Association in Oilseed Brassicas: a model for genetic improvement of major OECD crops for sustainable farming”.

Gamete simulation improves polygenic transmission disequilibrium analysis

On modeling and analyzing barley malt data in different years

Across-cohort QC analyses of genome-wide association study summary statistics from complex traits

Neurofuzzy Approach to Fault Detection of Nonlinear Systems

CALDERA: Finding all significant de Bruijn subgraphs for bacterial GWAS

A simulation study of impacts of error structure on modeling stockrecruitment data using generalized linear models