Association mapping from sequencing reads using k-mers

Genome wide association studies (GWAS) rely on microarrays, or more recently mapping of sequencing reads, to genotype individuals. The reliance on prior sequencing of a reference genome limits the scope of association studies, and also precludes mapping associations outside of the reference. We present an alignment free method for association studies of categorical phenotypes based on counting k-mers in whole-genome sequencing reads, testing for associations directly between k-mers and the trait of interest, and local assembly of the statistically significant k-mers to identify sequence differences. An analysis of the 1000 genomes data show that sequences identified by our method largely agree with results obtained using the standard approach. However, unlike standard GWAS, our method identifies associations with structural variations and sites not present in the reference genome. We also demonstrate that population stratification can be inferred from k-mers. Finally, application to an E.coli dataset on ampicillin resistance validates the approach.

Download Full-text

A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS)

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2017-0054 ◽

2018 ◽

Vol 17 (6) ◽

Cited By ~ 2

Author(s):

Huaqing Zhao ◽

Nandita Mitra ◽

Peter A. Kanetsky ◽

Katherine L. Nathanson ◽

Timothy R. Rebbeck

Keyword(s):

Principal Components ◽

Population Stratification ◽

Propensity Scores ◽

Association Studies ◽

Germ Cell Tumors ◽

Gwas Data ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Testicular Germ Cell ◽

Genome Wide

Abstract Genome-wide association studies (GWAS) are susceptible to bias due to population stratification (PS). The most widely used method to correct bias due to PS is principal components (PCs) analysis (PCA), but there is no objective method to guide which PCs to include as covariates. Often, the ten PCs with the highest eigenvalues are included to adjust for PS. This selection is arbitrary, and patterns of local linkage disequilibrium may affect PCA corrections. To address these limitations, we estimate genomic propensity scores based on all statistically significant PCs selected by the Tracy-Widom (TW) statistic. We compare a principal components and propensity scores (PCAPS) approach to PCA and EMMAX using simulated GWAS data under no, moderate, and severe PS. PCAPS reduced spurious genetic associations regardless of the degree of PS, resulting in odds ratio (OR) estimates closer to the true OR. We illustrate our PCAPS method using GWAS data from a study of testicular germ cell tumors. PCAPS provided a more conservative adjustment than PCA. Advantages of the PCAPS approach include reduction of bias compared to PCA, consistent selection of propensity scores to adjust for PS, the potential ability to handle outliers, and ease of implementation using existing software packages.

Download Full-text

Accounting for Population Stratification in Practice: A Comparison of the Main Strategies Dedicated to Genome-Wide Association Studies

PLoS ONE ◽

10.1371/journal.pone.0028845 ◽

2011 ◽

Vol 6 (12) ◽

pp. e28845 ◽

Cited By ~ 35

Author(s):

Matthieu Bouaziz ◽

Christophe Ambroise ◽

Mickael Guedj

Keyword(s):

Population Stratification ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

A mixed model reduces spurious genetic associations produced by population stratification in genome-wide association studies

Genomics ◽

10.1016/j.ygeno.2015.01.006 ◽

2015 ◽

Vol 105 (4) ◽

pp. 191-196 ◽

Cited By ~ 18

Author(s):

Jimin Shin ◽

Chaeyoung Lee

Keyword(s):

Population Stratification ◽

Mixed Model ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genetic Associations ◽

Genome Wide

Download Full-text

Robust methods for population stratification in genome wide association studies

BMC Bioinformatics ◽

10.1186/1471-2105-14-132 ◽

2013 ◽

Vol 14 (1) ◽

Cited By ~ 22

Author(s):

Li Liu ◽

Donghui Zhang ◽

Hong Liu ◽

Christopher Arendt

Keyword(s):

Population Stratification ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Robust Methods ◽

Genome Wide

Download Full-text

Novel genetic matching methods for handling population stratification in genome-wide association studies

BMC Bioinformatics ◽

10.1186/s12859-015-0521-4 ◽

2015 ◽

Vol 16 (1) ◽

Cited By ~ 6

Author(s):

André Lacour ◽

Vitalia Schüller ◽

Dmitriy Drichel ◽

Christine Herold ◽

Frank Jessen ◽

...

Keyword(s):

Population Stratification ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Matching Methods ◽

Genetic Matching ◽

Genome Wide

Download Full-text

A novel canis lupus familiaris reference genome improves variant resolution for use in breed-specific GWAS

Life Science Alliance ◽

10.26508/lsa.202000902 ◽

2021 ◽

Vol 4 (4) ◽

pp. e202000902 ◽

Cited By ~ 1

Author(s):

Robert A Player ◽

Ellen R Forsyth ◽

Kathleen J Verratti ◽

David W Mohr ◽

Alan F Scott ◽

...

Keyword(s):

Canis Lupus ◽

Reference Genome ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Canis Lupus Familiaris ◽

High Molecular Weight Dna ◽

Short Read ◽

Genome Wide ◽

The Cost

Reference genome fidelity is critically important for genome wide association studies, yet most vary widely from the study population. A typical whole genome sequencing approach implies short-read technologies resulting in fragmented assemblies with regions of ambiguity. Further information is lost by economic necessity when genotyping populations, as lower resolution technologies such as genotyping arrays are commonly used. Here, we present a phased reference genome for Canis lupus familiaris using high molecular weight DNA-sequencing technologies. We tested wet laboratory and bioinformatic approaches to demonstrate a minimum workflow to generate the 2.4 gigabase genome for a Labrador Retriever. The de novo assembly required eight Oxford Nanopore R9.4 flowcells (∼23X depth) and running a 10X Genomics library on the equivalent of one lane of an Illumina NovaSeq S1 flowcell (∼88X depth), bringing the cost of generating a nearly complete reference genome to less than $10K (USD). Mapping of short-read data from 10 Labrador Retrievers against this reference resulted in 1% more aligned reads versus the current reference (CanFam3.1, P < 0.001), and a 15% reduction of variant calls, increasing the chance of identifying true, low-effect size variants in a genome-wide association studies. We believe that by incorporating the cost to produce a full genome assembly into any large-scale genotyping project, an investigator can improve study power, decrease costs, and optimize the overall scientific value of their study.

Download Full-text

Genes, Psychology and Population History

10.31234/osf.io/kgz2t ◽

2021 ◽

Author(s):

Ken Richardson

Keyword(s):

Social Policy ◽

Genetic Markers ◽

Complex Traits ◽

Population Stratification ◽

Association Studies ◽

Population History ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Historical Dynamics

Genome wide association studies (GWAS) are being increasingly used to identify genetic markers of variation in complex traits such as intelligence and education. However, GWAS are compromised by population stratification (PS) leading to spurious associations, and attempts to correct for them statistically are also proving to be inadequate. This suggests the need for a deeper understanding of the sources of such PS and how its roots in complex social and historical dynamics can seriously mislead interpretations from GWAS/PGS to social policy.

Download Full-text

Deep mixed model for marginal epistasis detection and population stratification correction in genome-wide association studies

BMC Bioinformatics ◽

10.1186/s12859-019-3300-9 ◽

2019 ◽

Vol 20 (S23) ◽

Cited By ~ 1

Author(s):

Haohan Wang ◽

Tianwei Yue ◽

Jingkang Yang ◽

Wei Wu ◽

Eric P. Xing

Keyword(s):

Neural Network ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Complex Traits ◽

Population Stratification ◽

Mixed Model ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Abstract Background Genome-wide Association Studies (GWAS) have contributed to unraveling associations between genetic variants in the human genome and complex traits for more than a decade. While many works have been invented as follow-ups to detect interactions between SNPs, epistasis are still yet to be modeled and discovered more thoroughly. Results In this paper, following the previous study of detecting marginal epistasis signals, and motivated by the universal approximation power of deep learning, we propose a neural network method that can potentially model arbitrary interactions between SNPs in genetic association studies as an extension to the mixed models in correcting confounding factors. Our method, namely Deep Mixed Model, consists of two components: 1) a confounding factor correction component, which is a large-kernel convolution neural network that focuses on calibrating the residual phenotypes by removing factors such as population stratification, and 2) a fixed-effect estimation component, which mainly consists of an Long-short Term Memory (LSTM) model that estimates the association effect size of SNPs with the residual phenotype. Conclusions After validating the performance of our method using simulation experiments, we further apply it to Alzheimer’s disease data sets. Our results help gain some explorative understandings of the genetic architecture of Alzheimer’s disease.

Download Full-text

A faster implementation of association mapping from k-mers

10.1101/2020.04.14.040675 ◽

2020 ◽

Author(s):

Zakaria Mehrab ◽

Jaiaid Mobin ◽

Ibrahim Asadullah Tahmid ◽

Atif Rahman

Keyword(s):

Association Mapping ◽

Reference Genome ◽

Association Studies ◽

Free Association ◽

Mapping Method ◽

Genome Wide Association Studies ◽

E Coli ◽

Alignment Free ◽

Genome Wide ◽

Specific Sequences

AbstractGenome wide association studies (GWAS) attempt to map genotypes to phenotypes in organisms. This is typically performed by genotyping individuals using microarray or by aligning whole genome sequencing reads to a reference genome. Both approaches require knowledge of a reference genome which limits their application to organisms with no or incomplete reference genomes. This caveat can be removed using alignment-free association mapping methods based on k-mers from sequencing reads. Here we present an implementation of an alignment free association mapping method [1] to improve its execution time and flexibility. We have tested our implementation on an E. Coli ampicillin resistance dataset and observe improvement in performance over the original implementation while maintaining accuracy in results. Finally, we demonstrate that the method can be applied to find sex specific sequences.

Download Full-text

Genome-wide association studies for yield-related traits in soft red winter wheat grown in Virginia

10.1101/471656 ◽

2018 ◽

Author(s):

Brian P. Ward ◽

Gina Brown-Guedira ◽

Frederic L. Kolb ◽

David A. Van Sanford ◽

Priyanka Tyagi ◽

...

Keyword(s):

Winter Wheat ◽

Grain Yield ◽

Reference Genome ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Soft Red Winter Wheat ◽

Locus Model ◽

Genome Wide ◽

Red Winter Wheat

AbstractGrain yield is a trait of paramount importance in the breeding of all cereals. In wheat (Triticum aestivum L.), yield has steadily increased since the Green Revolution, though the current rate of increase is not forecasted to keep pace with demand due to growing world population and affluence. While several genome-wide association studies (GWAS) on yield and related component traits have been performed in wheat, the previous lack of a reference genome has made comparisons between studies difficult. In this study, a GWAS for yield and yield-related traits was carried out on a population of 324 soft red winter wheat lines across a total of four rain-fed environments in the state of Virginia using single-nucleotide polymorphism (SNP) marker data generated by a genotyping-by-sequencing (GBS) protocol. Two separate mixed linear models were used to identify significant marker-trait associations (MTAs). The first was a single-locus model utilizing a leave-one-chromosome-out approach to estimating kinship. The second was a sub-setting kinship multi-locus method (FarmCPU). The single-locus model identified nine significant MTAs for various yield-related traits, while the FarmCPU model identified 74 significant MTAs. The availability of the wheat reference genome allowed for the description of MTAs in terms of both genetic and physical positions, and enabled more extensive post-GWAS characterization of significant MTAs. The results indicate promising avenues for increasing grain yield by exploiting variation in traits relating to the number of grains per unit area, as well as phenological traits influencing grain-filling duration of genotypes.

Download Full-text