scholarly journals Association mapping from sequencing reads using k-mers

eLife ◽  
2018 ◽  
Vol 7 ◽  
Author(s):  
Atif Rahman ◽  
Ingileif Hallgrímsdóttir ◽  
Michael Eisen ◽  
Lior Pachter

Genome wide association studies (GWAS) rely on microarrays, or more recently mapping of sequencing reads, to genotype individuals. The reliance on prior sequencing of a reference genome limits the scope of association studies, and also precludes mapping associations outside of the reference. We present an alignment free method for association studies of categorical phenotypes based on counting k-mers in whole-genome sequencing reads, testing for associations directly between k-mers and the trait of interest, and local assembly of the statistically significant k-mers to identify sequence differences. An analysis of the 1000 genomes data show that sequences identified by our method largely agree with results obtained using the standard approach. However, unlike standard GWAS, our method identifies associations with structural variations and sites not present in the reference genome. We also demonstrate that population stratification can be inferred from k-mers. Finally, application to an E.coli dataset on ampicillin resistance validates the approach.

Author(s):  
Huaqing Zhao ◽  
Nandita Mitra ◽  
Peter A. Kanetsky ◽  
Katherine L. Nathanson ◽  
Timothy R. Rebbeck

Abstract Genome-wide association studies (GWAS) are susceptible to bias due to population stratification (PS). The most widely used method to correct bias due to PS is principal components (PCs) analysis (PCA), but there is no objective method to guide which PCs to include as covariates. Often, the ten PCs with the highest eigenvalues are included to adjust for PS. This selection is arbitrary, and patterns of local linkage disequilibrium may affect PCA corrections. To address these limitations, we estimate genomic propensity scores based on all statistically significant PCs selected by the Tracy-Widom (TW) statistic. We compare a principal components and propensity scores (PCAPS) approach to PCA and EMMAX using simulated GWAS data under no, moderate, and severe PS. PCAPS reduced spurious genetic associations regardless of the degree of PS, resulting in odds ratio (OR) estimates closer to the true OR. We illustrate our PCAPS method using GWAS data from a study of testicular germ cell tumors. PCAPS provided a more conservative adjustment than PCA. Advantages of the PCAPS approach include reduction of bias compared to PCA, consistent selection of propensity scores to adjust for PS, the potential ability to handle outliers, and ease of implementation using existing software packages.


2015 ◽  
Vol 16 (1) ◽  
Author(s):  
André Lacour ◽  
Vitalia Schüller ◽  
Dmitriy Drichel ◽  
Christine Herold ◽  
Frank Jessen ◽  
...  

2021 ◽  
Vol 4 (4) ◽  
pp. e202000902 ◽  
Author(s):  
Robert A Player ◽  
Ellen R Forsyth ◽  
Kathleen J Verratti ◽  
David W Mohr ◽  
Alan F Scott ◽  
...  

Reference genome fidelity is critically important for genome wide association studies, yet most vary widely from the study population. A typical whole genome sequencing approach implies short-read technologies resulting in fragmented assemblies with regions of ambiguity. Further information is lost by economic necessity when genotyping populations, as lower resolution technologies such as genotyping arrays are commonly used. Here, we present a phased reference genome for Canis lupus familiaris using high molecular weight DNA-sequencing technologies. We tested wet laboratory and bioinformatic approaches to demonstrate a minimum workflow to generate the 2.4 gigabase genome for a Labrador Retriever. The de novo assembly required eight Oxford Nanopore R9.4 flowcells (∼23X depth) and running a 10X Genomics library on the equivalent of one lane of an Illumina NovaSeq S1 flowcell (∼88X depth), bringing the cost of generating a nearly complete reference genome to less than $10K (USD). Mapping of short-read data from 10 Labrador Retrievers against this reference resulted in 1% more aligned reads versus the current reference (CanFam3.1, P < 0.001), and a 15% reduction of variant calls, increasing the chance of identifying true, low-effect size variants in a genome-wide association studies. We believe that by incorporating the cost to produce a full genome assembly into any large-scale genotyping project, an investigator can improve study power, decrease costs, and optimize the overall scientific value of their study.


2021 ◽  
Author(s):  
Ken Richardson

Genome wide association studies (GWAS) are being increasingly used to identify genetic markers of variation in complex traits such as intelligence and education. However, GWAS are compromised by population stratification (PS) leading to spurious associations, and attempts to correct for them statistically are also proving to be inadequate. This suggests the need for a deeper understanding of the sources of such PS and how its roots in complex social and historical dynamics can seriously mislead interpretations from GWAS/PGS to social policy.


2019 ◽  
Vol 20 (S23) ◽  
Author(s):  
Haohan Wang ◽  
Tianwei Yue ◽  
Jingkang Yang ◽  
Wei Wu ◽  
Eric P. Xing

Abstract Background Genome-wide Association Studies (GWAS) have contributed to unraveling associations between genetic variants in the human genome and complex traits for more than a decade. While many works have been invented as follow-ups to detect interactions between SNPs, epistasis are still yet to be modeled and discovered more thoroughly. Results In this paper, following the previous study of detecting marginal epistasis signals, and motivated by the universal approximation power of deep learning, we propose a neural network method that can potentially model arbitrary interactions between SNPs in genetic association studies as an extension to the mixed models in correcting confounding factors. Our method, namely Deep Mixed Model, consists of two components: 1) a confounding factor correction component, which is a large-kernel convolution neural network that focuses on calibrating the residual phenotypes by removing factors such as population stratification, and 2) a fixed-effect estimation component, which mainly consists of an Long-short Term Memory (LSTM) model that estimates the association effect size of SNPs with the residual phenotype. Conclusions After validating the performance of our method using simulation experiments, we further apply it to Alzheimer’s disease data sets. Our results help gain some explorative understandings of the genetic architecture of Alzheimer’s disease.


2020 ◽  
Author(s):  
Zakaria Mehrab ◽  
Jaiaid Mobin ◽  
Ibrahim Asadullah Tahmid ◽  
Atif Rahman

AbstractGenome wide association studies (GWAS) attempt to map genotypes to phenotypes in organisms. This is typically performed by genotyping individuals using microarray or by aligning whole genome sequencing reads to a reference genome. Both approaches require knowledge of a reference genome which limits their application to organisms with no or incomplete reference genomes. This caveat can be removed using alignment-free association mapping methods based on k-mers from sequencing reads. Here we present an implementation of an alignment free association mapping method [1] to improve its execution time and flexibility. We have tested our implementation on an E. Coli ampicillin resistance dataset and observe improvement in performance over the original implementation while maintaining accuracy in results. Finally, we demonstrate that the method can be applied to find sex specific sequences.


2018 ◽  
Author(s):  
Brian P. Ward ◽  
Gina Brown-Guedira ◽  
Frederic L. Kolb ◽  
David A. Van Sanford ◽  
Priyanka Tyagi ◽  
...  

AbstractGrain yield is a trait of paramount importance in the breeding of all cereals. In wheat (Triticum aestivum L.), yield has steadily increased since the Green Revolution, though the current rate of increase is not forecasted to keep pace with demand due to growing world population and affluence. While several genome-wide association studies (GWAS) on yield and related component traits have been performed in wheat, the previous lack of a reference genome has made comparisons between studies difficult. In this study, a GWAS for yield and yield-related traits was carried out on a population of 324 soft red winter wheat lines across a total of four rain-fed environments in the state of Virginia using single-nucleotide polymorphism (SNP) marker data generated by a genotyping-by-sequencing (GBS) protocol. Two separate mixed linear models were used to identify significant marker-trait associations (MTAs). The first was a single-locus model utilizing a leave-one-chromosome-out approach to estimating kinship. The second was a sub-setting kinship multi-locus method (FarmCPU). The single-locus model identified nine significant MTAs for various yield-related traits, while the FarmCPU model identified 74 significant MTAs. The availability of the wheat reference genome allowed for the description of MTAs in terms of both genetic and physical positions, and enabled more extensive post-GWAS characterization of significant MTAs. The results indicate promising avenues for increasing grain yield by exploiting variation in traits relating to the number of grains per unit area, as well as phenological traits influencing grain-filling duration of genotypes.


Sign in / Sign up

Export Citation Format

Share Document