scholarly journals A faster implementation of association mapping from k-mers

2020 ◽  
Author(s):  
Zakaria Mehrab ◽  
Jaiaid Mobin ◽  
Ibrahim Asadullah Tahmid ◽  
Atif Rahman

AbstractGenome wide association studies (GWAS) attempt to map genotypes to phenotypes in organisms. This is typically performed by genotyping individuals using microarray or by aligning whole genome sequencing reads to a reference genome. Both approaches require knowledge of a reference genome which limits their application to organisms with no or incomplete reference genomes. This caveat can be removed using alignment-free association mapping methods based on k-mers from sequencing reads. Here we present an implementation of an alignment free association mapping method [1] to improve its execution time and flexibility. We have tested our implementation on an E. Coli ampicillin resistance dataset and observe improvement in performance over the original implementation while maintaining accuracy in results. Finally, we demonstrate that the method can be applied to find sex specific sequences.

PLoS ONE ◽  
2021 ◽  
Vol 16 (1) ◽  
pp. e0245058
Author(s):  
Zakaria Mehrab ◽  
Jaiaid Mobin ◽  
Ibrahim Asadullah Tahmid ◽  
Atif Rahman

Genome wide association studies (GWAS) attempt to map genotypes to phenotypes in organisms. This is typically performed by genotyping individuals using microarray or by aligning whole genome sequencing reads to a reference genome. Both approaches require knowledge of a reference genome which hinders their application to organisms with no or incomplete reference genomes. This caveat can be removed by using alignment-free association mapping methods based on k-mers from sequencing reads. Here we present an improved implementation of an alignment free association mapping method. The new implementation is faster and includes additional features to make it more flexible than the original implementation. We have tested our implementation on an E. Coli ampicillin resistance dataset and observe improvement in execution time over the original implementation while maintaining accuracy in results. We also demonstrate that the method can be applied to find sex specific sequences.


eLife ◽  
2018 ◽  
Vol 7 ◽  
Author(s):  
Atif Rahman ◽  
Ingileif Hallgrímsdóttir ◽  
Michael Eisen ◽  
Lior Pachter

Genome wide association studies (GWAS) rely on microarrays, or more recently mapping of sequencing reads, to genotype individuals. The reliance on prior sequencing of a reference genome limits the scope of association studies, and also precludes mapping associations outside of the reference. We present an alignment free method for association studies of categorical phenotypes based on counting k-mers in whole-genome sequencing reads, testing for associations directly between k-mers and the trait of interest, and local assembly of the statistically significant k-mers to identify sequence differences. An analysis of the 1000 genomes data show that sequences identified by our method largely agree with results obtained using the standard approach. However, unlike standard GWAS, our method identifies associations with structural variations and sites not present in the reference genome. We also demonstrate that population stratification can be inferred from k-mers. Finally, application to an E.coli dataset on ampicillin resistance validates the approach.


2013 ◽  
Vol 64 (3) ◽  
pp. 244 ◽  
Author(s):  
L. W. Pembleton ◽  
J. Wang ◽  
N. O. I. Cogan ◽  
J. E. Pryce ◽  
G. Ye ◽  
...  

Due to the complex genetic architecture of perennial ryegrass, based on an obligate outbreeding reproductive habit, association-mapping approaches to genetic dissection offer the potential for effective identification of genetic marker–trait linkages. Associations with genes for agronomic characters, such as components of herbage nutritive quality, may then be utilised for accelerated cultivar improvement using advanced molecular breeding practices. The objective of the present study was to evaluate the presence of such associations for a broad range of candidate genes involved in pathways of cell wall biosynthesis and carbohydrate metabolism. An association-mapping panel composed from a broad range of non-domesticated and varietal sources was assembled and assessed for genome-wide sequence polymorphism. Removal of significant population structure obtained a diverse meta-population (220 genotypes) suitable for association studies. The meta-population was established with replication as a spaced-plant field trial. All plants were genotyped with a cohort of candidate gene-derived single nucleotide polymorphism (SNP) markers. Herbage samples were harvested at both vegetative and reproductive stages and were measured for a range of herbage quality traits using near infrared reflectance spectroscopy. Significant associations were identified for ~50% of the genes, accounting for small but significant components of phenotypic variance. The identities of genes with associated SNPs were largely consistent with detailed knowledge of ryegrass biology, and they are interpreted in terms of known biochemical and physiological processes. Magnitudes of effect of observed marker–trait gene association were small, indicating that future activities should focus on genome-wide association studies in order to identify the majority of causal mutations for complex traits such as forage quality.


2017 ◽  
Author(s):  
Haohan Wang ◽  
Xiang Liu ◽  
Yunpeng Xiao ◽  
Ming Xu ◽  
Eric P. Xing

AbstractGenome-wide Association Study has presented a promising way to understand the association between human genomes and complex traits. Many simple polymorphic loci have been shown to explain a significant fraction of phenotypic variability. However, challenges remain in the non-triviality of explaining complex traits associated with multifactorial genetic loci, especially considering the confounding factors caused by population structure, family structure, and cryptic relatedness. In this paper, we propose a Squared-LMM (LMM2) model, aiming to jointly correct population and genetic confounding factors. We offer two strategies of utilizing LMM2 for association mapping: 1) It serves as an extension of univariate LMM, which could effectively correct population structure, but consider each SNP in isolation. 2) It is integrated with the multivariate regression model to discover association relationship between complex traits and multifactorial genetic loci. We refer to this second model as sparse Squared-LMM (sLMM2). Further, we extend LMM2/sLMM2 by raising the power of our squared model to the LMMn/sLMMn model. We demonstrate the practical use of our model with synthetic phenotypic variants generated from genetic loci of Arabidopsis Thaliana. The experiment shows that our method achieves a more accurate and significant prediction on the association relationship between traits and loci. We also evaluate our models on collected phenotypes and genotypes with the number of candidate genes that the models could discover. The results suggest the potential and promising usage of our method in genome-wide association studies.


2018 ◽  
Author(s):  
Ping Zeng ◽  
Xinjie Hao ◽  
Xiang Zhou

AbstractMotivationGenome-wide association studies (GWASs) have identified many genetic loci associated with complex traits. A substantial fraction of these identified loci are associated with multiple traits – a phenomena known as pleiotropy. Identification of pleiotropic associations can help characterize the genetic relationship among complex traits and can facilitate our understanding of disease etiology. Effective pleiotropic association mapping requires the development of statistical methods that can jointly model multiple traits with genome-wide SNPs together.ResultsWe develop a joint modeling method, which we refer to as the integrative MApping of Pleiotropic association (iMAP). iMAP models summary statistics from GWASs, uses a multivariate Gaussian distribution to account for phenotypic correlation, simultaneously infers genome-wide SNP association pattern using mixture modeling, and has the potential to reveal causal relationship between traits. Importantly, iMAP integrates a large number of SNP functional annotations to substantially improve association mapping power, and, with a sparsity-inducing penalty, is capable of selecting informative annotations from a large, potentially noninformative set. To enable scalable inference of iMAP to association studies with hundreds of thousands of individuals and millions of SNPs, we develop an efficient expectation maximization algorithm based on an approximate penalized regression algorithm. With simulations and comparisons to existing methods, we illustrate the benefits of iMAP both in terms of high association mapping power and in terms of accurate estimation of genome-wide SNP association patterns. Finally, we apply iMAP to perform a joint analysis of 48 traits from 31 GWAS consortia together with 40 tissue-specific SNP annotations generated from the Roadmap Project. iMAP is freely available at www.xzlab.org/software.html.


2021 ◽  
Vol 4 (4) ◽  
pp. e202000902 ◽  
Author(s):  
Robert A Player ◽  
Ellen R Forsyth ◽  
Kathleen J Verratti ◽  
David W Mohr ◽  
Alan F Scott ◽  
...  

Reference genome fidelity is critically important for genome wide association studies, yet most vary widely from the study population. A typical whole genome sequencing approach implies short-read technologies resulting in fragmented assemblies with regions of ambiguity. Further information is lost by economic necessity when genotyping populations, as lower resolution technologies such as genotyping arrays are commonly used. Here, we present a phased reference genome for Canis lupus familiaris using high molecular weight DNA-sequencing technologies. We tested wet laboratory and bioinformatic approaches to demonstrate a minimum workflow to generate the 2.4 gigabase genome for a Labrador Retriever. The de novo assembly required eight Oxford Nanopore R9.4 flowcells (∼23X depth) and running a 10X Genomics library on the equivalent of one lane of an Illumina NovaSeq S1 flowcell (∼88X depth), bringing the cost of generating a nearly complete reference genome to less than $10K (USD). Mapping of short-read data from 10 Labrador Retrievers against this reference resulted in 1% more aligned reads versus the current reference (CanFam3.1, P < 0.001), and a 15% reduction of variant calls, increasing the chance of identifying true, low-effect size variants in a genome-wide association studies. We believe that by incorporating the cost to produce a full genome assembly into any large-scale genotyping project, an investigator can improve study power, decrease costs, and optimize the overall scientific value of their study.


2021 ◽  
Author(s):  
Wenmin Zhang ◽  
Hamed S Najafabadi ◽  
Yue Li

Identifying causal variants from genome-wide association studies (GWASs) is challenging due to widespread linkage disequilibrium (LD). Functional annotations of the genome may help prioritize variants that are biologically relevant and thus improve fine-mapping of GWAS results. However, classical fine-mapping methods have a high computational cost, particularly when the underlying genetic architecture and LD patterns are complex. Here, we propose a novel approach, SparsePro, to efficiently conduct functionally informed statistical fine-mapping. Our method enjoys two major innovations: First, by creating a sparse low-dimensional projection of the high-dimensional genotype, we enable a linear search of causal variants instead of an exponential search of causal configurations used in existing methods; Second, we adopt a probabilistic framework with a highly efficient variational expectation-maximization algorithm to integrate statistical associations and functional priors. We evaluate SparsePro through extensive simulations using resources from the UK Biobank. Compared to state-of-the-art methods, SparsePro achieved more accurate and well-calibrated posterior inference with greatly reduced computation time. We demonstrate the utility of SparsePro by investigating the genetic architecture of five functional biomarkers of vital organs. We identify potential causal variants contributing to the genetically encoded coordination mechanisms between vital organs and pinpoint target genes with potential pleiotropic effects. In summary, we have developed an efficient genome-wide fine-mapping method with the ability to integrate functional annotations. Our method may have wide utility in understanding the genetics of complex traits as well as in increasing the yield of functional follow-up studies of GWASs.


2018 ◽  
Author(s):  
Brian P. Ward ◽  
Gina Brown-Guedira ◽  
Frederic L. Kolb ◽  
David A. Van Sanford ◽  
Priyanka Tyagi ◽  
...  

AbstractGrain yield is a trait of paramount importance in the breeding of all cereals. In wheat (Triticum aestivum L.), yield has steadily increased since the Green Revolution, though the current rate of increase is not forecasted to keep pace with demand due to growing world population and affluence. While several genome-wide association studies (GWAS) on yield and related component traits have been performed in wheat, the previous lack of a reference genome has made comparisons between studies difficult. In this study, a GWAS for yield and yield-related traits was carried out on a population of 324 soft red winter wheat lines across a total of four rain-fed environments in the state of Virginia using single-nucleotide polymorphism (SNP) marker data generated by a genotyping-by-sequencing (GBS) protocol. Two separate mixed linear models were used to identify significant marker-trait associations (MTAs). The first was a single-locus model utilizing a leave-one-chromosome-out approach to estimating kinship. The second was a sub-setting kinship multi-locus method (FarmCPU). The single-locus model identified nine significant MTAs for various yield-related traits, while the FarmCPU model identified 74 significant MTAs. The availability of the wheat reference genome allowed for the description of MTAs in terms of both genetic and physical positions, and enabled more extensive post-GWAS characterization of significant MTAs. The results indicate promising avenues for increasing grain yield by exploiting variation in traits relating to the number of grains per unit area, as well as phenological traits influencing grain-filling duration of genotypes.


Nature ◽  
2019 ◽  
Vol 576 (7785) ◽  
pp. 106-111 ◽  
Author(s):  

AbstractThe underrepresentation of non-Europeans in human genetic studies so far has limited the diversity of individuals in genomic datasets and led to reduced medical relevance for a large proportion of the world’s population. Population-specific reference genome datasets as well as genome-wide association studies in diverse populations are needed to address this issue. Here we describe the pilot phase of the GenomeAsia 100K Project. This includes a whole-genome sequencing reference dataset from 1,739 individuals of 219 population groups and 64 countries across Asia. We catalogue genetic variation, population structure, disease associations and founder effects. We also explore the use of this dataset in imputation, to facilitate genetic studies in populations across Asia and worldwide.


2012 ◽  
Vol 5 (1) ◽  
Author(s):  
Chad C Brown ◽  
Tammy M Havener ◽  
Marisa Wong Medina ◽  
Ronald M Krauss ◽  
Howard L McLeod ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document