scholarly journals BIGwas: Single-command quality control and association testing for multi-cohort and biobank-scale GWAS/PheWAS data

GigaScience ◽  
2021 ◽  
Vol 10 (6) ◽  
Author(s):  
Jan Christian Kässens ◽  
Lars Wienbrandt ◽  
David Ellinghaus

Abstract Background Genome-wide association studies (GWAS) and phenome-wide association studies (PheWAS) involving 1 million GWAS samples from dozens of population-based biobanks present a considerable computational challenge and are carried out by large scientific groups under great expenditure of time and personnel. Automating these processes requires highly efficient and scalable methods and software, but so far there is no workflow solution to easily process 1 million GWAS samples. Results Here we present BIGwas, a portable, fully automated quality control and association testing pipeline for large-scale binary and quantitative trait GWAS data provided by biobank resources. By using Nextflow workflow and Singularity software container technology, BIGwas performs resource-efficient and reproducible analyses on a local computer or any high-performance compute (HPC) system with just 1 command, with no need to manually install a software execution environment or various software packages. For a single-command GWAS analysis with 974,818 individuals and 92 million genetic markers, BIGwas takes ∼16 days on a small HPC system with only 7 compute nodes to perform a complete GWAS QC and association analysis protocol. Our dynamic parallelization approach enables shorter runtimes for large HPCs. Conclusions Researchers without extensive bioinformatics knowledge and with few computer resources can use BIGwas to perform multi-cohort GWAS with 1 million GWAS samples and, if desired, use it to build their own (genome-wide) PheWAS resource. BIGwas is freely available for download from http://github.com/ikmb/gwas-qc and http://github.com/ikmb/gwas-assoc.

2018 ◽  
Vol 35 (14) ◽  
pp. 2512-2514 ◽  
Author(s):  
Bongsong Kim ◽  
Xinbin Dai ◽  
Wenchao Zhang ◽  
Zhaohong Zhuang ◽  
Darlene L Sanchez ◽  
...  

Abstract Summary We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. Availability and implementation GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 243-244
Author(s):  
Brittany N Diehl ◽  
Andres A Pech-Cervantes ◽  
Thomas H Terrill ◽  
Ibukun M Ogunade ◽  
Owen Rae ◽  
...  

Abstract Florida Native sheep is an indigenous breed from Florida and expresses superior parasite resistance. Previous candidate and genome wide association studies with Florida Native sheep have identified single nucleotide polymorphisms with additive and non-additive effects associated with parasite resistance. However, the role of other potential DNA variants, such as copy number variants (CNVs), controlling this complex trait have not been evaluated. The objective of the present study was to investigate the importance of CNVs on resistance to natural Haemonchus contortus infections in Florida Native sheep. A total of 200 sheep were evaluated in the present study. Phenotypic records included fecal egg count (FEC, eggs/gram), FAMACHA score, and packed cell volume (PCV, %). Sheep were genotyped using the GGP Ovine 50K SNP chip. The copy number analysis was used to identify CNVs using the univariate method. A total of 170 animals with CNVs and phenotypic data were used for the association testing. Association tests were carried out using single linear regression and Principal Component Analysis (PCA) correction to identify CNVs associated with FEC, FAMACHA, and PCV. To confirm our results, a second association testing using the correlation-trend test with PCA correction was performed. Significant CNVs were detected when their adjusted p-value was < 0.05 after FDR correction. A deletion CNV in chromosome 21 was associated with FEC. This DNA variant was located in intron 2 of RAB3IL gene and overlapped a QTL associated with changes in eosinophil number. Our study demonstrated for the first time that CNVs could be potentially involved with parasite resistance in this heritage sheep breed.


2012 ◽  
Vol 215 (1) ◽  
pp. 17-28 ◽  
Author(s):  
Georg Homuth ◽  
Alexander Teumer ◽  
Uwe Völker ◽  
Matthias Nauck

The metabolome, defined as the reflection of metabolic dynamics derived from parameters measured primarily in easily accessible body fluids such as serum, plasma, and urine, can be considered as the omics data pool that is closest to the phenotype because it integrates genetic influences as well as nongenetic factors. Metabolic traits can be related to genetic polymorphisms in genome-wide association studies, enabling the identification of underlying genetic factors, as well as to specific phenotypes, resulting in the identification of metabolome signatures primarily caused by nongenetic factors. Similarly, correlation of metabolome data with transcriptional or/and proteome profiles of blood cells also produces valuable data, by revealing associations between metabolic changes and mRNA and protein levels. In the last years, the progress in correlating genetic variation and metabolome profiles was most impressive. This review will therefore try to summarize the most important of these studies and give an outlook on future developments.


2018 ◽  
Author(s):  
Doug Speed ◽  
David J Balding

LD Score Regression (LDSC) has been widely applied to the results of genome-wide association studies. However, its estimates of SNP heritability are derived from an unrealistic model in which each SNP is expected to contribute equal heritability. As a consequence, LDSC tends to over-estimate confounding bias, under-estimate the total phenotypic variation explained by SNPs, and provide misleading estimates of the heritability enrichment of SNP categories. Therefore, we present SumHer, software for estimating SNP heritability from summary statistics using more realistic heritability models. After demonstrating its superiority over LDSC, we apply SumHer to the results of 24 large-scale association studies (average sample size 121 000). First we show that these studies have tended to substantially over-correct for confounding, and as a result the number of genome-wide significant loci has under-reported by about 20%. Next we estimate enrichment for 24 categories of SNPs defined by functional annotations. A previous study using LDSC reported that conserved regions were 13-fold enriched, and found a further twelve categories with above 2-fold enrichment. By contrast, our analysis using SumHer finds that conserved regions are only 1.6-fold (SD 0.06) enriched, and that no category has enrichment above 1.7-fold. SumHer provides an improved understanding of the genetic architecture of complex traits, which enables more efficient analysis of future genetic data.


PLoS Genetics ◽  
2021 ◽  
Vol 17 (1) ◽  
pp. e1009315
Author(s):  
Ardalan Naseri ◽  
Junjie Shi ◽  
Xihong Lin ◽  
Shaojie Zhang ◽  
Degui Zhi

Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the ϕ and π0 from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts.


2020 ◽  
Author(s):  
Celine Charon ◽  
Rodrigue Allodji ◽  
Vincent Meyer ◽  
Jean-François Deleuze

Abstract Quality control methods for genome-wide association studies and fine mapping are commonly used for imputation, however, they result in loss of many single nucleotide polymorphisms (SNPs). To investigate the consequences of filtration on imputation, we studied the direct effects on the number of markers, their allele frequencies, imputation quality scores and post-filtration events. We pre-phrased 1,031 genotyped individuals from diverse ethnicities and compared the imputed variants to 1,089 NCBI recorded individuals for additional validation.Without variant pre-filtration based on quality control (QC), we observed no impairment in the imputation of SNPs that failed QC whereas with pre-filtration there was an overall loss of information. Significant differences between frequencies with and without pre-filtration were found only in the range of very rare (5E-04-1E-03) and rare variants (1E-03-5E-03) (p < 1E-04). Increasing the post-filtration imputation quality score from 0.3 to 0.8 reduced the number of single nucleotide variants (SNVs) <0.001 2.5 fold with or without QC pre-filtration and halved the number of very rare variants (5E-04). As a result, to maintain confidence and enough SNVs, we propose here a 2-step post-filtration approach to increase the number of very rare and rare variants compared to conservative post-filtration methods.


2020 ◽  
Vol 26 (5) ◽  
pp. 490-500
Author(s):  
A. O. Konradi

The article reviews monogenic forms of hypertension, data on the role of heredity of essential hypertension and candidate genes, as well as genome-wide association studies. Modern approach for the role of genetics is driven by implementation of new technologies and their productivity. High performance speed of new technologies like genome-wide association studies provide data for better knowledge of genetic markers of hypertension. The major goal nowadays for research is to reveal molecular pathways of blood pressure regulation, which can help to move from populational to individual level of understanding of pathogenesis and treatment targets.


Thorax ◽  
2021 ◽  
pp. thoraxjnl-2020-215742
Author(s):  
Sanghun Lee ◽  
Jessica Lasky-Su ◽  
Sungho Won ◽  
Cecelia Laurie ◽  
Juan Carlos Celedón ◽  
...  

Most genome-wide association studies of obesity and body mass index (BMI) have so far assumed an additive mode of inheritance in their analysis, although association testing supports a recessive effect for some of the established loci, for example, rs1421085 in FTO. In two whole-genome sequencing (WGS) studies of children with asthma and their parents (892 Costa Rican trios and 286 North American trios), we discovered an association between a locus (rs9292139) in LOC102724122 and BMI that reaches genome-wide significance under a recessive model in the combined analysis. As the association does not achieve significance under an additive model, our finding illustrates the benefits of the recessive model in WGS analyses.


Sign in / Sign up

Export Citation Format

Share Document