scholarly journals Efficient management and analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr

2017 ◽  
Author(s):  
Florian Privé ◽  
Hugues Aschard ◽  
Michael G.B. Blum

AbstractMotivation:Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses. Specialized software for every part of the analysis pipeline have been developed to handle large genomic data. However, combining all these software into a single data analysis pipeline might be technically difficult.Results:Here we present two R packages, bigstatsr and bigsnpr, allowing for management and analysis of large scale genomic data to be performed within a single comprehensive framework. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement a fast derivation of Principal Component Analysis, functions to remove SNPs in Linkage Disequilibrium, and algorithms to learn Polygenic Risk Scores on millions of SNPs. We illustrate applications of the two R packages by analysing a case-control genomic dataset for the celiac disease, performing an association study and computing Polygenic Risk Scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500,000 individuals and 1 million markers on a single desktop computer.Availability:https://privefl.github.io/bigstatsr/ & https://privefl.github.io/bigsnpr/Contact:[email protected] & [email protected] information:Supplementary data are available at Bioinformatics online.

2020 ◽  
Vol 46 (Supplement_1) ◽  
pp. S4-S5
Author(s):  
Hengyi Cao ◽  
Hang Zhou ◽  
Tyrone Cannon

Abstract Background The relationships between schizophrenia polygenic risk and connectome-wide brain connectivity remain unclear. In particular, it is unknown whether and how schizophrenia polygenic risk would influence functional connectivity both at the connectome level and in a state-independent way. In this study, we used multi-paradigm fMRI data from two independent cohorts to investigate these questions. Methods The discovery sample included 623 healthy Caucasian participants (age 28.86 ± 3.63 years, 302 males) acquired from the Human Connectome Project (HCP). All subjects completed fMRI scans for a battery of eight paradigms and had imputed genetic data available. Following the procedures in our prior studies, we constructed whole-brain connectivity matrices for each paradigm in each individual. From these derived matrices, we computed cross-paradigm connectivity (CPC) using principal component analysis. These CPC matrices quantify shared connectivity patterns across all paradigms for each individual and thus represent state-independent “trait” network architecture of each subject. The polygenic risk scores (PRSs) for each subject were calculated based on the genome-wide association study (GWAS) results from the Psychiatric Genomics Consortium. The scores were calculated as the sum of genome-wide risk alleles for each individual, weighted by the corresponding odds ratios to schizophrenia. We report our main findings based on the GWAS-significant threshold (P = 5×10–8). In addition, to test the robustness of our findings, we also calculated PRSs with a set of other thresholds ranging from 5×10–7 to 5×10–2. The network-based statistic (NBS) analysis was performed to associate PRSs with CPC matrices, where age, sex, and head motion were included as covariates. Significance was determined by 10,000 permutations of the original sample. The validation sample included 44 patients with schizophrenia, 43 patients with bipolar disorders, 34 patients with attention deficit hyperactivity disorder, and 77 healthy controls drawn from the Consortium for Neuropsychiatric Phenomics (CNP). All subjects completed a battery of seven fMRI paradigms. We used this sample to examine 1) whether the identified connectomic findings were specifically detected in patients with schizophrenia; and 2) whether these findings could be related to behavioral deficits in patients with schizophrenia. Results In the HCP sample, the NBS analysis revealed a significant association (PFWE < 0.05) between schizophrenia PRS and a large-scale network involving a total of 69 edges connecting between 54 nodes. These nodes were predominantly distributed in the brain’s visual system, default-mode system, and frontoparietal system. Specifically, higher PRSs were associated with lower connectivity for all connections in the identified network (R = -0.37). The results were significant across all paradigms (R < -0.13, P < 0.001) and remained robust across multiple PRS thresholds (R < -0.10, P < 0.02). In the CNP sample, the connectivity of the detected network differed significantly between groups (P = 0.005), which was particularly driven by decreased connectivity in patients with SZ compared with that in HCs (PBonferroni = 0.03). The connectivity of the identified network was significantly correlated with both performance IQ (R = 0.28, P =0.002) and verbal IQ (R = 0.29, P = 0.001). Discussion These findings provide the first evidence for state-independent connectome-wide associations of schizophrenia polygenic risk at the systems level and suggest that disrupted integration of sensori-cognitive information may be a hallmark of genetic effects on the brain that contributes to the pathogenesis of schizophrenia.


2020 ◽  
Vol 5 ◽  
pp. 206
Author(s):  
Mathilde Boecker ◽  
Alvina G. Lai

Over the past three decades, the number of people globally with diabetes mellitus has more than doubled. It is estimated that by 2030, 439 million people will be suffering from the disease, 90-95% of whom will have type 2 diabetes (T2D). In 2017, 5 million deaths globally were attributable to T2D, placing it in the top 10 global causes of death. Because T2D is a result of both genetic and environmental factors, identification of individuals with high genetic risk can help direct early interventions to prevent progression to more serious complications. Genome-wide association studies have identified ~400 variants associated with T2D that can be used to calculate polygenic risk scores (PRS). Although PRSs are not currently more accurate than clinical predictors and do not yet predict risk with equal accuracy across all ethnic populations, they have several potential clinical uses. Here, we discuss potential usages of PRS for predicting T2D and for informing and optimising interventions. We also touch on possible health inequality risks of PRS and the feasibility of large-scale implementation of PRS in clinical practice. Before PRSs can be used as a therapeutic tool, it is important that further polygenic risk models are derived using non-European genome-wide association studies to ensure that risk prediction is accurate for all ethnic groups. Furthermore, it is essential that the ethical, social and legal implications of PRS are considered before their implementation in any context.


2018 ◽  
Vol 35 (14) ◽  
pp. 2512-2514 ◽  
Author(s):  
Bongsong Kim ◽  
Xinbin Dai ◽  
Wenchao Zhang ◽  
Zhaohong Zhuang ◽  
Darlene L Sanchez ◽  
...  

Abstract Summary We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. Availability and implementation GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. Supplementary information Supplementary data are available at Bioinformatics online.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Xiujin Li ◽  
Hailiang Song ◽  
Zhe Zhang ◽  
Yunmao Huang ◽  
Qin Zhang ◽  
...  

Abstract Background With the emphasis on analysing genotype-by-environment interactions within the framework of genomic selection and genome-wide association analysis, there is an increasing demand for reliable tools that can be used to simulate large-scale genomic data in order to assess related approaches. Results We proposed a theory to simulate large-scale genomic data on genotype-by-environment interactions and added this new function to our developed tool GPOPSIM. Additionally, a simulated threshold trait with large-scale genomic data was also added. The validation of the simulated data indicated that GPOSPIM2.0 is an efficient tool for mimicking the phenotypic data of quantitative traits, threshold traits, and genetically correlated traits with large-scale genomic data while taking genotype-by-environment interactions into account. Conclusions This tool is useful for assessing genotype-by-environment interactions and threshold traits methods.


2018 ◽  
Author(s):  
Tom G. Richardson ◽  
Sean Harrison ◽  
Gibran Hemani ◽  
George Davey Smith

AbstractThe age of large-scale genome-wide association studies (GWAS) has provided us with an unprecedented opportunity to evaluate the genetic liability of complex disease using polygenic risk scores (PRS). In this study, we have analysed 162 PRS (P<5×l0 05) derived from GWAS and 551 heritable traits from the UK Biobank study (N=334,398). Findings can be investigated using a web application (http://mrcieu.mrsoftware.org/PRS_atlas/), which we envisage will help uncover both known and novel mechanisms which contribute towards disease susceptibility.To demonstrate this, we have investigated the results from a phenome-wide evaluation of schizophrenia genetic liability. Amongst findings were inverse associations with measures of cognitive function which extensive follow-up analyses using Mendelian randomization (MR) provided evidence of a causal relationship. We have also investigated the effect of multiple risk factors on disease using mediation and multivariable MR frameworks. Our atlas provides a resource for future endeavours seeking to unravel the causal determinants of complex disease.


2018 ◽  
Author(s):  
Roman Teo Oliynyk

AbstractBackgroundGenome-wide association studies and other computational biology techniques are gradually discovering the causal gene variants that contribute to late-onset human diseases. After more than a decade of genome-wide association study efforts, these can account for only a fraction of the heritability implied by familial studies, the so-called “missing heritability” problem.MethodsComputer simulations of polygenic late-onset diseases in an aging population have quantified the risk allele frequency decrease at older ages caused by individuals with higher polygenic risk scores becoming ill proportionately earlier. This effect is most prominent for diseases characterized by high cumulative incidence and high heritability, examples of which include Alzheimer’s disease, coronary artery disease, cerebral stroke, and type 2 diabetes.ResultsThe incidence rate for late-onset diseases grows exponentially for decades after early onset ages, guaranteeing that the cohorts used for genome-wide association studies overrepresent older individuals with lower polygenic risk scores, whose disease cases are disproportionately due to environmental causes such as old age itself. This mechanism explains the decline in clinical predictive power with age and the lower discovery power of familial studies of heritability and genome-wide association studies. It also explains the relatively constant-with-age heritability found for late-onset diseases of lower prevalence, exemplified by cancers.ConclusionsFor late-onset polygenic diseases showing high cumulative incidence together with high initial heritability, rather than using relatively old age-matched cohorts, study cohorts combining the youngest possible cases with the oldest possible controls may significantly improve the discovery power of genome-wide association studies.


2019 ◽  
Author(s):  
Ying Sheng ◽  
Chiung-Yu Huang ◽  
Siarhei Lobach ◽  
Lydia Zablotska ◽  
Iryna Lobach ◽  
...  

ABSTRACTLarge-scale genome-wide analyses scans provide massive volumes of genetic variants on large number of cases and controls that can be used to estimate the genetic effects. Yet, the sets of non-genetic variables available in publicly available databases are often brief. It is known that omitting a continuous variable from a logistic regression model can result in biased estimates of odds ratios (OR) (e.g., Gail et al (1984), Neuhaus et al (1993), Hauck et al (1991), Zeger et al (1988)). We are interested to assess what information is needed to recover the bias in the OR estimate of genotype due to omitting a continuous variable in settings when the actual values of the omitted variable are not available. We derive two estimating procedures that can recover the degree of bias based on a conditional density of the omitted variable or knowing the distribution of the omitted variable. Importantly, our derivations show that omitting a continuous variable can result in either under- or over-estimation of the genetic effects. We performed extensive simulation studies to examine bias, variability, false positive rate, and power in the model that omits a continuous variable. We show the application to two genome-wide studies of Alzheimer’s disease.Data Availability StatementThe data that support the findings of this study are openly available in the Database of Genotypes and Phenotypes at [https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000372.v1.p1], reference number [phs000372.v1.p1] and at the Alzheimer’s Disease Neuroimaging Initiative http://adni.loni.usc.edu/.


2014 ◽  
Vol 205 (2) ◽  
pp. 113-119 ◽  
Author(s):  
Wouter J. Peyrot ◽  
Yuri Milaneschi ◽  
Abdel Abdellaoui ◽  
Patrick F. Sullivan ◽  
Jouke J. Hottenga ◽  
...  

BackgroundResearch on gene×environment interaction in major depressive disorder (MDD) has thus far primarily focused on candidate genes, although genetic effects are known to be polygenic.AimsTo test whether the effect of polygenic risk scores on MDD is moderated by childhood trauma.MethodThe study sample consisted of 1645 participants with a DSM-IV diagnosis of MDD and 340 screened controls from The Netherlands. Chronic or remitted episodes (severe MDD) were present in 956 participants. The occurrence of childhood trauma was assessed with the Childhood Trauma Interview and the polygenic risk scores were based on genome-wide meta-analysis results from the Psychiatric Genomics Consortium.ResultsThe polygenic risk scores and childhood trauma independently affected MDD risk, and evidence was found for interaction as departure from both multiplicativity and additivity, indicating that the effect of polygenic risk scores on depression is increased in the presence of childhood trauma. The interaction effects were similar in predicting all MDD risk and severe MDD risk, and explained a proportion of variation in MDD risk comparable to the polygenic risk scores themselves.ConclusionsThe interaction effect found between polygenic risk scores and childhood trauma implies that (1) studies on direct genetic effect on MDD gain power by focusing on individuals exposed to childhood trauma, and that (2) individuals with both high polygenic risk scores and exposure to childhood trauma are particularly at risk for developing MDD.


Sign in / Sign up

Export Citation Format

Share Document