scholarly journals Identifying non-identical-by-descent rare variants in population-scale whole genome sequencing data

2020 ◽  
Author(s):  
Kelsey E. Johnson ◽  
Benjamin F. Voight

AbstractThe site frequency spectrum in human populations is not accurately modeled by an infinite sites model, which assumes that all mutations are unique. Despite the pervasiveness of recurrent mutations, we lack computational methods to identify these events at specific sites in population sequencing data. Rare alleles that are identical-by-descent (IBD) are expected to segregate on a long, shared haplotype background that descends from a common ancestor. However, alleles introduced by recurrent mutation or by non-crossover gene conversions are identical-by-state and will have a shorter expected shared haplotype background. We hypothesized that the expected difference in shared haplotype background length can distinguish IBD and non-IBD variants in population sequencing data without pedigree information. We implemented a Bayesian hierarchical model and used Gibbs sampling to estimate the posterior probability of IBD state for rare variants, using simulations to demonstrate that our approach accurately distinguishes rare IBD and non-IBD variants. Applying our method to whole genome sequencing data from 3,621 individuals in the UK10K consortium, we found that non-IBD variants correlated with higher local mutation rates and genomic features like replication timing. Using a heuristic to categorize non-IBD variants as gene conversions or recurrent mutations, we found that potential gene conversions had expected properties such as enriched local GC content. By identifying recurrent mutations, we can better understand the spectrum of recent mutations in human populations, a source of genetic variation driving evolution and a key factor in understanding recent demographic history.

2021 ◽  
Author(s):  
KE Joyce ◽  
E Onabanjo ◽  
S Brownlow ◽  
F Nur ◽  
KO Olupona ◽  
...  

ABSTRACTPossession of a clinical or molecular disease label alters the context in which life-course events operate, but rarely explains the phenotypic variability observed by clinicians. Whole genome sequencing of unselected endothelial vasculopathy patients demonstrated more than a third had rare, likely deleterious variants in clinically-relevant genes unrelated to their vasculopathy (1 in 10 within platelet genes; 1 in 8 within coagulation genes; and 1 in 4 within erythrocyte hemolytic genes). High erythrocyte membrane variant rates paralleled genomic damage and prevalence indices in the general population. In blinded analyses, patients with greater hemorrhagic severity that had been attributed solely to their vasculopathy had more deleterious variants in platelet (Spearman ρ=0.25, p=0.008) and coagulation (Spearman ρ=0.21, p=0.024) genes. We conclude that rare diseases can provide insights for medicine beyond their primary pathophysiology, and propose a framework based on rare variants to inform interpretative approaches to accelerate clinical impact from whole genome sequencing.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Zihuai He ◽  
Linxi Liu ◽  
Chen Wang ◽  
Yann Le Guen ◽  
Justin Lee ◽  
...  

AbstractThe analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability, and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer’s Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.


2021 ◽  
Author(s):  
Zihuai He ◽  
Linxi Liu ◽  
Chen Wang ◽  
Yann Le Guen ◽  
Justin Lee ◽  
...  

AbstractThe analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer’s Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.


Sign in / Sign up

Export Citation Format

Share Document