scholarly journals Significant abundance of cis configurations of mutations in diploid human genomes

2017 ◽  
Author(s):  
Margret R. Hoehe ◽  
Ralf Herwig ◽  
Qing Mao ◽  
Brock A. Peters ◽  
Radoje Drmanac ◽  
...  

AbstractTo fully understand human genetic variation, one must assess the specific distribution of variants between the two chromosomal homologues of genes, and any functional units of interest, as the phase of variants can significantly impact gene function and phenotype. To this end, we have systematically analyzed 18,121 autosomal protein-coding genes in 1,092 statistically phased genomes from the 1000 Genomes Project, and an unprecedented number of 184 experimentally phased genomes from the Personal Genome Project. Here we show that mutations predicted to functionally alter the protein, and coding variants as a whole, are not randomly distributed between the two homologues of a gene, but do occur significantly more frequently in cis-than trans-configurations, with cis/trans ratios of ∼60:40. Significant cis-abundance was observed in virtually all individual genomes in all populations. Nearly all variable genes exhibited either cis, or trans configurations of protein-altering mutations in significant excess, allowing distinction of cis- and trans-abundant genes. These common patterns of phase were largely constituted by a shared, global set of phase-sensitive genes. We show significant enrichment of this global set with gene sets indicating its involvement in adaptation and evolution. Moreover, cis- and trans-abundant genes were found functionally distinguishable, and exhibited strikingly different distributional patterns of protein-altering mutations. This work establishes common patterns of phase as key characteristics of diploid human exomes and provides evidence for their potential functional significance. Thus, it highlights the importance of phase for the interpretation of protein-coding genetic variation, challenging the current conceptual and functional interpretation of autosomal genes.

2020 ◽  
Vol 37 (9) ◽  
pp. 2531-2548
Author(s):  
Gerrald A Lodewijk ◽  
Diana P Fernandes ◽  
Iraklis Vretzakis ◽  
Jeanne E Savage ◽  
Frank M J Jacobs

Abstract Ever since the availability of genomes from Neanderthals, Denisovans, and ancient humans, the field of evolutionary genomics has been searching for protein-coding variants that may hold clues to how our species evolved over the last ∼600,000 years. In this study, we identify such variants in the human-specific NOTCH2NL gene family, which were recently identified as possible contributors to the evolutionary expansion of the human brain. We find evidence for the existence of unique protein-coding NOTCH2NL variants in Neanderthals and Denisovans which could affect their ability to activate Notch signaling. Furthermore, in the Neanderthal and Denisovan genomes, we find unusual NOTCH2NL configurations, not found in any of the modern human genomes analyzed. Finally, genetic analysis of archaic and modern humans reveals ongoing adaptive evolution of modern human NOTCH2NL genes, identifying three structural variants acting complementary to drive our genome to produce a lower dosage of NOTCH2NL protein. Because copy-number variations of the 1q21.1 locus, encompassing NOTCH2NL genes, are associated with severe neurological disorders, this seemingly contradicting drive toward low levels of NOTCH2NL protein indicates that the optimal dosage of NOTCH2NL may have not yet been settled in the human population.


2018 ◽  
Author(s):  
Gabriel E. Hoffman ◽  
Eric E. Schadt ◽  
Panos Roussos

ABSTRACTIdentifying causal variants underling disease risk and adoption of personalized medicine are currently limited by the challenge of interpreting the functional consequences of genetic variants. Predicting the functional effects of disease-associated protein-coding variants is increasingly routine. Yet the vast majority of risk variants are non-coding, and predicting the functional consequence and prioritizing variants for functional validation remains a major challenge. Here we develop a deep learning model to accurately predict locus-specific signals from four epigenetic assays using only DNA sequence as input. Given the predicted epigenetic signal from DNA sequence for the reference and alternative alleles at a given locus, we generate a score of the predicted epigenetic consequences for 438 million variants. These impact scores are assay-specific, are predictive of allele-specific transcription factor binding and are enriched for variants associated with gene expression and disease risk. Nucleotide-level functional consequence scores for non-coding variants can refine the mechanism of known causal variants, identify novel risk variants and prioritize downstream experiments.


2015 ◽  
Author(s):  
Justin M Zook ◽  
David Catoe ◽  
Jennifer McDaniel ◽  
Lindsay Vang ◽  
Noah Spies ◽  
...  

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.


Author(s):  
Xiaoming Jia ◽  
Fernando S. Goes ◽  
Adam E. Locke ◽  
Duncan Palmer ◽  
Weiqing Wang ◽  
...  

AbstractBipolar disorder (BD) is a serious mental illness with substantial common variant heritability. However, the role of rare coding variation in BD is not well established. We examined the protein-coding (exonic) sequences of 3,987 unrelated individuals with BD and 5,322 controls of predominantly European ancestry across four cohorts from the Bipolar Sequencing Consortium (BSC). We assessed the burden of rare, protein-altering, single nucleotide variants classified as pathogenic or likely pathogenic (P-LP) both exome-wide and within several groups of genes with phenotypic or biologic plausibility in BD. While we observed an increased burden of rare coding P-LP variants within 165 genes identified as BD GWAS regions in 3,987 BD cases (meta-analysis OR = 1.9, 95% CI = 1.3–2.8, one-sided p = 6.0 × 10−4), this enrichment did not replicate in an additional 9,929 BD cases and 14,018 controls (OR = 0.9, one-side p = 0.70). Although BD shares common variant heritability with schizophrenia, in the BSC sample we did not observe a significant enrichment of P-LP variants in SCZ GWAS genes, in two classes of neuronal synaptic genes (RBFOX2 and FMRP) associated with SCZ or in loss-of-function intolerant genes. In this study, the largest analysis of exonic variation in BD, individuals with BD do not carry a replicable enrichment of rare P-LP variants across the exome or in any of several groups of genes with biologic plausibility. Moreover, despite a strong shared susceptibility between BD and SCZ through common genetic variation, we do not observe an association between BD risk and rare P-LP coding variants in genes known to modulate risk for SCZ.


2020 ◽  
Author(s):  
Henry J Martell ◽  
Darren K Griffin ◽  
Mark N Wass

AbstractThe availability of thousands of individual genomes provides many opportunities to understand genetic variation and the relationship to phenotype, particularly disease. However, this remains challenging as it is often difficult to identify if a non-synonymous variant alters protein structure and function. Many computational methods have been developed but they typically interpret individual variants in isolation, despite the possibility of variant-variant interactions. Here, we combine the genetic variation data present in the 1000 genome project with protein structural data to identify variant-variant interactions within individual human genomes. We find more than 4,000 combinations of variants that located close in 3D dimensional structure and more than 1,200 in protein-protein interfaces. Many variant combinations include amino acid changes that are compensatory such as maintaining charges or functional groups, thus supporting that these are coevolutionary events. This highlights the need for variant interpretation and precision medicine to consider the gestalt effects of variants.


2018 ◽  
Author(s):  
Justin M. Zook ◽  
Jennifer McDaniel ◽  
Hemang Parikh ◽  
Haynes Heaton ◽  
Sean A. Irvine ◽  
...  

AbstractBenchmark small variant calls from the Genome in a Bottle Consortium (GIAB) for the CEPH/HapMap genome NA12878 (HG001) have been used extensively for developing, optimizing, and demonstrating performance of sequencing and bioinformatics methods. Here, we develop a reproducible, cloud-based pipeline to integrate multiple sequencing datasets and form benchmark calls, enabling application to arbitrary human genomes. We use these reproducible methods to form high-confidence calls with respect to GRCh37 and GRCh38 for HG001 and 4 additional broadly-consented genomes from the Personal Genome Project that are available as NIST Reference Materials. These new genomes’ broad, open consent with few restrictions on availability of samples and data is enabling a uniquely diverse array of applications. Our new methods produce 17% more high-confidence SNPs, 176% more indels, and 12% larger regions than our previously published calls. To demonstrate that these calls can be used for accurate benchmarking, we compare other high-quality callsets to ours (e.g., Illumina Platinum Genomes), and we demonstrate that the majority of discordant calls are errors in the other callsets, We also highlight challenges in interpreting performance metrics when benchmarking against imperfect high-confidence calls. We show that benchmarking tools from the Global Alliance for Genomics and Health can be used with our calls to stratify performance metrics by variant type and genome context and elucidate strengths and weaknesses of a method.


2021 ◽  
Author(s):  
Lambert Moyon ◽  
Camille Berthelot ◽  
Alexandra Louis ◽  
Nga Thi Thuy Nguyen ◽  
Hugues Roest Crollius

Whole genome sequencing is increasingly used to diagnose medical conditions of genetic origin. While both coding and non-coding DNA variants contribute to a wide range of diseases, most patients who receive a WGS-based diagnosis today harbour a protein-coding mutation. Functional interpretation and prioritization of non-coding variants represents a persistent challenge, and disease-causing non-coding variants remain largely unidentified. Depending on the disease, WGS fails to identify a candidate variant in 20-80% of patients, severely limiting the usefulness of sequencing for personalised medicine. Here we present FINSURF, a machine-learning approach to predict the functional impact of non-coding variants in regulatory regions. FINSURF outperforms state-of-the-art methods, owing to control optimisation during training. In addition to ranking candidate variants, FINSURF also delivers diagnostic information on functional consequences of mutations. We applied FINSURF to a diverse set of 30 diseases with described causative non-coding mutations, and correctly identified the disease-causative non-coding variant within the ten top hits in 22 cases. FINSURF is implemented as an online server to as well as custom browser tracks, and provides a quick and efficient solution to prioritize candidate non-coding variants in realistic clinical settings.


2019 ◽  
Author(s):  
Matthew R Hildebrandt ◽  
Miriam S Reuter ◽  
Wei Wei ◽  
Naeimeh Tayebi ◽  
Jiajie Liu ◽  
...  

SummaryInduced Pluripotent Stem Cells (iPSC) derived from healthy individuals are important controls for disease modeling studies. To create a resource of genetically annotated iPSCs, we reprogrammed footprint-free lines from four volunteers in the Personal Genome Project Canada (PGPC). Multilineage directed differentiation efficiently produced functional cortical neurons, cardiomyocytes and hepatocytes. Pilot users further demonstrated line versatility by generating kidney organoids, T-lymphocytes and sensory neurons. A frameshift knockout was introduced into MYBPC3 and these cardiomyocytes exhibited the expected hypertrophic phenotype. Whole genome sequencing (WGS) based annotation of PGPC lines revealed on average 20 coding variants. Importantly, nearly all annotated PGPC and HipSci lines harboured at least one pre-existing or acquired variant with cardiac, neurological or other disease associations. Overall, PGPC lines were efficiently differentiated by multiple users into cell types found in six tissues for disease modeling, and clinical annotation highlighted variant-preferred lines for use as unaffected controls in specific disease settings.


2019 ◽  
Author(s):  
Chen-Shan Chin ◽  
Justin Wagner ◽  
Qiandong Zeng ◽  
Erik Garrison ◽  
Shilpa Garg ◽  
...  

AbstractWe develop the first human benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle/Personal Genome Project Ashkenazi son (HG002). As a proof-of-principle, we focus on a medically important, highly variable, 5 million base-pair region - the Major Histocompatibility Complex (MHC). Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct base-level accurate, phased de novo assemblies from the reads. We assemble a single haplotig (haplotype-specific contig) for each haplotype, and align reads back to each assembled haplotig to identify two regions of lower confidence. We align the haplotigs to the reference, call phased small and structural variants, and define the first small variant benchmark for the MHC, covering 21496 small variants in 4.58 million base-pairs (92 % of the MHC). The assembly-based benchmark is 99.95 % concordant with a draft mapping-based benchmark from the same long and linked reads within both benchmark regions, but covers 50 % more variants outside the mapping-based benchmark regions. The haplotigs and variant calls are completely concordant with phased clinical HLA types for HG002. This benchmark reliably identifies false positives and false negatives from mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks. These methods demonstrate a path towards future diploid assembly-based benchmarks for other complex regions of the genome.


2020 ◽  
Author(s):  
Sophia Cameron-Christie ◽  
Alex Mackay ◽  
Quanli Wang ◽  
Henric Olsson ◽  
Bastian Angermann ◽  
...  

AbstractIntroductionAsthma risk is a complex interplay between genetic susceptibility and environment. Despite many significantly-associated common variants, the contribution of rarer variants with potentially greater effect sizes has not been as extensively studied. We present an exome-based study adopting 24,576 cases and 120,530 controls to assess the contribution of rare protein-coding variants to the risk of early-onset or all-comer asthma.MethodsWe performed case-control analyses on three genetic units: variant-, gene- and pathway-level, using sequence data from the Scandinavian Asthma Genetic Study and UK Biobank participants with asthma. Cases were defined as all-comer asthma (n=24,576) and early-onset asthma (n=5,962). Controls were 120,530 UK Biobank participants without reported history of respiratory illness.ResultsVariant-level analyses identified statistically significant variants at moderate-to-common allele frequency, including protein-truncating variants in FLG and IL33. Asthma risk was significantly increased not only by individual, common FLG protein-truncating variants, but also among the collection of rare-to-private FLG protein-truncating variants (p=6.8×10−7). This signal was driven by early-onset asthma and did not correlate with circulating eosinophil levels. In contrast, a single splice variant in IL33 was significantly protective (p=8.0×10−10), while the collection of remaining IL33 protein-truncating variants showed no class effect (p=0.54). A pathway-based analysis identified that protein-truncating variants in loss-of-function intolerant genes were significantly enriched among individuals with asthma.ConclusionsAccess to the full allele frequency spectrum of protein-coding variants provides additional clarity about the potential mechanisms of action for FLG and IL33. Beyond these two significant drivers, we detected a significant enrichment of protein-truncating variants in loss-of-function intolerant genes.


Sign in / Sign up

Export Citation Format

Share Document