scholarly journals Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Dimitrios Vitsios ◽  
Ryan S. Dhindsa ◽  
Lawrence Middleton ◽  
Ayal B. Gussow ◽  
Slavé Petrovski

AbstractElucidating functionality in non-coding regions is a key challenge in human genomics. It has been shown that intolerance to variation of coding and proximal non-coding sequence is a strong predictor of human disease relevance. Here, we integrate intolerance to variation, functional genomic annotations and primary genomic sequence to build JARVIS: a comprehensive deep learning model to prioritize non-coding regions, outperforming other human lineage-specific scores. Despite being agnostic to evolutionary conservation, JARVIS performs comparably or outperforms conservation-based scores in classifying pathogenic single-nucleotide and structural variants. In constructing JARVIS, we introduce the genome-wide residual variation intolerance score (gwRVIS), applying a sliding-window approach to whole genome sequencing data from 62,784 individuals. gwRVIS distinguishes Mendelian disease genes from more tolerant CCDS regions and highlights ultra-conserved non-coding elements as the most intolerant regions in the human genome. Both JARVIS and gwRVIS capture previously inaccessible human-lineage constraint information and will enhance our understanding of the non-coding genome.

Author(s):  
Yichuan Liu ◽  
Hui-Qi Qu ◽  
Frank D. Mentch ◽  
Jingchun Qu ◽  
Xiao Chang ◽  
...  

AbstractMental disorders present a global health concern, while the diagnosis of mental disorders can be challenging. The diagnosis is even harder for patients who have more than one type of mental disorder, especially for young toddlers who are not able to complete questionnaires or standardized rating scales for diagnosis. In the past decade, multiple genomic association signals have been reported for mental disorders, some of which present attractive drug targets. Concurrently, machine learning algorithms, especially deep learning algorithms, have been successful in the diagnosis and/or labeling of complex diseases, such as attention deficit hyperactivity disorder (ADHD) or cancer. In this study, we focused on eight common mental disorders, including ADHD, depression, anxiety, autism, intellectual disabilities, speech/language disorder, delays in developments, and oppositional defiant disorder in the ethnic minority of African Americans. Blood-derived whole genome sequencing data from 4179 individuals were generated, including 1384 patients with the diagnosis of at least one mental disorder. The burden of genomic variants in coding/non-coding regions was applied as feature vectors in the deep learning algorithm. Our model showed ~65% accuracy in differentiating patients from controls. Ability to label patients with multiple disorders was similarly successful, with a hamming loss score less than 0.3, while exact diagnostic matches are around 10%. Genes in genomic regions with the highest weights showed enrichment of biological pathways involved in immune responses, antigen/nucleic acid binding, chemokine signaling pathway, and G-protein receptor activities. A noticeable fact is that variants in non-coding regions (e.g., ncRNA, intronic, and intergenic) performed equally well as variants in coding regions; however, unlike coding region variants, variants in non-coding regions do not express genomic hotspots whereas they carry much more narrow standard deviations, indicating they probably serve as alternative markers.


Author(s):  
Johanna L. Jones ◽  
Mark A. Corbett ◽  
Elise Yeaman ◽  
Duran Zhao ◽  
Jozef Gecz ◽  
...  

AbstractInherited paediatric cataract is a rare Mendelian disease that results in visual impairment or blindness due to a clouding of the eye’s crystalline lens. Here we report an Australian family with isolated paediatric cataract, which we had previously mapped to Xq24. Linkage at Xq24–25 (LOD = 2.53) was confirmed, and the region refined with a denser marker map. In addition, two autosomal regions with suggestive evidence of linkage were observed. A segregating 127 kb deletion (chrX:g.118373226_118500408del) in the Xq24–25 linkage region was identified from whole-genome sequencing data. This deletion completely removed a commonly deleted long non-coding RNA gene LOC101928336 and truncated the protein coding progesterone receptor membrane component 1 (PGRMC1) gene following exon 1. A literature search revealed a report of two unrelated males with non-syndromic intellectual disability, as well as congenital cataract, who had contiguous gene deletions that accounted for their intellectual disability but also disrupted the PGRMC1 gene. A morpholino-induced pgrmc1 knockdown in a zebrafish model produced significant cataract formation, supporting a role for PGRMC1 in lens development and cataract formation. We hypothesise that the loss of PGRMC1 causes cataract through disrupted PGRMC1-CYP51A1 protein–protein interactions and altered cholesterol biosynthesis. The cause of paediatric cataract in this family is the truncating deletion of PGRMC1, which we report as a novel cataract gene.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Zhongbo Chen ◽  
◽  
David Zhang ◽  
Regina H. Reynolds ◽  
Emil K. Gustavsson ◽  
...  

AbstractKnowledge of genomic features specific to the human lineage may provide insights into brain-related diseases. We leverage high-depth whole genome sequencing data to generate a combined annotation identifying regions simultaneously depleted for genetic variation (constrained regions) and poorly conserved across primates. We propose that these constrained, non-conserved regions (CNCRs) have been subject to human-specific purifying selection and are enriched for brain-specific elements. We find that CNCRs are depleted from protein-coding genes but enriched within lncRNAs. We demonstrate that per-SNP heritability of a range of brain-relevant phenotypes are enriched within CNCRs. We find that genes implicated in neurological diseases have high CNCR density, including APOE, highlighting an unannotated intron-3 retention event. Using human brain RNA-sequencing data, we show the intron-3-retaining transcript to be more abundant in Alzheimer’s disease with more severe tau and amyloid pathological burden. Thus, we demonstrate potential association of human-lineage-specific sequences in brain development and neurological disease.


2021 ◽  
Vol 11 (2) ◽  
pp. 131
Author(s):  
Laura B. Scheinfeldt ◽  
Andrew Brangan ◽  
Dara M. Kusic ◽  
Sudhir Kumar ◽  
Neda Gharani

Pharmacogenomics holds the promise of personalized drug efficacy optimization and drug toxicity minimization. Much of the research conducted to date, however, suffers from an ascertainment bias towards European participants. Here, we leverage publicly available, whole genome sequencing data collected from global populations, evolutionary characteristics, and annotated protein features to construct a new in silico machine learning pharmacogenetic identification method called XGB-PGX. When applied to pharmacogenetic data, XGB-PGX outperformed all existing prediction methods and identified over 2000 new pharmacogenetic variants. While there are modest pharmacogenetic allele frequency distribution differences across global population samples, the most striking distinction is between the relatively rare putatively neutral pharmacogene variants and the relatively common established and newly predicted functional pharamacogenetic variants. Our findings therefore support a focus on individual patient pharmacogenetic testing rather than on clinical presumptions about patient race, ethnicity, or ancestral geographic residence. We further encourage more attention be given to the impact of common variation on drug response and propose a new ‘common treatment, common variant’ perspective for pharmacogenetic prediction that is distinct from the types of variation that underlie complex and Mendelian disease. XGB-PGX has identified many new pharmacovariants that are present across all global communities; however, communities that have been underrepresented in genomic research are likely to benefit the most from XGB-PGX’s in silico predictions.


2020 ◽  
Author(s):  
Sihao Xiao ◽  
Zhentian Kai ◽  
David Brown ◽  
Claire L Shovlin ◽  

SUMMARYWhole genome sequencing (WGS) is championed by the UK National Health Service (NHS) to identify genetic variants that cause particular diseases. The full potential of WGS has yet to be realised as early data analytic steps prioritise protein-coding genes, and effectively ignore the less well annotated non-coding genome which is rich in transcribed and critical regulatory regions. To address, we developed a filter, which we call GROFFFY, and validated in WGS data from hereditary haemorrhagic telangiectasia patients within the 100,000 Genomes Project. Before filter application, the mean number of DNA variants compared to human reference sequence GRCh38 was 4,867,167 (range 4,786,039-5,070,340), and one-third lay within intergenic areas. GROFFFY removed a mean of 2,812,015 variants per DNA. In combination with allele frequency and other filters, GROFFFY enabled a 99.56% reduction in variant number. The proportion of intergenic variants was maintained, and no pathogenic variants in disease genes were lost. We conclude that the filter applied to NHS diagnostic samples in the 100,000 Genomes pipeline offers an efficient method to prioritise intergenic, intronic and coding gDNA variants. Reducing the overwhelming number of variants while retaining functional genome variation of importance to patients, enhances the near-term value of WGS in clinical diagnostics.


NAR Cancer ◽  
2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Chie Kikutake ◽  
Minako Yoshihara ◽  
Mikita Suyama

Abstract Cancer-related mutations have been mainly identified in protein-coding regions. Recent studies have demonstrated that mutations in non-coding regions of the genome could also be a risk factor for cancer. However, the non-coding regions comprise 98% of the total length of the human genome and contain a huge number of mutations, making it difficult to interpret their impacts on pathogenesis of cancer. To comprehensively identify cancer-related non-coding mutations, we focused on recurrent mutations in non-coding regions using somatic mutation data from COSMIC and whole-genome sequencing data from The Cancer Genome Atlas (TCGA). We identified 21 574 recurrent mutations in non-coding regions that were shared by at least two different samples from both COSMIC and TCGA databases. Among them, 580 candidate cancer-related non-coding recurrent mutations were identified based on epigenomic and chromatin structure datasets. One of such mutation was located in RREB1 binding site that is thought to interact with TEAD1 promoter. Our results suggest that mutations may disrupt the binding of RREB1 to the candidate enhancer region and increase TEAD1 expression levels. Our findings demonstrate that non-coding recurrent mutations and coding mutations may contribute to the pathogenesis of cancer.


Author(s):  
Christoph Schaniel ◽  
Priyanka Dhanan ◽  
Bin Hu ◽  
Yuguang Xiong ◽  
Teeya Raghunandan ◽  
...  

AbstractA library of well-characterized human induced pluripotent stem cell (hiPSC) lines from clinically healthy human subjects could serve as a powerful resource of normal controls for in vitro human development, disease modeling, genotype-phenotype association studies, and drug response evaluation. We report generation and extensive characterization of a gender-balanced, racially/ethnically diverse library of hiPSC lines from forty clinically healthy human individuals who range in age from 22-61. The hiPSCs match the karyotype and short tandem repeat identity of their parental fibroblasts, and have a transcription profile characteristic of pluripotent stem cells. We provide whole genome sequencing data for one hiPSC clone from each individual, ancestry determination, and analysis of Mendelian disease genes and risks. We document similar physiology of cardiomyocytes differentiated from multiple independent hiPSC clones derived from two individuals. This extensive characterization makes this hiPSC library a unique and valuable resource for many studies on human biology.


2020 ◽  
Author(s):  
Zhongbo Chen ◽  
David Zhang ◽  
Regina H. Reynolds ◽  
Emil K. Gustavsson ◽  
Sonia García Ruiz ◽  
...  

ABSTRACTKnowledge of genomic features specific to the human lineage may provide insights into brain-related diseases. We leverage high-depth whole genome sequencing data to generate a combined annotation identifying regions simultaneously depleted for genetic variation (constrained regions) and poorly conserved across primates. We propose that these constrained, non-conserved regions (CNCRs) have been subject to human-specific purifying selection and are enriched for brain-specific elements. We find that CNCRs are depleted from protein-coding genes but enriched within lncRNAs. We demonstrate that per-SNP heritability of a range of brain-relevant phenotypes are enriched within CNCRs. We find that genes implicated in neurological diseases have high CNCR density, including APOE, highlighting an unannotated intron-3 retention event. Using human brain RNA-sequencing data, we show the intron-3-retaining transcript/s to be more abundant in Alzheimer’s disease with more severe tau and amyloid pathological burden. Thus, we demonstrate the importance of human-lineage-specific sequences in brain development and neurological disease. We release our annotation through vizER (https://snca.atica.um.es/browser/app/vizER).


2021 ◽  
Vol 6 (1) ◽  
Author(s):  
Chun-Yu Wei ◽  
Jenn-Hwai Yang ◽  
Erh-Chan Yeh ◽  
Ming-Fang Tsai ◽  
Hsiao-Jung Kao ◽  
...  

AbstractPersonalized medical care focuses on prediction of disease risk and response to medications. To build the risk models, access to both large-scale genomic resources and human genetic studies is required. The Taiwan Biobank (TWB) has generated high-coverage, whole-genome sequencing data from 1492 individuals and genome-wide SNP data from 103,106 individuals of Han Chinese ancestry using custom SNP arrays. Principal components analysis of the genotyping data showed that the full range of Han Chinese genetic variation was found in the cohort. The arrays also include thousands of known functional variants, allowing for simultaneous ascertainment of Mendelian disease-causing mutations and variants that affect drug metabolism. We found that 21.2% of the population are mutation carriers of autosomal recessive diseases, 3.1% have mutations in cancer-predisposing genes, and 87.3% carry variants that affect drug response. We highlight how TWB data provide insight into both population history and disease burden, while showing how widespread genetic testing can be used to improve clinical care.


Sign in / Sign up

Export Citation Format

Share Document