scholarly journals Scoring of pathogenic non-coding variants in Mendelian diseases through supervised learning on ancient, recent and ongoing purifying selection signals in human

2018 ◽  
Author(s):  
Barthélémy Caron ◽  
Yufei Luo ◽  
Antonio Rausell

AbstractThe study of rare Mendelian diseases through exome sequencing typically yields incomplete diagnostic rates, ~8-70% depending on the disease type. Whole genome sequencing of the unresolved cases allows addressing the hypothesis that causal variants could lay in non-coding regions with damaging regulatory consequences. The large amount of rare and singleton variants found in each individual genome requires computational filtering and scoring strategies to gain power in downstream statistical genetics tests. However, state-of-the-art methods estimating the functional relevance of non-coding genomic regions have been mostly characterized on sets of variants largely composed of trait-associated polymorphisms and associated to common diseases, yet with modest accuracy and strong positional biases. In this work we first curated a collection of n=737 high-confidence pathogenic non-coding single-nucleotide variants in proximal cis-regulatory genomic regions associated to monogenic Mendelian diseases. We then systematically evaluated the ability to predict causal variants of a comprehensive set of natural selection features extracted at three genomic levels: the affected position, the flanking region and the associated gene. In addition to inter-species conservation, a comprehensive set of recent and ongoing purifying selection signals in human was explored, allowing to capture potential constraints associated to recently acquired regulatory elements in the human lineage. A supervised learning approach using gradient tree boosting on such features reached a high predictive performance characterized by an area under the ROC curve = 0.84 and an area under the Precision-Recall curve = 0.47. The figures represent a relative improvement of >10% and >34% respectively upon the performance of current state-of-the-art methods for prioritizing non-coding variants. Performance was consistent under multiple configurations of the sets of variants used for learning and for independent testing. The supervised learning design allowed the assessment of newly seen non-coding variants overcoming gene and positional bias. The scores produced by the approach allow a more consistent weighting and aggregation of candidate pathogenic variants from diverse non-coding regions within and across genes in the context of statistical tests for rare variant association analysis.


2013 ◽  
Vol 368 (1632) ◽  
pp. 20130021 ◽  
Author(s):  
Nathan Harmston ◽  
Anja Barešić ◽  
Boris Lenhard

Regions of several dozen to several hundred base pairs of extreme conservation have been found in non-coding regions in all metazoan genomes. The distribution of these elements within and across genomes has suggested that many have roles as transcriptional regulatory elements in multi-cellular organization, differentiation and development. Currently, there is no known mechanism or function that would account for this level of conservation at the observed evolutionary distances. Previous studies have found that, while these regions are under strong purifying selection, and not mutational coldspots, deletion of entire regions in mice does not necessarily lead to identifiable changes in phenotype during development. These opposing findings lead to several questions regarding their functional importance and why they are under strong selection in the first place. In this perspective, we discuss the methods and techniques used in identifying and dissecting these regions, their observed patterns of conservation, and review the current hypotheses on their functional significance.



2020 ◽  
Author(s):  
Nina Baumgarten ◽  
Florian Schmidt ◽  
Martin Wegner ◽  
Marie Hebel ◽  
Manuel Kaulich ◽  
...  

AbstractGenome-wide CRISPR screens are becoming more widespread and allow the simultaneous interrogation of thousands of genomic regions. Although recent progress has been made in the analysis of CRISPR screens, it is still an open problem how to interpret CRISPR mutations in non-coding regions of the genome. Most of the tools concentrate on the interpretation of mutations introduced in gene coding regions. We introduce a computational pipeline that uses epigenomic information about regulatory elements for the interpretation of CRISPR mutations in non-coding regions. We illustrate our approach on the analysis of a genome-wide CRISPR screen in hTERT-RPE-1 cells and reveal novel regulatory elements that mediate chemoresistance against doxorubicin in these cells. We infer links to established and to novel chemoresistance genes. Our approach is general and can be applied on any cell type and with different CRISPR enzymes.



2021 ◽  
Vol 33 (2) ◽  
pp. 157-165
Author(s):  
Xuanzong Guo ◽  
Uwe Ohler ◽  
Ferah Yildirim

Abstract Genetic variants associated with human diseases are often located outside the protein coding regions of the genome. Identification and functional characterization of the regulatory elements in the non-coding genome is therefore of crucial importance for understanding the consequences of genetic variation and the mechanisms of disease. The past decade has seen rapid progress in high-throughput analysis and mapping of chromatin accessibility, looping, structure, and occupancy by transcription factors, as well as epigenetic modifications, all of which contribute to the proper execution of regulatory functions in the non-coding genome. Here, we review the current technologies for the definition and functional validation of non-coding regulatory regions in the genome.



2020 ◽  
Vol 48 (W1) ◽  
pp. W193-W199 ◽  
Author(s):  
Nina Baumgarten ◽  
Dennis Hecker ◽  
Sivarajan Karunanithi ◽  
Florian Schmidt ◽  
Markus List ◽  
...  

Abstract A current challenge in genomics is to interpret non-coding regions and their role in transcriptional regulation of possibly distant target genes. Genome-wide association studies show that a large part of genomic variants are found in those non-coding regions, but their mechanisms of gene regulation are often unknown. An additional challenge is to reliably identify the target genes of the regulatory regions, which is an essential step in understanding their impact on gene expression. Here we present the EpiRegio web server, a resource of regulatory elements (REMs). REMs are genomic regions that exhibit variations in their chromatin accessibility profile associated with changes in expression of their target genes. EpiRegio incorporates both epigenomic and gene expression data for various human primary cell types and tissues, providing an integrated view of REMs in the genome. Our web server allows the analysis of genes and their associated REMs, including the REM’s activity and its estimated cell type-specific contribution to its target gene’s expression. Further, it is possible to explore genomic regions for their regulatory potential, investigate overlapping REMs and by that the dissection of regions of large epigenomic complexity. EpiRegio allows programmatic access through a REST API and is freely available at https://epiregio.de/.



Author(s):  
Roberto Lozano ◽  
Gregory T Booth ◽  
Bilan Yonis Omar ◽  
Bo Li ◽  
Edward S Buckler ◽  
...  

Abstract Control of gene expression is fundamental at every level of cell function. Promoter-proximal pausing and divergent transcription at promoters and enhancers, which are prominent features in animals, have only been studied in a handful of research experiments in plants. PRO-Seq analysis in cassava (Manihot esculenta) identified peaks of transcriptionally engaged RNA polymerase at both the 5′ and 3′ end of genes, consistent with paused or slowly moving Polymerase. In addition, we identified divergent transcription at intergenic sites. A full genome search for bi-directional transcription using an algorithm for enhancer detection developed in mammals (dREG) identified many intergenic regulatory element (IRE) candidates. These sites showed distinct patterns of methylation and nucleotide conservation based on genomic evolutionary rate profiling (GERP). SNPs within these IRE candidates explained significantly more variation in fitness and root composition than SNPs in chromosomal segments randomly ascertained from the same intergenic distribution, strongly suggesting a functional importance of these sites. Maize GRO-Seq data showed RNA polymerase occupancy at IREs consistent with patterns in cassava. Furthermore, these IREs in maize significantly overlapped with sites previously identified on the basis of open chromatin, histone marks, and methylation, and were enriched for reported eQTL. Our results suggest that bidirectional transcription can identify intergenic genomic regions in plants that play an important role in transcription regulation and whose identification has the potential to aid crop improvement.



2015 ◽  
Author(s):  
Melissa Ann Wilson Sayres ◽  
Pooja Narang

Natural selection reduces neutral population genetic diversity near coding regions of the genome because recombination has not had time to unlink selected alleles from nearby neutral regions. For ten sub-species of great apes, including human, we show that long-term selection affects estimates of divergence on the X differently from the autosomes. Divergence increases with increasing distance from genes on both the X chromosome and autosomes, but increases faster on the X chromosome than autosomes, resulting in increasing ratios of X/A divergence in putatively neutral regions. Similarly, divergence is reduced more on the X chromosome in neutral regions near conserved regulatory elements than on the autosomes. Consequently estimates of male mutation bias, which rely on comparing neutral divergence between the X and autosomes, are twice as high in neutral regions near genes versus far from genes. Our results suggest filters for putatively neutral genomic regions differ between the X and autosomes.



2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Nina Baumgarten ◽  
Florian Schmidt ◽  
Martin Wegner ◽  
Marie Hebel ◽  
Manuel Kaulich ◽  
...  

Abstract Genome-wide CRISPR screens are becoming more widespread and allow the simultaneous interrogation of thousands of genomic regions. Although recent progress has been made in the analysis of CRISPR screens, it is still an open problem how to interpret CRISPR mutations in non-coding regions of the genome. Most of the tools concentrate on the interpretation of mutations introduced in gene coding regions. We introduce a computational pipeline that uses epigenomic information about regulatory elements for the interpretation of CRISPR mutations in non-coding regions. We illustrate our analysis protocol on the analysis of a genome-wide CRISPR screen in hTERT-RPE1 cells and reveal novel regulatory elements that mediate chemoresistance against doxorubicin in these cells. We infer links to established and to novel chemoresistance genes. Our analysis protocol is general and can be applied on any cell type and with different CRISPR enzymes.



2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Do Hyeon Cha ◽  
Heon Yung Gee ◽  
Raul Cachau ◽  
Jong Mun Choi ◽  
Daeui Park ◽  
...  

Abstract Differentiating between inherited renal hypouricemia and transient hypouricemic status is challenging. Here, we aimed to describe the genetic background of hypouricemia patients using whole-exome sequencing (WES) and assess the feasibility for genetic diagnosis using two founder variants in primary screening. We selected all cases (N = 31) with extreme hypouricemia (<1.3 mg/dl) from a Korean urban cohort of 179,381 subjects without underlying conditions. WES and corresponding downstream analyses were performed for the discovery of rare causal variants for hypouricemia. Two known recessive variants within SLC22A12 (p.Trp258*, pArg90His) were identified in 24 out of 31 subjects (77.4%). In an independent cohort, we identified 50 individuals with hypouricemia and genotyped the p.Trp258* and p.Arg90His variants; 47 of the 50 (94%) hypouricemia cases were explained by only two mutations. Four novel coding variants in SLC22A12, p.Asn136Lys, p.Thr225Lys, p.Arg284Gln, and p.Glu429Lys, were additionally identified. In silico studies predict these as pathogenic variants. This is the first study to show the value of genetic diagnostic screening for hypouricemia in the clinical setting. Screening of just two ethnic-specific variants (p.Trp258* and p.Arg90His) identified 87.7% (71/81) of Korean patients with monogenic hypouricemia. Early genetic identification of constitutive hypouricemia may prevent acute kidney injury by avoidance of dehydration and excessive exercise.



2018 ◽  
Vol 7 (4) ◽  
pp. 603-622 ◽  
Author(s):  
Leonardo Gutiérrez-Gómez ◽  
Jean-Charles Delvenne

Abstract Several social, medical, engineering and biological challenges rely on discovering the functionality of networks from their structure and node metadata, when it is available. For example, in chemoinformatics one might want to detect whether a molecule is toxic based on structure and atomic types, or discover the research field of a scientific collaboration network. Existing techniques rely on counting or measuring structural patterns that are known to show large variations from network to network, such as the number of triangles, or the assortativity of node metadata. We introduce the concept of multi-hop assortativity, that captures the similarity of the nodes situated at the extremities of a randomly selected path of a given length. We show that multi-hop assortativity unifies various existing concepts and offers a versatile family of ‘fingerprints’ to characterize networks. These fingerprints allow in turn to recover the functionalities of a network, with the help of the machine learning toolbox. Our method is evaluated empirically on established social and chemoinformatic network benchmarks. Results reveal that our assortativity based features are competitive providing highly accurate results often outperforming state of the art methods for the network classification task.



Sign in / Sign up

Export Citation Format

Share Document