Scoring of pathogenic non-coding variants in Mendelian diseases through supervised learning on ancient, recent and ongoing purifying selection signals in human

Mapping Intimacies ◽

10.1101/363903 ◽

2018 ◽

Author(s):

Barthélémy Caron ◽

Yufei Luo ◽

Antonio Rausell

Keyword(s):

Supervised Learning ◽

State Of The Art ◽

Purifying Selection ◽

Regulatory Elements ◽

Coding Regions ◽

Mendelian Diseases ◽

Art Methods ◽

Causal Variants ◽

Genomic Regions ◽

Coding Variants

AbstractThe study of rare Mendelian diseases through exome sequencing typically yields incomplete diagnostic rates, ~8-70% depending on the disease type. Whole genome sequencing of the unresolved cases allows addressing the hypothesis that causal variants could lay in non-coding regions with damaging regulatory consequences. The large amount of rare and singleton variants found in each individual genome requires computational filtering and scoring strategies to gain power in downstream statistical genetics tests. However, state-of-the-art methods estimating the functional relevance of non-coding genomic regions have been mostly characterized on sets of variants largely composed of trait-associated polymorphisms and associated to common diseases, yet with modest accuracy and strong positional biases. In this work we first curated a collection of n=737 high-confidence pathogenic non-coding single-nucleotide variants in proximal cis-regulatory genomic regions associated to monogenic Mendelian diseases. We then systematically evaluated the ability to predict causal variants of a comprehensive set of natural selection features extracted at three genomic levels: the affected position, the flanking region and the associated gene. In addition to inter-species conservation, a comprehensive set of recent and ongoing purifying selection signals in human was explored, allowing to capture potential constraints associated to recently acquired regulatory elements in the human lineage. A supervised learning approach using gradient tree boosting on such features reached a high predictive performance characterized by an area under the ROC curve = 0.84 and an area under the Precision-Recall curve = 0.47. The figures represent a relative improvement of >10% and >34% respectively upon the performance of current state-of-the-art methods for prioritizing non-coding variants. Performance was consistent under multiple configurations of the sets of variants used for learning and for independent testing. The supervised learning design allowed the assessment of newly seen non-coding variants overcoming gene and positional bias. The scores produced by the approach allow a more consistent weighting and aggregation of candidate pathogenic variants from diverse non-coding regions within and across genes in the context of statistical tests for rare variant association analysis.

NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans

Genome Biology ◽

10.1186/s13059-019-1634-2 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 21

Author(s):

Barthélémy Caron ◽

Yufei Luo ◽

Antonio Rausell

Keyword(s):

Supervised Learning ◽

Purifying Selection ◽

Mendelian Diseases ◽

Coding Variants

The mystery of extreme non-coding conservation

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2013.0021 ◽

2013 ◽

Vol 368 (1632) ◽

pp. 20130021 ◽

Cited By ~ 52

Author(s):

Nathan Harmston ◽

Anja Barešić ◽

Boris Lenhard

Keyword(s):

Purifying Selection ◽

Regulatory Elements ◽

Cellular Organization ◽

Functional Importance ◽

Base Pairs ◽

Transcriptional Regulatory Elements ◽

Coding Regions ◽

Transcriptional Regulatory ◽

Methods And Techniques ◽

Extreme Conservation

Regions of several dozen to several hundred base pairs of extreme conservation have been found in non-coding regions in all metazoan genomes. The distribution of these elements within and across genomes has suggested that many have roles as transcriptional regulatory elements in multi-cellular organization, differentiation and development. Currently, there is no known mechanism or function that would account for this level of conservation at the observed evolutionary distances. Previous studies have found that, while these regions are under strong purifying selection, and not mutational coldspots, deletion of entire regions in mice does not necessarily lead to identifiable changes in phenotype during development. These opposing findings lead to several questions regarding their functional importance and why they are under strong selection in the first place. In this perspective, we discuss the methods and techniques used in identifying and dissecting these regions, their observed patterns of conservation, and review the current hypotheses on their functional significance.

Computational prediction of CRISPR-impaired non-coding regulatory regions

10.1101/2020.12.22.423923 ◽

2020 ◽

Author(s):

Nina Baumgarten ◽

Florian Schmidt ◽

Martin Wegner ◽

Marie Hebel ◽

Manuel Kaulich ◽

...

Keyword(s):

Computational Prediction ◽

Regulatory Elements ◽

Cell Type ◽

Coding Regions ◽

Gene Coding ◽

Genome Wide ◽

A Genome ◽

Crispr Screen ◽

Genomic Regions ◽

Made In

AbstractGenome-wide CRISPR screens are becoming more widespread and allow the simultaneous interrogation of thousands of genomic regions. Although recent progress has been made in the analysis of CRISPR screens, it is still an open problem how to interpret CRISPR mutations in non-coding regions of the genome. Most of the tools concentrate on the interpretation of mutations introduced in gene coding regions. We introduce a computational pipeline that uses epigenomic information about regulatory elements for the interpretation of CRISPR mutations in non-coding regions. We illustrate our approach on the analysis of a genome-wide CRISPR screen in hTERT-RPE-1 cells and reveal novel regulatory elements that mediate chemoresistance against doxorubicin in these cells. We infer links to established and to novel chemoresistance genes. Our approach is general and can be applied on any cell type and with different CRISPR enzymes.

How to find genomic regions relevant for gene regulation

Medizinische Genetik ◽

10.1515/medgen-2021-2074 ◽

2021 ◽

Vol 33 (2) ◽

pp. 157-165

Author(s):

Xuanzong Guo ◽

Uwe Ohler ◽

Ferah Yildirim

Keyword(s):

Functional Characterization ◽

Regulatory Elements ◽

Chromatin Accessibility ◽

High Throughput Analysis ◽

Protein Coding ◽

Coding Regions ◽

The Past ◽

Genomic Regions ◽

Regulatory Functions

Abstract Genetic variants associated with human diseases are often located outside the protein coding regions of the genome. Identification and functional characterization of the regulatory elements in the non-coding genome is therefore of crucial importance for understanding the consequences of genetic variation and the mechanisms of disease. The past decade has seen rapid progress in high-throughput analysis and mapping of chromatin accessibility, looping, structure, and occupancy by transcription factors, as well as epigenetic modifications, all of which contribute to the proper execution of regulatory functions in the non-coding genome. Here, we review the current technologies for the definition and functional validation of non-coding regulatory regions in the genome.

EpiRegio: analysis and retrieval of regulatory elements linked to genes

Nucleic Acids Research ◽

10.1093/nar/gkaa382 ◽

2020 ◽

Vol 48 (W1) ◽

pp. W193-W199 ◽

Cited By ~ 4

Author(s):

Nina Baumgarten ◽

Dennis Hecker ◽

Sivarajan Karunanithi ◽

Florian Schmidt ◽

Markus List ◽

...

Keyword(s):

Gene Expression ◽

Target Genes ◽

Association Studies ◽

Web Server ◽

Cell Types ◽

Regulatory Elements ◽

Chromatin Accessibility ◽

Genome Wide Association Studies ◽

Coding Regions ◽

Genomic Regions

Abstract A current challenge in genomics is to interpret non-coding regions and their role in transcriptional regulation of possibly distant target genes. Genome-wide association studies show that a large part of genomic variants are found in those non-coding regions, but their mechanisms of gene regulation are often unknown. An additional challenge is to reliably identify the target genes of the regulatory regions, which is an essential step in understanding their impact on gene expression. Here we present the EpiRegio web server, a resource of regulatory elements (REMs). REMs are genomic regions that exhibit variations in their chromatin accessibility profile associated with changes in expression of their target genes. EpiRegio incorporates both epigenomic and gene expression data for various human primary cell types and tissues, providing an integrated view of REMs in the genome. Our web server allows the analysis of genes and their associated REMs, including the REM’s activity and its estimated cell type-specific contribution to its target gene’s expression. Further, it is possible to explore genomic regions for their regulatory potential, investigate overlapping REMs and by that the dissection of regions of large epigenomic complexity. EpiRegio allows programmatic access through a REST API and is freely available at https://epiregio.de/.

RNA polymerase mapping in plants identifies intergenic regulatory elements enriched in causal variants

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab273 ◽

2021 ◽

Author(s):

Roberto Lozano ◽

Gregory T Booth ◽

Bilan Yonis Omar ◽

Bo Li ◽

Edward S Buckler ◽

...

Keyword(s):

Rna Polymerase ◽

Cell Function ◽

Regulatory Element ◽

Crop Improvement ◽

Regulatory Elements ◽

Open Chromatin ◽

Control Of Gene Expression ◽

Causal Variants ◽

Divergent Transcription ◽

Genomic Regions

Abstract Control of gene expression is fundamental at every level of cell function. Promoter-proximal pausing and divergent transcription at promoters and enhancers, which are prominent features in animals, have only been studied in a handful of research experiments in plants. PRO-Seq analysis in cassava (Manihot esculenta) identified peaks of transcriptionally engaged RNA polymerase at both the 5′ and 3′ end of genes, consistent with paused or slowly moving Polymerase. In addition, we identified divergent transcription at intergenic sites. A full genome search for bi-directional transcription using an algorithm for enhancer detection developed in mammals (dREG) identified many intergenic regulatory element (IRE) candidates. These sites showed distinct patterns of methylation and nucleotide conservation based on genomic evolutionary rate profiling (GERP). SNPs within these IRE candidates explained significantly more variation in fitness and root composition than SNPs in chromosomal segments randomly ascertained from the same intergenic distribution, strongly suggesting a functional importance of these sites. Maize GRO-Seq data showed RNA polymerase occupancy at IREs consistent with patterns in cassava. Furthermore, these IREs in maize significantly overlapped with sites previously identified on the basis of open chromatin, histone marks, and methylation, and were enriched for reported eQTL. Our results suggest that bidirectional transcription can identify intergenic genomic regions in plants that play an important role in transcription regulation and whose identification has the potential to aid crop improvement.

Long-term natural selection affects patterns of neutral divergence on the X chromosome more than the autosomes.

10.1101/023234 ◽

2015 ◽

Author(s):

Melissa Ann Wilson Sayres ◽

Pooja Narang

Keyword(s):

Natural Selection ◽

X Chromosome ◽

Great Apes ◽

Regulatory Elements ◽

Mutation Bias ◽

Term Selection ◽

Coding Regions ◽

Male Mutation Bias ◽

Genomic Regions

Natural selection reduces neutral population genetic diversity near coding regions of the genome because recombination has not had time to unlink selected alleles from nearby neutral regions. For ten sub-species of great apes, including human, we show that long-term selection affects estimates of divergence on the X differently from the autosomes. Divergence increases with increasing distance from genes on both the X chromosome and autosomes, but increases faster on the X chromosome than autosomes, resulting in increasing ratios of X/A divergence in putatively neutral regions. Similarly, divergence is reduced more on the X chromosome in neutral regions near conserved regulatory elements than on the autosomes. Consequently estimates of male mutation bias, which rely on comparing neutral divergence between the X and autosomes, are twice as high in neutral regions near genes versus far from genes. Our results suggest filters for putatively neutral genomic regions differ between the X and autosomes.

Computational prediction of CRISPR-impaired non-coding regulatory regions

Biological Chemistry ◽

10.1515/hsz-2020-0392 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Nina Baumgarten ◽

Florian Schmidt ◽

Martin Wegner ◽

Marie Hebel ◽

Manuel Kaulich ◽

...

Keyword(s):

Computational Prediction ◽

Regulatory Elements ◽

Coding Regions ◽

Gene Coding ◽

Genome Wide ◽

A Genome ◽

Crispr Screen ◽

Genomic Regions ◽

Analysis Protocol ◽

Made In

Abstract Genome-wide CRISPR screens are becoming more widespread and allow the simultaneous interrogation of thousands of genomic regions. Although recent progress has been made in the analysis of CRISPR screens, it is still an open problem how to interpret CRISPR mutations in non-coding regions of the genome. Most of the tools concentrate on the interpretation of mutations introduced in gene coding regions. We introduce a computational pipeline that uses epigenomic information about regulatory elements for the interpretation of CRISPR mutations in non-coding regions. We illustrate our analysis protocol on the analysis of a genome-wide CRISPR screen in hTERT-RPE1 cells and reveal novel regulatory elements that mediate chemoresistance against doxorubicin in these cells. We infer links to established and to novel chemoresistance genes. Our analysis protocol is general and can be applied on any cell type and with different CRISPR enzymes.

Contribution of SLC22A12 on hypouricemia and its clinical significance for screening purposes

Scientific Reports ◽

10.1038/s41598-019-50798-6 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Do Hyeon Cha ◽

Heon Yung Gee ◽

Raul Cachau ◽

Jong Mun Choi ◽

Daeui Park ◽

...

Keyword(s):

Genetic Diagnosis ◽

Kidney Injury ◽

Genetic Identification ◽

Diagnostic Screening ◽

Renal Hypouricemia ◽

Pathogenic Variants ◽

Whole Exome ◽

Causal Variants ◽

Coding Variants ◽

Independent Cohort

Abstract Differentiating between inherited renal hypouricemia and transient hypouricemic status is challenging. Here, we aimed to describe the genetic background of hypouricemia patients using whole-exome sequencing (WES) and assess the feasibility for genetic diagnosis using two founder variants in primary screening. We selected all cases (N = 31) with extreme hypouricemia (<1.3 mg/dl) from a Korean urban cohort of 179,381 subjects without underlying conditions. WES and corresponding downstream analyses were performed for the discovery of rare causal variants for hypouricemia. Two known recessive variants within SLC22A12 (p.Trp258*, pArg90His) were identified in 24 out of 31 subjects (77.4%). In an independent cohort, we identified 50 individuals with hypouricemia and genotyped the p.Trp258* and p.Arg90His variants; 47 of the 50 (94%) hypouricemia cases were explained by only two mutations. Four novel coding variants in SLC22A12, p.Asn136Lys, p.Thr225Lys, p.Arg284Gln, and p.Glu429Lys, were additionally identified. In silico studies predict these as pathogenic variants. This is the first study to show the value of genetic diagnostic screening for hypouricemia in the clinical setting. Screening of just two ethnic-specific variants (p.Trp258* and p.Arg90His) identified 87.7% (71/81) of Korean patients with monogenic hypouricemia. Early genetic identification of constitutive hypouricemia may prevent acute kidney injury by avoidance of dehydration and excessive exercise.

Multi-hop assortativities for network classification

Journal of Complex Networks ◽

10.1093/comnet/cny034 ◽

2018 ◽

Vol 7 (4) ◽

pp. 603-622 ◽

Cited By ~ 1

Author(s):

Leonardo Gutiérrez-Gómez ◽

Jean-Charles Delvenne

Keyword(s):

Machine Learning ◽

Scientific Collaboration ◽

State Of The Art ◽

Medical Engineering ◽

Research Field ◽

Classification Task ◽

Collaboration Network ◽

Structural Patterns ◽

Art Methods

Abstract Several social, medical, engineering and biological challenges rely on discovering the functionality of networks from their structure and node metadata, when it is available. For example, in chemoinformatics one might want to detect whether a molecule is toxic based on structure and atomic types, or discover the research field of a scientific collaboration network. Existing techniques rely on counting or measuring structural patterns that are known to show large variations from network to network, such as the number of triangles, or the assortativity of node metadata. We introduce the concept of multi-hop assortativity, that captures the similarity of the nodes situated at the extremities of a randomly selected path of a given length. We show that multi-hop assortativity unifies various existing concepts and offers a versatile family of ‘fingerprints’ to characterize networks. These fingerprints allow in turn to recover the functionalities of a network, with the help of the machine learning toolbox. Our method is evaluated empirically on established social and chemoinformatic network benchmarks. Results reveal that our assortativity based features are competitive providing highly accurate results often outperforming state of the art methods for the network classification task.