scholarly journals Integrating long-range regulatory interactions to predict gene expression using graph convolutional neural networks

2020 ◽  
Author(s):  
Jeremy Bigness ◽  
Xavi Loinaz ◽  
Shalin Patel ◽  
Erica Larschan ◽  
Ritambhara Singh

Long-range spatial interactions among genomic regions are critical for regulating gene expression and their disruption has been associated with a host of diseases. However, when modeling the effects of regulatory factors on gene expression, most deep learning models either neglect long-range interactions or fail to capture the inherent 3D structure of the underlying biological system. This prevents the field from obtaining a more comprehensive understanding of gene regulation and from fully leveraging the structural information present in the data sets. Here, we propose a graph convolutional neural network (GCNN) framework to integrate measurements probing spatial genomic organization and measurements of local regulatory factors, specifically histone modifications, to predict gene expression. This formulation enables the model to incorporate crucial information about long-range interactions via a natural encoding of spatial interaction relationships into a graph representation. Furthermore, we show that our model is interpretable in terms of the observed biological regulatory factors, highlighting both the histone modifications and the interacting genomic regions that contribute to a gene's predicted expression. We apply our GCNN model to datasets for GM12878 (lymphoblastoid) and K562 (myelogenous leukemia) cell lines and demonstrate its state-of-the-art prediction performance. We also obtain importance scores corresponding to the histone mark features and interacting regions for some exemplar genes and validate them with evidence from the literature. Our model presents a novel setup for predicting gene expression by integrating multimodal datasets.

2019 ◽  
Vol 2019 ◽  
pp. 1-12
Author(s):  
Livia Eiselleova ◽  
Viktor Lukjanov ◽  
Simon Farkas ◽  
David Svoboda ◽  
Karel Stepka ◽  
...  

The eukaryotic nucleus is a highly complex structure that carries out multiple functions primarily needed for gene expression, and among them, transcription seems to be the most fundamental. Diverse approaches have demonstrated that transcription takes place at discrete sites known as transcription factories, wherein RNA polymerase II (RNAP II) is attached to the factory and immobilized while transcribing DNA. It has been proposed that transcription factories promote chromatin loop formation, creating long-range interactions in which relatively distant genes can be transcribed simultaneously. In this study, we examined long-range interactions between the POU5F1 gene and genes previously identified as being POU5F1 enhancer-interacting, namely, CDYL, TLE2, RARG, and MSX1 (all involved in transcriptional regulation), in human pluripotent stem cells (hPSCs) and their early differentiated counterparts. As a control gene, RUNX1 was used, which is expressed during hematopoietic differentiation and not associated with pluripotency. To reveal how these long-range interactions between POU5F1 and the selected genes change with the onset of differentiation and upon RNAP II inhibition, we performed three-dimensional fluorescence in situ hybridization (3D-FISH) followed by computational simulation analysis. Our analysis showed that the numbers of long-range interactions between specific genes decrease during differentiation, suggesting that the transcription of monitored genes is associated with pluripotency. In addition, we showed that upon inhibition of RNAP II, long-range associations do not disintegrate and remain constant. We also analyzed the distance distributions of these genes in the context of their positions in the nucleus and revealed that they tend to have similar patterns resembling normal distribution. Furthermore, we compared data created in vitro and in silico to assess the biological relevance of our results.


2000 ◽  
Vol 113 (14) ◽  
pp. 2527-2533 ◽  
Author(s):  
G.S. Stein ◽  
A.J. van Wijnen ◽  
J.L. Stein ◽  
J.B. Lian ◽  
M. Montecino ◽  
...  

The subnuclear organization of nucleic acids and cognate regulatory factors suggests that there are functional interrelationships between nuclear structure and gene expression. Nuclear proteins that are localized in discrete domains within the nucleus include the leukemia-associated acute myelogenous leukemia (AML) and promyelocytic leukemia (PML) factors, the SC-35 RNA-processing factors, nucleolar proteins and components of both transcriptional and DNA replication complexes. Mechanisms that control the spatial distribution of transcription factors within the three-dimensional context of the nucleus may involve the sorting of regulatory information, as well as contribute to the assembly and activity of sites that support gene expression. Molecular, cellular, genetic and biochemical approaches have identified distinct protein segments, termed intranuclear-targeting signals, that are responsible for directing regulatory factors to specific subnuclear sites. Gene rearrangements that remove or alter intranuclear-targeting signals are prevalent in leukemias and have been linked to altered localization of regulatory factors within the nucleus. These modifications in the intranuclear targeting of transcription factors might abrogate fidelity of gene expression in tumor cells by influencing the spatial organization and/or assembly of machineries involved in the synthesis and processing of gene transcripts.


2017 ◽  
Author(s):  
Yan Kai ◽  
Jaclyn Andricovich ◽  
Zhouhao Zeng ◽  
Jun Zhu ◽  
Alexandros Tzatsos ◽  
...  

AbstractThe CCCTC-binding zinc finger protein (CTCF)-mediated network of long-range chromatin interactions is important for genome organization and function. Although this network has been considered largely invariant, we found that it exhibits extensive cell-type-specific interactions that contribute to cell identity. Here we present Lollipop—a machine-learning framework—which predicts CTCF-mediated long-range interactions using genomic and epigenomic features. Using ChIA-PET data as benchmark, we demonstrated that Lollipop accurately predicts CTCF-mediated chromatin interactions both within and across cell-types, and outperforms other methods based only on CTCF motif orientation. Predictions were confirmed computationally and experimentally by Chromatin Conformation Capture (3C). Moreover, our approach reveals novel determinants of CTCF-mediated chromatin wiring, such as gene expression within the loops. Our study contributes to a better understanding about the underlying principles of CTCF-mediated chromatin interactions and their impact on gene expression.


2019 ◽  
Author(s):  
Davide Chicco ◽  
Haixin Sarah Bi ◽  
Jüri Reimand ◽  
Michael M. Hoffman

AbstractTransforming data from genome-scale assays into knowledge of affected molecular functions and pathways is a key challenge in biomedical research. Using vocabularies of functional terms and databases annotating genes with these terms, pathway enrichment methods can identify terms enriched in a gene list. With data that can refer to intergenic regions, however, one must first connect the regions to the terms, which are usually annotated only to genes. To make these connections, existing pathway enrichment approaches apply unwarranted assumptions such as annotating non-coding regions with the terms from adjacent genes. We developed a computational method that instead links genomic regions to annotations using data on long-range chromatin interactions. Our method, Biological Enrichment of Hidden Sequence Targets (BEHST), finds Gene Ontology (GO) terms enriched in genomic regions more precisely and accurately than existing methods. We demonstrate BEHST’s ability to retrieve more pertinent and less ambiguous GO terms associated with results of in vivo mouse enhancer screens or enhancer RNA assays for multiple tissue types. BEHST will accelerate the discovery of affected pathways mediated through long-range interactions that explain non-coding hits in genome-wide association study (GWAS) or genome editing screens. BEHST is free software with a command-line interface for Linux or macOS and a web interface (http://behst.hoffmanlab.org/).


2021 ◽  
Author(s):  
Žiga Avsec ◽  
Vikram Agarwal ◽  
Daniel Visentin ◽  
Joseph R. Ledsam ◽  
Agnieszka Grabska-Barwinska ◽  
...  

AbstractThe next phase of genome biology research requires understanding how DNA sequence encodes phenotypes, from the molecular to organismal levels. How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequence through the use of a new deep learning architecture called Enformer that is able to integrate long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Notably, Enformer outperformed the best team on the critical assessment of genome interpretation (CAGI5) challenge for noncoding variant interpretation with no additional training. Furthermore, Enformer learned to predict promoter-enhancer interactions directly from DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of growing human disease associations to cell-type-specific gene regulatory mechanisms and provide a framework to interpret cis-regulatory evolution. To foster these downstream applications, we have made the pre-trained Enformer model openly available, and provide pre-computed effect predictions for all common variants in the 1000 Genomes dataset.One-sentence summaryImproved noncoding variant effect prediction and candidate enhancer prioritization from a more accurate sequence to expression model driven by extended long-range interaction modelling.


2021 ◽  
Vol 18 (10) ◽  
pp. 1196-1203 ◽  
Author(s):  
Žiga Avsec ◽  
Vikram Agarwal ◽  
Daniel Visentin ◽  
Joseph R. Ledsam ◽  
Agnieszka Grabska-Barwinska ◽  
...  

AbstractHow noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer–promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.


Sign in / Sign up

Export Citation Format

Share Document