Integrating long-range regulatory interactions to predict gene expression using graph convolutional neural networks

Long-range spatial interactions among genomic regions are critical for regulating gene expression and their disruption has been associated with a host of diseases. However, when modeling the effects of regulatory factors on gene expression, most deep learning models either neglect long-range interactions or fail to capture the inherent 3D structure of the underlying biological system. This prevents the field from obtaining a more comprehensive understanding of gene regulation and from fully leveraging the structural information present in the data sets. Here, we propose a graph convolutional neural network (GCNN) framework to integrate measurements probing spatial genomic organization and measurements of local regulatory factors, specifically histone modifications, to predict gene expression. This formulation enables the model to incorporate crucial information about long-range interactions via a natural encoding of spatial interaction relationships into a graph representation. Furthermore, we show that our model is interpretable in terms of the observed biological regulatory factors, highlighting both the histone modifications and the interacting genomic regions that contribute to a gene's predicted expression. We apply our GCNN model to datasets for GM12878 (lymphoblastoid) and K562 (myelogenous leukemia) cell lines and demonstrate its state-of-the-art prediction performance. We also obtain importance scores corresponding to the histone mark features and interacting regions for some exemplar genes and validate them with evidence from the literature. Our model presents a novel setup for predicting gene expression by integrating multimodal datasets.

Download Full-text

The Role of RNA Polymerase II Contiguity and Long-Range Interactions in the Regulation of Gene Expression in Human Pluripotent Stem Cells

Stem Cells International ◽

10.1155/2019/1375807 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12

Author(s):

Livia Eiselleova ◽

Viktor Lukjanov ◽

Simon Farkas ◽

David Svoboda ◽

Karel Stepka ◽

...

Keyword(s):

Gene Expression ◽

Stem Cells ◽

Rna Polymerase ◽

Rna Polymerase Ii ◽

Long Range ◽

Pluripotent Stem Cells ◽

Human Pluripotent Stem Cells ◽

Transcription Factories ◽

Long Range Interactions ◽

Rnap Ii

The eukaryotic nucleus is a highly complex structure that carries out multiple functions primarily needed for gene expression, and among them, transcription seems to be the most fundamental. Diverse approaches have demonstrated that transcription takes place at discrete sites known as transcription factories, wherein RNA polymerase II (RNAP II) is attached to the factory and immobilized while transcribing DNA. It has been proposed that transcription factories promote chromatin loop formation, creating long-range interactions in which relatively distant genes can be transcribed simultaneously. In this study, we examined long-range interactions between the POU5F1 gene and genes previously identified as being POU5F1 enhancer-interacting, namely, CDYL, TLE2, RARG, and MSX1 (all involved in transcriptional regulation), in human pluripotent stem cells (hPSCs) and their early differentiated counterparts. As a control gene, RUNX1 was used, which is expressed during hematopoietic differentiation and not associated with pluripotency. To reveal how these long-range interactions between POU5F1 and the selected genes change with the onset of differentiation and upon RNAP II inhibition, we performed three-dimensional fluorescence in situ hybridization (3D-FISH) followed by computational simulation analysis. Our analysis showed that the numbers of long-range interactions between specific genes decrease during differentiation, suggesting that the transcription of monitored genes is associated with pluripotency. In addition, we showed that upon inhibition of RNAP II, long-range associations do not disintegrate and remain constant. We also analyzed the distance distributions of these genes in the context of their positions in the nucleus and revealed that they tend to have similar patterns resembling normal distribution. Furthermore, we compared data created in vitro and in silico to assess the biological relevance of our results.

Download Full-text

Intranuclear trafficking of transcription factors: implications for biological control

Journal of Cell Science ◽

10.1242/jcs.113.14.2527 ◽

2000 ◽

Vol 113 (14) ◽

pp. 2527-2533 ◽

Cited By ~ 1

Author(s):

G.S. Stein ◽

A.J. van Wijnen ◽

J.L. Stein ◽

J.B. Lian ◽

M. Montecino ◽

...

Keyword(s):

Gene Expression ◽

Transcription Factors ◽

Spatial Organization ◽

Three Dimensional ◽

Myelogenous Leukemia ◽

Regulatory Factors ◽

Acute Myelogenous ◽

Gene Transcripts ◽

Targeting Signals ◽

Discrete Domains

The subnuclear organization of nucleic acids and cognate regulatory factors suggests that there are functional interrelationships between nuclear structure and gene expression. Nuclear proteins that are localized in discrete domains within the nucleus include the leukemia-associated acute myelogenous leukemia (AML) and promyelocytic leukemia (PML) factors, the SC-35 RNA-processing factors, nucleolar proteins and components of both transcriptional and DNA replication complexes. Mechanisms that control the spatial distribution of transcription factors within the three-dimensional context of the nucleus may involve the sorting of regulatory information, as well as contribute to the assembly and activity of sites that support gene expression. Molecular, cellular, genetic and biochemical approaches have identified distinct protein segments, termed intranuclear-targeting signals, that are responsible for directing regulatory factors to specific subnuclear sites. Gene rearrangements that remove or alter intranuclear-targeting signals are prevalent in leukemias and have been linked to altered localization of regulatory factors within the nucleus. These modifications in the intranuclear targeting of transcription factors might abrogate fidelity of gene expression in tumor cells by influencing the spatial organization and/or assembly of machineries involved in the synthesis and processing of gene transcripts.

Download Full-text

Long-Range Interactions in Neuronal Gene Expression: Evidence from Gene Targeting in the GABAA Receptor β2–α6–α1–γ2 Subunit Gene Cluster

Molecular and Cellular Neuroscience ◽

10.1006/mcne.2000.0856 ◽

2000 ◽

Vol 16 (1) ◽

pp. 34-41 ◽

Cited By ~ 48

Author(s):

M. Uusi-Oukari ◽

J. Heikkilä ◽

S.T. Sinkkonen ◽

R. Mäkelä ◽

B. Hauer ◽

...

Keyword(s):

Gene Expression ◽

Gene Cluster ◽

Gene Targeting ◽

Gabaa Receptor ◽

Long Range ◽

Subunit Gene ◽

Neuronal Gene ◽

Long Range Interactions ◽

Expression Evidence

Download Full-text

Long-Range Interactions in Riboswitch Control of Gene Expression

Annual Review of Biophysics ◽

10.1146/annurev-biophys-070816-034042 ◽

2017 ◽

Vol 46 (1) ◽

pp. 455-481 ◽

Cited By ~ 33

Author(s):

Christopher P. Jones ◽

Adrian R. Ferré-D'Amaré

Keyword(s):

Gene Expression ◽

Long Range ◽

Control Of Gene Expression ◽

Long Range Interactions

Download Full-text

Predicting CTCF-mediated chromatin interactions by integrating genomic and epigenomic features

10.1101/215871 ◽

2017 ◽

Cited By ~ 2

Author(s):

Yan Kai ◽

Jaclyn Andricovich ◽

Zhouhao Zeng ◽

Jun Zhu ◽

Alexandros Tzatsos ◽

...

Keyword(s):

Gene Expression ◽

Long Range ◽

Zinc Finger Protein ◽

Cell Types ◽

Learning Framework ◽

Chromatin Interactions ◽

Long Range Interactions ◽

Extensive Cell ◽

Cell Type Specific ◽

And Function

AbstractThe CCCTC-binding zinc finger protein (CTCF)-mediated network of long-range chromatin interactions is important for genome organization and function. Although this network has been considered largely invariant, we found that it exhibits extensive cell-type-specific interactions that contribute to cell identity. Here we present Lollipop—a machine-learning framework—which predicts CTCF-mediated long-range interactions using genomic and epigenomic features. Using ChIA-PET data as benchmark, we demonstrated that Lollipop accurately predicts CTCF-mediated chromatin interactions both within and across cell-types, and outperforms other methods based only on CTCF motif orientation. Predictions were confirmed computationally and experimentally by Chromatin Conformation Capture (3C). Moreover, our approach reveals novel determinants of CTCF-mediated chromatin wiring, such as gene expression within the loops. Our study contributes to a better understanding about the underlying principles of CTCF-mediated chromatin interactions and their impact on gene expression.

Download Full-text

BEHST: genomic set enrichment analysis enhanced through integration of chromatin long-range interactions

10.1101/168427 ◽

2019 ◽

Cited By ~ 4

Author(s):

Davide Chicco ◽

Haixin Sarah Bi ◽

Jüri Reimand ◽

Michael M. Hoffman

Keyword(s):

Long Range ◽

Genome Wide Association Study ◽

Gene List ◽

Enrichment Analysis ◽

Computational Method ◽

Pathway Enrichment ◽

Long Range Interactions ◽

Genomic Regions ◽

Go Terms

AbstractTransforming data from genome-scale assays into knowledge of affected molecular functions and pathways is a key challenge in biomedical research. Using vocabularies of functional terms and databases annotating genes with these terms, pathway enrichment methods can identify terms enriched in a gene list. With data that can refer to intergenic regions, however, one must first connect the regions to the terms, which are usually annotated only to genes. To make these connections, existing pathway enrichment approaches apply unwarranted assumptions such as annotating non-coding regions with the terms from adjacent genes. We developed a computational method that instead links genomic regions to annotations using data on long-range chromatin interactions. Our method, Biological Enrichment of Hidden Sequence Targets (BEHST), finds Gene Ontology (GO) terms enriched in genomic regions more precisely and accurately than existing methods. We demonstrate BEHST’s ability to retrieve more pertinent and less ambiguous GO terms associated with results of in vivo mouse enhancer screens or enhancer RNA assays for multiple tissue types. BEHST will accelerate the discovery of affected pathways mediated through long-range interactions that explain non-coding hits in genome-wide association study (GWAS) or genome editing screens. BEHST is free software with a command-line interface for Linux or macOS and a web interface (http://behst.hoffmanlab.org/).

Download Full-text

Nuclear organization and gene expression: homologous pairing and long-range interactions

Current Opinion in Cell Biology ◽

10.1016/s0955-0674(97)80012-9 ◽

1997 ◽

Vol 9 (3) ◽

pp. 388-395 ◽

Cited By ~ 50

Author(s):

Steven Henikoff

Keyword(s):

Gene Expression ◽

Long Range ◽

Nuclear Organization ◽

Homologous Pairing ◽

Long Range Interactions

Download Full-text

Effective gene expression prediction from sequence by integrating long-range interactions

10.1101/2021.04.07.438649 ◽

2021 ◽

Author(s):

Žiga Avsec ◽

Vikram Agarwal ◽

Daniel Visentin ◽

Joseph R. Ledsam ◽

Agnieszka Grabska-Barwinska ◽

...

Keyword(s):

Gene Expression ◽

Long Range ◽

Dna Sequence ◽

Human Genetics ◽

Saturation Mutagenesis ◽

Specific Gene ◽

Range Interaction ◽

Long Range Interactions ◽

Noncoding Variant ◽

Variant Effect

AbstractThe next phase of genome biology research requires understanding how DNA sequence encodes phenotypes, from the molecular to organismal levels. How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequence through the use of a new deep learning architecture called Enformer that is able to integrate long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Notably, Enformer outperformed the best team on the critical assessment of genome interpretation (CAGI5) challenge for noncoding variant interpretation with no additional training. Furthermore, Enformer learned to predict promoter-enhancer interactions directly from DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of growing human disease associations to cell-type-specific gene regulatory mechanisms and provide a framework to interpret cis-regulatory evolution. To foster these downstream applications, we have made the pre-trained Enformer model openly available, and provide pre-computed effect predictions for all common variants in the 1000 Genomes dataset.One-sentence summaryImproved noncoding variant effect prediction and candidate enhancer prioritization from a more accurate sequence to expression model driven by extended long-range interaction modelling.

Download Full-text

Effective gene expression prediction from sequence by integrating long-range interactions

Nature Methods ◽

10.1038/s41592-021-01252-x ◽

2021 ◽

Vol 18 (10) ◽

pp. 1196-1203 ◽

Cited By ~ 1

Author(s):

Žiga Avsec ◽

Vikram Agarwal ◽

Daniel Visentin ◽

Joseph R. Ledsam ◽

Agnieszka Grabska-Barwinska ◽

...

Keyword(s):

Gene Expression ◽

Long Range ◽

Dna Sequences ◽

Human Genetics ◽

Cell Types ◽

Saturation Mutagenesis ◽

Noncoding Dna ◽

Long Range Interactions ◽

Disease Associations ◽

Reporter Assays

AbstractHow noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer–promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.

Download Full-text

Theoretical characterization of long-range interactions in the Ne+ (2P) + H2(1σ+g) charge-transfer states

Molecular Physics ◽

10.1080/00268979809482206 ◽

1998 ◽

Vol 93 (2) ◽

pp. 229-240 ◽

Cited By ~ 5

Author(s):

M.F. FALCETTA ◽

P.E. SISKA

Keyword(s):

Charge Transfer ◽

Long Range ◽

Long Range Interactions ◽

Theoretical Characterization

Download Full-text