scholarly journals Effective gene expression prediction from sequence by integrating long-range interactions

2021 ◽  
Vol 18 (10) ◽  
pp. 1196-1203 ◽  
Author(s):  
Žiga Avsec ◽  
Vikram Agarwal ◽  
Daniel Visentin ◽  
Joseph R. Ledsam ◽  
Agnieszka Grabska-Barwinska ◽  
...  

AbstractHow noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer–promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.

2021 ◽  
Author(s):  
Žiga Avsec ◽  
Vikram Agarwal ◽  
Daniel Visentin ◽  
Joseph R. Ledsam ◽  
Agnieszka Grabska-Barwinska ◽  
...  

AbstractThe next phase of genome biology research requires understanding how DNA sequence encodes phenotypes, from the molecular to organismal levels. How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequence through the use of a new deep learning architecture called Enformer that is able to integrate long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Notably, Enformer outperformed the best team on the critical assessment of genome interpretation (CAGI5) challenge for noncoding variant interpretation with no additional training. Furthermore, Enformer learned to predict promoter-enhancer interactions directly from DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of growing human disease associations to cell-type-specific gene regulatory mechanisms and provide a framework to interpret cis-regulatory evolution. To foster these downstream applications, we have made the pre-trained Enformer model openly available, and provide pre-computed effect predictions for all common variants in the 1000 Genomes dataset.One-sentence summaryImproved noncoding variant effect prediction and candidate enhancer prioritization from a more accurate sequence to expression model driven by extended long-range interaction modelling.


2017 ◽  
Author(s):  
Yan Kai ◽  
Jaclyn Andricovich ◽  
Zhouhao Zeng ◽  
Jun Zhu ◽  
Alexandros Tzatsos ◽  
...  

AbstractThe CCCTC-binding zinc finger protein (CTCF)-mediated network of long-range chromatin interactions is important for genome organization and function. Although this network has been considered largely invariant, we found that it exhibits extensive cell-type-specific interactions that contribute to cell identity. Here we present Lollipop—a machine-learning framework—which predicts CTCF-mediated long-range interactions using genomic and epigenomic features. Using ChIA-PET data as benchmark, we demonstrated that Lollipop accurately predicts CTCF-mediated chromatin interactions both within and across cell-types, and outperforms other methods based only on CTCF motif orientation. Predictions were confirmed computationally and experimentally by Chromatin Conformation Capture (3C). Moreover, our approach reveals novel determinants of CTCF-mediated chromatin wiring, such as gene expression within the loops. Our study contributes to a better understanding about the underlying principles of CTCF-mediated chromatin interactions and their impact on gene expression.


1995 ◽  
Vol 51 (5) ◽  
pp. 5084-5091 ◽  
Author(s):  
S. V. Buldyrev ◽  
A. L. Goldberger ◽  
S. Havlin ◽  
R. N. Mantegna ◽  
M. E. Matsa ◽  
...  

2019 ◽  
Vol 2019 ◽  
pp. 1-12
Author(s):  
Livia Eiselleova ◽  
Viktor Lukjanov ◽  
Simon Farkas ◽  
David Svoboda ◽  
Karel Stepka ◽  
...  

The eukaryotic nucleus is a highly complex structure that carries out multiple functions primarily needed for gene expression, and among them, transcription seems to be the most fundamental. Diverse approaches have demonstrated that transcription takes place at discrete sites known as transcription factories, wherein RNA polymerase II (RNAP II) is attached to the factory and immobilized while transcribing DNA. It has been proposed that transcription factories promote chromatin loop formation, creating long-range interactions in which relatively distant genes can be transcribed simultaneously. In this study, we examined long-range interactions between the POU5F1 gene and genes previously identified as being POU5F1 enhancer-interacting, namely, CDYL, TLE2, RARG, and MSX1 (all involved in transcriptional regulation), in human pluripotent stem cells (hPSCs) and their early differentiated counterparts. As a control gene, RUNX1 was used, which is expressed during hematopoietic differentiation and not associated with pluripotency. To reveal how these long-range interactions between POU5F1 and the selected genes change with the onset of differentiation and upon RNAP II inhibition, we performed three-dimensional fluorescence in situ hybridization (3D-FISH) followed by computational simulation analysis. Our analysis showed that the numbers of long-range interactions between specific genes decrease during differentiation, suggesting that the transcription of monitored genes is associated with pluripotency. In addition, we showed that upon inhibition of RNAP II, long-range associations do not disintegrate and remain constant. We also analyzed the distance distributions of these genes in the context of their positions in the nucleus and revealed that they tend to have similar patterns resembling normal distribution. Furthermore, we compared data created in vitro and in silico to assess the biological relevance of our results.


2014 ◽  
Vol 35 (1) ◽  
pp. 224-237 ◽  
Author(s):  
Zhijun Qiu ◽  
Carolyn Song ◽  
Navid Malakouti ◽  
Daniel Murray ◽  
Aymen Hariz ◽  
...  

Gene expression frequently requires chromatin-remodeling complexes, and it is assumed that these complexes have common gene targets across cell types. Contrary to this belief, we show by genome-wide expression profiling that Bptf, an essential and unique subunit of the nucleosome-remodeling factor (NURF), predominantly regulates the expression of a unique set of genes between diverse cell types. Coincident with its functions in gene expression, we observed that Bptf is also important for regulating nucleosome occupancy at nucleosome-free regions (NFRs), many of which are located at sites occupied by the multivalent factors Ctcf and cohesin. NURF function at Ctcf binding sites could be direct, because Bptf occupies Ctcf binding sitesin vivoand has physical interactions with CTCF and the cohesin subunit SA2. Assays of several Ctcf binding sites using reporter assays showed that their regulatory activity requires Bptf in two different cell types. Focused studies atH2-K1showed that Bptf regulates the ability of Klf4 to bind near an upstream Ctcf site, possibly influencing gene expression. In combination, these studies demonstrate that gene expression as regulated by NURF occurs partly through physical and functional interactions with the ubiquitous and multivalent factors Ctcf and cohesin.


2008 ◽  
Vol 205 (4) ◽  
pp. 747-750 ◽  
Author(s):  
Adam Williams ◽  
Richard A. Flavell

The spatial organization of the genome is thought to play an important part in the coordination of gene regulation. New techniques have been used to identify specific long-range interactions between distal DNA sequences, revealing an ever-increasing complexity to nuclear organization. CCCTC-binding factor (CTCF) is a versatile zinc finger protein with diverse regulatory functions. New data now help define how CTCF mediates both long-range intrachromosomal and interchromosomal interactions, and highlight CTCF as an important factor in determining the three-dimensional structure of the genome.


2020 ◽  
Author(s):  
Jeremy Bigness ◽  
Xavi Loinaz ◽  
Shalin Patel ◽  
Erica Larschan ◽  
Ritambhara Singh

Long-range spatial interactions among genomic regions are critical for regulating gene expression and their disruption has been associated with a host of diseases. However, when modeling the effects of regulatory factors on gene expression, most deep learning models either neglect long-range interactions or fail to capture the inherent 3D structure of the underlying biological system. This prevents the field from obtaining a more comprehensive understanding of gene regulation and from fully leveraging the structural information present in the data sets. Here, we propose a graph convolutional neural network (GCNN) framework to integrate measurements probing spatial genomic organization and measurements of local regulatory factors, specifically histone modifications, to predict gene expression. This formulation enables the model to incorporate crucial information about long-range interactions via a natural encoding of spatial interaction relationships into a graph representation. Furthermore, we show that our model is interpretable in terms of the observed biological regulatory factors, highlighting both the histone modifications and the interacting genomic regions that contribute to a gene's predicted expression. We apply our GCNN model to datasets for GM12878 (lymphoblastoid) and K562 (myelogenous leukemia) cell lines and demonstrate its state-of-the-art prediction performance. We also obtain importance scores corresponding to the histone mark features and interacting regions for some exemplar genes and validate them with evidence from the literature. Our model presents a novel setup for predicting gene expression by integrating multimodal datasets.


2019 ◽  
Author(s):  
Chen-Hao Chen ◽  
Rongbin Zheng ◽  
Jingyu Fan ◽  
Myles Brown ◽  
Jun S. Liu ◽  
...  

AbstractTo characterize the genomic distances over which transcription factors (TFs) influence gene expression, we examined thousands of TF and histone modification ChIP-seq datasets and thousands of gene expression profiles. A model integrating these data revealed two classes of TF: one with short-range regulatory influence, the other with long-range regulatory influence. The two TF classes also had distinct chromatin-binding preferences and auto-regulatory properties. The regulatory range of a single TF bound within different topologically associating domains (TADs) depended on intrinsic TAD properties such as local gene density and G/C content, but also on the TAD chromatin state in specific cell types. Our results provide evidence that most TFs belong to one of these two functional classes, and that the regulatory range of long-range TFs is chromatin-state dependent. Thus, consideration of TF type, distance-to-target, and chromatin context is likely important in identifying TF regulatory targets and interpreting GWAS and eQTL SNPs.


PLoS ONE ◽  
2021 ◽  
Vol 16 (1) ◽  
pp. e0244864
Author(s):  
Carlos Mora-Martinez

Large amounts of effort have been invested in trying to understand how a single genome is able to specify the identity of hundreds of cell types. Inspired by some aspects of Caenorhabditis elegans biology, we implemented an in silico evolutionary strategy to produce gene regulatory networks (GRNs) that drive cell-specific gene expression patterns, mimicking the process of terminal cell differentiation. Dynamics of the gene regulatory networks are governed by a thermodynamic model of gene expression, which uses DNA sequences and transcription factor degenerate position weight matrixes as input. In a version of the model, we included chromatin accessibility. Experimentally, it has been determined that cell-specific and broadly expressed genes are regulated differently. In our in silico evolved GRNs, broadly expressed genes are regulated very redundantly and the architecture of their cis-regulatory modules is different, in accordance to what has been found in C. elegans and also in other systems. Finally, we found differences in topological positions in GRNs between these two classes of genes, which help to explain why broadly expressed genes are so resilient to mutations. Overall, our results offer an explanatory hypothesis on why broadly expressed genes are regulated so redundantly compared to cell-specific genes, which can be extrapolated to phenomena such as ChIP-seq HOT regions.


Development ◽  
1997 ◽  
Vol 124 (12) ◽  
pp. 2325-2334 ◽  
Author(s):  
R. Dosch ◽  
V. Gawantka ◽  
H. Delius ◽  
C. Blumenstock ◽  
C. Niehrs

The marginal zone is a ring of tissue that gives rise to a characteristic dorsoventral pattern of mesoderm in amphibian embryos. Bmp-4 is thought to play an important role in specifying ventral mesodermal fate. Here we show (1) that different doses of Bmp-4 are sufficient to pattern four distinct mesodermal cell types and to pattern gene expression in the early gastrula marginal zone into three domains, (2) that there is a graded requirement for a Bmp signal in mesodermal patterning, and (3) that Bmp-4 has long-range activity which can become graded in the marginal zone by the antagonizing action of noggin. The results argue that Bmp-4 acts as a morphogen in dorsoventral patterning of mesoderm.


Sign in / Sign up

Export Citation Format

Share Document