Predicting Enhancer-Promoter Interaction from Genomic Sequence with Deep Neural Networks

AbstractIn the human genome, distal enhancers are involved in regulating target genes through proxi-mal promoters by forming enhancer-promoter interactions. Although recently developed high-throughput experimental approaches have allowed us to recognize potential enhancer-promoter interactions genome-wide, it is still largely unclear to what extent the sequence-level information encoded in our genome help guide such interactions. Here we report a new computational method (named “SPEID”) using deep learning models to predict enhancer-promoter interactions based on sequence-based features only, when the locations of putative enhancers and promoters in a particular cell type are given. Our results across six different cell types demonstrate that SPEID is effective in predicting enhancer-promoter interactions as compared to state-of-the-art methods that only use information from a single cell type. As a proof-of-principle, we also applied SPEID to identify somatic non-coding mutations in melanoma samples that may have reduced enhancer-promoter interactions in tumor genomes. This work demonstrates that deep learning models can help reveal that sequence-based features alone are sufficient to reliably predict enhancer-promoter interactions genome-wide.

Download Full-text

SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models

10.1101/2020.05.13.093997 ◽

2020 ◽

Author(s):

Yupeng Wang ◽

Rosario B. Jaime-Lara ◽

Abhrarup Roy ◽

Ying Sun ◽

Xinyue Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Dna Sequences ◽

Cell Types ◽

Learning Models ◽

Cell Type ◽

Coding Sequences ◽

Sequence Features ◽

Cell Type Specific ◽

Different Cell Types

AbstractWe propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, sequential k-mer (k=5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers including gkm-SVM and DanQ, with regard to distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL is able to directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified according to their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.

Download Full-text

Systematic evaluation of cell-type deconvolution pipelines for sequencing-based bulk DNA methylomes

10.1101/2021.11.29.470374 ◽

2021 ◽

Author(s):

Yunhee Jeong ◽

Reka Toth ◽

Marlene Ganslmeier ◽

Kersten Breuer ◽

Christoph Plass ◽

...

Keyword(s):

Cell Types ◽

Systematic Evaluation ◽

Cell Type ◽

Factors Affecting ◽

Genome Wide ◽

Cell Type Composition ◽

Type Composition ◽

Level Information ◽

Genomic Regions ◽

The Impact

DNA methylation sequencing is becoming increasingly popular, yielding genome-wide methylome data at single-base pair resolution through the novel cost- and labor-optimized protocols. It has tremendous potential for cell-type heterogeneity analysis, particularly in tumors, due to intrinsic read-level information. Although diverse deconvolution methods were developed to infer cell-type composition based on bulk sequencing-based methylomes, their systematic evaluation has not been performed so far. Here, we thoroughly review and evaluate five previously published deconvolution methods: Bayesian epiallele detection (BED), PRISM, csmFinder + coMethy, ClubCpG and MethylPurify, together with two array-based methods, MeDeCom and Houseman as a comparison group. Sequencing-based deconvolution methods consist of two main steps, informative region selection and cell-type composition estimation. Accordingly, we individually assessed the performance of each step and demonstrated the impact of the former step upon the performance of the following one. In conclusion, we demonstrate the best method showing the highest accuracy in different samples, and infer factors affecting cell-type deconvolution performance according to the number of cell types in the mixture. We found that cell-type deconvolution performance is influenced by different factors according to the number of components in the mixture. Whereas selecting similar genomic regions to DMRs generally contributed to increasing the performance in bi-component mixtures, the uniformity of cell-type distribution showed a high correlation with the performance in five cell-type bulk analyses.

Download Full-text

SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models

10.21203/rs.3.rs-94396/v1 ◽

2020 ◽

Author(s):

Yupeng Wang ◽

Rosario Jaime-Lara ◽

Abhrarup Roy ◽

Ying Sun ◽

Xinyue Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Cell Types ◽

Regulatory Elements ◽

Learning Models ◽

Cell Type ◽

Coding Sequences ◽

Sequence Features ◽

A Genome ◽

Cell Type Specific

Abstract ObjectiveComputational identification of cell type-specific regulatory elements on a genome-wide scale is very challenging.ResultsWe propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, sequential k-mer (k=5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers including gkm-SVM and DanQ, with regard to distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL is able to directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified according to their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.

Download Full-text

SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models

BMC Research Notes ◽

10.1186/s13104-021-05518-7 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Yupeng Wang ◽

Rosario B. Jaime-Lara ◽

Abhrarup Roy ◽

Ying Sun ◽

Xinyue Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Cell Types ◽

Regulatory Elements ◽

Learning Models ◽

Cell Type ◽

Coding Sequences ◽

Sequence Features ◽

A Genome ◽

Cell Type Specific

Abstract Objective To address the challenge of computational identification of cell type-specific regulatory elements on a genome-wide scale. Results We propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, positional k-mer (k = 5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences across each nucleotide position were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers (including gkm-SVM and DanQ) in distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL can directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified based on their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.

Download Full-text

Fully-automated cell-type identification with specific markers extracted from single-cell transcriptomic data

10.1101/812131 ◽

2019 ◽

Author(s):

Aleksandr Ianevski ◽

Anil K Giri ◽

Tero Aittokallio

Keyword(s):

Single Cell ◽

Cell Types ◽

Computational Method ◽

Marker Genes ◽

Cell Type ◽

Cell Clusters ◽

Mouse Tissues ◽

Single Cell Type ◽

The Individual ◽

Selection Of

AbstractSingle-cell transcriptomics enables systematic charting of cellular composition of complex tissues. Identification of cell populations often relies on unsupervised clustering of cells based on the similarity of the scRNA-seq profiles, followed by manual annotation of cell clusters using established marker genes. However, manual selection of marker genes for cell-type annotation is a laborious and error-prone task since the selected markers must be specific both to the individual cell clusters and various cell types. Here, we developed a computational method, termed ScType, which enables data-driven selection of marker genes based solely on given scRNA-seq data. Using a compendium of 7 scRNA-seq datasets from various human and mouse tissues, we demonstrate how ScType enables unbiased, accurate and fully-automated single-cell type annotation by guaranteeing the specificity of marker genes both across cell clusters and cell types. The widely-applicable method is implemented as an interactive web-tool (https://sctype.fimm.fi), connected with comprehensive database of specific markers.

Download Full-text

Comprehensive enhancer-target gene assignments improve gene set level interpretation of genome-wide regulatory data

10.1101/2020.10.22.351049 ◽

2020 ◽

Author(s):

Tingting Qin ◽

Christopher Lee ◽

Raymond Cavalcante ◽

Peter Orchard ◽

Heming Yao ◽

...

Keyword(s):

Target Genes ◽

Cell Types ◽

Regulatory Elements ◽

Systematic Evaluation ◽

Cell Type ◽

Gene Set ◽

Go Annotation ◽

Genome Wide ◽

Comparable Performance ◽

Tf Gene

AbstractRevealing the gene targets of distal regulatory elements is challenging yet critical for interpreting regulome data. Experiment-derived enhancer-gene links are restricted to a small set of enhancers and/or cell types, while the accuracy of genome-wide approaches remains elusive due to the lack of a systematic evaluation. We combined multiple spatial and in silico approaches for defining enhancer locations and linking them to their target genes aggregated across >500 cell types, generating 1,860 human genome-wide distal Enhancer to Target gene Definitions (EnTDefs). To evaluate performance, we used gene set enrichment testing on 87 independent ENCODE ChIP-seq datasets of 34 transcription factors (TFs) and assessed concordance of results with known TF Gene Ontology (GO) annotations., assuming that greater concordance with TF-GO annotation signifies better enrichment results and thus more accurate enhancer-to-gene assignments. Notably, the top ranked 741 (40%) EnTDefs significantly outperformed the common, naïve approach of linking distal regions to the nearest genes (FDR < 0.05), and the top 10 ranked EnTDefs performed well when applied to ChIP-seq data of other cell types. These general EnTDefs also showed comparable performance to EnTDefs generated using cell-type-specific data. Our findings illustrate the power of our approach to provide genome-wide interpretation regardless of cell type.

Download Full-text

EMeth: An EM algorithm for cell type decomposition based on DNA methylation data

Scientific Reports ◽

10.1038/s41598-021-84864-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Hanyu Zhang ◽

Ruoyi Cai ◽

James Dai ◽

Wei Sun

Keyword(s):

Dna Methylation ◽

Tumor Cells ◽

T Regulatory Cells ◽

Simulated Data ◽

Cell Types ◽

Computational Method ◽

Methylation Data ◽

Cell Type ◽

A Cell ◽

Type Decomposition

AbstractWe introduce a new computational method named EMeth to estimate cell type proportions using DNA methylation data. EMeth is a reference-based method that requires cell type-specific DNA methylation data from relevant cell types. EMeth improves on the existing reference-based methods by detecting the CpGs whose DNA methylation are inconsistent with the deconvolution model and reducing their contributions to cell type decomposition. Another novel feature of EMeth is that it allows a cell type with known proportions but unknown reference and estimates its methylation. This is motivated by the case of studying methylation in tumor cells while bulk tumor samples include tumor cells as well as other cell types such as infiltrating immune cells, and tumor cell proportion can be estimated by copy number data. We demonstrate that EMeth delivers more accurate estimates of cell type proportions than several other methods using simulated data and in silico mixtures. Applications in cancer studies show that the proportions of T regulatory cells estimated by DNA methylation have expected associations with mutation load and survival time, while the estimates from gene expression miss such associations.

Download Full-text

HCR-FlowFISH: A flexible CRISPR screening method to identify cis-regulatory elements and their target genes

10.1101/2020.05.11.078675 ◽

2020 ◽

Author(s):

SK Reilly ◽

SJ Gosai ◽

A Gutierrez ◽

JC Ulirsch ◽

M Kanai ◽

...

Keyword(s):

Gene Expression ◽

Target Genes ◽

Screening Method ◽

Cell Types ◽

Regulatory Elements ◽

Hybridization Chain Reaction ◽

Genome Wide ◽

Wide Range ◽

Causal Variants ◽

Endogenous Loci

AbstractCRISPR screens for cis-regulatory elements (CREs) have shown unprecedented power to endogenously characterize the non-coding genome. To characterize CREs we developed HCR-FlowFISH (Hybridization Chain Reaction Fluorescent In-Situ Hybridization coupled with Flow Cytometry), which directly quantifies native transcripts within their endogenous loci following CRISPR perturbations of regulatory elements, eliminating the need for restrictive phenotypic assays such as growth or transcript-tagging. HCR-FlowFISH accurately quantifies gene expression across a wide range of transcript levels and cell types. We also developed CASA (CRISPR Activity Screen Analysis), a hierarchical Bayesian model to identify and quantify CRE activity. Using >270,000 perturbations, we identified CREs for GATA1, HDAC6, ERP29, LMO2, MEF2C, CD164, NMU, FEN1 and the FADS gene cluster. Our methods detect subtle gene expression changes and identify CREs regulating multiple genes, sometimes at different magnitudes and directions. We demonstrate the power of HCR-FlowFISH to parse genome-wide association signals by nominating causal variants and target genes.

Download Full-text

A computational method to aid the design and analysis of single cell RNA-seq experiments for cell type identification

10.1101/247114 ◽

2018 ◽

Cited By ~ 1

Author(s):

Douglas Abrams ◽

Parveen Kumar ◽

R. Krishna Murthy Karuturi ◽

Joshy George

Keyword(s):

Experimental Design ◽

Single Cell ◽

Single Cells ◽

Cell Types ◽

Cell Number ◽

Fold Change ◽

Computational Method ◽

Marker Genes ◽

Cell Type ◽

Estimate Sample Size

AbstractBackgroundThe advent of single cell RNA sequencing (scRNA-seq) enabled researchers to study transcriptomic activity within individual cells and identify inherent cell types in the sample. Although numerous computational tools have been developed to analyze single cell transcriptomes, there are no published studies and analytical packages available to guide experimental design and to devise suitable analysis procedure for cell type identification.ResultsWe have developed an empirical methodology to address this important gap in single cell experimental design and analysis into an easy-to-use tool called SCEED (Single Cell Empirical Experimental Design and analysis). With SCEED, user can choose a variety of combinations of tools for analysis, conduct performance analysis of analytical procedures and choose the best procedure, and estimate sample size (number of cells to be profiled) required for a given analytical procedure at varying levels of cell type rarity and other experimental parameters. Using SCEED, we examined 3 single cell algorithms using 48 simulated single cell datasets that were generated for varying number of cell types and their proportions, number of genes expressed per cell, number of marker genes and their fold change, and number of single cells successfully profiled in the experiment.ConclusionsBased on our study, we found that when marker genes are expressed at fold change of 4 or more than the rest of the genes, either Seurat or Simlr algorithm can be used to analyze single cell dataset for any number of single cells isolated (minimum 1000 single cells were tested). However, when marker genes are expected to be only up to fC 2 upregulated, choice of the single cell algorithm is dependent on the number of single cells isolated and proportion of rare cell type to be identified. In conclusion, our work allows the assessment of various single cell methods and also aids in examining the single cell experimental design.

Download Full-text

A compendium of uniformly processed human gene expression and splicing quantitative trait loci

Nature Genetics ◽

10.1038/s41588-021-00924-w ◽

2021 ◽

Vol 53 (9) ◽

pp. 1290-1299

Author(s):

Nurlan Kerimov ◽

James D. Hayhurst ◽

Kateryna Peikova ◽

Jonathan R. Manning ◽

Peter Walter ◽

...

Keyword(s):

Gene Expression ◽

Quantitative Trait ◽

Target Genes ◽

Genome Wide Association Study ◽

Cell Types ◽

Summary Statistics ◽

Genome Wide ◽

Cell Type Specific ◽

Trait Locus ◽

Complex Human Traits

AbstractMany gene expression quantitative trait locus (eQTL) studies have published their summary statistics, which can be used to gain insight into complex human traits by downstream analyses, such as fine mapping and co-localization. However, technical differences between these datasets are a barrier to their widespread use. Consequently, target genes for most genome-wide association study (GWAS) signals have still not been identified. In the present study, we present the eQTL Catalogue (https://www.ebi.ac.uk/eqtl), a resource of quality-controlled, uniformly re-computed gene expression and splicing QTLs from 21 studies. We find that, for matching cell types and tissues, the eQTL effect sizes are highly reproducible between studies. Although most QTLs were shared between most bulk tissues, we identified a greater diversity of cell-type-specific QTLs from purified cell types, a subset of which also manifested as new disease co-localizations. Our summary statistics are freely available to enable the systematic interpretation of human GWAS associations across many cell types and tissues.

Download Full-text