scholarly journals DENIES: A deep learning based two-layer predictor for enhancing the identification of enhancers and their strength with DNA shape information

2021 ◽  
Author(s):  
Li Ye ◽  
Chunquan Li ◽  
Jiquan Ma

The identification of enhancers has always been an important task in bioinformatics owing to their major role in regulating gene expression. For this reason, many computational algorithms devoted to enhancer identification have been put forward over the years, ranging from statistics and machine learning to the increasing popular deep learning. To boost the performance of their methods, more features tend to be extracted from the single DNA sequences and integrated to develop an ensemble classifier. Nevertheless, the sequence-derived features used in previous studies can hardly provide the 3D structure information of DNA sequences, which is regarded as an important factor affecting the binding preferences of transcription factors to regulatory elements like enhancers. Given that, we here propose DENIES, a deep learning based two-layer predictor for enhancing the identification of enhancers and their strength. Besides two common sequence-derived features (i.e. one-hot and k-mer), it introduces DNA shape for describing the 3D structures of DNA sequences. The results of performance comparison with a series of state-of-the-art methods conducted on the same datasets prove the effectiveness and robustness of our method. The code implementation of our predictor is freely available at https://github.com/hlju-liye/DENIES.

2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Kanchan Jha ◽  
Sriparna Saha

Abstract Protein is the primary building block of living organisms. It interacts with other proteins and is then involved in various biological processes. Protein–protein interactions (PPIs) help in predicting and hence help in understanding the functionality of the proteins, causes and growth of diseases, and designing new drugs. However, there is a vast gap between the available protein sequences and the identification of protein–protein interactions. To bridge this gap, researchers proposed several computational methods to reveal the interactions between proteins. These methods merely depend on sequence-based information of proteins. With the advancement of technology, different types of information related to proteins are available such as 3D structure information. Nowadays, deep learning techniques are adopted successfully in various domains, including bioinformatics. So, current work focuses on the utilization of different modalities, such as 3D structures and sequence-based information of proteins, and deep learning algorithms to predict PPIs. The proposed approach is divided into several phases. We first get several illustrations of proteins using their 3D coordinates information, and three attributes, such as hydropathy index, isoelectric point, and charge of amino acids. Amino acids are the building blocks of proteins. A pre-trained ResNet50 model, a subclass of a convolutional neural network, is utilized to extract features from these representations of proteins. Autocovariance and conjoint triad are two widely used sequence-based methods to encode proteins, which are used here as another modality of protein sequences. A stacked autoencoder is utilized to get the compact form of sequence-based information. Finally, the features obtained from different modalities are concatenated in pairs and fed into the classifier to predict labels for protein pairs. We have experimented on the human PPIs dataset and Saccharomyces cerevisiae PPIs dataset and compared our results with the state-of-the-art deep-learning-based classifiers. The results achieved by the proposed method are superior to those obtained by the existing methods. Extensive experimentations on different datasets indicate that our approach to learning and combining features from two different modalities is useful in PPI prediction.


2020 ◽  
Vol 117 (41) ◽  
pp. 25655-25666 ◽  
Author(s):  
Alexandra Maslova ◽  
Ricardo N. Ramirez ◽  
Ke Ma ◽  
Hugo Schmutz ◽  
Chendi Wang ◽  
...  

Although we know many sequence-specific transcription factors (TFs), how the DNA sequence of cis-regulatory elements is decoded and orchestrated on the genome scale to determine immune cell differentiation is beyond our grasp. Leveraging a granular atlas of chromatin accessibility across 81 immune cell types, we asked if a convolutional neural network (CNN) could learn to infer cell type-specific chromatin accessibility solely from regulatory DNA sequences. With a tailored architecture and an ensemble approach to CNN parameter interpretation, we show that our trained network (“AI-TAC”) does so by rediscovering ab initio the binding motifs for known regulators and some unknown ones. Motifs whose importance is learned virtually as functionally important overlap strikingly well with positions determined by chromatin immunoprecipitation for several TFs. AI-TAC establishes a hierarchy of TFs and their interactions that drives lineage specification and also identifies stage-specific interactions, like Pax5/Ebf1 vs. Pax5/Prdm1, or the role of different NF-κB dimers in different cell types. AI-TAC assigns Spi1/Cebp and Pax5/Ebf1 as the drivers necessary for myeloid and B lineage fates, respectively, but no factors seemed as dominantly required for T cell differentiation, which may represent a fall-back pathway. Mouse-trained AI-TAC can parse human DNA, revealing a strikingly similar ranking of influential TFs and providing additional support that AI-TAC is a generalizable regulatory sequence decoder. Thus, deep learning can reveal the regulatory syntax predictive of the full differentiative complexity of the immune system.


2021 ◽  
Vol 27 (S1) ◽  
pp. 464-465
Author(s):  
Ramon Manzorro ◽  
Matan Leibovich ◽  
Joshua Vincent ◽  
Sreyas Mohan ◽  
David Matteson ◽  
...  

2017 ◽  
Author(s):  
Sandra Gusewski ◽  
Rainer Melzer ◽  
Florian Rüempler ◽  
Christian Gafert ◽  
Güenter Theiβen

ABSTRACTSEPALLATA3 of Arabidopsis thaliana is a MADS-domain transcription factor and a central player in flower development. MADS-domain proteins bind as dimers to AT-rich sequences termed ‘CArG-boxes’ which share the consensus 5’-CC(A/T)6GG-3’. Since only a fraction of the abundant CArG-boxes in the Arabidopsis genome are bound by SEPALLATA3, more elaborate principles have to be discovered to better understand which features turn CArG-box sequences into genuine recognition sites. Here, we investigated to which extent the shape of the DNA contributes to the DNA-binding specificity of SEPALLATA3. We determined in vitro binding affinities of SEPALLATA3 to a variety of DNA probes which all contain the CArG-box motif, but differ in their DNA shape characteristics. We found that binding affinity correlates well with certain DNA shape features associated with ‘A-tracts’. Analysis of SEPALLATA3 proteins with single amino acid substitutions in the DNA-binding MADS-domain further revealed that a highly conserved arginine residue, which is expected to contact the DNA minor groove, contributes significantly to the shape readout. Our studies show that the specific recognition of cis-regulatory elements by plant MADS-domain transcription factors heavily depend on shape readout mechanisms and that the absence of a critical arginine residue in the MADS-domain impairs binding specificity.


Author(s):  
Alex M. Tseng ◽  
Avanti Shrikumar ◽  
Anshul Kundaje

AbstractDeep learning models can accurately map genomic DNA sequences to associated functional molecular readouts such as protein–DNA binding data. Base-resolution importance (i.e. “attribution”) scores inferred from these models can highlight predictive sequence motifs and syntax. Unfortunately, these models are prone to overfitting and are sensitive to random initializations, often resulting in noisy and irreproducible attributions that obfuscate underlying motifs. To address these shortcomings, we propose a novel attribution prior, where the Fourier transform of input-level attribution scores are computed at training-time, and high-frequency components of the Fourier spectrum are penalized. We evaluate different model architectures with and without attribution priors trained on genome-wide binary or continuous molecular profiles. We show that our attribution prior dramatically improves models’ stability, interpretability, and performance on held-out data, especially when training data is severely limited. Our attribution prior also allows models to identify biologically meaningful sequence motifs more sensitively and precisely within individual regulatory elements. The prior is agnostic to the model architecture or predicted experimental assay, yet provides similar gains across all experiments. This work represents an important advancement in improving the reliability of deep learning models for deciphering the regulatory code of the genome.


2020 ◽  
Vol 26 ◽  
Author(s):  
Xiaoping Min ◽  
Fengqing Lu ◽  
Chunyan Li

: Enhancer-promoter interactions (EPIs) in the human genome are of great significance to transcriptional regulation which tightly controls gene expression. Identification of EPIs can help us better deciphering gene regulation and understanding disease mechanisms. However, experimental methods to identify EPIs are constrained by the fund, time and manpower while computational methods using DNA sequences and genomic features are viable alternatives. Deep learning methods have shown promising prospects in classification and efforts that have been utilized to identify EPIs. In this survey, we specifically focus on sequence-based deep learning methods and conduct a comprehensive review of the literatures of them. We first briefly introduce existing sequence-based frameworks on EPIs prediction and their technique details. After that, we elaborate on the dataset, pre-processing means and evaluation strategies. Finally, we discuss the challenges these methods are confronted with and suggest several future opportunities.


Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


Cancers ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 2013
Author(s):  
Edian F. Franco ◽  
Pratip Rana ◽  
Aline Cruz ◽  
Víctor V. Calderón ◽  
Vasco Azevedo ◽  
...  

A heterogeneous disease such as cancer is activated through multiple pathways and different perturbations. Depending upon the activated pathway(s), the survival of the patients varies significantly and shows different efficacy to various drugs. Therefore, cancer subtype detection using genomics level data is a significant research problem. Subtype detection is often a complex problem, and in most cases, needs multi-omics data fusion to achieve accurate subtyping. Different data fusion and subtyping approaches have been proposed over the years, such as kernel-based fusion, matrix factorization, and deep learning autoencoders. In this paper, we compared the performance of different deep learning autoencoders for cancer subtype detection. We performed cancer subtype detection on four different cancer types from The Cancer Genome Atlas (TCGA) datasets using four autoencoder implementations. We also predicted the optimal number of subtypes in a cancer type using the silhouette score and found that the detected subtypes exhibit significant differences in survival profiles. Furthermore, we compared the effect of feature selection and similarity measures for subtype detection. For further evaluation, we used the Glioblastoma multiforme (GBM) dataset and identified the differentially expressed genes in each of the subtypes. The results obtained are consistent with other genomic studies and can be corroborated with the involved pathways and biological functions. Thus, it shows that the results from the autoencoders, obtained through the interaction of different datatypes of cancer, can be used for the prediction and characterization of patient subgroups and survival profiles.


Author(s):  
Eun Ji Jeong ◽  
Donghyuk Choi ◽  
Dong Woo Lee

Conventional cell-counting software uses contour or watershed segmentations and focuses on identifying two-dimensional (2D) cells attached on the bottom of plastic plates. Recently developed software has been useful tools for the quality control of 2D cell-based assays by measuring initial seed cell numbers. These algorithms do not, however, quantitatively test in three-dimensional (3D) cell-based assays using extracellular matrix (ECM), because cells are aggregated and overlapped in the 3D structure of the ECM such as Matrigel, collagen, and alginate. Such overlapped and aggregated cells make it difficult to segment cells and to count the number of cells accurately. It is important, however, to determine the number of cells to standardize experiments and ensure the reproducibility of 3D cell-based assays. In this study, we apply a 3D cell-counting method using U-net deep learning to high-density aggregated cells in ECM to identify initial seed cell numbers. The proposed method showed a 10% counting error in high-density aggregated cells, while the contour and watershed segmentations showed 30% and 40% counting errors, respectively. Thus, the proposed method can reduce the seed cell-counting error in 3D cell-based assays by providing the exact number of cells to researchers, thereby enabling the acquisition of quality control in 3D cell-based assays.


Sign in / Sign up

Export Citation Format

Share Document