An Information Gain-based Method for Evaluating the Classification Power of Features Towards Identifying Enhancers

2020 ◽  
Vol 15 (6) ◽  
pp. 574-580
Author(s):  
Tianjiao Zhang ◽  
Rongjie Wang ◽  
Qinghua Jiang ◽  
Yadong Wang

Background: Enhancers are cis-regulatory elements that enhance gene expression on DNA sequences. Since most of enhancers are located far from transcription start sites, it is difficult to identify them. As other regulatory elements, the regions around enhancers contain a variety of features, which can help in enhancer recognition. Objective: The classification power of features differs significantly, the performances of existing methods that use one or a few features for identifying enhancer vary greatly. Therefore, evaluating the classification power of each feature can improve the predictive performance of enhancers. Methods: We present an evaluation method based on Information Gain (IG) that captures the entropy change of enhancer recognition according to features. To validate the performance of our method, experiments using the Single Feature Prediction Accuracy (SFPA) were conducted on each feature. Results: The average IG values of the sequence feature, transcriptional feature and epigenetic feature are 0.068, 0.213, and 0.299, respectively. Through SFPA, the average AUC values of the sequence feature, transcriptional feature and epigenetic feature are 0.534, 0.605, and 0.647, respectively. The verification results are consistent with our evaluation results. Conclusion: This IG-based method can effectively evaluate the classification power of features for identifying enhancers. Compared with sequence features, epigenetic features are more effective for recognizing enhancers.

2021 ◽  
Vol 12 ◽  
Author(s):  
Xin Zeng ◽  
Sung-Joon Park ◽  
Kenta Nakai

Promoters and enhancers are well-known regulatory elements modulating gene expression. As confirmed by high-throughput sequencing technologies, these regulatory elements are bidirectionally transcribed. That is, promoters produce stable mRNA in the sense direction and unstable RNA in the antisense direction, while enhancers transcribe unstable RNA in both directions. Although it is thought that enhancers and promoters share a similar architecture of transcription start sites (TSSs), how the transcriptional machinery distinctly uses these genomic regions as promoters or enhancers remains unclear. To address this issue, we developed a deep learning (DL) method by utilizing a convolutional neural network (CNN) and the saliency algorithm. In comparison with other classifiers, our CNN presented higher predictive performance, suggesting the overarching importance of the high-order sequence features, captured by the CNN. Moreover, our method revealed that there are substantial sequence differences between the enhancers and promoters. Remarkably, the 20–120 bp downstream regions from the center of bidirectional TSSs seemed to contribute to the RNA stability. These regions in promoters tend to have a larger number of guanines and cytosines compared to those in enhancers, and this feature contributed to the classification of the regulatory elements. Our CNN-based method can capture the complex TSS architectures. We found that the genomic regions around TSSs for promoters and enhancers contribute to RNA stability and show GC-biased characteristics as a critical determinant for promoter TSSs.


2020 ◽  
Author(s):  
Christopher Terranova ◽  
Kristina M. Stemler ◽  
Praveen Barrodia ◽  
Sabrina L. Jeter-Jones ◽  
Zhongqi Ge ◽  
...  

Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


1991 ◽  
Vol 11 (3) ◽  
pp. 1488-1499 ◽  
Author(s):  
H J Roth ◽  
G C Das ◽  
J Piatigorsky

Expression of the chicken beta B1-crystallin gene was examined. Northern (RNA) blot and primer extension analyses showed that while abundant in the lens, the beta B1 mRNA is absent from the liver, brain, heart, skeletal muscle, and fibroblasts of the chicken embryo, suggesting lens specificity. Promoter fragments ranging from 434 to 126 bp of 5'-flanking sequence (plus 30 bp of exon 1) of the beta B1 gene fused to the bacterial chloramphenicol acetyltransferase gene functioned much more efficiently in transfected embryonic chicken lens epithelial cells than in transfected primary muscle fibroblasts or HeLa cells. Transient expression of recombinant plasmids in cultured lens cells, DNase I footprinting, in vitro transcription in a HeLa cell extract, and gel mobility shift assays were used to identify putative functional promoter elements of the beta B1-crystallin gene. Sequence analysis revealed a number of potential regulatory elements between positions -126 and -53 of the beta B1 promoter, including two Sp1 sites, two octamer binding sequence-like sites (OL-1 and OL-2), and two polyomavirus enhancer-like sites (PL-1 and PL-2). Deletion and site-specific mutation experiments established the functional importance of PL-1 (-116 to -102), PL-2 (-90 to -76), and OL-2 (-75 to -68). DNase I footprinting using a lens or a HeLa cell nuclear extract and gel mobility shifts using a lens nuclear extract indicated the presence of putative lens transcription factors binding to these DNA sequences. Competition experiments provided evidence that PL-1 and PL-2 recognize the same or very similar factors, while OL-2 recognizes a different factor. Our data suggest that the same or closely related transcription factors found in many tissues are used for expression of the chicken beta B1-crystallin gene in the lens.


2022 ◽  
Author(s):  
Edward J Banigan ◽  
Wen Tang ◽  
Aafke A van den Berg ◽  
Roman R Stocsits ◽  
Gordana Wutz ◽  
...  

Cohesin organizes mammalian interphase chromosomes by reeling chromatin fibers into dynamic loops (Banigan and Mirny, 2020; Davidson et al., 2019; Kim et al., 2019; Yatskevich et al., 2019). "Loop extrusion" is obstructed when cohesin encounters a properly oriented CTCF protein (Busslinger et al., 2017; de Wit et al., 2015; Fudenberg et al., 2016; Nora et al., 2017; Sanborn et al., 2015; Wutz et al., 2017), and recent work indicates that other factors, such as the replicative helicase MCM (Dequeker et al., 2020), can also act as barriers to loop extrusion. It has been proposed that transcription relocalizes (Busslinger et al., 2017; Glynn et al., 2004; Lengronne et al., 2004) or interferes with cohesin (Heinz et al., 2018; Jeppsson et al., 2020; Valton et al., 2021; S. Zhang et al., 2021), and that active transcription start sites function as cohesin loading sites (Busslinger et al., 2017; Kagey et al., 2010; Zhu et al., 2021; Zuin et al., 2014), but how these effects, and transcription in general, shape chromatin is unknown. To determine whether transcription can modulate loop extrusion, we studied cells in which the primary extrusion barriers could be removed by CTCF depletion and cohesin's residence time and abundance on chromatin could be increased by Wapl knockout. We found evidence that transcription directly interacts with loop extrusion through a novel "moving barrier" mechanism, but not by loading cohesin at active promoters. Hi-C experiments showed intricate, cohesin-dependent genomic contact patterns near actively transcribed genes, and in CTCF-Wapl double knockout (DKO) cells (Busslinger et al., 2017), genomic contacts were enriched between sites of transcription-driven cohesin localization ("cohesin islands"). Similar patterns also emerged in polymer simulations in which transcribing RNA polymerases (RNAPs) acted as "moving barriers" by impeding, slowing, or pushing loop-extruding cohesins. The model predicts that cohesin does not load preferentially at promoters and instead accumulates at TSSs due to the barrier function of RNAPs. We tested this prediction by new ChIP-seq experiments, which revealed that the "cohesin loader" Nipbl (Ciosk et al., 2000) co-localizes with cohesin, but, unlike in previous reports (Busslinger et al., 2017; Kagey et al., 2010; Zhu et al., 2021; Zuin et al., 2014), Nipbl did not accumulate at active promoters. We propose that RNAP acts as a new type of barrier to loop extrusion that, unlike CTCF, is not stationary in its precise genomic position, but is itself dynamically translocating and relocalizes cohesin along DNA. In this way, loop extrusion could enable translocating RNAPs to maintain contacts with distal regulatory elements, allowing transcriptional activity to shape genomic functional organization.


1990 ◽  
Vol 10 (6) ◽  
pp. 2475-2484
Author(s):  
A M Curatola ◽  
C Basilico

Expression of the K-fgf/hst proto-oncogene appears to be restricted to cells in the early stages of development, such as embryonal carcinoma (EC) cells. When EC cells are induced to differentiate, K-fgf expression is drastically repressed. To identify cis-acting DNA elements responsible for this type of regulation, we constructed a plasmid in which cat gene expression was driven by about 1 kilobase of upstream K-fgf human DNA sequences, including the putative promoter, and transfected it into undifferentiated F9 EC cells or HeLa cells as prototypes of cells which express or do not express, respectively, the K-fgf proto-oncogene. This plasmid was essentially inactive in both cell types, and the addition of more than 8 kilobases of DNA sequences upstream of the K-fgf promoter did not lead to any increase in chloramphenicol acetyltransferase (CAT) expression. On the other hand, when we inserted in this plasmid DNA sequences which are 3' of the human K-fgf coding sequences, we could detect a significant stimulation of CAT activity. Analysis of these sequences led to the identification of enhancerlike DNA elements which are part of the 3' noncoding region of K-fgf exon 3 and promote CAT expression only in undifferentiated mouse F9 or human NT2/D1 EC cells, but not in HeLa, 3T3, or differentiated F9 cells, therefore mimicking the physiological expression of the K-fgf proto-oncogene. Similar elements are also present in the 3' region of the murine K-fgf proto-oncogene, in a region showing high homology to the human K-fgf sequences. These regulatory elements can promote CAT expression from heterologous promoters in an EC-specific manner, suggesting that they interact with a specific cellular transacting protein(s) whose expression is developmentally regulated.


2019 ◽  
Vol 70 (15) ◽  
pp. 3867-3879 ◽  
Author(s):  
Anneke Frerichs ◽  
Julia Engelhorn ◽  
Janine Altmüller ◽  
Jose Gutierrez-Marcos ◽  
Wolfgang Werr

Abstract Fluorescence-activated cell sorting (FACS) and assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) were combined to analyse the chromatin state of lateral organ founder cells (LOFCs) in the peripheral zone of the Arabidopsis apetala1-1 cauliflower-1 double mutant inflorescence meristem. On a genome-wide level, we observed a striking correlation between transposase hypersensitive sites (THSs) detected by ATAC-seq and DNase I hypersensitive sites (DHSs). The mostly expanded DHSs were often substructured into several individual THSs, which correlated with phylogenetically conserved DNA sequences or enhancer elements. Comparing chromatin accessibility with available RNA-seq data, THS change configuration was reflected by gene activation or repression and chromatin regions acquired or lost transposase accessibility in direct correlation with gene expression levels in LOFCs. This was most pronounced immediately upstream of the transcription start, where genome-wide THSs were abundant in a complementary pattern to established H3K4me3 activation or H3K27me3 repression marks. At this resolution, the combined application of FACS/ATAC-seq is widely applicable to detect chromatin changes during cell-type specification and facilitates the detection of regulatory elements in plant promoters.


Author(s):  
F.E. Usman-Hamza ◽  
A.F. Atte ◽  
A.O. Balogun ◽  
H.A. Mojeed ◽  
A.O. Bajeh ◽  
...  

Software testing using software defect prediction aims to detect as many defects as possible in software before the software release. This plays an important role in ensuring quality and reliability. Software defect prediction can be modeled as a classification problem that classifies software modules into two classes: defective and non-defective; and classification algorithms are used for this process. This study investigated the impact of feature selection methods on classification via clustering techniques for software defect prediction. Three clustering techniques were selected; Farthest First Clusterer, K-Means and Make-Density Clusterer, and three feature selection methods: Chi-Square, Clustering Variation, and Information Gain were used on software defect datasets from NASA repository. The best software defect prediction model was farthest-first using information gain feature selection method with an accuracy of 78.69%, precision value of 0.804 and recall value of 0.788. The experimental results showed that the use of clustering techniques as a classifier gave a good predictive performance and feature selection methods further enhanced their performance. This indicates that classification via clustering techniques can give competitive results against standard classification methods with the advantage of not having to train any model using labeled dataset; as it can be used on the unlabeled datasets.Keywords: Classification, Clustering, Feature Selection, Software Defect PredictionVol. 26, No 1, June, 2019


1987 ◽  
Vol 7 (5) ◽  
pp. 1807-1814 ◽  
Author(s):  
A B Chepelinsky ◽  
B Sommer ◽  
J Piatigorsky

Previous experiments have indicated that 5' flanking DNA sequences (nucleotides-366 to +46) are capable of regulating the lens-specific transcription of the murine alpha A-crystallin gene. Here we have analyzed these 5' regulatory sequences by transfecting explanted embryonic chicken lens epithelia with different alpha A-crystallin-CAT (chloramphenicol acetyltransferase) hybrid genes (alpha A-crystallin promoter sequences fused to the bacterial CAT gene in the pSVO-CAT expression vector). The results indicated the presence of a proximal (-88 to +46) and a distal (-111 to -88) domain which must interact for promoter function. Deletion experiments showed that the sequence between -88 and -60 was essential for function of the proximal domain in the explanted epithelia. A synthetic oligonucleotide containing the sequence between -111 and -84 activated the proximal domain when placed in either orientation 57 base pairs upstream from position -88 of the alpha A-crystallin-CAT hybrid gene.


1987 ◽  
Vol 7 (12) ◽  
pp. 4377-4389 ◽  
Author(s):  
P F Bouvagnet ◽  
E E Strehler ◽  
G E White ◽  
M A Strehler-Page ◽  
B Nadal-Ginard ◽  
...  

To identify the DNA sequences that regulate the expression of the sarcomeric myosin heavy-chain (MHC) genes in muscle cells, a series of deletion constructs of the rat embryonic MHC gene was assayed for transient expression after introduction into myogenic and nonmyogenic cells. The sequences in 1.4 kilobases of 5'-flanking DNA were found to be sufficient to direct expression of the MHC gene constructs in a tissue-specific manner (i.e., in differentiated muscle cells but not in undifferentiated muscle and nonmuscle cells). Three main distinct regulatory domains have been identified: (i) the upstream sequences from positions -1413 to -174, which determine the level of expression of the MHC gene and are constituted of three positive regulatory elements and two negative ones; (ii) a muscle-specific regulatory element from positions -173 to -142, which restricts the expression of the MHC gene to muscle cells; and (iii) the promoter region, downstream from position -102, which directs transcription initiation. Introduction of the simian virus 40 enhancer into constructs where subportions of or all of the upstream sequences are deleted (up to position -173) strongly increases the level of expression of such truncated constructs but without changing their muscle specificity. These upstream sequences, which can be substituted for by the simian virus 40 enhancer, function in an orientation-, position-, and promoter-dependent fashion. The muscle-specific element is also promoter specific but does not support efficient expression of the MHC gene. The MHC promoter in itself is not muscle specific. These results underline the importance of the concerted action of multiple regulatory elements that are likely to represent targets for DNA-binding-regulatory proteins.


Sign in / Sign up

Export Citation Format

Share Document