The identification of DNA binding regions of the σ54 factor using artificial neural network

AbstractTranscription of many bacterial genes is regulated by alternative RNA polymerase sigma factors as the sigma 54 (σ54). A single essential σ promotes transcription of thousands of genes and many alternative σ factors promote transcription of multiple specialized genes required for coping with stress or development. Bacterial genomes have two families of sigma factors, sigma 70 (σ70) and sigma 54 (σ54). σ54 uses a more complex mechanism with specialized enhancers-binding proteins and DNA melting and is well known for its role in regulation of nitrogen metabolism in proteobacteria. The identification of these regulatory elements is the main step to understand the metabolic networks. In this study, we propose a supervised pattern recognition model with neural network to identify Transcription Factor Binding Sites (TFBSs) for σ54. This approach is capable of detecting σ54 TFBSs with sensitivity higher than 98% in recent published data. False positives are reduced with the addition of ANN and feature extraction, which increase the specificity of the program. We also propose a free, fast and friendly tool for σ54 recognition and a σ54 related genes database, available for consult. S54Finder can analyze from short DNA sequences to complete genomes and is available online. The software was used to determine σ54 TFBSs on the complete bacterial genomes database from NCBI and the result is available for comparison. S54Finder does the identification of σ54 regulated genes for a large set of genomes allowing evolutionary and conservation studies of the regulation system between the organisms.

Download Full-text

DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest

10.1101/224527 ◽

2017 ◽

Cited By ~ 1

Author(s):

Balachandran Manavalan ◽

Tae Hwan Shin ◽

Gwang Lee

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Dna Sequences ◽

Feature Selection Method ◽

Regulatory Elements ◽

Dnase I ◽

Support Vector ◽

Large Set ◽

Dnase I Hypersensitive Sites ◽

Hypersensitive Sites

AbstractDNase I hypersensitive sites (DHSs) are genomic regions that provide important information regarding the presence of transcriptional regulatory elements and the state of chromatin. Therefore, identifying DHSs in uncharacterized DNA sequences is crucial for understanding their biological functions and mechanisms. Although many experimental methods have been proposed to identify DHSs, they have proven to be expensive for genome-wide application. Therefore, it is necessary to develop computational methods for DHS prediction. In this study, we proposed a support vector machine (SVM)-based method for predicting DHSs, called DHSpred (DNase I Hypersensitive Site predictor in human DNA sequences), which was trained with 174 optimal features. The optimal combination of features was identified from a large set that included nucleotide composition and di- and trinucleotide physicochemical properties, using a random forest algorithm. DHSpred achieved a Matthews correlation coefficient and accuracy of 0.660 and 0.871, respectively, which were 3% higher than those of control SVM predictors trained with non-optimized features, indicating the efficiency of the feature selection method. Furthermore, the performance of DHSpred was superior to that of state-of-the-art predictors. An online prediction server has been developed to assist the scientific community, and is freely available at:http://www.thegleelab.org/DHSpred.html.

Download Full-text

DeepCAPE: a deep convolutional neural network for the accurate prediction of enhancers

10.1101/398115 ◽

2018 ◽

Cited By ~ 3

Author(s):

Shengquan Chen ◽

Mingxin Gan ◽

Hairong Lv ◽

Rui Jiang

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Cell Line ◽

Cell Lines ◽

Dna Sequences ◽

Disease Status ◽

Regulatory Elements ◽

Chromatin Accessibility ◽

Deep Convolutional Neural Network ◽

Typical Cell

AbstractThe establishment of a landscape of enhancers across human cells is crucial to deciphering the mechanism of gene regulation, cell differentiation, and disease development. High-throughput experimental approaches, though having successfully reported enhancers in typical cell lines, are still too costly and time consuming to perform systematic identification of enhancers specific to different cell lines under a variety of disease status. Existing computational methods, though capable of predicting regulatory elements purely relying on DNA sequences, lack the power of cell line-specific screening. Recent studies have suggested that chromatin accessibility of a DNA segment is closely related to its potential function in regulation, and thus may provide useful information in identifying regulatory elements. Motivated by the above understanding, we integrate DNA sequences and chromatin accessibility data to accurately predict enhancers in a cell line-specific manner. We proposed DeepCAPE, a deep convolutional neural network to predict enhancers via the integration of DNA sequences and DNase-seq data. We demonstrate that our model not only consistently outperforms existing methods in the classification of enhancers against background sequences, but also accurately predicts enhancers across different cell lines. We further visualize kernels of the first convolutional layer and show the match of identified sequence signatures and known motifs. We finally demonstrate the potential ability of our model to explain functional implications of putative disease-associated genetic variants and discriminate disease-related enhancers.

Download Full-text

Deep exploration networks for rapid engineering of functional DNA sequences

10.1101/864363 ◽

2019 ◽

Cited By ~ 2

Author(s):

Johannes Linder ◽

Nicholas Bogard ◽

Alexander B. Rosenberg ◽

Georg Seelig

Keyword(s):

Neural Network ◽

Dna Sequences ◽

Network Models ◽

Regulatory Elements ◽

Differential Splicing ◽

Neural Network Models ◽

Gradient Ascent ◽

Functional Dna ◽

The Cost ◽

Deep Exploration

Engineering gene sequences with defined functional properties is a major goal of synthetic biology. Deep neural network models, together with gradient ascent-style optimization, show promise for sequence generation. The generated sequences can however get stuck in local minima, have low diversity and their fitness depends heavily on initialization. Here, we develop deep exploration networks (DENs), a type of generative model tailor-made for searching a sequence space to minimize the cost of a neural network fitness predictor. By making the network compete with itself to control sequence diversity during training, we obtain generators capable of sampling hundreds of thousands of high-fitness sequences. We demonstrate the power of DENs in the context of engineering RNA isoforms, including polyadenylation and cell type-specific differential splicing. Using DENs, we engineered polyadenylation signals with more than 10-fold higher selection odds than the best gradient ascent-generated patterns and identified splice regulatory elements predicted to result in highly differential splicing between cell lines.

Download Full-text

Adsorption Isotherm Predictions for Multiple Molecules in MOFs Using the Same Deep Learning Model

10.26434/chemrxiv.9894224.v1 ◽

2019 ◽

Author(s):

Ryther Anderson ◽

Achay Biong ◽

Diego Gómez-Gualdrón

Keyword(s):

Neural Network ◽

Machine Learning ◽

Molecular Simulation ◽

Large Scale ◽

Learning Model ◽

Operating Conditions ◽

Small Subset ◽

Screening Methods ◽

Large Set ◽

Metal Organic

<div>Tailoring the structure and chemistry of metal-organic frameworks (MOFs) enables the manipulation of their adsorption properties to suit specific energy and environmental applications. As there are millions of possible MOFs (with tens of thousands already synthesized), molecular simulation, such as grand canonical Monte Carlo (GCMC), has frequently been used to rapidly evaluate the adsorption performance of a large set of MOFs. This allows subsequent experiments to focus only on a small subset of the most promising MOFs. In many instances, however, even molecular simulation becomes prohibitively time consuming, underscoring the need for alternative screening methods, such as machine learning, to precede molecular simulation efforts. In this study, as a proof of concept, we trained a neural network as the first example of a machine learning model capable of predicting full adsorption isotherms of different molecules not included in the training of the model. To achieve this, we trained our neural network only on alchemical species, represented only by their geometry and force field parameters, and used this neural network to predict the loadings of real adsorbates. We focused on predicting room temperature adsorption of small (one- and two-atom) molecules relevant to chemical separations. Namely, argon, krypton, xenon, methane, ethane, and nitrogen. However, we also observed surprisingly promising predictions for more complex molecules, whose properties are outside the range spanned by the alchemical adsorbates. Prediction accuracies suitable for large-scale screening were achieved using simple MOF (e.g. geometric properties and chemical moieties), and adsorbate (e.g. forcefield parameters and geometry) descriptors. Our results illustrate a new philosophy of training that opens the path towards development of machine learning models that can predict the adsorption loading of any new adsorbate at any new operating conditions in any new MOF.</div>

Download Full-text

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Bioinformatics ◽

10.1093/bioinformatics/btab083 ◽

2021 ◽

Author(s):

Yanrong Ji ◽

Zhihan Zhou ◽

Han Liu ◽

Ramana V Davuluri

Keyword(s):

Dna Sequences ◽

Regulatory Elements ◽

Ease Of Use ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Motifs ◽

Semantic Relationship ◽

Accurate Identification ◽

Conserved Sequence ◽

Genome Wide

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Biological Impact of a Large-Scale Genomic Inversion That Grossly Disrupts the Relative Positions of the Origin and Terminus Loci of theStreptococcus pyogenesChromosome

Journal of Bacteriology ◽

10.1128/jb.00090-19 ◽

2019 ◽

Vol 201 (17) ◽

Cited By ~ 1

Author(s):

Dragutin J. Savic ◽

Scott V. Nguyen ◽

Kimberly McCullor ◽

W. Michael McShan

Keyword(s):

Dna Sequences ◽

Parental Strain ◽

Large Scale ◽

Galleria Mellonella ◽

Acute Infection ◽

Relative Length ◽

Published Data ◽

Rich Medium ◽

Content Type

ABSTRACTA large-scale genomic inversion encompassing 0.79 Mb of the 1.816-Mb-longStreptococcus pyogenesserotype M49 strain NZ131 chromosome spontaneously occurs in a minor subpopulation of cells, and in this report genetic selection was used to obtain a stable lineage with this chromosomal rearrangement. This inversion, which drastically displaces theorisite relative to the terminus, changes the relative length of the replication arms so that one replichore is approximately 0.41 Mb while the other is about 1.40 Mb in length. Genomic reversion to the original chromosome constellation is not observed in PCR-monitored analyses after 180 generations of growth in rich medium. Compared to the parental strain, the inversion surprisingly demonstrates a nearly identical growth pattern in the first phase of the exponential phase, but differences do occur when resources in the medium become limited. When cultured separately in rich medium during prolonged stationary phase or in an experimental acute infection animal model (Galleria mellonella), the parental strain and the invertant have equivalent survival rates. However, when they are coincubated together, bothin vitroandin vivo, the survival of the invertant declines relative to the level for the parental strain. The accompanying aspect of the study suggests that inversions taking place nearoriCalways happen to secure the linkage oforiCto DNA sequences responsible for chromosome partition. The biological relevance of large-scale inversions is also discussed.IMPORTANCEBased on our previous work, we created to our knowledge the largest asymmetric inversion, covering 43.5% of theS. pyogenesgenome. In spite of a drastic replacement of origin of replication and the unbalanced size of replichores (1.4 Mb versus 0.41 Mb), the invertant, when not challenged with its progenitor, showed impressive vitality for growthin vitroand in pathogenesis assays. The mutant supports the existing idea that slightly deleterious mutations can provide the setting for secondary adaptive changes. Furthermore, comparative analysis of the mutant with previously published data strongly indicates that even large genomic rearrangements survive provided that the integrity of theoriCand the chromosome partition cluster is preserved.

Download Full-text

Tandem Repeats in Bacillus: Unique Features and Taxonomic Distribution

International Journal of Molecular Sciences ◽

10.3390/ijms22105373 ◽

2021 ◽

Vol 22 (10) ◽

pp. 5373

Author(s):

Juan A. Subirana ◽

Xavier Messeguer

Keyword(s):

Tandem Repeats ◽

Bacterial Species ◽

Individual Species ◽

Large Set ◽

Bacterial Genomes ◽

Rna Molecules ◽

Variable Sequence ◽

Repeat Size ◽

Genomic Studies ◽

Dna Tandem Repeats

Little is known about DNA tandem repeats across prokaryotes. We have recently described an enigmatic group of tandem repeats in bacterial genomes with a constant repeat size but variable sequence. These findings strongly suggest that tandem repeat size in some bacteria is under strong selective constraints. Here, we extend these studies and describe tandem repeats in a large set of Bacillus. Some species have very few repeats, while other species have a large number. Most tandem repeats have repeats with a constant size (either 52 or 20–21 nt), but a variable sequence. We characterize in detail these intriguing tandem repeats. Individual species have several families of tandem repeats with the same repeat length and different sequence. This result is in strong contrast with eukaryotes, where tandem repeats of many sizes are found in any species. We discuss the possibility that they are transcribed as small RNA molecules. They may also be involved in the stabilization of the nucleoid through interaction with proteins. We also show that the distribution of tandem repeats in different species has a taxonomic significance. The data we present for all tandem repeats and their families in these bacterial species will be useful for further genomic studies.

Download Full-text

Chicken beta B1-crystallin gene expression: presence of conserved functional polyomavirus enhancer-like and octamer binding-like promoter elements found in non-lens genes.

Molecular and Cellular Biology ◽

10.1128/mcb.11.3.1488 ◽

1991 ◽

Vol 11 (3) ◽

pp. 1488-1499 ◽

Cited By ~ 34

Author(s):

H J Roth ◽

G C Das ◽

J Piatigorsky

Keyword(s):

Transcription Factors ◽

Hela Cell ◽

Dna Sequences ◽

Nuclear Extract ◽

Regulatory Elements ◽

Dnase I ◽

Flanking Sequence ◽

Mobility Shift ◽

Promoter Elements ◽

Dnase I Footprinting

Expression of the chicken beta B1-crystallin gene was examined. Northern (RNA) blot and primer extension analyses showed that while abundant in the lens, the beta B1 mRNA is absent from the liver, brain, heart, skeletal muscle, and fibroblasts of the chicken embryo, suggesting lens specificity. Promoter fragments ranging from 434 to 126 bp of 5'-flanking sequence (plus 30 bp of exon 1) of the beta B1 gene fused to the bacterial chloramphenicol acetyltransferase gene functioned much more efficiently in transfected embryonic chicken lens epithelial cells than in transfected primary muscle fibroblasts or HeLa cells. Transient expression of recombinant plasmids in cultured lens cells, DNase I footprinting, in vitro transcription in a HeLa cell extract, and gel mobility shift assays were used to identify putative functional promoter elements of the beta B1-crystallin gene. Sequence analysis revealed a number of potential regulatory elements between positions -126 and -53 of the beta B1 promoter, including two Sp1 sites, two octamer binding sequence-like sites (OL-1 and OL-2), and two polyomavirus enhancer-like sites (PL-1 and PL-2). Deletion and site-specific mutation experiments established the functional importance of PL-1 (-116 to -102), PL-2 (-90 to -76), and OL-2 (-75 to -68). DNase I footprinting using a lens or a HeLa cell nuclear extract and gel mobility shifts using a lens nuclear extract indicated the presence of putative lens transcription factors binding to these DNA sequences. Competition experiments provided evidence that PL-1 and PL-2 recognize the same or very similar factors, while OL-2 recognizes a different factor. Our data suggest that the same or closely related transcription factors found in many tissues are used for expression of the chicken beta B1-crystallin gene in the lens.

Download Full-text

Nucleosome Positioning with Set of Key Positions and Nucleosome Affinity

The Open Biomedical Engineering Journal ◽

10.2174/1874120701408010166 ◽

2014 ◽

Vol 8 (1) ◽

pp. 166-170 ◽

Cited By ~ 1

Author(s):

Jia Wang ◽

Shuai Liu ◽

Weina Fu

Keyword(s):

Neural Network ◽

Experimental Data ◽

Dna Sequence ◽

Dna Sequences ◽

Nucleosome Positioning ◽

Experimental Results ◽

Negative Effects ◽

Sequence Structure ◽

Precise Positioning ◽

Histone Octamer

The formation and precise positioning of nucleosome in chromatin occupies a very important role in studying life process. Today, there are many researchers who discovered that the positioning where the location of a DNA sequence fragment wraps around a histone octamer in genome is not random but regular. However, the positioning is closely relevant to the concrete sequence of core DNA. So in this paper, we analyzed the relation between the affinity and sequence structure of core DNA, and extracted the set of key positions. In these positions, the nucleotide sequences probably occupy mainly action in the binding. First, we simplified and formatted the experimental data with the affinity. Then, to find the key positions in the wrapping, we used neural network to analyze the positive and negative effects of nucleosome generation for each position in core DNA sequences. However, we reached a class of weights with every position to describe this effect. Finally, based on the positions with high weights, we analyzed the reason why the chosen positions are key positions, and used these positions to construct a model for nucleosome positioning prediction. Experimental results show the effectiveness of our method.

Download Full-text

SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models

10.1101/2020.05.13.093997 ◽

2020 ◽

Author(s):

Yupeng Wang ◽

Rosario B. Jaime-Lara ◽

Abhrarup Roy ◽

Ying Sun ◽

Xinyue Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Dna Sequences ◽

Cell Types ◽

Learning Models ◽

Cell Type ◽

Coding Sequences ◽

Sequence Features ◽

Cell Type Specific ◽

Different Cell Types

AbstractWe propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, sequential k-mer (k=5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers including gkm-SVM and DanQ, with regard to distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL is able to directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified according to their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.

Download Full-text