A novel method SEProm for prokaryotic promoter prediction based on DNA structure and energetics

Abstract Motivation Despite conservation in general architecture of promoters and protein–DNA interaction interface of RNA polymerases among various prokaryotes, identification of promoter regions in the whole genome sequences remains a daunting challenge. The available tools for promoter prediction do not seem to address the problem satisfactorily, apparently because the biochemical nature of promoter signals is yet to be understood fully. Using 28 structural and 3 energetic parameters, we found that prokaryotic promoter regions have a unique structural and energy state, quite distinct from that of coding regions and the information for this signature state is in-built in their sequences. We developed a novel promoter prediction tool from these 31 parameters using various statistical techniques. Results Here, we introduce SEProm, a novel tool that is developed by studying and utilizing the in-built structural and energy information of DNA sequences, which is applicable to all prokaryotes including archaea. Compared to five most recent, diverged and current best available tools, SEProm performs much better, predicting promoters with an ‘F-value’ of 82.04 and ‘Precision’ of 81.08. The next best ‘F-value’ was obtained with PromPredict (72.14) followed by BProm (68.37). On the basis of ‘Precision’ value, the next best ‘Precision’ was observed for Pepper (75.39) followed by PromPredict (72.01). SEProm maintained the lead even when comparison was done on two test organisms (not involved in training for SEProm). Availability and implementation The software is freely available with easy to follow instructions (www.scfbio-iitd.res.in/software/TSS_Predict.jsp). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Bioinformatics ◽

10.1093/bioinformatics/btab083 ◽

2021 ◽

Author(s):

Yanrong Ji ◽

Zhihan Zhou ◽

Han Liu ◽

Ramana V Davuluri

Keyword(s):

Dna Sequences ◽

Regulatory Elements ◽

Ease Of Use ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Motifs ◽

Semantic Relationship ◽

Accurate Identification ◽

Conserved Sequence ◽

Genome Wide

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FastSK: fast sequence analysis with gapped string kernels

Bioinformatics ◽

10.1093/bioinformatics/btaa817 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i857-i865

Author(s):

Derrick Blakely ◽

Eamon Collins ◽

Ritambhara Singh ◽

Andrew Norton ◽

Jack Lanchantin ◽

...

Keyword(s):

Sequence Analysis ◽

Dna Sequences ◽

English Language ◽

Computation Time ◽

Entity Recognition ◽

Supplementary Information ◽

Support Vector ◽

Homology Detection ◽

Scalable Algorithm ◽

String Kernels

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Structural defects of a Pax8 mutant that give rise to congenital hypothyroidism

Biochemical Journal ◽

10.1042/bj3410089 ◽

1999 ◽

Vol 341 (1) ◽

pp. 89-93 ◽

Cited By ~ 5

Author(s):

Gianluca TELL ◽

Lucia PELLIZZARI ◽

Gennaro ESPOSITO ◽

Carlo PUCILLO ◽

Paolo Emidio MACCHIA ◽

...

Keyword(s):

Dna Sequence ◽

Congenital Hypothyroidism ◽

Dna Sequences ◽

Dna Interaction ◽

Structural Defects ◽

Molecular Defect ◽

Wild Type ◽

Coding Region ◽

Induced Fit ◽

Helical Content

Pax proteins are transcriptional regulators that play important roles during embryogenesis. These proteins recognize specific DNA sequences via a conserved element: the paired domain (Prd domain). The low level of organized secondary structure, in the free state, is a general feature of Prd domains; however, these proteins undergo a dramatic gain in α-helical content upon interaction with DNA (‘induced fit’). Pax8 is expressed in the developing thyroid, kidney and several areas of the central nervous system. In humans, mutations of the Pax8 gene, which are mapped to the coding region of the Prd domain, give rise to congenital hypothyroidism. Here, we have investigated the molecular defects caused by a mutation in which leucine at position 62 is substituted for an arginine. Leu62 is conserved among Prd domains, and contributes towards the packing together of helices 1 and 3. The binding affinity of the Leu62Arg mutant for a specific DNA sequence (the C sequence of thyroglobulin promoter) is decreased 60-fold with respect to the wild-type Pax8 Prd domain. However, the affinities with which the wild-type and the mutant proteins bind to a non-specific DNA sequence are very similar. CD spectra demonstrate that, in the absence of DNA, both wild-type Pax8 and the Leu62Arg mutant possess a low α-helical content; however, in the Leu62Arg mutant, the gain in α-helical content upon interaction with DNA is greatly reduced with respect to the wild-type protein. Thus the molecular defect of the Leu62Arg mutant causes a reduced capability for induced fit upon DNA interaction.

Download Full-text

Real-time kinetic studies of Mycobacterium tuberculosis LexA-DNA interaction

Bioscience Reports ◽

10.1042/bcj20210434 ◽

2021 ◽

Author(s):

Chitral Chatterjee ◽

Soneya Majumdar ◽

Sachin Deshpande ◽

Deepak Pant ◽

Saravanan Matheshwaran

Keyword(s):

Amino Acids ◽

Mycobacterium Tuberculosis ◽

Dna Binding ◽

Kinetic Parameters ◽

Dna Sequences ◽

Kinetic Studies ◽

Dna Interaction ◽

Damage Repair ◽

Inducible Genes ◽

Kinetics Of

Transcriptional repressor, LexA, regulates the “SOS” response, an indispensable bacterial DNA damage repair machinery. Compared to its E.coli ortholog, LexA from Mycobacterium tuberculosis (Mtb) possesses a unique N-terminal extension of additional 24 amino acids in its DNA binding domain (DBD) and 18 amino acids insertion at its hinge region that connects the DBD to the C-terminal dimerization/autoproteolysis domain. Despite the importance of LexA in “SOS” regulation, Mtb LexA remains poorly characterized and the functional importance of its additional amino acids remained elusive. In addition, the lack of data on kinetic parameters of Mtb LexA-DNA interaction prompted us to perform kinetic analyses of Mtb LexA and its deletion variants using Bio-layer Interferometry (BLI). Mtb LexA is seen to bind to different “SOS” boxes, DNA sequences present in the operator regions of damage-inducible genes, with comparable nanomolar affinity. Deletion of 18 amino acids from the linker region is found to affect DNA binding unlike the deletion of the N-terminal stretch of extra 24 amino acids. The conserved RKG motif has been found to be critical for DNA binding. Overall, this study provides insights into the kinetics of the interaction between Mtb LexA and its target “SOS” boxes. The kinetic parameters obtained for DNA binding of Mtb LexA would be instrumental to clearly understand the mechanism of “SOS” regulation and activation in Mtb.

Download Full-text

DNA mismatch repair deficient tumors exhibit length variability of repetitive DNA sequences in diverse promoter regions

European Journal of Cancer ◽

10.1016/s0959-8049(97)84419-x ◽

1997 ◽

Vol 33 ◽

pp. S11

Author(s):

C. Sutter ◽

J. Gebert ◽

P. Bischoff ◽

D. Kube ◽

C. Herfarth ◽

...

Keyword(s):

Mismatch Repair ◽

Repetitive Dna ◽

Dna Sequences ◽

Dna Mismatch Repair ◽

Repetitive Dna Sequences ◽

Promoter Regions ◽

Dna Mismatch

Download Full-text

Detailed studies of the binding mechanism of the Sinorhizobium meliloti transcriptional activator ExpG to DNA

Microbiology ◽

10.1099/mic.0.27442-0 ◽

2005 ◽

Vol 151 (1) ◽

pp. 259-268 ◽

Cited By ~ 33

Author(s):

Birgit Baumgarth ◽

Frank Wilco Bartels ◽

Dario Anselmetti ◽

Anke Becker ◽

Robert Ros

Keyword(s):

Binding Site ◽

Single Molecule ◽

Sinorhizobium Meliloti ◽

Specific Binding ◽

Dna Interaction ◽

Dynamic Force ◽

Sequence Motifs ◽

Promoter Regions ◽

Dissociation Kinetics ◽

Direct Binding

The exopolysaccharide galactoglucan promotes the establishment of symbiosis between the nitrogen-fixing Gram-negative soil bacterium Sinorhizobium meliloti 2011 and its host plant alfalfa. The transcriptional regulator ExpG activates expression of galactoglucan biosynthesis genes by direct binding to the expA1, expG/expD1 and expE1 promoter regions. ExpG is a member of the MarR family of regulatory proteins. Analysis of target sequences of an ExpG(His)6 fusion protein in the exp promoter regions resulted in the identification of a binding site composed of a conserved palindromic region and two associated sequence motifs. Association and dissociation kinetics of the specific binding of ExpG(His)6 to this binding site were characterized by standard biochemical methods and by single-molecule spectroscopy based on the atomic force microscope (AFM). Dynamic force spectroscopy indicated a distinct difference in the kinetics between the wild-type binding sequence and two mutated binding sites, leading to a closer understanding of the ExpG–DNA interaction.

Download Full-text

Identification of a cell-specific DNA-binding activity that interacts with a transcriptional activator of genes expressed in the acinar pancreas

Molecular and Cellular Biology ◽

10.1128/mcb.9.6.2464-2476.1989 ◽

1989 ◽

Vol 9 (6) ◽

pp. 2464-2476

Author(s):

M Cockell ◽

B J Stevenson ◽

M Strubin ◽

O Hagenbüchle ◽

P K Wellauer

Keyword(s):

Dna Binding ◽

Dna Sequences ◽

Dna Interaction ◽

Binding Activity ◽

Alpha Amylase ◽

Dna Binding Activity ◽

A Cell ◽

Flanking Regions

Footprint analysis of the 5'-flanking regions of the alpha-amylase 2, elastase 2, and trypsina genes, which are expressed in the acinar pancreas, showed multiple sites of protein-DNA interaction for each gene. Competition experiments demonstrated that a region from each 5'-flanking region interacted with the same cell-specific DNA-binding activity. We show by in vitro binding assays that this DNA-binding activity also recognizes a sequence within the 5'-flanking regions of elastase 1, chymotrypsinogen B, carboxypeptidase A, and trypsind genes. Methylation interference and protection studies showed that the DNA-binding activity recognized a bipartite motif, the subelements of which were separated by integral helical turns of DNA. The alpha-amylase 2 cognate sequence was found to enhance in vivo transcription of its own promoter in a cell-specific manner, which identified the DNA-binding activity as a transcription factor (PTF 1). The observation that PTF 1 bound to DNA sequences that have been defined as transcriptional enhancers by others suggests that this factor is involved in the coordinate expression of genes transcribed in the acinar pancreas.

Download Full-text

The DNA walk and its demonstration of deterministic chaos—relevance to genomic alterations in lung cancer

Bioinformatics ◽

10.1093/bioinformatics/bty1021 ◽

2019 ◽

Vol 35 (16) ◽

pp. 2738-2748 ◽

Cited By ~ 1

Author(s):

Blake Hewelt ◽

Haiqing Li ◽

Mohit Kumar Jolly ◽

Prakash Kulkarni ◽

Isa Mambetsariev ◽

...

Keyword(s):

Lung Cancer ◽

Open Source ◽

Fractal Analysis ◽

Dna Sequences ◽

Chaotic Behavior ◽

Supplementary Information ◽

Wild Type ◽

Genomic Alterations ◽

Turtle Graphics ◽

Dna Walk

AbstractMotivationAdvancements in cancer genetics have facilitated the development of therapies with actionable mutations. Although mutated genes have been studied extensively, their chaotic behavior has not been appreciated. Thus, in contrast to naïve DNA, mutated DNA sequences can display characteristics of unpredictability and sensitivity to the initial conditions that may be dictated by the environment, expression patterns and presence of other genomic alterations. Employing a DNA walk as a form of 2D analysis of the nucleotide sequence, we demonstrate that chaotic behavior in the sequence of a mutated gene can be predicted.ResultsUsing fractal analysis for these DNA walks, we have determined the complexity and nucleotide variance of commonly observed mutated genes in non-small cell lung cancer, and their wild-type counterparts. DNA walks for wild-type genes demonstrate varying levels of chaos, with BRAF, NTRK1 and MET exhibiting greater levels of chaos than KRAS, paxillin and EGFR. Analyzing changes in chaotic properties, such as changes in periodicity and linearity, reveal that while deletion mutations indicate a notable disruption in fractal ‘self-similarity’, fusion mutations demonstrate bifurcations between the two genes. Our results suggest that the fractals generated by DNA walks can yield important insights into potential consequences of these mutated genes.Availability and implementationIntroduction to Turtle graphics in Python is an open source article on learning to develop a script for Turtle graphics in Python, freely available on the web at https://docs.python.org/2/library/turtle.html. cDNA sequences were obtained through NCBI RefSeq database, an open source database that contains information on a large array of genes, such as their nucleotide and amino acid sequences, freely available at https://www.ncbi.nlm.nih.gov/refseq/. FracLac plugin for Fractal analysis in ImageJ is an open source plugin for the ImageJ program to perform fractal analysis, free to download at https://imagej.nih.gov/ij/plugins/fraclac/FLHelp/Introduction.html.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Higher-order Markov models for metagenomic sequence classification

Bioinformatics ◽

10.1093/bioinformatics/btaa562 ◽

2020 ◽

Vol 36 (14) ◽

pp. 4130-4136

Author(s):

David J Burks ◽

Rajeev K Azad

Keyword(s):

Dna Sequences ◽

Markov Models ◽

Fragment Size ◽

Higher Order ◽

Training Data ◽

Supplementary Information ◽

Local Alignment ◽

Metagenomic Sequence ◽

Higher Order Models

Abstract Motivation Alignment-free, stochastic models derived from k-mer distributions representing reference genome sequences have a rich history in the classification of DNA sequences. In particular, the variants of Markov models have previously been used extensively. Higher-order Markov models have been used with caution, perhaps sparingly, primarily because of the lack of enough training data and computational power. Advances in sequencing technology and computation have enabled exploitation of the predictive power of higher-order models. We, therefore, revisited higher-order Markov models and assessed their performance in classifying metagenomic sequences. Results Comparative assessment of higher-order models (HOMs, 9th order or higher) with interpolated Markov model, interpolated context model and lower-order models (8th order or lower) was performed on metagenomic datasets constructed using sequenced prokaryotic genomes. Our results show that HOMs outperform other models in classifying metagenomic fragments as short as 100 nt at all taxonomic ranks, and at lower ranks when the fragment size was increased to 250 nt. HOMs were also found to be significantly more accurate than local alignment which is widely relied upon for taxonomic classification of metagenomic sequences. A novel software implementation written in C++ performs classification faster than the existing Markovian metagenomic classifiers and can therefore be used as a standalone classifier or in conjunction with existing taxonomic classifiers for more robust classification of metagenomic sequences. Availability and implementation The software has been made available at https://github.com/djburks/SMM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Reprogrammable Recognition Codes in Bicoid Homeodomain-DNA Interaction

Molecular and Cellular Biology ◽

10.1128/mcb.20.20.7673-7684.2000 ◽

2000 ◽

Vol 20 (20) ◽

pp. 7673-7684 ◽

Cited By ~ 32

Author(s):

Vrushank Dave ◽

Chen Zhao ◽

Fan Yang ◽

Chang-Shung Tung ◽

Jun Ma

Keyword(s):

Dna Sequences ◽

Human Disease ◽

Dna Recognition ◽

Dna Interaction ◽

Morphogenetic Protein ◽

Different Types ◽

Disease Protein

ABSTRACT We describe experiments to determine how the homeodomain of the Drosophila morphogenetic protein Bicoid recognizes different types of DNA sequences found in natural enhancers. Our chemical footprint analyses reveal that the Bicoid homeodomain makes both shared and distinct contacts with a consensus site A1 (TAATCC) and a nonconsensus site X1 (TAAGCT). In particular, the guanine of X1 at position 4 (TAAGCT) is protected by Bicoid homeodomain. We provide further evidence suggesting that the unique arginine at position 54 (Arg 54) of the Bicoid homeodomain enables the protein to recognize X1 by specifically interacting with this position 4 guanine. We also describe experiments to analyze the contribution of artificially introduced Arg 54 to DNA recognition by other Bicoid-related homeodomains, including that from the human disease protein Pitx2. Our experiments demonstrate that the role of Arg 54 varies depending on the exact homeodomain framework and DNA sequences. Together, our results suggest that Bicoid and its related homeodomains utilize distinct recognition codes to interact with different DNA sequences, underscoring the need to study DNA recognition by Bicoid-class homeodomains in an individualized manner.

Download Full-text