scholarly journals Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss

Genetics ◽  
2020 ◽  
Vol 216 (2) ◽  
pp. 353-358
Author(s):  
Mengchi Wang ◽  
David Wang ◽  
Kai Zhang ◽  
Vu Ngo ◽  
Shicai Fan ◽  
...  

Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.

2019 ◽  
Author(s):  
Mengchi Wang ◽  
David Wang ◽  
Kai Zhang ◽  
Vu Ngo ◽  
Shicai Fan ◽  
...  

ABSTRACTSequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, representing motifs by wildcard-style consensus sequences is compact and sufficient for interpreting the motif information and search for motif match. Based on mutual information theory and Jenson-Shannon Divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized alphabets. Here we show that this representation provides a simple and efficient way to identify the binding sites of 1156 common TFs in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves 0.81 area under the precision-recall curve, significantly (p-value < 0.01) outperforming all existing methods, including maximal positional weight, Douglas and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.AVAILABILITYMotto is freely available at http://wanglab.ucsd.edu/star/motto.


2013 ◽  
Vol 11 (01) ◽  
pp. 1340004 ◽  
Author(s):  
IVAN KULAKOVSKIY ◽  
VICTOR LEVITSKY ◽  
DMITRY OSHCHEPKOV ◽  
LEONID BRYZGALOV ◽  
ILYA VORONTSOV ◽  
...  

Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) became a method of choice to locate DNA segments bound by different regulatory proteins. ChIP-Seq produces extremely valuable information to study transcriptional regulation. The wet-lab workflow is often supported by downstream computational analysis including construction of models of nucleotide sequences of transcription factor binding sites in DNA, which can be used to detect binding sites in ChIP-Seq data at a single base pair resolution. The most popular TFBS model is represented by positional weight matrix (PWM) with statistically independent positional weights of nucleotides in different columns; such PWMs are constructed from a gapless multiple local alignment of sequences containing experimentally identified TFBSs. Modern high-throughput techniques, including ChIP-Seq, provide enough data for careful training of advanced models containing more parameters than PWM. Yet, many suggested multiparametric models often provide only incremental improvement of TFBS recognition quality comparing to traditional PWMs trained on ChIP-Seq data. We present a novel computational tool, diChIPMunk, that constructs TFBS models as optimal dinucleotide PWMs, thus accounting for correlations between nucleotides neighboring in input sequences. diChIPMunk utilizes many advantages of ChIPMunk, its ancestor algorithm, accounting for ChIP-Seq base coverage profiles ("peak shape") and using the effective subsampling-based core procedure which allows processing of large datasets. We demonstrate that diPWMs constructed by diChIPMunk outperform traditional PWMs constructed by ChIPMunk from the same ChIP-Seq data. Software website: http://autosome.ru/dichipmunk/


Genome ◽  
1989 ◽  
Vol 31 (2) ◽  
pp. 503-509 ◽  
Author(s):  
Veronica C. Blasquez ◽  
Ann O. Sperry ◽  
Peter N. Cockerill ◽  
William T. Garrard

We have recently identified an evolutionarily conserved class of sequences that organize chromosomal loops in the interphase nucleus, which we have termed "matrix association regions" (MARs). MARs are about 200 bp long, AT-rich, contain topoisomerase II consensus sequences and other AT-rich sequence motifs, often reside near cis-acting regulatory sequences, and their binding sites are abundant (> 10 000 per mammalian nucleus). Here we demonstrate that the interactions between the mouse κ immunoglobulin gene MAR and topoisomerase II or the "nuclear matrix" occur between multiple and sometimes overlapping binding sites. Interestingly, the sites most susceptible to topoisomerase II cleavage are localized near the breakpoints of a previously described illegitimate recombination event. The presence of multiple binding sites within single MARs may allow DNA and RNA polymerase passage without disrupting primary loop organization.Key words: MARs, chromatin loops, topoisomerase II, nuclear matrix.


1988 ◽  
Vol 8 (6) ◽  
pp. 2275-2279 ◽  
Author(s):  
M E Cerdan ◽  
R S Zitomer

In Saccharomyces cerevisiae, the two genes, CYC1 and CYC7, that encode the isoforms of cytochrome c are expressed at different levels. Oxygen regulation is mediated by the expression of the CYP1 gene, and the CYP1 protein interacts with both CYC1 upstream activation sequence 1 (UAS1) and CYC7 UASo. In this study, the homology between the CYP1-binding sites of both genes was investigated. The most noticeable difference between the CYC1 and CYC7 UASs is the presence of GC base pairs at the same positions in a repeated sequence in CYC7 compared with CG base pairs in CYC1. Directed mutagenesis changing these GC residues to CG residues in CYC7 led to CYC1-like expression of CYC7 both in a CYP1 wild-type strain and in a strain carrying the semidominant mutation CYP1-16 which reverses the oxygen-dependent expression of the two genes. Our results strongly support the hypothesis that the CYP1-binding sites in CYC1 and CYC7 are related forms of the same sequence and that the CYP1-16 protein has altered specificity for the variant forms of the consensus sequences in both genes.


2021 ◽  
Vol 25 (1) ◽  
pp. 7-17
Author(s):  
A. V. Tsukanov ◽  
V. G. Levitsky ◽  
T. I. Merkulova

The most popular model for the search of ChIP-seq data for transcription factor binding sites (TFBS) is the positional weight matrix (PWM). However, this model does not take into account dependencies between nucleotide occurrences in different site positions. Currently, two recently proposed models, BaMM and InMoDe, can do as much. However, application of these models was usually limited only to comparing their recognition accuracies with that of PWMs, while none of the analyses of the co-prediction and relative positioning of hits of different models in peaks has yet been performed. To close this gap, we propose the pipeline called MultiDeNA. This pipeline includes stages of model training, assessing their recognition accuracy, scanning ChIP-seq peaks and their classif ication based on scan results. We applied our pipeline to 22 ChIP-seq datasets of TF FOXA2 and considered PWM, dinucleotide PWM (diPWM), BaMM and InMoDe models. The combination of these four models allowed a signif icant increase in the fraction of recognized peaks compared to that for the sole PWM model: the increase was 26.3 %. The BaMM model provided the main contribution to the recognition of sites. Although the major fraction of predicted peaks contained TFBS of different models with coincided positions, the medians of the fraction of peaks containing the predictions of sole models were 1.08, 0.49, 4.15 and 1.73 % for PWM, diPWM, BaMM and InMoDe, respectively. Thus, FOXA2 BSs were not fully described by only a sole model, which indicates theirs heterogeneity. We assume that the BaMM model is the most successful in describing the structure of the FOXA2 BS in ChIP-seq datasets under study.


1999 ◽  
Vol 181 (6) ◽  
pp. 1934-1938 ◽  
Author(s):  
Chung-Dar Lu ◽  
Ahmed T. Abdelal

ABSTRACT The ast operon, encoding enzymes of the arginine succinyltransferase (AST) pathway, was cloned from Salmonella typhimurium, and the nucleotide sequence for the upstream flanking region was determined. The control region contains several regulatory consensus sequences, including binding sites for NtrC, cyclic AMP receptor protein (CRP), and ArgR. The results of DNase I footprintings and gel retardation experiments confirm binding of these regulatory proteins to the identified sites. Exogenous arginine induced AST under nitrogen-limiting conditions, and this induction was abolished in an argR derivative. AST was also induced under carbon starvation conditions; this induction required functional CRP as well as functional ArgR. The combined data are consistent with the hypothesis that binding of one or more ArgR molecules to a region between the upstream binding sites for NtrC and CRP and two putative promoters plays a pivotal role in modulating expression of theast operon in response to nitrogen or carbon limitation.


1995 ◽  
Vol 42 (2) ◽  
pp. 183-189 ◽  
Author(s):  
J Szopa

The nuclear matrices of plant cell nuclei display intrinsic nuclease activity which consists in nicking supercoiled DNA. A cDNA encoding a 32 kDa endonuclease has been cloned and sequenced. The nucleotide and deduced amino-acid sequences show high homology to known 14-3-3 protein sequences from other sources. The amino-acid sequence shows agreement with consensus sequences for potential phosphorylation by protein kinase A and C and for calcium, lipid and membrane-binding sites. The nucleotide-binding site is also present within the conserved part of the sequence. By Northern blot analysis, the differential expression of the corresponding mRNA was detected; it was the strongest in sink tissues. The endonuclease activity found on DNA-polyacrylamide gel electrophoresis coincided with mRNA content and was the highest in tuber.


1988 ◽  
Vol 8 (6) ◽  
pp. 2275-2279
Author(s):  
M E Cerdan ◽  
R S Zitomer

In Saccharomyces cerevisiae, the two genes, CYC1 and CYC7, that encode the isoforms of cytochrome c are expressed at different levels. Oxygen regulation is mediated by the expression of the CYP1 gene, and the CYP1 protein interacts with both CYC1 upstream activation sequence 1 (UAS1) and CYC7 UASo. In this study, the homology between the CYP1-binding sites of both genes was investigated. The most noticeable difference between the CYC1 and CYC7 UASs is the presence of GC base pairs at the same positions in a repeated sequence in CYC7 compared with CG base pairs in CYC1. Directed mutagenesis changing these GC residues to CG residues in CYC7 led to CYC1-like expression of CYC7 both in a CYP1 wild-type strain and in a strain carrying the semidominant mutation CYP1-16 which reverses the oxygen-dependent expression of the two genes. Our results strongly support the hypothesis that the CYP1-binding sites in CYC1 and CYC7 are related forms of the same sequence and that the CYP1-16 protein has altered specificity for the variant forms of the consensus sequences in both genes.


2020 ◽  
Vol 36 (11) ◽  
pp. 3573-3575
Author(s):  
Henry Pratt ◽  
Zhiping Weng

Abstract Summary Sequence logos were introduced nearly 30 years ago as a human-readable format for representing consensus sequences, and they remain widely used. As new experimental and computational techniques have developed, logos have been extended: extra symbols represent covalent modifications to nucleotides, logos with multiple letters at each position illustrate models with multi-nucleotide features and symbols extending below the x-axis may represent a binding energy penalty for a residue or a negative weight output from a neural network. Web-based visualization tools for genomic data are increasingly taking advantage of modern web technology to offer dynamic, interactive figures to users, but support for sequence logos remains limited. Here, we present LogoJS, a Javascript package for rendering customizable, interactive, vector-graphic sequence logos and embedding them in web applications. LogoJS supports all the aforementioned logo extensions and is bundled with a companion web application for creating and sharing logos. Availability and implementation LogoJS is implemented both in plain Javascript and ReactJS, a popular user-interface framework. The web application is hosted at logojs.wenglab.org. All major browsers and operating systems are supported. The package and application are open-source; code is available at GitHub. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document