scholarly journals Large-scale sequence similarity analysis reveals the scope of sequence and function divergence in PilZ domain proteins

2020 ◽  
Author(s):  
Qing Wei Cheang ◽  
Shuo Sheng ◽  
Linghui Xu ◽  
Zhao-Xun Liang

AbstractPilZ domain-containing proteins constitute a superfamily of widely distributed bacterial signalling proteins. Although studies have established the canonical PilZ domain as an adaptor protein domain evolved to specifically bind the second messenger c-di-GMP, mounting evidence suggest that the PilZ domain has undergone enormous divergent evolution to generate a superfamily of proteins that are characterized by a wide range of c-di-GMP-binding affinity, binding partners and cellular functions. The divergent evolution has even generated families of non-canonical PilZ domains that completely lack c-di-GMP binding ability. In this study, we performed a large-scale sequence analysis on more than 28,000 single- and di-domain PilZ proteins using the sequence similarity networking tool created originally to analyse functionally diverse enzyme superfamilies. The sequence similarity networks (SSN) generated by the analysis feature a large number of putative isofunctional protein clusters, and thus, provide an unprecedented panoramic view of the sequence-function relationship and function diversification in PilZ proteins. Some of the protein clusters in the networks are considered as unexplored clusters that contain proteins with completely unknown biological function; whereas others contain one, two or a few functionally known proteins, and therefore, enabling us to infer the cellular function of uncharacterized homologs or orthologs. With the ultimate goal of elucidating the diverse roles played by PilZ proteins in bacterial signal transduction, the work described here will facilitate the annotation of the vast number of PilZ proteins encoded by bacterial genome and help to prioritize functionally unknown PilZ proteins for future studies.ImportanceAlthough PilZ domain is best known as the protein domain evolved specifically for the binding of the second messenger c-di-GMP, divergent evolution has generated a superfamily of PilZ proteins with a diversity of ligand or protein-binding properties and cellular functions. We analysed the sequences of more than 28,000 PilZ proteins using the sequence similarity networking (SSN) tool to yield a global view of the sequence-function relationship and function diversification in PilZ proteins. The results will facilitate the annotation of the vast number of PilZ proteins encoded by bacterial genomes and help us prioritize PilZ proteins for future studies.

2019 ◽  
Author(s):  
N. Tessa Pierce ◽  
Luiz Irber ◽  
Taylor Reiter ◽  
Phillip Brooks ◽  
C. Titus Brown

The sourmash software package uses MinHash-based sketching to create “signatures”, compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash.


F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 1006 ◽  
Author(s):  
N. Tessa Pierce ◽  
Luiz Irber ◽  
Taylor Reiter ◽  
Phillip Brooks ◽  
C. Titus Brown

The sourmash software package uses MinHash-based sketching to create “signatures”, compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash.


2020 ◽  
Vol 36 (12) ◽  
pp. 3749-3757 ◽  
Author(s):  
Wei Zheng ◽  
Xiaogen Zhou ◽  
Qiqige Wuyun ◽  
Robin Pearce ◽  
Yang Li ◽  
...  

Abstract Motivation Protein domains are subunits that can fold and function independently. Correct domain boundary assignment is thus a critical step toward accurate protein structure and function analyses. There is, however, no efficient algorithm available for accurate domain prediction from sequence. The problem is particularly challenging for proteins with discontinuous domains, which consist of domain segments that are separated along the sequence. Results We developed a new algorithm, FUpred, which predicts protein domain boundaries utilizing contact maps created by deep residual neural networks coupled with coevolutionary precision matrices. The core idea of the algorithm is to retrieve domain boundary locations by maximizing the number of intra-domain contacts, while minimizing the number of inter-domain contacts from the contact maps. FUpred was tested on a large-scale dataset consisting of 2549 proteins and generated correct single- and multi-domain classifications with a Matthew’s correlation coefficient of 0.799, which was 19.1% (or 5.3%) higher than the best machine learning (or threading)-based method. For proteins with discontinuous domains, the domain boundary detection and normalized domain overlapping scores of FUpred were 0.788 and 0.521, respectively, which were 17.3% and 23.8% higher than the best control method. The results demonstrate a new avenue to accurately detect domain composition from sequence alone, especially for discontinuous, multi-domain proteins. Availability and implementation https://zhanglab.ccmb.med.umich.edu/FUpred. Supplementary information Supplementary data are available at Bioinformatics online.


F1000Research ◽  
2017 ◽  
Vol 5 ◽  
pp. 1987 ◽  
Author(s):  
Jasper J. Koehorst ◽  
Edoardo Saccenti ◽  
Peter J. Schaap ◽  
Vitor A. P. Martins dos Santos ◽  
Maria Suarez-Diez

A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic boundaries, and it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.


Author(s):  
Manoj Kumar ◽  
Paul Carr ◽  
Simon Turner

AbstractS-acylation is the addition of a fatty acid to a cysteine residue of a protein. While this modification may profoundly alter protein behaviour, its effects on the function of plant proteins remains poorly characterised, largely as a result to the lack of basic information regarding which proteins are S-acylated and where in the proteins the modification occurs. In order to address this gap in our knowledge, we have performed a comprehensive analysis of plant protein S-acylation from 6 separate tissues. In our highest confidence group, we identified 5185 cysteines modified by S-acylation, which were located in 4891 unique peptides from 2643 different proteins. This represents around 9% of the entire Arabidopsis proteome and suggests an important role for S-acylation in many essential cellular functions including trafficking, signalling and metabolism. To illustrate the potential of this dataset, we focus on cellulose synthesis and confirm for the first time the S-acylation of all proteins known to be involved in cellulose synthesis and trafficking of the cellulose synthase complex. In the secondary cell walls, cellulose synthesis requires three different catalytic subunits (CESA4, CESA7 and CESA8) that all exhibit striking sequence similarity. While all three proteins have been widely predicted to possess a RING-type zinc finger at their N-terminus, for CESA4 and CESA8, we find evidence for S-acylation of cysteines in this region that is incompatible with any role in coordinating metal ions. We show that while CESA7 may possess a RING type domain, the same region of CESA4 and CESA8 appear to have evolved a very different structure. Together, the data suggests this study represents an atlas of S-acylation in Arabidopsis that will facilitate the broader study of this elusive post-translational modification in plants as well as demonstrates the importance of undertaking further work in this area.


2020 ◽  
Author(s):  
Matthew Merski ◽  
Krzysztof Młynarczyk ◽  
Jan Ludwiczak ◽  
Jakub Skrzeczkowski ◽  
Stanisław Dunin-Horkawicz ◽  
...  

Abstract Background: Protein repeats can confound sequence analyses because the repetitiveness of their amino acid sequences lead to difficulties in identifying whether similar repeats are due to convergent or divergent evolution. We noted that the patterns derived from traditional “dot plot” protein sequence self-similarity analysis tended to be conserved in sets of related repeat proteins and this conservation could be quantitated using a Jaccard metric. Results: Comparison of these dot plots obviated the issues due to sequence similarity for analysis of repeat proteins. A high Jaccard similarity score was suggestive of a conserved relationship between closely related repeat proteins. The dot plot patterns decayed quickly in the absence of selective pressure with an expected loss of 50% of Jaccard similarity due to a loss of 8.2 % sequence identity. To perform method testing, we assembled a standard set of 79 repeat proteins representing all the subgroups in RepeatsDB. Comparison of known repeat and non-repeat proteins from the PDB suggested that the information content in dot plots could be used to identify repeat proteins from pure sequence with no requirement for structural information. Analysis of the UniRef90 database suggested that 16.9% of all known proteins could be classified as repeat proteins. These 13.3 million putative repeat protein chains were clustered and a significant amount (82.9%) of clusters containing between 5 and 200 members were of a single functional type.Conclusions: Dot plot analysis of repeat proteins attempts to obviate issues that arise due to the sequence degeneracy of repeat proteins. These results show that this kind of analysis can efficiently be applied to analyze repeat proteins on a large scale.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 1987 ◽  
Author(s):  
Jasper J. Koehorst ◽  
Edoardo Saccenti ◽  
Peter J. Schaap ◽  
Vitor A. P. Martins dos Santos ◽  
Maria Suarez-Diez

A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic bounderies. As the computational cost scales linearly, and not quadratically with the number of genomes, it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.


Author(s):  
Daniel A Nissley ◽  
Anna Carbery ◽  
Mark Chonofsky ◽  
Charlotte M Deane

Abstract Motivation Protein synthesis is a non-equilibrium process, meaning that the speed of translation can influence the ability of proteins to fold and function. Assuming that structurally similar proteins fold by similar pathways, the profile of translation speed along an mRNA should be evolutionarily conserved between related proteins to direct correct folding and downstream function. The only evidence to date for such conservation of translation speed between homologous proteins has used codon rarity as a proxy for translation speed. There are, however, many other factors including mRNA structure and the chemistry of the amino acids in the A- and P-sites of the ribosome that influence the speed of amino acid addition. Results Ribosome profiling experiments provide a signal directly proportional to the underlying translation times at the level of individual codons. We compared ribosome occupancy profiles (extracted from five different large-scale yeast ribosome profiling studies) between related protein domains to more directly test if their translation schedule was conserved. Our analysis reveals that the ribosome occupancy profiles of paralogous domains tend to be significantly more similar to one another than to profiles of non-paralogous domains. This trend does not depend on domain length, structural classes, amino acid composition or sequence similarity. Our results indicate that entire ribosome occupancy profiles and not just rare codon locations are conserved between even distantly related domains in yeast, providing support for the hypothesis that translation schedule is conserved between structurally related domains to retain folding pathways and facilitate efficient folding. Availability and implementation Python3 code is available on GitHub at https://github.com/DanNissley/Compare-ribosome-occupancy. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Hiral Sanghavi ◽  
sharmistha MAJUMDAR

The THAP (Thanatos-associated protein) domain is a DNA-binding domain which binds DNA via a zinc coordinating C2CH motif. Although THAP domains share a conserved structural fold, they bind different DNA sequences in different THAP proteins, which in turn perform distinct cellular functions. In this study, we investigate (using multiple sequence alignment, in silico motif and secondary structure prediction) THAP domain conservation within the homologs of the human THAP (hTHAP) protein family. We report that there is significant variation in sequence and predicted secondary structure elements across hTHAP homologs. Interestingly, we report that the THAP domain can be either longer or shorter than the conventional 90 residues and the amino terminal C2CH motif within the THAP domain serves as a hotspot for insertion or deletion. Our results lay the foundation for future studies which will further our understanding of the evolution of THAP domain and regulation of its function.


Sign in / Sign up

Export Citation Format

Share Document