scholarly journals Self-Analysis of Repeat Proteins Reveals Evolutionarily Conserved Patterns

2020 ◽  
Author(s):  
Matthew Merski ◽  
Krzysztof Młynarczyk ◽  
Jan Ludwiczak ◽  
Jakub Skrzeczkowski ◽  
Stanisław Dunin-Horkawicz ◽  
...  

Abstract Background: Protein repeats can confound sequence analyses because the repetitiveness of their amino acid sequences lead to difficulties in identifying whether similar repeats are due to convergent or divergent evolution. We noted that the patterns derived from traditional “dot plot” protein sequence self-similarity analysis tended to be conserved in sets of related repeat proteins and this conservation could be quantitated using a Jaccard metric. Results: Comparison of these dot plots obviated the issues due to sequence similarity for analysis of repeat proteins. A high Jaccard similarity score was suggestive of a conserved relationship between closely related repeat proteins. The dot plot patterns decayed quickly in the absence of selective pressure with an expected loss of 50% of Jaccard similarity due to a loss of 8.2 % sequence identity. To perform method testing, we assembled a standard set of 79 repeat proteins representing all the subgroups in RepeatsDB. Comparison of known repeat and non-repeat proteins from the PDB suggested that the information content in dot plots could be used to identify repeat proteins from pure sequence with no requirement for structural information. Analysis of the UniRef90 database suggested that 16.9% of all known proteins could be classified as repeat proteins. These 13.3 million putative repeat protein chains were clustered and a significant amount (82.9%) of clusters containing between 5 and 200 members were of a single functional type.Conclusions: Dot plot analysis of repeat proteins attempts to obviate issues that arise due to the sequence degeneracy of repeat proteins. These results show that this kind of analysis can efficiently be applied to analyze repeat proteins on a large scale.

2020 ◽  
Author(s):  
Matthew Merski ◽  
Krzysztof Młynarczyk ◽  
Jan Ludwiczak ◽  
Jakub Skrzeczkowski ◽  
Stanisław Dunin-Horkawicz ◽  
...  

Abstract Background Protein repeats can confound sequence analyses because the repetitiveness of their amino acid sequences lead to difficulties in identifying whether similar repeats are due to convergent or divergent evolution. We noted that the patterns derived from traditional “dot plot” protein sequence self-similarity analysis tended to be conserved in sets of related repeat proteins and this conservation could be quantitated using a Jaccard metric. Results Comparison of these dot plots obviated the issues due to sequence similarity for analysis of repeat proteins. A high Jaccard similarity score was suggestive of a conserved relationship between closely related repeat proteins. The dot plot patterns decay quickly in the absence of selective pressure with an expected loss of 50% of Jaccard similarity due to a loss of 8.2 % sequence identity. We assembled a standard set of 79 repeat proteins representing all the subgroups in RepeatsDB to perform method testing on. Comparison of known repeat and non-repeat proteins from the PDB suggested that the information content in dot plots could be used to identify repeat proteins from pure sequence without needing structural information. Analysis of the UniRef90 database suggested that 16.9% of all known proteins could be classified as repeat proteins. These 13.3 million putative repeat protein chains were clustered and a significant amount (82.9%) of clusters containing between 5 and 200 members were of a single functional type. Conclusions Dot plot analysis of repeat proteins attempts to obviate issues that arise due to the sequence degeneracy of repeat proteins. These results show that this kind of analysis can efficiently be applied to analyze repeat proteins on a large scale.


2020 ◽  
Author(s):  
Matthew Merski ◽  
Krzysztof Młynarczyk ◽  
Jan Ludwiczak ◽  
Jakub Skrzeczkowski ◽  
Stanisław Dunin-Horkawicz ◽  
...  

Abstract Background Protein repeats can confound sequence analyses because the repetitiveness of their amino acid sequences lead to difficulties in identifying whether similar repeats are due to convergent or divergent evolution. We noted that the patterns derived from traditional “dot plot” protein sequence self-similarity analysis tended to be conserved in sets of related repeat proteins and this conservation could be quantitated using a Jaccard metric. Results Comparison of these dot plots obviated the issues due to sequence similarity for analysis of repeat proteins. A high Jaccard similarity score was suggestive of a conserved relationship between closely related repeat proteins. The dot plot patterns decayed quickly in the absence of selective pressure with an expected loss of 50% of Jaccard similarity due to a loss of 8.2 % sequence identity. To perform method testing, we assembled a standard set of 79 repeat proteins representing all the subgroups in RepeatsDB. Comparison of known repeat and non-repeat proteins from the PDB suggested that the information content in dot plots could be used to identify repeat proteins from pure sequence with no requirement for structural information. Analysis of the UniRef90 database suggested that 16.9% of all known proteins could be classified as repeat proteins. These 13.3 million putative repeat protein chains were clustered and a significant amount (82.9%) of clusters containing between 5 and 200 members were of a single functional type. Conclusions Dot plot analysis of repeat proteins attempts to obviate issues that arise due to the sequence degeneracy of repeat proteins. These results show that this kind of analysis can efficiently be applied to analyze repeat proteins on a large scale.


2019 ◽  
Author(s):  
Matthew Merski ◽  
Krzysztof Młynarczyk ◽  
Jan Ludwiczak ◽  
Jakub Skrzeczkowski ◽  
Stanisław Dunin-Horkawicz ◽  
...  

Abstract Background Protein repeats can confound sequence analyses due to the repetitiveness of their amino acid sequences leading to difficulties in identifying when similar repeats are due to convergent or divergent evolution. We noted that the patterns derived from traditional dot plot protein self-analysis tended to be conserved in sets of related repeat proteins and this conservation could be quantified using a standard Jaccard metric.Results Use of these plots obviated the issues due to sequence similarity for analysis of these proteins. The dot plot patterns decay quickly in the absence of selective pressure with an expected loss of 50% of Jaccard similarity due to a loss of 8.2% sequence identity. Comparison of repeat and non-repeat proteins in the PDB suggested that the information content in dot plots could be used to identify repeat proteins from pure sequence. Analysis of the UniProt90 database suggested that 16.9% of all known proteins could be classified as repeat proteins. These 13.3 million putative repeat protein chains were clustered and significant amount (82.9%) of clusters containing between 5 and 200 members were of a single functional type.Conclusions Dot plot analysis of repeat proteins obviates the issues that arise from sequence degeneracy. These results show that this kind of analysis can efficiently be applied to analyze repeat proteins on a large scale.


2020 ◽  
Author(s):  
Qing Wei Cheang ◽  
Shuo Sheng ◽  
Linghui Xu ◽  
Zhao-Xun Liang

AbstractPilZ domain-containing proteins constitute a superfamily of widely distributed bacterial signalling proteins. Although studies have established the canonical PilZ domain as an adaptor protein domain evolved to specifically bind the second messenger c-di-GMP, mounting evidence suggest that the PilZ domain has undergone enormous divergent evolution to generate a superfamily of proteins that are characterized by a wide range of c-di-GMP-binding affinity, binding partners and cellular functions. The divergent evolution has even generated families of non-canonical PilZ domains that completely lack c-di-GMP binding ability. In this study, we performed a large-scale sequence analysis on more than 28,000 single- and di-domain PilZ proteins using the sequence similarity networking tool created originally to analyse functionally diverse enzyme superfamilies. The sequence similarity networks (SSN) generated by the analysis feature a large number of putative isofunctional protein clusters, and thus, provide an unprecedented panoramic view of the sequence-function relationship and function diversification in PilZ proteins. Some of the protein clusters in the networks are considered as unexplored clusters that contain proteins with completely unknown biological function; whereas others contain one, two or a few functionally known proteins, and therefore, enabling us to infer the cellular function of uncharacterized homologs or orthologs. With the ultimate goal of elucidating the diverse roles played by PilZ proteins in bacterial signal transduction, the work described here will facilitate the annotation of the vast number of PilZ proteins encoded by bacterial genome and help to prioritize functionally unknown PilZ proteins for future studies.ImportanceAlthough PilZ domain is best known as the protein domain evolved specifically for the binding of the second messenger c-di-GMP, divergent evolution has generated a superfamily of PilZ proteins with a diversity of ligand or protein-binding properties and cellular functions. We analysed the sequences of more than 28,000 PilZ proteins using the sequence similarity networking (SSN) tool to yield a global view of the sequence-function relationship and function diversification in PilZ proteins. The results will facilitate the annotation of the vast number of PilZ proteins encoded by bacterial genomes and help us prioritize PilZ proteins for future studies.


eLife ◽  
2015 ◽  
Vol 4 ◽  
Author(s):  
Sergey Ovchinnikov ◽  
Lisa Kinch ◽  
Hahnbeom Park ◽  
Yuxing Liao ◽  
Jimin Pei ◽  
...  

The prediction of the structures of proteins without detectable sequence similarity to any protein of known structure remains an outstanding scientific challenge. Here we report significant progress in this area. We first describe de novo blind structure predictions of unprecendented accuracy we made for two proteins in large families in the recent CASP11 blind test of protein structure prediction methods by incorporating residue–residue co-evolution information in the Rosetta structure prediction program. We then describe the use of this method to generate structure models for 58 of the 121 large protein families in prokaryotes for which three-dimensional structures are not available. These models, which are posted online for public access, provide structural information for the over 400,000 proteins belonging to the 58 families and suggest hypotheses about mechanism for the subset for which the function is known, and hypotheses about function for the remainder.


Cancers ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 2111
Author(s):  
Bo-Wei Zhao ◽  
Zhu-Hong You ◽  
Lun Hu ◽  
Zhen-Hao Guo ◽  
Lei Wang ◽  
...  

Identification of drug-target interactions (DTIs) is a significant step in the drug discovery or repositioning process. Compared with the time-consuming and labor-intensive in vivo experimental methods, the computational models can provide high-quality DTI candidates in an instant. In this study, we propose a novel method called LGDTI to predict DTIs based on large-scale graph representation learning. LGDTI can capture the local and global structural information of the graph. Specifically, the first-order neighbor information of nodes can be aggregated by the graph convolutional network (GCN); on the other hand, the high-order neighbor information of nodes can be learned by the graph embedding method called DeepWalk. Finally, the two kinds of feature are fed into the random forest classifier to train and predict potential DTIs. The results show that our method obtained area under the receiver operating characteristic curve (AUROC) of 0.9455 and area under the precision-recall curve (AUPR) of 0.9491 under 5-fold cross-validation. Moreover, we compare the presented method with some existing state-of-the-art methods. These results imply that LGDTI can efficiently and robustly capture undiscovered DTIs. Moreover, the proposed model is expected to bring new inspiration and provide novel perspectives to relevant researchers.


Genome ◽  
2004 ◽  
Vol 47 (1) ◽  
pp. 141-155 ◽  
Author(s):  
H H Yan ◽  
J Mudge ◽  
D-J Kim ◽  
R C Shoemaker ◽  
D R Cook ◽  
...  

To gain insight into genomic relationships between soybean (Glycine max) and Medicago truncatula, eight groups of bacterial artificial chromosome (BAC) contigs, together spanning 2.60 million base pairs (Mb) in G. max and 1.56 Mb in M. truncatula, were compared through high-resolution physical mapping combined with sequence and hybridization analysis of low-copy BAC ends. Cross-hybridization among G. max and M. truncatula contigs uncovered microsynteny in six of the contig groups and extensive microsynteny in three. Between G. max homoeologous (within genome duplicate) contigs, 85% of coding and 75% of noncoding sequences were conserved at the level of cross-hybridization. By contrast, only 29% of sequences were conserved between G. max and M. truncatula, and some kilobase-scale rearrangements were also observed. Detailed restriction maps were constructed for 11 contigs from the three highly microsyntenic groups, and these maps suggested that sequence order was highly conserved between G. max duplicates and generally conserved between G. max and M. truncatula. One instance of homoeologous BAC contigs in M. truncatula was also observed and examined in detail. A sequence similarity search against the Arabidopsis thaliana genome sequence identified up to three microsyntenic regions in A. thaliana for each of two of the legume BAC contig groups. Together, these results confirm previous predictions of one recent genome-wide duplication in G. max and suggest that M. truncatula also experienced ancient large-scale genome duplications.Key words: Glycine max, Medicago truncatula, Arabidopsis thaliana, conserved microsynteny, genome duplication.


Molecules ◽  
2021 ◽  
Vol 26 (11) ◽  
pp. 3228
Author(s):  
Xiaotong Li ◽  
Minghong Jian ◽  
Yanhong Sun ◽  
Qunyan Zhu ◽  
Zhenxin Wang

In order to improve their bioapplications, inorganic nanoparticles (NPs) are usually functionalized with specific biomolecules. Peptides with short amino acid sequences have attracted great attention in the NP functionalization since they are easy to be synthesized on a large scale by the automatic synthesizer and can integrate various functionalities including specific biorecognition and therapeutic function into one sequence. Conjugation of peptides with NPs can generate novel theranostic/drug delivery nanosystems with active tumor targeting ability and efficient nanosensing platforms for sensitive detection of various analytes, such as heavy metallic ions and biomarkers. Massive studies demonstrate that applications of the peptide–NP bioconjugates can help to achieve the precise diagnosis and therapy of diseases. In particular, the peptide–NP bioconjugates show tremendous potential for development of effective anti-tumor nanomedicines. This review provides an overview of the effects of properties of peptide functionalized NPs on precise diagnostics and therapy of cancers through summarizing the recent publications on the applications of peptide–NP bioconjugates for biomarkers (antigens and enzymes) and carcinogens (e.g., heavy metallic ions) detection, drug delivery, and imaging-guided therapy. The current challenges and future prospects of the subject are also discussed.


2005 ◽  
Vol 391 (2) ◽  
pp. 409-415 ◽  
Author(s):  
Anna Kärkönen ◽  
Alain Murigneux ◽  
Jean-Pierre Martinant ◽  
Elodie Pepey ◽  
Christophe Tatout ◽  
...  

UDPGDH (UDP-D-glucose dehydrogenase) oxidizes UDP-Glc (UDP-D-glucose) to UDP-GlcA (UDP-D-glucuronate), the precursor of UDP-D-xylose and UDP-L-arabinose, major cell wall polysaccharide precursors. Maize (Zea mays L.) has at least two putative UDPGDH genes (A and B), according to sequence similarity to a soya bean UDPGDH gene. The predicted maize amino acid sequences have 95% similarity to that of soya bean. Maize mutants with a Mu-element insertion in UDPGDH-A or UDPGDH-B were isolated (udpgdh-A1 and udpgdh-B1 respectively) and studied for changes in wall polysaccharide biosynthesis. The udpgdh-A1 and udpgdh-B1 homozygotes showed no visible phenotype but exhibited 90 and 60–70% less UDPGDH activity respectively than wild-types in a radiochemical assay with 30 μM UDP-glucose. Ethanol dehydrogenase (ADH) activity varied independently of UDPGDH activity, supporting the hypothesis that ADH and UDPGDH activities are due to different enzymes in maize. When extracts from wild-types and udpgdh-A1 homozygotes were assayed with increasing concentrations of UDP-Glc, at least two isoforms of UDPGDH were detected, having Km values of approx. 380 and 950 μM for UDP-Glc. Leaf and stem non-cellulosic polysaccharides had lower Ara/Gal and Xyl/Gal ratios in udpgdh-A1 homozygotes than in wild-types, whereas udpgdh-B1 homozygotes exhibited more variability among individual plants, suggesting that UDPGDH-A activity has a more important role than UDPGDH-B in UDP-GlcA synthesis. The fact that mutation of a UDPGDH gene interferes with polysaccharide synthesis suggests a greater importance for the sugar nucleotide oxidation pathway than for the myo-inositol pathway in UDP-GlcA biosynthesis during post-germinative growth of maize.


2021 ◽  
Vol 15 (3) ◽  
pp. 1-31
Author(s):  
Haida Zhang ◽  
Zengfeng Huang ◽  
Xuemin Lin ◽  
Zhe Lin ◽  
Wenjie Zhang ◽  
...  

Driven by many real applications, we study the problem of seeded graph matching. Given two graphs and , and a small set of pre-matched node pairs where and , the problem is to identify a matching between and growing from , such that each pair in the matching corresponds to the same underlying entity. Recent studies on efficient and effective seeded graph matching have drawn a great deal of attention and many popular methods are largely based on exploring the similarity between local structures to identify matching pairs. While these recent techniques work provably well on random graphs, their accuracy is low over many real networks. In this work, we propose to utilize higher-order neighboring information to improve the matching accuracy and efficiency. As a result, a new framework of seeded graph matching is proposed, which employs Personalized PageRank (PPR) to quantify the matching score of each node pair. To further boost the matching accuracy, we propose a novel postponing strategy, which postpones the selection of pairs that have competitors with similar matching scores. We show that the postpone strategy indeed significantly improves the matching accuracy. To improve the scalability of matching large graphs, we also propose efficient approximation techniques based on algorithms for computing PPR heavy hitters. Our comprehensive experimental studies on large-scale real datasets demonstrate that, compared with state-of-the-art approaches, our framework not only increases the precision and recall both by a significant margin but also achieves speed-up up to more than one order of magnitude.


Sign in / Sign up

Export Citation Format

Share Document