scholarly journals Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Elena Tea Russo ◽  
Alessandro Laio ◽  
Marco Punta

Abstract Background The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence. Results We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results. Conclusions The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.

2020 ◽  
Vol 49 (D1) ◽  
pp. D412-D419 ◽  
Author(s):  
Jaina Mistry ◽  
Sara Chuguransky ◽  
Lowri Williams ◽  
Matloob Qureshi ◽  
Gustavo A Salazar ◽  
...  

Abstract The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.


2020 ◽  
Author(s):  
Elena Tea Russo ◽  
Alessandro Laio ◽  
Marco Punta

As the UniProt database approaches the 200 million entries' mark, the vast majority of proteins it contains lack any experimental validation of their functions. In this context, the identification of homologous relationships between proteins remains the single most widely applicable tool for generating functional and structural hypotheses in silico. Although many databases exist that classify proteins and protein domains into homologous families, large sections of the sequence space remain unassigned. We introduce DPCfam, a new unsupervised procedure that uses sequence alignments and Density Peak Clustering to automatically classify homologous protein regions. Here, we present a proof-of-principle experiment based on the analysis of two clans from the Pfam protein family database. Our tests indicate that DPCfam automatically-generated clusters are generally evolutionary accurate corresponding to one or more Pfam families and that they cover a significant fraction of known homologs. Overall, DPCfam shows potential both for assisting manual annotation efforts (domain discovery, detection of classification inconsistencies, improvement of family coverage and boosting of clan membership) and as a stand-alone tool for unsupervised classification of sparsely annotated protein datasets such as those from environmental metagenomics studies (domain discovery, analysis of domain diversity). Algorithm implementation used in this paper is available at https://gitlab.com/ETRu/dpcfam (Requires Python 3, C++ compiler and runs on Linux systems.); data are available at https://zenodo.org/record/3934399


IEEE Access ◽  
2021 ◽  
pp. 1-1
Author(s):  
Shudong Wang ◽  
Yigang He ◽  
Baiqiang Yin ◽  
Wenbo Zeng ◽  
Ying Deng ◽  
...  

2018 ◽  
Vol 83 ◽  
pp. 33-39 ◽  
Author(s):  
Feng Wang ◽  
Jing-yi Zhou ◽  
Yu Tian ◽  
Yu Wang ◽  
Ping Zhang ◽  
...  

2019 ◽  
Vol 1229 ◽  
pp. 012024 ◽  
Author(s):  
Fan Hong ◽  
Yang Jing ◽  
Hou Cun-cun ◽  
Zhang Ke-zhen ◽  
Yao Ruo-xia

Author(s):  
Xinzheng Niu ◽  
Yunhong Zheng ◽  
Philippe Fournier-Viger ◽  
Bing Wang

Author(s):  
Zafaryab Rasool ◽  
Rui Zhou ◽  
Lu Chen ◽  
Chengfei Liu ◽  
Jiajie Xu

Sign in / Sign up

Export Citation Format

Share Document