pfam family
Recently Published Documents


TOTAL DOCUMENTS

20
(FIVE YEARS 1)

H-INDEX

7
(FIVE YEARS 0)

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Elena Tea Russo ◽  
Alessandro Laio ◽  
Marco Punta

Abstract Background The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence. Results We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results. Conclusions The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.



2020 ◽  
Vol 49 (D1) ◽  
pp. D412-D419 ◽  
Author(s):  
Jaina Mistry ◽  
Sara Chuguransky ◽  
Lowri Williams ◽  
Matloob Qureshi ◽  
Gustavo A Salazar ◽  
...  

Abstract The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.



2019 ◽  
Vol 24 ◽  
pp. e925
Author(s):  
Yoan Bouzin ◽  
Benjamin Thomas Viart ◽  
María Moriel-Carretero ◽  
Sofia Kossida

Python Function uncover (PyFuncover) is a new bioinformatic tool able to search proteins with a specific function in a full proteome. The pipeline coded in python uses BLAST alignment and the sequences from a PFAM family as the search seed. We tested PyFuncover using the fatty acid-binding family (FABP) Lipocalin_7 from PFAM (version 32, 2019) against the Homo sapiens NCBI proteome. After applying the scoring function in all the BLAST results, the data were classified and submitted to a GO-TERM analysis using bioDBnet. Analyses showed that all families of FABPs were ranked within the top scores. Included within this category were also families able to bind to hydrophobic molecules similar to fatty acids such as the retinol acid transporter and the cellular retinoic acid-binding protein.



2018 ◽  
Author(s):  
R. Tetley ◽  
P. Guardado-Calvo ◽  
J. Fedry ◽  
F. Rey ◽  
F. Cazals

AbstractWe present a sequence-structure based method characterizing a set of functionally related proteins exhibiting low sequence identity and loose structural conservation. Given a (small) set of structures, our method consists of three main steps. First, pairwise structural alignments are combined with multi-scale geometric analysis to produce structural motifs i.e. regions structurally more conserved than the whole structures. Second, the sub-sequences of the motifs are used to build profile hidden Markov models (HMM) biased towards the structurally conserved regions. Third, these HMM are used to retrieve from UniProtKB proteins harboring signatures compatible with the function studied, in a bootstrap fashion.We apply these hybrid HMM to investigate two questions related to class II fusion proteins, an especially challenging class since known structures exhibit low sequence identity (less than 15%) and loose structural similarity (of the order of 15Å in lRMSD). In a first step, we compare the performances of our hybrid HMM against those of sequence based HMM. Using various learning sets, we show that both classes of HMM retrieve unique species. The number of unique species reported by both classes of methods are comparable, stressing the novelty brought by our hybrid models. In a second step, we use our models to identify 17 plausible HAP2-GSC1 candidate sequences in 10 different drosophila melanogaster species. These models are not identified by the PFÅM family HAP2-GCS1 (PF10699), stressing the ability of our structural motifs to capture signals more subtle than whole Pfam domains.In a more general setting, our method should be of interest for all cases functional families with low sequence identity and loose structural conservation.Our software tools are available from the FunChaT package of the Structural Bioinformatics Library (http://sbl.inria.fr).



2017 ◽  
Vol 114 (13) ◽  
pp. E2662-E2671 ◽  
Author(s):  
Guido Uguzzoni ◽  
Shalini John Lovis ◽  
Francesco Oteri ◽  
Alexander Schug ◽  
Hendrik Szurmant ◽  
...  

Proteins have evolved to perform diverse cellular functions, from serving as reaction catalysts to coordinating cellular propagation and development. Frequently, proteins do not exert their full potential as monomers but rather undergo concerted interactions as either homo-oligomers or with other proteins as hetero-oligomers. The experimental study of such protein complexes and interactions has been arduous. Theoretical structure prediction methods are an attractive alternative. Here, we investigate homo-oligomeric interfaces by tracing residue coevolution via the global statistical direct coupling analysis (DCA). DCA can accurately infer spatial adjacencies between residues. These adjacencies can be included as constraints in structure prediction techniques to predict high-resolution models. By taking advantage of the ongoing exponential growth of sequence databases, we go significantly beyond anecdotal cases of a few protein families and apply DCA to a systematic large-scale study of nearly 2,000 Pfam protein families with sufficient sequence information and structurally resolved homo-oligomeric interfaces. We find that large interfaces are commonly identified by DCA. We further demonstrate that DCA can differentiate between subfamilies with different binding modes within one large Pfam family. Sequence-derived contact information for the subfamilies proves sufficient to assemble accurate structural models of the diverse protein-oligomers. Thus, we provide an approach to investigate oligomerization for arbitrary protein families leading to structural models complementary to often-difficult experimental methods. Combined with ever more abundant sequential data, we anticipate that this study will be instrumental to allow the structural description of many heteroprotein complexes in the future.



2016 ◽  
Vol 80 (2) ◽  
pp. 429-450 ◽  
Author(s):  
John E. Cronan

SUMMARYAlthough the structure of lipoic acid and its role in bacterial metabolism were clear over 50 years ago, it is only in the past decade that the pathways of biosynthesis of this universally conserved cofactor have become understood. Unlike most cofactors, lipoic acid must be covalently bound to its cognate enzyme proteins (the 2-oxoacid dehydrogenases and the glycine cleavage system) in order to function in central metabolism. Indeed, the cofactor is assembled on its cognate proteins rather than being assembled and subsequently attached as in the typical pathway, like that of biotin attachment. The first lipoate biosynthetic pathway determined was that ofEscherichia coli, which utilizes two enzymes to form the active lipoylated protein from a fatty acid biosynthetic intermediate. Recently, a more complex pathway requiring four proteins was discovered inBacillus subtilis, which is probably an evolutionary relic. This pathway requires the H protein of the glycine cleavage system of single-carbon metabolism to form active (lipoyl) 2-oxoacid dehydrogenases. The bacterial pathways inform the lipoate pathways of eukaryotic organisms. Plants use theE. colipathway, whereas mammals and fungi probably use theB. subtilispathway. The lipoate metabolism enzymes (except those of sulfur insertion) are members of PFAM family PF03099 (the cofactor transferase family). Although these enzymes share some sequence similarity, they catalyze three markedly distinct enzyme reactions, making the usual assignment of function based on alignments prone to frequent mistaken annotations. This state of affairs has possibly clouded the interpretation of one of the disorders of human lipoate metabolism.



2014 ◽  
Vol 23 (10) ◽  
pp. 1380-1391 ◽  
Author(s):  
Abhinav Kumar ◽  
Marco Punta ◽  
Herbert L. Axelrod ◽  
Debanu Das ◽  
Carol L. Farr ◽  
...  


PLoS ONE ◽  
2014 ◽  
Vol 9 (4) ◽  
pp. e94981 ◽  
Author(s):  
Patricia Lassaux ◽  
Oscar Conchillo-Solé ◽  
Babu A. Manjasetty ◽  
Daniel Yero ◽  
Lucia Perletti ◽  
...  


2013 ◽  
Vol 69 (11) ◽  
pp. 2186-2193 ◽  
Author(s):  
Jaina Mistry ◽  
Edda Kloppmann ◽  
Burkhard Rost ◽  
Marco Punta

High-resolution structural knowledge is key to understanding how proteins function at the molecular level. The number of entries in the Protein Data Bank (PDB), the repository of all publicly available protein structures, continues to increase, with more than 8000 structures released in 2012 alone. The authors of this article have studied how structural coverage of the protein-sequence space has changed over time by monitoring the number of Pfam families that acquired their first representative structure each year from 1976 to 2012. Twenty years ago, for every 100 new PDB entries released, an estimated 20 Pfam families acquired their first structure. By 2012, this decreased to only about five families per 100 structures. The reasons behind the slower pace at which previously uncharacterized families are being structurally covered were investigated. It was found that although more than 50% of current Pfam families are still without a structural representative, this set is enriched in families that are small, functionally uncharacterized or rich in problem features such as intrinsically disordered and transmembrane regions. While these are important constraints, the reasons why it may not yet be time to give up the pursuit of a targeted but more comprehensive structural coverage of the protein-sequence space are discussed.





Sign in / Sign up

Export Citation Format

Share Document