scholarly journals Capturing Protein Domain Structure and Function Using Self-Supervision on Domain Architectures

Algorithms ◽  
2021 ◽  
Vol 14 (1) ◽  
pp. 28
Author(s):  
Damianos P. Melidis ◽  
Wolfgang Nejdl

Predicting biological properties of unseen proteins is shown to be improved by the use of protein sequence embeddings. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector separately. Therefore, current sequence embedding cannot be intrinsically evaluated on the degree of their captured biological information in a quantitative manner. We address this drawback by our approach, dom2vec, by learning vector representation for protein domains and not for each amino acid base, as biological metadata do exist for each domain separately. To perform a reliable quantitative intrinsic evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of a domain, which are its structure, enzymatic, and molecular function. Notably, dom2vec obtains an adequate level of performance in the intrinsic assessment—therefore, we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperforms sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction.

2020 ◽  
Author(s):  
Damianos P. Melidis ◽  
Brandon Malone ◽  
Wolfgang Nejdl

Abstract Background: Word embedding approaches have revolutionized natural language processing (NLP) research. These approaches aim to map words to a low-dimensional vector space, in which words with similar linguistic features cluster together. Embedding-based methods have also been developed for proteins, where words are amino acids and sentences are proteins. The learned embeddings have been evaluated qualitatively, via visual inspection of the embedding space and extrinsically, via performance comparison on downstream protein prediction tasks. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector. Results: Here, we present dom2vec, an approach for learning protein domain embeddings using word2vec on InterPro annotations. In contrast to sequence embeddings, biological metadata do exist for protein domains, related to each domain separately. Therefore, we present four intrinsic evaluation strategies to quantitatively assess the quality of the learned embedding space. To perform a reliable evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of domains. These are the structure, enzymatic and molecular function of a given domain. Notably, dom2vec obtains adequate level of performance in the intrinsic assessment, therefore we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperform sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction. Conclusions: We report that the application of word2vec on InterPro annotations produces domain embeddings with two significant advantages over sequence embeddings. First, each unique dom2vec vector can be quantitatively evaluated towards its available structure and function metadata. Second, the produced embeddings can outperform the sequence embeddings for a subset of downstream tasks. Overall, dom2vec embeddings are able to capture the most important biological properties of domains and surpass sequence embeddings for a subset of prediction tasks.


1997 ◽  
Vol 75 (6) ◽  
pp. 687-696 ◽  
Author(s):  
Tamo Fukamizo ◽  
Ryszard Brzezinski

Novel information on the structure and function of chitosanase, which hydrolyzes the beta -1,4-glycosidic linkage of chitosan, has accumulated in recent years. The cloning of the chitosanase gene from Streptomyces sp. strain N174 and the establishment of an efficient expression system using Streptomyces lividans TK24 have contributed to these advances. Amino acid sequence comparisons of the chitosanases that have been sequenced to date revealed a significant homology in the N-terminal module. From energy minimization based on the X-ray crystal structure of Streptomyces sp. strain N174 chitosanase, the substrate binding cleft of this enzyme was estimated to be composed of six monosaccharide binding subsites. The hydrolytic reaction takes place at the center of the binding cleft with an inverting mechanism. Site-directed mutagenesis of the carboxylic amino acid residues that are conserved revealed that Glu-22 and Asp-40 are the catalytic residues. The tryptophan residues in the chitosanase do not participate directly in the substrate binding but stabilize the protein structure by interacting with hydrophobic and carboxylic side chains of the other amino acid residues. Structural and functional similarities were found between chitosanase, barley chitinase, bacteriophage T4 lysozyme, and goose egg white lysozyme, even though these proteins share no sequence similarities. This information can be helpful for the design of new chitinolytic enzymes that can be applied to carbohydrate engineering, biological control of phytopathogens, and other fields including chitinous polysaccharide degradation. Key words: chitosanase, amino acid sequence, overexpression system, reaction mechanism, site-directed mutagenesis.


2010 ◽  
Vol 2010 ◽  
pp. 1-10 ◽  
Author(s):  
J. Santiago Mejia ◽  
Erik N. Arthun ◽  
Richard G. Titus

One approach to identify epitopes that could be used in the design of vaccines to control several arthropod-borne diseases simultaneously is to look for common structural features in the secretome of the pathogens that cause them. Using a novel bioinformatics technique, cysteine-abundance and distribution analysis, we found that many different proteins secreted by several arthropod-borne pathogens, includingPlasmodium falciparum, Borrelia burgdorferi, and eight species of Proteobacteria, are devoid of cysteine residues. The identification of three cysteine-abundance and distribution patterns in several families of proteins secreted by pathogenic and nonpathogenic Proteobacteria, and not found when the amino acid analyzed was tryptophan, provides evidence of forces restricting the content of cysteine residues in microbial proteins during evolution. We discuss these findings in the context of protein structure and function, antigenicity and immunogenicity, and host-parasite relationships.


1996 ◽  
Vol 135 (3) ◽  
pp. 673-687 ◽  
Author(s):  
A J Kreuz ◽  
A Simcox ◽  
D Maughan

Drosophila indirect flight muscle (IFM) contains two different types of tropomyosin: a standard 284-amino acid muscle tropomyosin, Ifm-TmI, encoded by the TmI gene, and two > 400 amino acid tropomyosins, TnH-33 and TnH-34, encoded by TmII. The two IFM-specific TnH isoforms are unique tropomyosins with a COOH-terminal extension of approximately 200 residues which is hydrophobic and rich in prolines. Previous analysis of a hypomorphic TmI mutant, Ifm(3)3, demonstrated that Ifm-TmI is necessary for proper myofibrillar assembly, but no null TmI mutant or TmII mutant which affects the TnH isoforms have been reported. In the current report, we show that four flightless mutants (Warmke et al., 1989) are alleles of TmI, and characterize a deficiency which deletes both TmI and TmII. We find that haploidy of TmI causes myofibrillar disruptions and flightless behavior, but that haploidy of TmII causes neither. Single fiber mechanics demonstrates that power output is much lower in the TmI haploid line (32% of wild-type) than in the TmII haploid line (73% of wild-type). In myofibers nearly depleted of Ifm-TmI, net power output is virtually abolished (< 1% of wild-type) despite the presence of an organized fibrillar core (approximately 20% of wild-type). The results suggest Ifm-TmI (the standard tropomyosin) plays a key role in fiber structure, power production, and flight, with reduced Ifm-TmI expression producing corresponding changes of IFM structure and function. In contrast, reduced expression of the TnH isoforms has an unexpectedly mild effect on IFM structure and function.


2008 ◽  
Vol 52 (4) ◽  
pp. 216-223 ◽  
Author(s):  
Takuya Yano ◽  
Eri Nobusawa ◽  
Alexander Nagy ◽  
Setsuko Nakajima ◽  
Katsuhisa Nakajima

1971 ◽  
Vol 123 (1) ◽  
pp. 57-67 ◽  
Author(s):  
P. R. Carnegie

Myelin from the central nervous system contains an unusual basic protein, which can induce experimental autoimmune encephalomyelitis. The basic protein from human brain was digested with trypsin and other enzymes and the sequence of the 170 amino acids was determined. The localization of the encephalitogenic determinants was described. Possible roles for the protein in the structure and function of myelin are discussed.


2003 ◽  
Vol 77 (22) ◽  
pp. 12310-12318 ◽  
Author(s):  
Kevin J. Kunstman ◽  
Bridget Puffer ◽  
Bette T. Korber ◽  
Carla Kuiken ◽  
Una R. Smith ◽  
...  

ABSTRACT A chemokine receptor from the seven-transmembrane-domain G-protein-coupled receptor superfamily is an essential coreceptor for the cellular entry of human immunodeficiency virus type 1 (HIV-1) and simian immunodeficiency virus (SIV) strains. To investigate nonhuman primate CC-chemokine receptor 5 (CCR5) homologue structure and function, we amplified CCR5 DNA sequences from peripheral blood cells obtained from 24 representative species and subspecies of the primate suborders Prosimii (family Lemuridae) and Anthropoidea (families Cebidae, Callitrichidae, Cercopithecidae, Hylobatidae, and Pongidae) by PCR with primers flanking the coding region of the gene. Full-length CCR5 was inserted into pCDNA3.1, and multiple clones were sequenced to permit discrimination of both alleles. Compared to the human CCR5 sequence, the CCR5 sequences of the Lemuridae, Cebidae, and Cercopithecidae shared 87, 91 to 92, and 96 to 99% amino acid sequence homology, respectively. Amino acid substitutions tended to cluster in the amino and carboxy termini, the first transmembrane domain, and the second extracellular loop, with a pattern of species-specific changes that characterized CCR5 homologues from primates within a given family. At variance with humans, all primate species examined from the suborder Anthropoidea had amino acid substitutions at positions 13 (N to D) and 129 (V to I); the former change is critical for CD4-independent binding of SIV to CCR5. Within the Cebidae, Cercopithecidae, and Pongidae (including humans), CCR5 nucleotide similarities were 95.2 to 97.4, 98.0 to 99.5, and 98.3 to 99.3%, respectively. Despite this low genetic diversity, the phylogeny of the selected primate CCR5 homologue sequences agrees with present primate systematics, apart from some intermingling of species of the Cebidae and Cercopithecidae. Constructed HOS.CD4 cell lines expressing the entire CCR5 homologue protein from each of the Anthropoidea species and subspecies were tested for their ability to support HIV-1 and SIV entry and membrane fusion. Other than that of Cercopithecus pygerythrus, all CCR5 homologues tested were able to support both SIV and HIV-1 entry. Our results suggest that the shared structure and function of primate CCR5 homologue proteins would not impede the movement of primate immunodeficiency viruses between species.


Sign in / Sign up

Export Citation Format

Share Document