scholarly journals Prediction of Protein Subcellular Localization Based on Primary Sequence Data

Author(s):  
Mert Özarar ◽  
Volkan Atalay ◽  
Rengül Çetin Atalay
2020 ◽  
Vol 21 (7) ◽  
pp. 546-557
Author(s):  
Rahul Semwal ◽  
Pritish Kumar Varadwaj

Aims: To develop a tool that can annotate subcellular localization of human proteins. Background: With the progression of high throughput human proteomics projects, an enormous amount of protein sequence data has been discovered in the recent past. All these raw sequence data require precise mapping and annotation for their respective biological role and functional attributes. The functional characteristics of protein molecules are highly dependent on the subcellular localization/ compartment. Therefore, a fully automated and reliable protein subcellular localization prediction system would be very useful for current proteomic research. Objective: To develop a machine learning-based predictive model that can annotate the subcellular localization of human proteins with high accuracy and precision. Methods: In this study, we used the PSI-CD-HIT homology criterion and utilized the sequence-based features of protein sequences to develop a powerful subcellular localization predictive model. The dataset used to train the HumDLoc model was extracted from a reliable data source, Uniprot knowledge base, which helps the model to generalize on the unseen dataset. Result : The proposed model, HumDLoc, was compared with two of the most widely used techniques: CELLO and DeepLoc, and other machine learning-based tools. The result demonstrated promising predictive performance of HumDLoc model based on various machine learning parameters such as accuracy (≥97.00%), precision (≥0.86), recall (≥0.89), MCC score (≥0.86), ROC curve (0.98 square unit), and precision-recall curve (0.93 square unit). Conclusion: In conclusion, HumDLoc was able to outperform several alternative tools for correctly predicting subcellular localization of human proteins. The HumDLoc has been hosted as a web-based tool at https://bioserver.iiita.ac.in/HumDLoc/.


2013 ◽  
Vol 88 (2) ◽  
pp. 219-229
Author(s):  
A. Chaudhary ◽  
N. Singh ◽  
H.S. Singh

AbstractNematodes of the family Thelastomatidae are parasitic in the alimentary tract of many arthropods, including Periplaneta americana L. In Meerut, Uttar Pradesh, India, two nematode species, namely Hammerschmidtiella indicus and Thelastoma icemi, belonging to this family have been reported. In the present study, the molecular phylogeny of these two nematode species was derived using small subunit (18S) sequence and secondary-structure analyses. The small subunit sequence analyses were carried out to explore the validation and systematics of these species. Phylogenetic analyses were performed for primary sequence data as well as using neighbour-joining and maximum-parsimony approaches. In contrast, the inferred secondary structures for the two species, using free-energy modelling, showed structural identities. As well as this, motif sequences were also found to be a promising tool for nematode species identification. The study provides molecular characterization based on primary sequence data of the 18S ribosomal DNA region of the nematodes along with secondary-structure data and motif sequences for inferences at higher taxonomic levels.


1991 ◽  
Vol 275 (2) ◽  
pp. 529-534 ◽  
Author(s):  
I B Wilson ◽  
Y Gavel ◽  
G von Heijne

To study the sequence requirements for addition of O-linked N-acetylgalactosamine to proteins, amino acid distributions around 174 O-glycosylation sites were compared with distributions around non-glycosylated sites. In comparison with non-glycosylated serine and threonine residues, the most prominent feature in the vicinity of O-glycosylated sites is a significantly increased frequency of proline residues, especially at positions -1 and +3 relative to the glycosylated residues. Alanine, serine and threonine are also significantly increased. The high serine and threonine content of O-glycosylated regions is due to the presence of clusters of several closely spaced glycosylated hydroxy amino acids in many O-glycosylated proteins. Such clusters can be predicted from the primary sequence in some cases, but there is no apparent possibility of predicting isolated O-glycosylation sites from primary sequence data.


2002 ◽  
Vol 79 (1) ◽  
pp. 1-9 ◽  
Author(s):  
GUGS LUSHAI ◽  
HUGH D. LOXDALE

Emperical evidence for intraclonal genetic variation is described here for clonal systems using a variety of molecular techniques and implicating a diversity of mechanisms. However, clonal systems are still generally perceived as having strict genetic fidelity. As concepts of genetic variability move from primary sequence data to include epigenetic and structural influences on genetic expression, the ability to detect changes in the genome at short intervals allows precedence to be given to inherent biological variation that is often analytically ignored. Therefore, the advent of powerful molecular techniques, like genome mapping, mean that our concepts of genetic fidelity within eukaryotic clones and the whole philosophy of the ‘clone’ needs to be re-evaluated and re-defined to replace old unproven dogma in this aspect of science.


Molecules ◽  
2019 ◽  
Vol 24 (5) ◽  
pp. 919 ◽  
Author(s):  
Bo Li ◽  
Lijun Cai ◽  
Bo Liao ◽  
Xiangzheng Fu ◽  
Pingping Bing ◽  
...  

The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou’s pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.


Sign in / Sign up

Export Citation Format

Share Document