The Prediction of Calpain Cleavage Sites with the mRMR and IFS Approaches

Calpains are an important family of the Ca2+-dependent cysteine proteases which catalyze the limited proteolysis of many specific substrates. Calpains play crucial roles in basic physiological and pathological processes, and identification of the calpain cleavage sites may facilitate the understanding of the molecular mechanisms and biological function. But traditional experiment approaches to predict the sites are accurate, and are always labor-intensive and time-consuming. Thus, it is common to see that computational methods receive increasing attention due to their convenience and fast speed in recent years. In this study, we develop a new predictor based on the support vector machine (SVM) with the maximum relevance minimum redundancy (mRMR) method followed by incremental feature selection (IFS). And we concern the feature of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility to represent the calpain cleavage sites. Experimental results show that the performance of our predictor is better than several other state-of- the-art predictors, whose average prediction accuracy is 79.49%, sensitivity is 62.31%, and specificity is 88.12%. Since user-friendly and publicly accessible web servers represent the future direction for developing practically more useful predictors, here we have provided a web-server for the method presented in this paper.

Download Full-text

PVPred-SCM: Improved Prediction and Analysis of Phage Virion Proteins Using a Scoring Card Method

Cells ◽

10.3390/cells9020353 ◽

2020 ◽

Vol 9 (2) ◽

pp. 353 ◽

Cited By ~ 12

Author(s):

Phasit Charoenkwan ◽

Sakawrat Kanthawong ◽

Nalini Schaduangrat ◽

Janchai Yana ◽

Watshara Shoombuatong

Keyword(s):

Molecular Mechanisms ◽

Propensity Scores ◽

State Of The Art ◽

Support Vector ◽

Dipeptide Composition ◽

Biophysical Properties ◽

Validation Test ◽

Novel Method ◽

User Friendly ◽

Scoring Card Method

Although, existing methods have been successful in predicting phage (or bacteriophage) virion proteins (PVPs) using various types of protein features and complex classifiers, such as support vector machine and naïve Bayes, these two methods do not allow interpretability. However, the characterization and analysis of PVPs might be of great significance to understanding the molecular mechanisms of bacteriophage genetics and the development of antibacterial drugs. Hence, we herein proposed a novel method (PVPred-SCM) based on the scoring card method (SCM) in conjunction with dipeptide composition to identify and characterize PVPs. In PVPred-SCM, the propensity scores of 400 dipeptides were calculated using the statistical discrimination approach. Rigorous independent validation test showed that PVPred-SCM utilizing only dipeptide composition yielded an accuracy of 77.56%, indicating that PVPred-SCM performed well relative to the state-of-the-art method utilizing a number of protein features. Furthermore, the propensity scores of dipeptides were used to provide insights into the biochemical and biophysical properties of PVPs. Upon comparison, it was found that PVPred-SCM was superior to the existing methods considering its simplicity, interpretability, and implementation. Finally, in an effort to facilitate high-throughput prediction of PVPs, we provided a user-friendly web-server for identifying the likelihood of whether or not these sequences are PVPs. It is anticipated that PVPred-SCM will become a useful tool or at least a complementary existing method for predicting and analyzing PVPs.

Download Full-text

Computational identification of multiple lysine PTM sites by analyzing the instance hardness and feature importance

Scientific Reports ◽

10.1038/s41598-021-98458-y ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sabit Ahmed ◽

Afrida Rahman ◽

Md. Al Mehedi Hasan ◽

Shamim Ahmad ◽

S. M. Shovan

Keyword(s):

Cell Biology ◽

Molecular Mechanisms ◽

Feature Representation ◽

Computational Method ◽

Post Translational Modifications ◽

Redundant Data ◽

Feature Selection Approach ◽

Instance Hardness ◽

User Friendly ◽

Better Than

AbstractIdentification of post-translational modifications (PTM) is significant in the study of computational proteomics, cell biology, pathogenesis, and drug development due to its role in many bio-molecular mechanisms. Though there are several computational tools to identify individual PTMs, only three predictors have been established to predict multiple PTMs at the same lysine residue. Furthermore, detailed analysis and assessment on dataset balancing and the significance of different feature encoding techniques for a suitable multi-PTM prediction model are still lacking. This study introduces a computational method named ’iMul-kSite’ for predicting acetylation, crotonylation, methylation, succinylation, and glutarylation, from an unrecognized peptide sample with one, multiple, or no modifications. After successfully eliminating the redundant data samples from the majority class by analyzing the hardness of the sequence-coupling information, feature representation has been optimized by adopting the combination of ANOVA F-Test and incremental feature selection approach. The proposed predictor predicts multi-label PTM sites with 92.83% accuracy using the top 100 features. It has also achieved a 93.36% aiming rate and 96.23% coverage rate, which are much better than the existing state-of-the-art predictors on the validation test. This performance indicates that ’iMul-kSite’ can be used as a supportive tool for further K-PTM study. For the convenience of the experimental scientists, ’iMul-kSite’ has been deployed as a user-friendly web-server at http://103.99.176.239/iMul-kSite.

Download Full-text

Prediction of Neddylation Sites Using the Composition of k-spaced Amino Acid Pairs and Fuzzy SVM

Current Bioinformatics ◽

10.2174/1574893614666191114123453 ◽

2020 ◽

Vol 15 (7) ◽

pp. 725-731

Author(s):

Zhe Ju ◽

Shi-Yun Wang

Keyword(s):

Amino Acid ◽

Molecular Mechanisms ◽

Feature Selection Method ◽

Class Imbalance ◽

Support Vector ◽

Post Translational Modification ◽

Fuzzy Support Vector Machine ◽

Accurate Identification ◽

Isopeptide Bonds ◽

User Friendly

Introduction: Neddylation is the process of ubiquitin-like protein NEDD8 attaching substrate lysine via isopeptide bonds. As a highly dynamic and reversible post-translational modification, lysine neddylation has been found to be involved in various biological processes and closely associated with many diseases. Objective: The accurate identification of neddylation sites is necessary to elucidate the underlying molecular mechanisms of neddylation. As traditional experimental methods are often expensive and time-consuming, it is imperative to design computational methods to identify neddylation sites. Methods: In this study, a novel predictor named CKSAAP_NeddSite is developed to detect neddylation sites. An effective feature encoding technology, the composition of k-spaced amino acid pairs, is used to encode neddylation sites. And the F-score feature selection method is adopted to remove the redundant features. Moreover, a fuzzy support vector machine algorithm is employed to overcome the class imbalance and noise problem. Results: As illustrated by 10-fold cross-validation, CKSAAP_NeddSite achieves an AUC of 0.9848. Independent tests also show that CKSAAP_NeddSite significantly outperforms existing neddylation sites predictor. Therefore, CKSAAP_NeddSite can be a useful bioinformatics tool for the prediction of neddylation sites. Feature analysis shows that some residues around neddylation sites may play an important role in the prediction. Conclusion: The results of analysis and prediction could offer useful information for elucidating the molecular mechanisms of neddylation. A user-friendly web-server for CKSAAP_NeddSite is established at 123.206.31.171/CKSAAP_NeddSite.

Download Full-text

A Computational Method for the Identification of Endolysins and Autolysins

Protein and Peptide Letters ◽

10.2174/0929866526666191002104735 ◽

2020 ◽

Vol 27 (4) ◽

pp. 329-336 ◽

Cited By ~ 1

Author(s):

Lei Xu ◽

Guangmin Liang ◽

Baowen Chen ◽

Xu Tan ◽

Huaikun Xiang ◽

...

Keyword(s):

Support Vector Machine ◽

Cell Wall ◽

Experimental Results ◽

Computational Method ◽

Lytic Enzyme ◽

Support Vector ◽

Lytic Enzymes ◽

Data Set ◽

Optimal Feature ◽

Better Than

Background: Cell lytic enzyme is a kind of highly evolved protein, which can destroy the cell structure and kill the bacteria. Compared with antibiotics, cell lytic enzyme will not cause serious problem of drug resistance of pathogenic bacteria. Thus, the study of cell wall lytic enzymes aims at finding an efficient way for curing bacteria infectious. Compared with using antibiotics, the problem of drug resistance becomes more serious. Therefore, it is a good choice for curing bacterial infections by using cell lytic enzymes. Cell lytic enzyme includes endolysin and autolysin and the difference between them is the purpose of the break of cell wall. The identification of the type of cell lytic enzymes is meaningful for the study of cell wall enzymes. Objective: In this article, our motivation is to predict the type of cell lytic enzyme. Cell lytic enzyme is helpful for killing bacteria, so it is meaningful for study the type of cell lytic enzyme. However, it is time consuming to detect the type of cell lytic enzyme by experimental methods. Thus, an efficient computational method for the type of cell lytic enzyme prediction is proposed in our work. Method: We propose a computational method for the prediction of endolysin and autolysin. First, a data set containing 27 endolysins and 41 autolysins is built. Then the protein is represented by tripeptides composition. The features are selected with larger confidence degree. At last, the classifier is trained by the labeled vectors based on support vector machine. The learned classifier is used to predict the type of cell lytic enzyme. Results: Following the proposed method, the experimental results show that the overall accuracy can attain 97.06%, when 44 features are selected. Compared with Ding's method, our method improves the overall accuracy by nearly 4.5% ((97.06-92.9)/92.9%). The performance of our proposed method is stable, when the selected feature number is from 40 to 70. The overall accuracy of tripeptides optimal feature set is 94.12%, and the overall accuracy of Chou's amphiphilic PseAAC method is 76.2%. The experimental results also demonstrate that the overall accuracy is improved by nearly 18% when using the tripeptides optimal feature set. Conclusion: The paper proposed an efficient method for identifying endolysin and autolysin. In this paper, support vector machine is used to predict the type of cell lytic enzyme. The experimental results show that the overall accuracy of the proposed method is 94.12%, which is better than some existing methods. In conclusion, the selected 44 features can improve the overall accuracy for identification of the type of cell lytic enzyme. Support vector machine performs better than other classifiers when using the selected feature set on the benchmark data set.

Download Full-text

pLoc_bal-mPlant: Predict Subcellular Localization of Plant Proteins by General PseAAC and Balancing Training Dataset

Current Pharmaceutical Design ◽

10.2174/1381612824666181119145030 ◽

2019 ◽

Vol 24 (34) ◽

pp. 4013-4022 ◽

Cited By ~ 28

Author(s):

Xiang Cheng ◽

Xuan Xiao ◽

Kuo-Chen Chou

Keyword(s):

Subcellular Localization ◽

Basic Research ◽

Training Dataset ◽

Sequence Information ◽

Plant Proteins ◽

Protein Subcellular Localization ◽

Computational Tools ◽

Validation Tests ◽

User Friendly ◽

Better Than

Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mPlant” was developed for identifying the subcellular localization of plant proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mPlant was trained by an extremely skewed dataset in which some subsets (i.e., the protein numbers for some subcellular locations) were more than 10 times larger than the others. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. To overcome such biased consequence, we have developed a new and bias-free predictor called pLoc_bal-mPlant by balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mPlant, the existing state-of-the-art predictor in identifying the subcellular localization of plant proteins. To maximize the convenience for the majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mPlant/, by which users can easily get their desired results without the need to go through the detailed mathematics.

Download Full-text

pLoc_bal-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by General PseAAC and Quasi-balancing Training Dataset

Medicinal Chemistry ◽

10.2174/1573406415666181218102517 ◽

2019 ◽

Vol 15 (5) ◽

pp. 472-485 ◽

Cited By ~ 21

Author(s):

Kuo-Chen Chou ◽

Xiang Cheng ◽

Xuan Xiao

Keyword(s):

Drug Development ◽

Subcellular Localization ◽

Basic Research ◽

The Other ◽

Training Dataset ◽

Sequence Information ◽

Eukaryotic Proteins ◽

Validation Tests ◽

User Friendly ◽

Better Than

Background/Objective: Information of protein subcellular localization is crucially important for both basic research and drug development. With the explosive growth of protein sequences discovered in the post-genomic age, it is highly demanded to develop powerful bioinformatics tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called “pLoc-mEuk” was developed for identifying the subcellular localization of eukaryotic proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems where many proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mEuk was trained by an extremely skewed dataset where some subset was about 200 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. Methods: To alleviate such bias, we have developed a new predictor called pLoc_bal-mEuk by quasi-balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLocmEuk, the existing state-of-the-art predictor in identifying the subcellular localization of eukaryotic proteins. It has not escaped our notice that the quasi-balancing treatment can also be used to deal with many other biological systems. Results: To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/. Conclusion: It is anticipated that the pLoc_bal-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend trend in drug development.

Download Full-text

NLOS Multipath Classification of GNSS Signal Correlation Output Using Machine Learning

Sensors ◽

10.3390/s21072503 ◽

2021 ◽

Vol 21 (7) ◽

pp. 2503

Author(s):

Taro Suzuki ◽

Yoshiharu Amano

Keyword(s):

Machine Learning ◽

Satellite System ◽

Training Data ◽

Support Vector ◽

Positioning Errors ◽

Automated Method ◽

Global Navigation Satellite ◽

Better Than ◽

Signal Correlation

This paper proposes a method for detecting non-line-of-sight (NLOS) multipath, which causes large positioning errors in a global navigation satellite system (GNSS). We use GNSS signal correlation output, which is the most primitive GNSS signal processing output, to detect NLOS multipath based on machine learning. The shape of the multi-correlator outputs is distorted due to the NLOS multipath. The features of the shape of the multi-correlator are used to discriminate the NLOS multipath. We implement two supervised learning methods, a support vector machine (SVM) and a neural network (NN), and compare their performance. In addition, we also propose an automated method of collecting training data for LOS and NLOS signals of machine learning. The evaluation of the proposed NLOS detection method in an urban environment confirmed that NN was better than SVM, and 97.7% of NLOS signals were correctly discriminated.

Download Full-text

Zonation of Landslide Susceptibility in Ruijin, Jiangxi, China

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18115906 ◽

2021 ◽

Vol 18 (11) ◽

pp. 5906

Author(s):

Xiaoting Zhou ◽

Weicheng Wu ◽

Ziyu Lin ◽

Guiliang Zhang ◽

Renxiang Chen ◽

...

Keyword(s):

Environmental Factors ◽

Landslide Susceptibility ◽

Urban Areas ◽

Support Vector ◽

Susceptibility Map ◽

Human Society ◽

Learning Approaches ◽

Prevention Measures ◽

Landslide Occurrence ◽

Better Than

Landslides are one of the major geohazards threatening human society. The objective of this study was to conduct a landslide hazard susceptibility assessment for Ruijin, Jiangxi, China, and to provide technical support to the local government for implementing disaster reduction and prevention measures. Machine learning approaches, e.g., random forests (RFs) and support vector machines (SVMs) were employed and multiple geo-environmental factors such as land cover, NDVI, landform, rainfall, lithology, and proximity to faults, roads, and rivers, etc., were utilized to achieve our purposes. For categorical factors, three processing approaches were proposed: simple numerical labeling (SNL), weight assignment (WA)-based and frequency ratio (FR)-based. Then 19 geo-environmental factors were respectively converted into raster to constitute three 19-band datasets, i.e., DS1, DS2, and DS3 from three different processes. Then, 155 observed landslides that occurred in the past decades were vectorized, among which 70% were randomly selected to compose a training set (TS1) and the remaining 30% to form a validation set (VS1). A number of non-landslide (no-risk) samples distributed in the whole study area were identified in low slope (<1–3°) zones such as urban areas and croplands, and also added to the TS1 and VS1 in the same ratio. For comparison, we used the FR approach to identify the no-risk samples in both flat and non-flat areas, and merged them into the field-observed landslides to constitute another pair of training and validation sets (TS2 and VS2) using the same ratio of 7:3. The RF algorithm was applied to model the probability of the landslide occurrence using DS1, DS2, and DS3 as predictive variables and TS1 and TS2 for training to obtain the SNL-based, WA-based, and FR-based RF models, respectively. Verified against VS1 and VS2, the three models have similar overall accuracy (OA) and Kappa coefficient (KC), which are 89.61%, 91.47%, and 94.54%, and 0.7926, 0.8299, and 0.8908, respectively. All of them are much better than the three models obtained by SVM algorithm with OA of 81.79%, 82.86%, and 83%, and KC of 0.6337, 0.655, and 0.660. New case verification with the recent 26 landslide events of 2017–2020 revealed that the landslide susceptibility map from WA-based RF modeling was able to properly identify the high and very high susceptibility zones where 23 new landslides had occurred, and performed better than the SNL-based and FR-based RF modeling, though the latter has a slightly higher OA and KC. Hence, we concluded that all three RF models achieve reasonable risk prediction, but WA-based and FR-based RF modeling deserves a recommendation for application elsewhere. The results of this study may serve as reference for the local authorities in prevention and early warning of landslide hazards.

Download Full-text

Deep Learning Methods for Classification of Certain Abnormalities in Echocardiography

Electronics ◽

10.3390/electronics10040495 ◽

2021 ◽

Vol 10 (4) ◽

pp. 495

Author(s):

Imayanmosha Wahlang ◽

Arnab Kumar Maji ◽

Goutam Saha ◽

Prasun Chakrabarti ◽

Michal Jasinski ◽

...

Keyword(s):

Deep Learning ◽

Short Term Memory ◽

Support Vector ◽

Variational Autoencoder ◽

Different Types ◽

Static Images ◽

Long Short Term Memory ◽

2D And 3D ◽

Better Than

This article experiments with deep learning methodologies in echocardiogram (echo), a promising and vigorously researched technique in the preponderance field. This paper involves two different kinds of classification in the echo. Firstly, classification into normal (absence of abnormalities) or abnormal (presence of abnormalities) has been done, using 2D echo images, 3D Doppler images, and videographic images. Secondly, based on different types of regurgitation, namely, Mitral Regurgitation (MR), Aortic Regurgitation (AR), Tricuspid Regurgitation (TR), and a combination of the three types of regurgitation are classified using videographic echo images. Two deep-learning methodologies are used for these purposes, a Recurrent Neural Network (RNN) based methodology (Long Short Term Memory (LSTM)) and an Autoencoder based methodology (Variational AutoEncoder (VAE)). The use of videographic images distinguished this work from the existing work using SVM (Support Vector Machine) and also application of deep-learning methodologies is the first of many in this particular field. It was found that deep-learning methodologies perform better than SVM methodology in normal or abnormal classification. Overall, VAE performs better in 2D and 3D Doppler images (static images) while LSTM performs better in the case of videographic images.

Download Full-text

Velody 2—Resilient High-Capacity MIDI Steganography for Organ and Harpsichord Music

Applied Sciences ◽

10.3390/app11010039 ◽

2020 ◽

Vol 11 (1) ◽

pp. 39

Author(s):

Eric Järpe ◽

Mattias Weckstén

Keyword(s):

Mean Absolute Error ◽

Signal To Noise Ratio ◽

High Capacity ◽

Absolute Error ◽

Music Technology ◽

Alternative Methods ◽

File Size ◽

User Friendly ◽

Different Levels ◽

Better Than

A new method for musical steganography for the MIDI format is presented. The MIDI standard is a user-friendly music technology protocol that is frequently deployed by composers of different levels of ambition. There is to the author’s knowledge no fully implemented and rigorously specified, publicly available method for MIDI steganography. The goal of this study, however, is to investigate how a novel MIDI steganography algorithm can be implemented by manipulation of the velocity attribute subject to restrictions of capacity and security. Many of today’s MIDI steganography methods—less rigorously described in the literature—fail to be resilient to steganalysis. Traces (such as artefacts in the MIDI code which would not occur by the mere generation of MIDI music: MIDI file size inflation, radical changes in mean absolute error or peak signal-to-noise ratio of certain kinds of MIDI events or even audible effects in the stego MIDI file) that could catch the eye of a scrutinizing steganalyst are side-effects of many current methods described in the literature. This steganalysis resilience is an imperative property of the steganography method. However, by restricting the carrier MIDI files to classical organ and harpsichord pieces, the problem of velocities following the mood of the music can be avoided. The proposed method, called Velody 2, is found to be on par with or better than the cutting edge alternative methods regarding capacity and inflation while still possessing a better resilience against steganalysis. An audibility test was conducted to check that there are no signs of audible traces in the stego MIDI files.

Download Full-text