mAHTPred: a sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation

Balachandran Manavalan; Shaherin Basith; Tae Hwan Shin; Leyi Wei; Gwang Lee

doi:10.1093/bioinformatics/bty1047

mAHTPred: a sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation

Bioinformatics ◽

10.1093/bioinformatics/bty1047 ◽

2018 ◽

Vol 35 (16) ◽

pp. 2757-2765 ◽

Cited By ~ 63

Author(s):

Balachandran Manavalan ◽

Shaherin Basith ◽

Tae Hwan Shin ◽

Leyi Wei ◽

Gwang Lee

Keyword(s):

Nearest Neighbor ◽

Feature Representation ◽

Superior Performance ◽

Supplementary Information ◽

Gradient Boosting ◽

Support Vector ◽

Pharmaceutical Drugs ◽

K Nearest Neighbor ◽

Feature Descriptors ◽

Predicted Probability

AbstractMotivationCardiovascular disease is the primary cause of death globally accounting for approximately 17.7 million deaths per year. One of the stakes linked with cardiovascular diseases and other complications is hypertension. Naturally derived bioactive peptides with antihypertensive activities serve as promising alternatives to pharmaceutical drugs. So far, there is no comprehensive analysis, assessment of diverse features and implementation of various machine-learning (ML) algorithms applied for antihypertensive peptide (AHTP) model construction.ResultsIn this study, we utilized six different ML algorithms, namely, Adaboost, extremely randomized tree (ERT), gradient boosting (GB), k-nearest neighbor, random forest (RF) and support vector machine (SVM) using 51 feature descriptors derived from eight different feature encodings for the prediction of AHTPs. While ERT-based trained models performed consistently better than other algorithms regardless of various feature descriptors, we treated them as baseline predictors, whose predicted probability of AHTPs was further used as input features separately for four different ML-algorithms (ERT, GB, RF and SVM) and developed their corresponding meta-predictors using a two-step feature selection protocol. Subsequently, the integration of four meta-predictors through an ensemble learning approach improved the balanced prediction performance and model robustness on the independent dataset. Upon comparison with existing methods, mAHTPred showed superior performance with an overall improvement of approximately 6–7% in both benchmarking and independent datasets.Availability and implementationThe user-friendly online prediction tool, mAHTPred is freely accessible at http://thegleelab.org/mAHTPred.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Bacterial Immunogenicity Prediction by Machine Learning Methods

Vaccines ◽

10.3390/vaccines8040709 ◽

2020 ◽

Vol 8 (4) ◽

pp. 709

Author(s):

Ivan Dimitrov ◽

Nevena Zaharieva ◽

Irini Doytchinova

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Predictive Ability ◽

Initial Step ◽

Majority Voting ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor ◽

Test Set ◽

Extreme Gradient Boosting

The identification of protective immunogens is the most important and vigorous initial step in the long-lasting and expensive process of vaccine design and development. Machine learning (ML) methods are very effective in data mining and in the analysis of big data such as microbial proteomes. They are able to significantly reduce the experimental work for discovering novel vaccine candidates. Here, we applied six supervised ML methods (partial least squares-based discriminant analysis, k nearest neighbor (kNN), random forest (RF), support vector machine (SVM), random subspace method (RSM), and extreme gradient boosting) on a set of 317 known bacterial immunogens and 317 bacterial non-immunogens and derived models for immunogenicity prediction. The models were validated by internal cross-validation in 10 groups from the training set and by the external test set. All of them showed good predictive ability, but the xgboost model displays the most prominent ability to identify immunogens by recognizing 84% of the known immunogens in the test set. The combined RSM-kNN model was the best in the recognition of non-immunogens, identifying 92% of them in the test set. The three best performing ML models (xgboost, RSM-kNN, and RF) were implemented in the new version of the server VaxiJen, and the prediction of bacterial immunogens is now based on majority voting.

Download Full-text

Identifying Modes of Driving Railway Trains from GPS Trajectory Data: An Ensemble Classifier-Based Approach

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi7080308 ◽

2018 ◽

Vol 7 (8) ◽

pp. 308 ◽

Cited By ~ 4

Author(s):

Han Zheng ◽

Zanyang Cui ◽

Xingchen Zhang

Keyword(s):

Nearest Neighbor ◽

Capacity Utilization ◽

Real Data ◽

Parameter Tuning ◽

Integrated Approach ◽

Ensemble Classifier ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor ◽

Trajectory Data

Recognizing Modes of Driving Railway Trains (MDRT) can help to solve railway freight transportation problems in driver behavior research, auto-driving system design and capacity utilization optimization. Previous studies have focused on analyses and applications of MDRT, but there is currently no approach to automatically and effectively identify MDRT in the context of big data. In this study, we propose an integrated approach including data preprocessing, feature extraction, classifiers modeling, training and parameter tuning, and model evaluation to infer MDRT using GPS data. The highlights of this study are as follows: First, we propose methods for extracting Driving Segmented Standard Deviation Features (DSSDF) combined with classical features for the purpose of improving identification performances. Second, we find the most suitable classifier for identifying MDRT based on a comparison of performances of K-Nearest Neighbor, Support Vector Machines, AdaBoost, Random Forest, Gradient Boosting Decision Tree, and XGBoost. From the real-data experiment, we conclude that: (i) The ensemble classifier XGBoost produces the best performance with an accuracy of 92.70%; (ii) The group of DSSDF plays an important role in identifying MDRT with an accuracy improvement of 11.2% (using XGBoost). The proposed approach has been applied in capacity utilization optimization and new driver training for the Baoshen Railway.

Download Full-text

Stress Classification of ECG-Derived HRV Features Extracted from Wearable Devices

10.20944/preprints202103.0644.v1 ◽

2021 ◽

Author(s):

Kayisan Mary Dalmeida ◽

Giovanni Luca Masala

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Wearable Devices ◽

Mental Wellbeing ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor ◽

Automobile Crashes ◽

Machine Learning Model ◽

Stress Classification

Stress has been identified as one of the major causes of automobile crashes which then lead to high rates of fatalities and injuries each year. Stress can be measured via physiological measurements and in this study the focus will be based on the features that can be extracted by common wearable devices. Hence the study will be mainly focusing on the heart rate variability (HRV). This study is aimed to develop a good predictive model that can accurately classify stress levels from ECG-derived HRV features, obtained from automobile drivers, testing different machine learning methodologies such as K-Nearest Neighbor (KNN), Support Vector Machines (SVM), Multilayer Perceptron (MLP), Random Forest (RF) and Gradient Boosting (GB). Moreover, the models obtained with highest predictive power will be used as reference for the development of a machine learning model that would be used to classify stress from HRV features derived from HRV measurements obtained from wearable devices. We demonstrate that MLP was the ideal stress classifier by achieving a Recall of 80%. The proposed method can be also used on all applications in which is important to monitor the stress level e. g. in physical rehabilitation, anxiety relief or mental wellbeing.

Download Full-text

Machine learning-based patient classification system for adults with stroke: A systematic review

Chronic Illness ◽

10.1177/17423953211067435 ◽

2021 ◽

pp. 174239532110674

Author(s):

Suebsarn Ruksakulpiwat ◽

Witchuda Thongking ◽

Wendie Zhou ◽

Chitchanok Benjasirisan ◽

Lalipat Phianhasin ◽

...

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Classification System ◽

Nearest Neighbor ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor ◽

Optimal Outcomes ◽

And Gender ◽

Meta Analyses

Objective To evaluate the existing evidence of a machine learning-based classification system that stratifies patients with stroke. Methods The authors carried out a systematic review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) recommendations for a review article. PubMed, MEDLINE, Web of Science, and CINAHL Plus Full Text were searched from January 2015 to February 2021. Results There are twelve studies included in this systematic review. Fifteen algorithms were used in the included studies. The most common forms of machine learning (ML) used to classify stroke patients were the support vector machine (SVM) (n = 8 studies), followed by random forest (RF) (n = 7 studies), decision tree (DT) (n = 4 studies), gradient boosting (GB) (n = 4 studies), neural networks (NNs) (n = 3 studies), deep learning (n = 2 studies), and k-nearest neighbor (k-NN) (n = 2 studies), respectively. Forty-four features of inputs were used in the included studies, and age and gender are the most common features in the ML model. Discussion There is no single algorithm that performed better or worse than all others at classifying patients with stroke, in part because different input data require different algorithms to achieve optimal outcomes.

Download Full-text

Application of Artificial Intelligence and Machine Learning Techniques in Classifying Extent of Dementia Across Alzheimer's Image Data

International Journal of Quantitative Structure-Property Relationships ◽

10.4018/ijqspr.2021040103 ◽

2021 ◽

Vol 6 (2) ◽

pp. 29-46

Author(s):

Robin Ghosh ◽

Anirudh Reddy Cingreddy ◽

Venkata Melapu ◽

Sravanthi Joginipelli ◽

Supratik Kar

Keyword(s):

Neural Network ◽

Machine Learning ◽

Nearest Neighbor ◽

Image Data ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor ◽

Mild Dementia ◽

Extreme Gradient Boosting

Alzheimer's disease (AD) is one of the most common forms of dementia and the sixth-leading cause of death in older adults. The presented study has illustrated the applications of deep learning (DL) and associated methods, which could have a broader impact on identifying dementia stages and may guide therapy in the future for multiclass image detection. The studied datasets contain around 6,400 magnetic resonance imaging (MRI) images, each segregated into the severity of Alzheimer's classes: mild dementia, very mild dementia, non-dementia, moderate dementia. These four image specifications were used to classify the dementia stages in each patient applying the convolutional neural network (CNN) algorithm. Employing the CNN-based in silico model, the authors successfully classified and predicted the different AD stages and got around 97.19% accuracy. Again, machine learning (ML) techniques like extreme gradient boosting (XGB), support vector machine (SVM), k-nearest neighbor (KNN), and artificial neural network (ANN) offered accuracy of 96.62%, 96.56%, 94.62, and 89.88%, respectively.

Download Full-text

Integrating synthetic minority oversampling and gradient boosting decision tree for bogie fault diagnosis in rail vehicles

Proceedings of the Institution of Mechanical Engineers Part F Journal of Rail and Rapid Transit ◽

10.1177/0954409718795089 ◽

2018 ◽

Vol 233 (3) ◽

pp. 312-325 ◽

Cited By ~ 11

Author(s):

Linlin Kou ◽

Yong Qin ◽

Xunjun Zhao ◽

Yong Fu

Keyword(s):

Support Vector Machine ◽

Fault Diagnosis ◽

Fault Detection ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor

Bogies are critical components of a rail vehicle, which are important for the safe operation of rail transit. In this study, the authors analyzed the real vibration data of the bogies of a railway vehicle obtained from a Chinese subway company under four different operating conditions. The authors selected 15 feature indexes – that ranged from time-domain, energy, and entropy – as well as their correlations. The adaptive synthetic sampling approach–gradient boosting decision tree (ADASYN–GBDT) method is proposed for the bogie fault diagnosis. A comparison between ADASYN–GBDT and the three commonly used classifiers (K-nearest neighbor, support vector machine, and Gaussian naïve Bayes), combined with random forest as the feature selection, was done under different test data sizes. A confusion matrix was used to evaluate those classifiers. In K-nearest neighbor, support vector machine, and Gaussian naïve Bayes, the optimal features should be selected first, while the proposed method of this study does not need to select the optimal features. K-nearest neighbor, support vector machine, and Gaussian naïve Bayes produced inaccurate results in multi-class identification. It can be seen that the lowest false detection rates of the proposed ADASYN–GBDT model are 92.95% and 87.81% when proportion of the test dataset is 0.4 and 0.9, respectively. In addition, the ADASYN–GBDT model has the ability to correctly identify a fault, which makes it more practical and suitable for use in railway operations. The entire process (training and testing) was finished in 2.4231 s and the detection procedure took 0.0027 s on average. The results show that the proposed ADASYN–GBDT method satisfied the requirements of real-time performance and accuracy for online fault detection. It might therefore aid in the fault detection of bogies.

Download Full-text

IgA Nephropathy Prediction in Children with Machine Learning Algorithms

Future Internet ◽

10.3390/fi12120230 ◽

2020 ◽

Vol 12 (12) ◽

pp. 230

Author(s):

Ping Zhang ◽

Rongqin Wang ◽

Nianfeng Shi

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Immunoglobulin A ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor ◽

Chi Square ◽

Extreme Gradient Boosting

Immunoglobulin A nephropathy (IgAN) is the most common primary glomerular disease all over the world and it is a major cause of renal failure. IgAN prediction in children with machine learning algorithms has been rarely studied. We retrospectively analyzed the electronic medical records from the Nanjing Eastern War Zone Hospital, chose eXtreme Gradient Boosting (XGBoost), random forest (RF), CatBoost, support vector machines (SVM), k-nearest neighbor (KNN), and extreme learning machine (ELM) models in order to predict the probability that the patient would not reach or reach end-stage renal disease (ESRD) within five years, used the chi-square test to select the most relevant 16 features as the input of the model, and designed a decision-making system (DMS) of IgAN prediction in children that is based on XGBoost and Django framework. The receiver operating characteristic (ROC) curve was used in order to evaluate the performance of the models and XGBoost had the best performance by comparison. The AUC value, accuracy, precision, recall, and f1-score of XGBoost were 85.11%, 78.60%, 75.96%, 76.70%, and 76.33%, respectively. The XGBoost model is useful for physicians and pediatric patients in providing predictions regarding IgAN. As an advantage, a DMS can be designed based on the XGBoost model to assist a physician to effectively treat IgAN in children for preventing deterioration.

Download Full-text

Classification of Parkinson’s disease and essential tremor based on balance and gait characteristics from wearable motion sensors via machine learning techniques: a data-driven approach

Journal of NeuroEngineering and Rehabilitation ◽

10.1186/s12984-020-00756-5 ◽

2020 ◽

Vol 17 (1) ◽

Author(s):

Sanghee Moon ◽

Hyun-Je Song ◽

Vibhash D. Sharma ◽

Kelly E. Lyons ◽

Rajesh Pahwa ◽

...

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Motion Sensors ◽

Learning Models ◽

K Nearest Neighbor ◽

Gait Characteristics ◽

Machine Learning Models

Abstract Background Parkinson’s disease (PD) and essential tremor (ET) are movement disorders that can have similar clinical characteristics including tremor and gait difficulty. These disorders can be misdiagnosed leading to delay in appropriate treatment. The aim of the study was to determine whether balance and gait variables obtained with wearable inertial motion sensors can be utilized to differentiate between PD and ET using machine learning. Additionally, we compared classification performances of several machine learning models. Methods This retrospective study included balance and gait variables collected during the instrumented stand and walk test from people with PD (n = 524) and with ET (n = 43). Performance of several machine learning techniques including neural networks, support vector machine, k-nearest neighbor, decision tree, random forest, and gradient boosting, were compared with a dummy model or logistic regression using F1-scores. Results Machine learning models classified PD and ET based on balance and gait characteristics better than the dummy model (F1-score = 0.48) or logistic regression (F1-score = 0.53). The highest F1-score was 0.61 of neural network, followed by 0.59 of gradient boosting, 0.56 of random forest, 0.55 of support vector machine, 0.53 of decision tree, and 0.49 of k-nearest neighbor. Conclusions This study demonstrated the utility of machine learning models to classify different movement disorders based on balance and gait characteristics collected from wearable sensors. Future studies using a well-balanced data set are needed to confirm the potential clinical utility of machine learning models to discern between PD and ET.

Download Full-text

A hybrid evolutionary learning classification for robot ground pattern recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202940 ◽

2021 ◽

pp. 1-15

Author(s):

Jiankai Zuo ◽

Yaying Zhang

Keyword(s):

Nearest Neighbor ◽

Fitness Function ◽

Gradient Boosting ◽

Support Vector ◽

Evolutionary Learning ◽

Ensemble Classifiers ◽

Improved Genetic Algorithm ◽

K Nearest Neighbor ◽

Obvious Effect ◽

Extreme Gradient Boosting

In the field of intelligent robot engineering, whether it is humanoid, bionic or vehicle robots, the driving forms of standing, moving and walking, and the consciousness discrimination of the environment in which they are located have always been the focus and difficulty of research. Based on such problems, Naive Bayes Classifier (NBC), Support Vector Machine(SVM), k-Nearest-Neighbor (KNN), Decision Tree (DT), Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) were introduced to conduct experiments. The six individual classifiers have an obvious effect on a particular type of ground, but the overall performance is poor. Therefore, the paper proposes a “Novel Hybrid Evolutionary Learning” method (NHEL) which combines every single classifier by means of weighted voting and adopts an improved genetic algorithm (GA) to obtain the optimal weight. According to the fitness function and evolution times, this paper designs the adaptively changing crossover and mutation rate and applies the conjugate gradient (CG) to enhance GA. By making full use of the global search capabilities of GA and the fast local search ability of CG, the convergence speed is accelerated and the search precision is upgraded. The experimental results show that the performance of the proposed model is significantly better than individual machine learning and ensemble classifiers.

Download Full-text

Adaptive Sparse Representation of Continuous Input for Tsetlin Machines Based on Stochastic Searching on the Line

Electronics ◽

10.3390/electronics10172107 ◽

2021 ◽

Vol 10 (17) ◽

pp. 2107

Author(s):

Kuruge Darshana Abeyrathna ◽

Ole-Christoffer Granmo ◽

Morten Goodwin

Keyword(s):

Nearest Neighbor ◽

Additive Models ◽

Feature Representation ◽

Support Vector ◽

K Nearest Neighbor ◽

Novel Approach ◽

Vector Machines ◽

Real Value ◽

Continuous Input ◽

Static Threshold

This paper introduces a novel approach to representing continuous inputs in Tsetlin Machines (TMs). Instead of using one Tsetlin Automaton (TA) for every unique threshold found when Booleanizing continuous input, we employ two Stochastic Searching on the Line (SSL) automata to learn discriminative lower and upper bounds. The two resulting Boolean features are adapted to the rest of the clause by equipping each clause with its own team of SSLs, which update the bounds during the learning process. Two standard TAs finally decide whether to include the resulting features as part of the clause. In this way, only four automata altogether represent one continuous feature (instead of potentially hundreds of them). We evaluate the performance of the new scheme empirically using five datasets, along with a study of interpretability. On average, TMs with SSL feature representation use 4.3 times fewer literals than the TM with static threshold-based features. Furthermore, in terms of average memory usage and F1-Score, our approach outperforms simple Multi-Layered Artificial Neural Networks, Decision Trees, Support Vector Machines, K-Nearest Neighbor, Random Forest, Gradient Boosted Trees (XGBoost), and Explainable Boosting Machines (EBMs), as well as the standard and real-value weighted TMs. Our approach further outperforms Neural Additive Models on Fraud Detection and StructureBoost on CA-58 in terms of the Area Under Curve while performing competitively on COMPAS.

Download Full-text