Protein pKa prediction by tree-based machine learning

Author(s):  
Ada Y. Chen ◽  
Juyong Lee ◽  
Ana Damjanovic ◽  
Bernard R. Brooks

We present four tree-based machine learning models for protein pKa prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pKa datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pKa prediction tool PROPKA. The overall RMSE for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys and Tyr), and 0.63 when considering Asp, Glu, His and Lys only. We provide pKa predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted pKa values close to the physiological pH.

2021 ◽  
Author(s):  
Vitaliy Degtyarev ◽  
Konstantinos Daniel Tsavdaridis

Large web openings introduce complex structural behaviors and additional failure modes of steel cellular beams, which must be considered in the design using laborious calculations (e.g., exercising SCI P355). This paper presents seven machine learning (ML) models, including decision tree (DT), random forest (RF), k-nearest neighbor (KNN), gradient boosting regressor (GBR), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and gradient boosting with categorical features support (CatBoost), for predicting the elastic buckling and ultimate loads of steel cellular beams. Large datasets of finite element (FE) simulation results, validated against experimental data, were used to develop the models. The ML models were fine-tuned via an extensive hyperparameter search to obtain their best performance. The elastic buckling and ultimate loads predicted by the optimized ML models demonstrated excellent agreement with the numerical data. The accuracy of the ultimate load predictions by the ML models exceeded the accuracy provided by the existing design provisions for steel cellular beams published in SCI P355 and AISC Design Guide 31. The relative feature importance and feature dependence of the models were evaluated and discussed in the paper. An interactive Python-based notebook and a user-friendly web application for predicting the elastic buckling and ultimate loads of steel cellular beams using the developed optimized ML models were created and made publicly available. The web application deployed to the cloud allows for making predictions in any web browser on any device, including mobile. The source code of the application available on GitHub allows running the application locally and independently from the cloud service.


2021 ◽  
pp. 0958305X2110449
Author(s):  
Irfan Ullah ◽  
Kai Liu ◽  
Toshiyuki Yamamoto ◽  
Rabia Emhamed Al Mamlook ◽  
Arshad Jamal

The rapid growth of transportation sector and related emissions are attracting the attention of policymakers to ensure environmental sustainability. Therefore, the deriving factors of transport emissions are extremely important to comprehend. The role of electric vehicles is imperative amid rising transport emissions. Electric vehicles pave the way towards a low-carbon economy and sustainable environment. Successful deployment of electric vehicles relies heavily on energy consumption models that can predict energy consumption efficiently and reliably. Improving electric vehicles’ energy consumption efficiency will significantly help to alleviate driver anxiety and provide an essential framework for operation, planning, and management of the charging infrastructure. To tackle the challenge of electric vehicles’ energy consumption prediction, this study aims to employ advanced machine learning models, extreme gradient boosting, and light gradient boosting machine to compare with traditional machine learning models, multiple linear regression, and artificial neural network. Electric vehicles energy consumption data in the analysis were collected in Aichi Prefecture, Japan. To evaluate the performance of the prediction models, three evaluation metrics were used; coefficient of determination ( R2), root mean square error, and mean absolute error. The prediction outcome exhibits that the extreme gradient boosting and light gradient boosting machine provided better and robust results compared to multiple linear regression and artificial neural network. The models based on extreme gradient boosting and light gradient boosting machine yielded higher values of R2, lower mean absolute error, and root mean square error values have proven to be more accurate. However, the results demonstrated that the light gradient boosting machine is outperformed the extreme gradient boosting model. A detailed feature important analysis was carried out to demonstrate the impact and relative influence of different input variables on electric vehicles energy consumption prediction. The results imply that an advanced machine learning model can enhance the prediction performance of electric vehicles energy consumption.


2021 ◽  
Vol 9 (4) ◽  
pp. 376 ◽  
Author(s):  
Yunfei Yang ◽  
Haiwen Tu ◽  
Lei Song ◽  
Lin Chen ◽  
De Xie ◽  
...  

Resistance is one of the important performance indicators of ships. In this paper, a prediction method based on the Radial Basis Function neural network (RBFNN) is proposed to predict the resistance of a 13500 transmission extension unit (13500TEU) container ship at different drafts. The predicted draft state in the known range is called interpolation prediction; otherwise, it is extrapolation prediction. First, ship features are extracted to make the resistance Rt prediction. The resistance prediction results show that the performance of the RBFNN is significantly better than the other four machine learning models, backpropagation neural network (BPNN), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost). Then, the ship data is processed in a dimensionless manner, and the models mentioned above are used to predict the total resistance coefficient Ct of the container ship. The prediction results show that the RBFNN prediction model still performs well. Good results can be obtained by RBFNN in interpolation prediction, even when using part of dimensionless features. Finally, the accuracy of the prediction method based on RBFNN is greatly improved compared with the modified admiralty coefficient.


2019 ◽  
Author(s):  
Kasper Van Mens ◽  
Joran Lokkerbol ◽  
Richard Janssen ◽  
Robert de Lange ◽  
Bea Tiemens

BACKGROUND It remains a challenge to predict which treatment will work for which patient in mental healthcare. OBJECTIVE In this study we compare machine algorithms to predict during treatment which patients will not benefit from brief mental health treatment and present trade-offs that must be considered before an algorithm can be used in clinical practice. METHODS Using an anonymized dataset containing routine outcome monitoring data from a mental healthcare organization in the Netherlands (n = 2,655), we applied three machine learning algorithms to predict treatment outcome. The algorithms were internally validated with cross-validation on a training sample (n = 1,860) and externally validated on an unseen test sample (n = 795). RESULTS The performance of the three algorithms did not significantly differ on the test set. With a default classification cut-off at 0.5 predicted probability, the extreme gradient boosting algorithm showed the highest positive predictive value (ppv) of 0.71(0.61 – 0.77) with a sensitivity of 0.35 (0.29 – 0.41) and area under the curve of 0.78. A trade-off can be made between ppv and sensitivity by choosing different cut-off probabilities. With a cut-off at 0.63, the ppv increased to 0.87 and the sensitivity dropped to 0.17. With a cut-off of at 0.38, the ppv decreased to 0.61 and the sensitivity increased to 0.57. CONCLUSIONS Machine learning can be used to predict treatment outcomes based on routine monitoring data.This allows practitioners to choose their own trade-off between being selective and more certain versus inclusive and less certain.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Jong Ho Kim ◽  
Haewon Kim ◽  
Ji Su Jang ◽  
Sung Mi Hwang ◽  
So Young Lim ◽  
...  

Abstract Background Predicting difficult airway is challengeable in patients with limited airway evaluation. The aim of this study is to develop and validate a model that predicts difficult laryngoscopy by machine learning of neck circumference and thyromental height as predictors that can be used even for patients with limited airway evaluation. Methods Variables for prediction of difficulty laryngoscopy included age, sex, height, weight, body mass index, neck circumference, and thyromental distance. Difficult laryngoscopy was defined as Grade 3 and 4 by the Cormack-Lehane classification. The preanesthesia and anesthesia data of 1677 patients who had undergone general anesthesia at a single center were collected. The data set was randomly stratified into a training set (80%) and a test set (20%), with equal distribution of difficulty laryngoscopy. The training data sets were trained with five algorithms (logistic regression, multilayer perceptron, random forest, extreme gradient boosting, and light gradient boosting machine). The prediction models were validated through a test set. Results The model’s performance using random forest was best (area under receiver operating characteristic curve = 0.79 [95% confidence interval: 0.72–0.86], area under precision-recall curve = 0.32 [95% confidence interval: 0.27–0.37]). Conclusions Machine learning can predict difficult laryngoscopy through a combination of several predictors including neck circumference and thyromental height. The performance of the model can be improved with more data, a new variable and combination of models.


2021 ◽  
Author(s):  
Seong Hwan Kim ◽  
Eun-Tae Jeon ◽  
Sungwook Yu ◽  
Kyungmi O ◽  
Chi Kyung Kim ◽  
...  

Abstract We aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multi-center prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanations (SHAP) method to evaluate feature importance. Of the 3,623 stroke patients, the 2,363 who had arrived at the hospital within 24 hours of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.778, 95% CI, 0.726 - 0.830). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the SHAP method can be adjusted to individualize the features’ effects on the predictive power of the model.


An effective representation by machine learning algorithms is to obtain the results especially in Big Data, there are numerous applications can produce outcome, whereas a Random Forest Algorithm (RF) Gradient Boosting Machine (GBM), Decision tree (DT) in Python will able to give the higher accuracy in regard with classifying various parameters of Airliner Passengers satisfactory levels. The complex information of airline passengers has provided huge data for interpretation through different parameters of satisfaction that contains large information in quantity wise. An algorithm has to support in classifying these data’s with accuracies. As a result some of the methods may provide less precision and there is an opportunity of information cancellation and furthermore information missing utilizing conventional techniques. Subsequently RF and GBM used to conquer the unpredictability and exactness about the information provided. The aim of this study is to identify an Algorithm which is suitable for classifying the satisfactory level of airline passengers with data analytics using python by knowing the output. The optimization and Implementation of independent variables by training and testing for accuracy in python platform determined the variation between the each parameters and also recognized RF and GBM as a better algorithm in comparison with other classifying algorithms.


Author(s):  
Harsha A K

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.


2021 ◽  
Vol 3 (1) ◽  
Author(s):  
B. A Omodunbi

Diabetes mellitus is a health disorder that occurs when the blood sugar level becomes extremely high due to body resistance in producing the required amount of insulin. The aliment happens to be among the major causes of death in Nigeria and the world at large. This study was carried out to detect diabetes mellitus by developing a hybrid model that comprises of two machine learning model namely Light Gradient Boosting Machine (LGBM) and K-Nearest Neighbor (KNN). This research is aimed at developing a machine learning model for detecting the occurrence of diabetes in patients. The performance metrics employed in evaluating the finding for this study are Receiver Operating Characteristics (ROC) Curve, Five-fold Cross-validation, precision, and accuracy score. The proposed system had an accuracy of 91% and the area under the Receiver Operating Characteristic Curve was 93%. The experimental result shows that the prediction accuracy of the hybrid model is better than traditional machine learning


2021 ◽  
pp. 1-29
Author(s):  
Fikrewold H. Bitew ◽  
Corey S. Sparks ◽  
Samuel H. Nyarko

Abstract Objective: Child undernutrition is a global public health problem with serious implications. In this study, estimate predictive algorithms for the determinants of childhood stunting by using various machine learning (ML) algorithms. Design: This study draws on data from the Ethiopian Demographic and Health Survey of 2016. Five machine learning algorithms including eXtreme gradient boosting (xgbTree), k-nearest neighbors (K-NN), random forest (RF), neural network (NNet), and the generalized linear models (GLM) were considered to predict the socio-demographic risk factors for undernutrition in Ethiopia. Setting: Households in Ethiopia. Participants: A total of 9,471 children below five years of age. Results: The descriptive results show substantial regional variations in child stunting, wasting, and underweight in Ethiopia. Also, among the five ML algorithms, xgbTree algorithm shows a better prediction ability than the generalized linear mixed algorithm. The best predicting algorithm (xgbTree) shows diverse important predictors of undernutrition across the three outcomes which include time to water source, anemia history, child age greater than 30 months, small birth size, and maternal underweight, among others. Conclusions: The xgbTree algorithm was a reasonably superior ML algorithm for predicting childhood undernutrition in Ethiopia compared to other ML algorithms considered in this study. The findings support improvement in access to water supply, food security, and fertility regulation among others in the quest to considerably improve childhood nutrition in Ethiopia.


Sign in / Sign up

Export Citation Format

Share Document