A Novel Machine Learning Strategy for the Prediction of Antihypertensive Peptides Derived from Food with High Efficiency

Strategies to screen antihypertensive peptides with high throughput and rapid speed will doubtlessly contribute to the treatment of hypertension. Food-derived antihypertensive peptides can reduce blood pressure without side effects. In the present study, a novel model based on the eXtreme Gradient Boosting (XGBoost) algorithm was developed and compared with the dominating machine learning models. To further reflect on the reliability of the method in a real situation, the optimized XGBoost model was utilized to predict the antihypertensive degree of the k-mer peptides cutting from six key proteins in bovine milk, and the peptide–protein docking technology was introduced to verify the findings. The results showed that the XGBoost model achieved outstanding performance, with an accuracy of 86.50% and area under the receiver operating characteristic curve of 94.11%, which were better than the other models. Using the XGBoost model, the prediction of antihypertensive peptides derived from milk protein was consistent with the peptide–protein docking results, and was more efficient. Our results indicate that using the XGBoost algorithm as a novel auxiliary tool is feasible to screen for antihypertensive peptides derived from food, with high throughput and high efficiency.

Download Full-text

A Novel Machine Learning Strategy for Prediction of Antihypertensive Peptides Derived from Food with High Efficiency

10.1101/2020.08.12.248955 ◽

2020 ◽

Author(s):

Liyang Wang ◽

Dantong Niu ◽

Xiaoya Wang ◽

Qun Shen ◽

Yong Xue

Keyword(s):

Machine Learning ◽

High Throughput ◽

High Efficiency ◽

Characteristic Curve ◽

Bovine Milk ◽

Structural Features ◽

Protein Docking ◽

Gradient Boosting ◽

Extreme Gradient Boosting ◽

Antihypertensive Peptides

AbstractStrategies to screen antihypertensive peptides with high throughput and rapid speed will be doubtlessly contributed to the treatment of hypertension. The food-derived antihypertensive peptides can reduce blood pressure without side effects. In present study, a novel model based on Extreme Gradient Boosting (XGBoost) algorithm was developed using the primary structural features of the food-derived peptides, and its performance in the prediction of antihypertensive peptides was compared with the dominating machine learning models. To further reflect the reliability of the method in real situation, the optimized XGBoost model was utilized to predict the antihypertensive degree of k-mer peptides cutting from 6 key proteins in bovine milk and the peptide-protein docking technology was introduced to verify the findings. The results showed that the XGBoost model achieved outstanding performance with the accuracy of 0.9841 and the area under the receiver operating characteristic curve of 0.9428, which were better than the other models. Using the XGBoost model, the prediction of antihypertensive peptides derived from milk protein was consistent with the peptide-protein docking results, and was more efficient. Our results indicate that using XGBoost algorithm as a novel auxiliary tool is feasible for screening antihypertensive peptide derived from food with high throughput and high efficiency.

Download Full-text

An Interpretable Early Dynamic Sequential Predictor for Sepsis-Induced Coagulopathy Progression in the Real-World Using Machine Learning

Frontiers in Medicine ◽

10.3389/fmed.2021.775047 ◽

2021 ◽

Vol 8 ◽

Author(s):

Ruixia Cui ◽

Wenbo Hua ◽

Kai Qu ◽

Heran Yang ◽

Yingmu Tong ◽

...

Keyword(s):

Machine Learning ◽

Real World ◽

Time Series Data ◽

Time Window ◽

Medical Center ◽

Characteristic Curve ◽

Series Data ◽

Gradient Boosting ◽

Early Management ◽

Extreme Gradient Boosting

Sepsis-associated coagulation dysfunction greatly increases the mortality of sepsis. Irregular clinical time-series data remains a major challenge for AI medical applications. To early detect and manage sepsis-induced coagulopathy (SIC) and sepsis-associated disseminated intravascular coagulation (DIC), we developed an interpretable real-time sequential warning model toward real-world irregular data. Eight machine learning models including novel algorithms were devised to detect SIC and sepsis-associated DIC 8n (1 ≤ n ≤ 6) hours prior to its onset. Models were developed on Xi'an Jiaotong University Medical College (XJTUMC) and verified on Beth Israel Deaconess Medical Center (BIDMC). A total of 12,154 SIC and 7,878 International Society on Thrombosis and Haemostasis (ISTH) overt-DIC labels were annotated according to the SIC and ISTH overt-DIC scoring systems in train set. The area under the receiver operating characteristic curve (AUROC) were used as model evaluation metrics. The eXtreme Gradient Boosting (XGBoost) model can predict SIC and sepsis-associated DIC events up to 48 h earlier with an AUROC of 0.929 and 0.910, respectively, and even reached 0.973 and 0.955 at 8 h earlier, achieving the highest performance to date. The novel ODE-RNN model achieved continuous prediction at arbitrary time points, and with an AUROC of 0.962 and 0.936 for SIC and DIC predicted 8 h earlier, respectively. In conclusion, our model can predict the sepsis-associated SIC and DIC onset up to 48 h in advance, which helps maximize the time window for early management by physicians.

Download Full-text

The prediction of asymptomatic carotid atherosclerosis with electronic health records: a comparative study of six machine learning models

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01480-3 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jiaxin Fan ◽

Mengying Chen ◽

Jian Luo ◽

Shusen Yang ◽

Jinming Shi ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Carotid Atherosclerosis ◽

Characteristic Curve ◽

Gradient Boosting ◽

Learning Models ◽

Health Records ◽

Extreme Gradient Boosting ◽

Electronic Health ◽

Machine Learning Models

Abstract Background Screening carotid B-mode ultrasonography is a frequently used method to detect subjects with carotid atherosclerosis (CAS). Due to the asymptomatic progression of most CAS patients, early identification is challenging for clinicians, and it may trigger ischemic stroke. Recently, machine learning has shown a strong ability to classify data and a potential for prediction in the medical field. The combined use of machine learning and the electronic health records of patients could provide clinicians with a more convenient and precise method to identify asymptomatic CAS. Methods Retrospective cohort study using routine clinical data of medical check-up subjects from April 19, 2010 to November 15, 2019. Six machine learning models (logistic regression [LR], random forest [RF], decision tree [DT], eXtreme Gradient Boosting [XGB], Gaussian Naïve Bayes [GNB], and K-Nearest Neighbour [KNN]) were used to predict asymptomatic CAS and compared their predictability in terms of the area under the receiver operating characteristic curve (AUCROC), accuracy (ACC), and F1 score (F1). Results Of the 18,441 subjects, 6553 were diagnosed with asymptomatic CAS. Compared to DT (AUCROC 0.628, ACC 65.4%, and F1 52.5%), the other five models improved prediction: KNN + 7.6% (0.704, 68.8%, and 50.9%, respectively), GNB + 12.5% (0.753, 67.0%, and 46.8%, respectively), XGB + 16.0% (0.788, 73.4%, and 55.7%, respectively), RF + 16.6% (0.794, 74.5%, and 56.8%, respectively) and LR + 18.1% (0.809, 74.7%, and 59.9%, respectively). The highest achieving model, LR predicted 1045/1966 cases (sensitivity 53.2%) and 3088/3566 non-cases (specificity 86.6%). A tenfold cross-validation scheme further verified the predictive ability of the LR. Conclusions Among machine learning models, LR showed optimal performance in predicting asymptomatic CAS. Our findings set the stage for an early automatic alarming system, allowing a more precise allocation of CAS prevention measures to individuals probably to benefit most.

Download Full-text

A Comparative Analysis of Machine Learning Models for Prediction of Insurance Uptake in Kenya

10.20944/preprints202010.0186.v1 ◽

2020 ◽

Author(s):

Nelson Yego ◽

Juma Kasozi ◽

Joseph Nkrunziza

Keyword(s):

Machine Learning ◽

Random Forest ◽

Characteristic Curve ◽

Confusion Matrix ◽

Gradient Boosting ◽

Support Vector ◽

Sampled Data ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

The role of insurance in financial inclusion as well as in economic growth is immense. However, low uptake seems to impede the growth of the sector hence the need for a model that robustly predicts uptake of insurance among potential clients. In this research, we compared the performances of eight (8) machine learning models in predicting the uptake of insurance. The classifiers considered were Logistic Regression, Gaussian Naive Bayes, Support Vector Machines, K Nearest Neighbors, Decision Tree, Random Forest, Gradient Boosting Machines and Extreme Gradient boosting. The data used in the classification was from the 2016 Kenya FinAccess Household Survey. Comparison of performance was done for both upsampled and downsampled data due to data imbalance. For upsampled data, Random Forest classifier showed highest accuracy and precision compared to other classifiers but for down sampled data, gradient boosting was optimal. It is noteworthy that for both upsampled and downsampled data, tree-based classifiers were more robust than others in insurance uptake prediction. However, in spite of hyper-parameter optimization, the area under receiver operating characteristic curve remained highest for Random Forest as compared to other tree-based models. Also, the confusion matrix for Random Forest showed least false positives, and highest true positives hence could be construed as the most robust model for predicting the insurance uptake. Finally, the most important feature in predicting uptake was having a bank product hence bancassurance could be said to be a plausible channel of distribution of insurance products.

Download Full-text

MRI Radiomic Features to Predict IDH1 Mutation Status in Gliomas: A Machine Learning Approach using Gradient Tree Boosting

International Journal of Molecular Sciences ◽

10.3390/ijms21218004 ◽

2020 ◽

Vol 21 (21) ◽

pp. 8004

Author(s):

Yu Sakai ◽

Chen Yang ◽

Shingo Kihira ◽

Nadejda Tsankova ◽

Fahad Khan ◽

...

Keyword(s):

Machine Learning ◽

Characteristic Curve ◽

Area Under The Curve ◽

Prognostic Indicator ◽

Idh1 Mutation ◽

Gradient Boosting ◽

Isocitrate Dehydrogenase 1 ◽

Test Set ◽

Mutation Status ◽

Extreme Gradient Boosting

In patients with gliomas, isocitrate dehydrogenase 1 (IDH1) mutation status has been studied as a prognostic indicator. Recent advances in machine learning (ML) have demonstrated promise in utilizing radiomic features to study disease processes in the brain. We investigate whether ML analysis of multiparametric radiomic features from preoperative Magnetic Resonance Imaging (MRI) can predict IDH1 mutation status in patients with glioma. This retrospective study included patients with glioma with known IDH1 status and preoperative MRI. Radiomic features were extracted from Fluid-Attenuated Inversion Recovery (FLAIR) and Diffusion-Weighted-Imaging (DWI). The dataset was split into training, validation, and testing sets by stratified sampling. Synthetic Minority Oversampling Technique (SMOTE) was applied to the training sets. eXtreme Gradient Boosting (XGBoost) classifiers were trained, and the hyperparameters were tuned. Receiver operating characteristic curve (ROC), accuracy, and f1-scores were collected. A total of 100 patients (age: 55 ± 15, M/F 60/40); with IDH1 mutant (n = 22) and IDH1 wildtype (n = 78) were included. The best performance was seen with a DWI-trained XGBoost model, which achieved ROC with Area Under the Curve (AUC) of 0.97, accuracy of 0.90, and f1-score of 0.75 on the test set. The FLAIR-trained XGBoost model achieved ROC with AUC of 0.95, accuracy of 0.90, f1-score of 0.75 on the test set. A model that was trained on combined FLAIR-DWI radiomic features did not provide incremental accuracy. The results show that a XGBoost classifier using multiparametric radiomic features derived from preoperative MRI can predict IDH1 mutation status with > 90% accuracy.

Download Full-text

Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population

npj Digital Medicine ◽

10.1038/s41746-020-00331-1 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Andrew Ward ◽

Ashish Sarraju ◽

Sukyung Chung ◽

Jiang Li ◽

Robert Harrington ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Disease ◽

Risk Prediction ◽

Disease Risk ◽

Characteristic Curve ◽

Cardiovascular Disease Risk ◽

Gradient Boosting ◽

Atherosclerotic Cardiovascular Disease ◽

Extreme Gradient Boosting ◽

Ascvd Risk

Abstract The pooled cohort equations (PCE) predict atherosclerotic cardiovascular disease (ASCVD) risk in patients with characteristics within prespecified ranges and has uncertain performance among Asians or Hispanics. It is unknown if machine learning (ML) models can improve ASCVD risk prediction across broader diverse, real-world populations. We developed ML models for ASCVD risk prediction for multi-ethnic patients using an electronic health record (EHR) database from Northern California. Our cohort included patients aged 18 years or older with no prior CVD and not on statins at baseline (n = 262,923), stratified by PCE-eligible (n = 131,721) or PCE-ineligible patients based on missing or out-of-range variables. We trained ML models [logistic regression with L2 penalty and L1 lasso penalty, random forest, gradient boosting machine (GBM), extreme gradient boosting] and determined 5-year ASCVD risk prediction, including with and without incorporation of additional EHR variables, and in Asian and Hispanic subgroups. A total of 4309 patients had ASCVD events, with 2077 in PCE-ineligible patients. GBM performance in the full cohort, including PCE-ineligible patients (area under receiver-operating characteristic curve (AUC) 0.835, 95% confidence interval (CI): 0.825–0.846), was significantly better than that of the PCE in the PCE-eligible cohort (AUC 0.775, 95% CI: 0.755–0.794). Among patients aged 40–79, GBM performed similarly before (AUC 0.784, 95% CI: 0.759–0.808) and after (AUC 0.790, 95% CI: 0.765–0.814) incorporating additional EHR data. Overall, ML models achieved comparable or improved performance compared to the PCE while allowing risk discrimination in a larger group of patients including PCE-ineligible patients. EHR-trained ML models may help bridge important gaps in ASCVD risk prediction.

Download Full-text

Machine learning to predict distal caries in mandibular second molars associated with impacted third molars

Scientific Reports ◽

10.1038/s41598-021-95024-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sung-Hwi Hur ◽

Eun-Young Lee ◽

Min-Kyung Kim ◽

Somi Kim ◽

Ji-Yeon Kang ◽

...

Keyword(s):

Machine Learning ◽

Decision Making ◽

Clinical Decision Making ◽

Prediction Models ◽

Contact Point ◽

Characteristic Curve ◽

Gradient Boosting ◽

Support Vector ◽

Third Molars ◽

Extreme Gradient Boosting

AbstractImpacted mandibular third molars (M3M) are associated with the occurrence of distal caries on the adjacent mandibular second molars (DCM2M). In this study, we aimed to develop and validate five machine learning (ML) models designed to predict the occurrence of DCM2Ms due to the proximity with M3Ms and determine the relative importance of predictive variables for DCM2Ms that are important for clinical decision making. A total of 2642 mandibular second molars adjacent to M3Ms were analyzed and DCM2Ms were identified in 322 cases (12.2%). The models were trained using logistic regression, random forest, support vector machine, artificial neural network, and extreme gradient boosting ML methods and were subsequently validated using testing datasets. The performance of the ML models was significantly superior to that of single predictors. The area under the receiver operating characteristic curve of the machine learning models ranged from 0.88 to 0.89. Six features (sex, age, contact point at the cementoenamel junction, angulation of M3Ms, Winter's classification, and Pell and Gregory classification) were identified as relevant predictors. These prediction models could be used to detect patients at a high risk of developing DCM2M and ultimately contribute to caries prevention and treatment decision-making for impacted M3Ms.

Download Full-text

Machine Learning-Based Cardiovascular Disease Prediction Model: A Cohort Study on the Korean National Health Insurance Service Health Screening Database

Diagnostics ◽

10.3390/diagnostics11060943 ◽

2021 ◽

Vol 11 (6) ◽

pp. 943

Author(s):

Joung Ouk (Ryan) Kim ◽

Yong-Suk Jeong ◽

Jin Ho Kim ◽

Jong-Weon Lee ◽

Dougho Park ◽

...

Keyword(s):

Machine Learning ◽

Health Insurance ◽

Prediction Model ◽

National Health Insurance ◽

National Health ◽

Prediction Models ◽

Characteristic Curve ◽

Health Screening ◽

Gradient Boosting ◽

Extreme Gradient Boosting

Background: This study proposes a cardiovascular diseases (CVD) prediction model using machine learning (ML) algorithms based on the National Health Insurance Service-Health Screening datasets. Methods: We extracted 4699 patients aged over 45 as the CVD group, diagnosed according to the international classification of diseases system (I20–I25). In addition, 4699 random subjects without CVD diagnosis were enrolled as a non-CVD group. Both groups were matched by age and gender. Various ML algorithms were applied to perform CVD prediction; then, the performances of all the prediction models were compared. Results: The extreme gradient boosting, gradient boosting, and random forest algorithms exhibited the best average prediction accuracy (area under receiver operating characteristic curve (AUROC): 0.812, 0.812, and 0.811, respectively) among all algorithms validated in this study. Based on AUROC, the ML algorithms improved the CVD prediction performance, compared to previously proposed prediction models. Preexisting CVD history was the most important factor contributing to the accuracy of the prediction model, followed by total cholesterol, low-density lipoprotein cholesterol, waist-height ratio, and body mass index. Conclusions: Our results indicate that the proposed health screening dataset-based CVD prediction model using ML algorithms is readily applicable, produces validated results and outperforms the previous CVD prediction models.

Download Full-text

Ideafix: a decision tree-based method for the refinement of variants in FFPE DNA sequencing data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab092 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Maitena Tellaetxe-Abete ◽

Borja Calvo ◽

Charles Lawrie

Keyword(s):

Machine Learning ◽

Characteristic Curve ◽

R Package ◽

Classification Performance ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Molecular Testing ◽

Sequencing Data ◽

Genomic Context ◽

Extreme Gradient Boosting

Abstract Increasingly, treatment decisions for cancer patients are being made from next-generation sequencing results generated from formalin-fixed and paraffin-embedded (FFPE) biopsies. However, this material is prone to sequence artefacts that cannot be easily identified. In order to address this issue, we designed a machine learning-based algorithm to identify these artefacts using data from >1 600 000 variants from 27 paired FFPE and fresh-frozen breast cancer samples. Using these data, we assembled a series of variant features and evaluated the classification performance of five machine learning algorithms. Using leave-one-sample-out cross-validation, we found that XGBoost (extreme gradient boosting) and random forest obtained AUC (area under the receiver operating characteristic curve) values >0.86. Performance was further tested using two independent datasets that resulted in AUC values of 0.96, whereas a comparison with previously published tools resulted in a maximum AUC value of 0.92. The most discriminating features were read pair orientation bias, genomic context and variant allele frequency. In summary, our results show a promising future for the use of these samples in molecular testing. We built the algorithm into an R package called Ideafix (DEAmination FIXing) that is freely available at https://github.com/mmaitenat/ideafix.

Download Full-text

Clinically applicable rapid susceptibility testing of multi-drug resistant Staphylococcus aureus by mass spectrometry and extreme gradient boosting machine

10.1101/2021.10.05.463151 ◽

2021 ◽

Author(s):

Zhuo Wang ◽

Hsin-Yao Wang ◽

Yuxuan Pang ◽

Chia-Ru Chung ◽

Jorng-Tzong Horng ◽

...

Keyword(s):

Machine Learning ◽

Mass Spectrometry ◽

Staphylococcus Aureus ◽

Characteristic Curve ◽

Gradient Boosting ◽

Learning Technology ◽

Drug Resistant ◽

Antibiotic Susceptibility Test ◽

Extreme Gradient Boosting ◽

Severe Infections

Multi drug resistant Staphylococcus aureus is one of the major causes of severe infections. Due to the delays of conventional antibiotic susceptibility test (AST), most cases were prescribed by experience with a lower recovery rate. Linking a 7 year study of over 20,000 Staphylococcus aureus infected patients, we incorporated mass spectrometry and machine learning technology to predict the susceptibilities of patients for 4 different antibiotics that can enable early antibiotic decisions. The predictive models were externally validated in an independent patient cohort, resulting in an area under the receiver operating characteristic curve of 0.94 , 0.90, 50 0.86, 0.91 and an area under the precision recall curve of 0.93, 0.87, 0.87, 0.81 for oxacillin (OXA), clindamycin (CLI), erythromycin (ERY) and trimethoprim sulfamethoxazole (SXT), respectively. Moreover, our pipeline provides AST 24-36 h faster than standard workflows, reduction of inappropriate antibiotic usage with preclinical prediction, and demonstrates the potential of combining mass spectrometry with machine learning (ML) to assist early and accurate prescription. Therapies to individual patients could be tailored in the process of precision medicine.

Download Full-text