Robust Performance of Potentially Functional SNPs in Machine Learning Models for the Prediction of Atorvastatin-Induced Myalgia

Background:Statins can cause muscle symptoms resulting in poor adherence to therapy and increased cardiovascular risk. We hypothesize that combinations of potentially functional SNPs (pfSNPs), rather than individual SNPs, better predict myalgia in patients on atorvastatin. This study assesses the value of potentially functional single nucleotide polymorphisms (pfSNPs) and employs six machine learning algorithms to identify the combination of SNPs that best predict myalgia.Methods: Whole genome sequencing of 183 Chinese, Malay and Indian patients from Singapore was conducted to identify genetic variants associated with atorvastatin induced myalgia. To adjust for confounding factors, demographic and clinical characteristics were also examined for their association with myalgia. The top factor, sex, was then used as a covariate in the whole genome association analyses. Variants that were highly associated with myalgia from this and previous studies were extracted, assessed for potential functionality (pfSNPs) and incorporated into six machine learning models. Predictive performance of a combination of different models and inputs were compared using the average cross validation area under ROC curve (AUC). The minimum combination of SNPs to achieve maximum sensitivity and specificity as determined by AUC, that predict atorvastatin-induced myalgia in most, if not all the six machine learning models was determined.Results: Through whole genome association analyses using sex as a covariate, a larger proportion of pfSNPs compared to non-pf SNPs were found to be highly associated with myalgia. Although none of the individual SNPs achieved genome wide significance in univariate analyses, machine learning models identified a combination of 15 SNPs that predict myalgia with good predictive performance (AUC >0.9). SNPs within genes identified in this study significantly outperformed SNPs within genes previously reported to be associated with myalgia. pfSNPs were found to be more robust in predicting myalgia, outperforming non-pf SNPs in the majority of machine learning models tested.Conclusion: Combinations of pfSNPs that were consistently identified by different machine learning models to have high predictive performance have good potential to be clinically useful for predicting atorvastatin-induced myalgia once validated against an independent cohort of patients.

Download Full-text

Application of Machine Learning Techniques to Predict Binding Affinity for Drug Targets: A Study of Cyclin-Dependent Kinase 2

Current Medicinal Chemistry ◽

10.2174/2213275912666191102162959 ◽

2020 ◽

Vol 28 (2) ◽

pp. 253-265 ◽

Cited By ~ 3

Author(s):

Gabriela Bitencourt-Ferreira ◽

Amauri Duarte da Silva ◽

Walter Filgueira de Azevedo

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

Predictive Performance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Scoring Functions ◽

Cyclin Dependent Kinase ◽

Learning Models ◽

Learning Techniques ◽

Machine Learning Models

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.

Download Full-text

Whole-genome association analyses for lifetime reproductive traits in the pig

Journal of Animal Science ◽

10.2527/jas.2010-3236 ◽

2011 ◽

Vol 89 (4) ◽

pp. 988-995 ◽

Cited By ~ 54

Author(s):

S. K. Onteru ◽

B. Fan ◽

M. T. Nikkilä ◽

D. J. Garrick ◽

K. J. Stalder ◽

...

Keyword(s):

Reproductive Traits ◽

Whole Genome ◽

Association Analyses ◽

Genome Association ◽

Whole Genome Association

Download Full-text

Temporal and Spatial Autocorrelation as Determinants of Regional AOD-PM2.5 Model Performance in the Middle East

Remote Sensing ◽

10.3390/rs13183790 ◽

2021 ◽

Vol 13 (18) ◽

pp. 3790

Author(s):

Khang Chau ◽

Meredith Franklin ◽

Huikyo Lee ◽

Michael Garay ◽

Olga Kalashnikova

Keyword(s):

Machine Learning ◽

Middle East ◽

United Arab Emirates ◽

Atmospheric Correction ◽

Predictive Performance ◽

Variable Importance ◽

Learning Models ◽

Median Test ◽

Temporal And Spatial ◽

Machine Learning Models

Exposure to fine particulate matter (PM2.5) air pollution has been shown in numerous studies to be associated with detrimental health effects. However, the ability to conduct epidemiological assessments can be limited due to challenges in generating reliable PM2.5 estimates, particularly in parts of the world such as the Middle East where measurements are scarce and extreme meteorological events such as sandstorms are frequent. In order to supplement exposure modeling efforts under such conditions, satellite-retrieved aerosol optical depth (AOD) has proven to be useful due to its global coverage. By using AODs from the Multiangle Implementation of Atmospheric Correction (MAIAC) of the MODerate Resolution Imaging Spectroradiometer (MODIS) and the Multiangle Imaging Spectroradiometer (MISR) combined with meteorological and assimilated aerosol information from the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2), we constructed machine learning models to predict PM2.5 in the area surrounding the Persian Gulf, including Kuwait, Bahrain, and the United Arab Emirates (U.A.E). Our models showed regional differences in predictive performance, with better results in the U.A.E. (median test R2 = 0.66) than Kuwait (median test R2 = 0.51). Variable importance also differed by region, where satellite-retrieved AOD variables were more important for predicting PM2.5 in Kuwait than in the U.A.E. Divergent trends in the temporal and spatial autocorrelations of PM2.5 and AOD in the two regions offered possible explanations for differences in predictive performance and variable importance. In a test of model transferability, we found that models trained in one region and applied to another did not predict PM2.5 well, even if the transferred model had better performance. Overall the results of our study suggest that models developed over large geographic areas could generate PM2.5 estimates with greater uncertainty than could be obtained by taking a regional modeling approach. Furthermore, development of methods to better incorporate spatial and temporal autocorrelations in machine learning models warrants further examination.

Download Full-text

A predictive performance comparison of machine learning models for judicial cases

2017 IEEE Symposium Series on Computational Intelligence (SSCI) ◽

10.1109/ssci.2017.8285436 ◽

2017 ◽

Cited By ~ 4

Author(s):

Zhenyu Liu ◽

Huanhuan Chen

Keyword(s):

Machine Learning ◽

Predictive Performance ◽

Performance Comparison ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Development of Combined Heavy Rain Damage Prediction Models with Machine Learning

Water ◽

10.3390/w11122516 ◽

2019 ◽

Vol 11 (12) ◽

pp. 2516 ◽

Cited By ~ 1

Author(s):

Changhyun Choi ◽

Jeonghwan Kim ◽

Jungwook Kim ◽

Hung Soo Kim

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Prediction Model ◽

Prediction Models ◽

Predictive Performance ◽

Heavy Rain ◽

Learning Models ◽

Damage Prediction ◽

Natural Disaster Management ◽

Machine Learning Models

Adequate forecasting and preparation for heavy rain can minimize life and property damage. Some studies have been conducted on the heavy rain damage prediction model (HDPM), however, most of their models are limited to the linear regression model that simply explains the linear relation between rainfall data and damage. This study develops the combined heavy rain damage prediction model (CHDPM) where the residual prediction model (RPM) is added to the HDPM. The predictive performance of the CHDPM is analyzed to be 4–14% higher than that of HDPM. Through this, we confirmed that the predictive performance of the model is improved by combining the RPM of the machine learning models to complement the linearity of the HDPM. The results of this study can be used as basic data beneficial for natural disaster management.

Download Full-text

A Bayesian latent class analysis for whole-genome association analyses: an illustration using the GAW15 simulated rheumatoid arthritis dense scan data

BMC Proceedings ◽

10.1186/1753-6561-1-s1-s112 ◽

2007 ◽

Vol 1 (S1) ◽

Cited By ~ 7

Author(s):

Fredrick R Schumacher ◽

Peter Kraft

Keyword(s):

Rheumatoid Arthritis ◽

Latent Class Analysis ◽

Latent Class ◽

Whole Genome ◽

Class Analysis ◽

Association Analyses ◽

Genome Association ◽

Whole Genome Association ◽

Scan Data ◽

Bayesian Latent Class Analysis

Download Full-text

Machine Learning to Predict 10-year Cardiovascular Mortality from the Electrocardiogram: Analysis of the Third National Health and Nutrition Examination Survey (NHANES III)

10.1101/2021.09.09.21263327 ◽

2021 ◽

Author(s):

Chang H Kim ◽

Sadeer Al-Kindi ◽

Yasir Tarabichi ◽

Suril Gohel ◽

Riddhi Vyas ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Mortality ◽

Predictive Performance ◽

Nhanes Iii ◽

Nutrition Examination Survey ◽

Learning Models ◽

The Third ◽

Health And Nutrition ◽

Machine Learning Models ◽

Ecg Data

Background: The value of the electrocardiogram (ECG) for predicting long-term cardiovascular outcomes is not well defined. Machine learning methods are well suited for analysis of highly correlated data such as that from the ECG. Methods: Using demographic, clinical, and 12-lead ECG data from the Third National Health and Nutrition Examination Survey (NHANES III), machine learning models were trained to predict 10-year cardiovascular mortality in ambulatory U.S. adults. Predictive performance of each model was assessed using area under receiver operating characteristic curve (AUROC), area under precision-recall curve (AUPRC), sensitivity, and specificity. These were compared to the 2013 American College of Cardiology/American Heart Association Pooled Cohort Equations (PCE). Results: 7,067 study participants (mean age: 59.2 +/- 13.4 years, female: 52.5%, white: 73.9%, black: 23.3%) were included. At 10 years of follow up, 338 (4.8%) had died from cardiac causes. Compared to the PCE (AUROC: 0.668, AUPRC: 0.125, sensitivity: 0.492, specificity: 0.859), machine learning models only required demographic and ECG data to achieve comparable performance: logistic regression (AUROC: 0.754, AUPRC: 0.141, sensitivity: 0.747, specificity: 0.759), neural network (AUROC: 0.764, AUPRC: 0.149, sensitivity: 0.722, specificity: 0.787), and ensemble model (AUROC: 0.695, AUPRC: 0.166, sensitivity: 0.468, specificity: 0.912). Additional clinical data did not improve the predictive performance of machine learning models. In variable importance analysis, important ECG features clustered in inferior and lateral leads. Conclusions: Machine learning can be applied to demographic and ECG data to predict 10-year cardiovascular mortality in ambulatory adults, with potentially important implications for primary prevention.

Download Full-text

Empirical asset pricing via machine learning: evidence from the European stock market

Journal of Asset Management ◽

10.1057/s41260-021-00237-x ◽

2021 ◽

Author(s):

Wolfgang Drobetz ◽

Tizian Otto

Keyword(s):

Machine Learning ◽

Stock Returns ◽

Network Architecture ◽

Risk Measures ◽

Predictive Performance ◽

Support Vector ◽

Learning Models ◽

Learning Methods ◽

Machine Learning Methods ◽

Machine Learning Models

AbstractThis paper evaluates the predictive performance of machine learning methods in forecasting European stock returns. Compared to a linear benchmark model, interactions and nonlinear effects help improve the predictive performance. But machine learning models must be adequately trained and tuned to overcome the high dimensionality problem and to avoid overfitting. Across all machine learning methods, the most important predictors are based on price trends and fundamental signals from valuation ratios. However, the models exhibit substantial variation in statistical predictive performance that translate into pronounced differences in economic profitability. The return and risk measures of long-only trading strategies indicate that machine learning models produce sizeable gains relative to our benchmark. Neural networks perform best, also after accounting for transaction costs. A classification-based portfolio formation, utilizing a support vector machine that avoids estimating stock-level expected returns, performs even better than the neural network architecture.

Download Full-text

Interpreting tree ensemble machine learning models with endoR

10.1101/2022.01.03.474763 ◽

2022 ◽

Author(s):

Albane Ruaud ◽

Niklas A Pfister ◽

Ruth E Ley ◽

Nicholas D Youngblut

Keyword(s):

Machine Learning ◽

Predictive Performance ◽

R Package ◽

Metagenomic Data ◽

Learning Models ◽

Model Interpretation ◽

Ensemble Machine Learning ◽

Fermenting Bacteria ◽

Microbiome Data ◽

Machine Learning Models

Background: Tree ensemble machine learning models are increasingly used in microbiome science as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. While such models are often good at predicting phenotypes based on microbiome data, they only yield limited insights into how microbial taxa or genomic content may be associated. Results: We developed endoR, a method to interpret a fitted tree ensemble model. First, endoR simplifies the fitted model into a decision ensemble from which it then extracts information on the importance of individual features and their pairwise interactions and also visualizes these data as an interpretable network. Both the network and importance scores derived from endoR provide insights into how features, and interactions between them, contribute to the predictive performance of the fitted model. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained. We assessed the performance of endoR on both simulated and real metagenomic data. We found endoR to infer true associations with more or comparable accuracy than other commonly used approaches while easing and enhancing model interpretation. Using endoR, we also confirmed published results on gut microbiome differences between cirrhotic and healthy individuals. Finally, we utilized endoR to gain insights into components of the microbiome that predict the presence of human gut methanogens, as these hydrogen-consumers are expected to interact with fermenting bacteria in a complex syntrophic network. Specifically, we analyzed a global metagenome dataset of 2203 individuals and confirmed the previously reported association between Methanobacteriaceae and Christensenellales. Additionally, we observed that Methanobacteriaceae are associated with a network of hydrogen-producing bacteria. Conclusion: Our method accurately captures how tree ensembles use features and interactions between them to predict a response. As demonstrated by our applications, the resultant visualizations and summary outputs facilitate model interpretation and enable the generation of novel hypotheses about complex systems. An implementation of endoR is available as an open-source R-package on GitHub (https://github.com/leylabmpi/endoR).

Download Full-text

Machine learning predictive models of LDL-C in the population of eastern India and its comparison with directly measured and calculated LDL-C

Annals of Clinical Biochemistry International Journal of Laboratory Medicine ◽

10.1177/00045632211046805 ◽

2021 ◽

pp. 000456322110468

Author(s):

Anudeep P P ◽

Suchitra Kumari ◽

Aishvarya S Rajasimman ◽

Saurav Nayak ◽

Pooja Priyadarsini

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Random Forests ◽

Predictive Performance ◽

Support Vector ◽

Learning Models ◽

Complex Interactions ◽

Clinical Biochemistry Laboratory ◽

Study Laboratory ◽

Machine Learning Models

Background LDL-C is a strong risk factor for cardiovascular disorders. The formulas used to calculate LDL-C showed varying performance in different populations. Machine learning models can study complex interactions between the variables and can be used to predict outcomes more accurately. The current study evaluated the predictive performance of three machine learning models—random forests, XGBoost, and support vector Rregression (SVR) to predict LDL-C from total cholesterol, triglyceride, and HDL-C in comparison to linear regression model and some existing formulas for LDL-C calculation, in eastern Indian population. Methods The lipid profiles performed in the clinical biochemistry laboratory of AIIMS Bhubaneswar during 2019–2021, a total of 13,391 samples were included in the study. Laboratory results were collected from the laboratory database. 70% of data were classified as train set and used to develop the three machine learning models and linear regression formula. These models were tested in the rest 30% of the data (test set) for validation. Performance of models was evaluated in comparison to best six existing LDL-C calculating formulas. Results LDL-C predicted by XGBoost and random forests models showed a strong correlation with directly estimated LDL-C (r = 0.98). Two machine learning models performed superior to the six existing and commonly used LDL-C calculating formulas like Friedewald in the study population. When compared in different triglycerides strata also, these two models outperformed the other methods used. Conclusion Machine learning models like XGBoost and random forests can be used to predict LDL-C with more accuracy comparing to conventional linear regression LDL-C formulas.

Download Full-text