scholarly journals Addressing Measurement Error in Random Forests using Quantitative Bias Analysis

Author(s):  
Tammy Jiang ◽  
Jaimie L Gradus ◽  
Timothy L Lash ◽  
Matthew P Fox

Abstract Although variables are often measured with error, the impact of measurement error on machine learning predictions is seldom quantified. The purpose of this study was to assess the impact of measurement error on random forest model performance and variable importance. First, we assessed the impact of misclassification (i.e., measurement error of categorical variables) of predictors on random forest model performance (e.g., accuracy, sensitivity) and variable importance (mean decrease in accuracy) using data from the United States National Comorbidity Survey Replication (2001 - 2003). Second, we simulated datasets in which we know the true model performance and variable importance measures and could verify that quantitative bias analysis was recovering the truth in misclassified versions of the datasets. Our findings show that measurement error in the data used to construct random forests can distort model performance and variable importance measures, and that bias analysis can recover the correct results. This study highlights the utility of applying quantitative bias analysis in machine learning to quantify the impact of measurement error on study results.

Minerals ◽  
2020 ◽  
Vol 10 (5) ◽  
pp. 420
Author(s):  
Chris Aldrich

Linear regression is often used as a diagnostic tool to understand the relative contributions of operational variables to some key performance indicator or response variable. However, owing to the nature of plant operations, predictor variables tend to be correlated, often highly so, and this can lead to significant complications in assessing the importance of these variables. Shapley regression is seen as the only axiomatic approach to deal with this problem but has almost exclusively been used with linear models to date. In this paper, the approach is extended to random forests, and the results are compared with some of the empirical variable importance measures widely used with these models, i.e., permutation and Gini variable importance measures. Four case studies are considered, of which two are based on simulated data and two on real world data from the mineral process industries. These case studies suggest that the random forest Shapley variable importance measure may be a more reliable indicator of the influence of predictor variables than the other measures that were considered. Moreover, the results obtained with the Gini variable importance measure was as reliable or better than that obtained with the permutation measure of the random forest.


2018 ◽  
Author(s):  
Chetna Kumari ◽  
Naidu Subbarao ◽  
Muhammad Abulaish

AbstractAutophagy (in Greek: self-eating) is the cellular process for delivery of heterogenic intracellular material to lysosomal digestion. Protein kinases are integral to the autophagy process, and when dysregulated or mutated cause several human diseases. Atg1, the first autophagy-related protein identified is a serine/threonine protein kinases (STPKs). mTOR (mammalian Target of Rapamycin), AMPK (AMP-activated protein kinase), Akt, MAPK (mitogen-activated protein kinase) and PKC (protein kinase C) are other STPKs which regulate various components/steps of autophagy, and are often deregulated in cancer. MAPK have three subfamilies – ERKs, p38, and JNKs. JNKs (c-Jun N-terminal Kinases) have three isoforms in mammals – JNK1, JNK2, and JNK3, each with distinct cellular locations and functions. JNK1 plays role in starvation induced activation of autophagy, and the context-specific role of autophagy in tumorigenesis establish JNK1 a challenging anticancer drug target. Since JNKs are closely related to other members of MAPK family (p38, MAP kinase and the ERKs), it is difficult to design JNK-selective inhibitors. Designing JNK isoform-selective inhibitors are even more challenging as the ATP-binding sites among all JNKs are highly conserved. Although limited informations are available to explore computational approaches to predict JNK1 inhibitors, it seems diificult to find literature exploring machine learning techniques to predict JNKs inhibitors. This study aims to apply machine learning to predict JNK1 inhibitors regulating autophagy in cancer using Random Forest (RF). Here, RF algorithm is used for two purposes‐ to select and rank the molecular descriptors calculated using PaDEL descriptor software and as clasifier. The descriptors are prioritized by calculating Variable Importance Measures (VIMs) using functions based on mean square error (IncMSE) and node purity (IncNodePurity) of RF. The classification models based on a set of 22 prioritized descriptors shows accuracy 86.36%, precision 88.27% and AUC (Area Under ROC curve) 0.8914. We conclude that machine learning-based compound classification using Random Forest is one of the ligand-based approach that can be opted for virtual screening of large compound library of JNK1 bioactives.Author SummaryOut of the three isoforms of JNKs (cJun N-terminal Kinases) in human (each with distinct cellular locations and functions), JNK1 plays role in starvation induced activation of autophagy. The role of JNK1 in autophagy modulation and dual role of autophagy in tumor cells makes JNK1 a promising anticancer drug target. Since JNKs are closely related to other members of MAPK (Mitogen-Activated Protein Kinases) family, it is difficult to design JNK selective inhibitors. Designing JNK isoformselective inhibitors are even more challenging as the ATP binding sites among all JNKs are highly conserved. Random forest classifier usually outperforms several other machine learning algorithms for classification and prediction tasks in diverse areas of research. In this work, we have used Random Forest algorithm for two purposes: (i) calculating variable importance measures to rank and select molecular features, and (ii) predicting JNK1 inhibitors regulating autophagy in cancer. We have used paDEL calculated molecular features of JNK1 bioactivity dataset from ChEMBL database to build classification models using random forest classifier. Our results show that by optimally selecting features from top 10% based on variable importance measure the classification accuracy is high, and the classification model proposed in this study can be integrated with drug design pipeline to virtually screen compound libraries for predicting JNK1 inhibitors.


2019 ◽  
Vol 12 (3) ◽  
pp. 1209-1225 ◽  
Author(s):  
Christoph A. Keller ◽  
Mat J. Evans

Abstract. Atmospheric chemistry models are a central tool to study the impact of chemical constituents on the environment, vegetation and human health. These models are numerically intense, and previous attempts to reduce the numerical cost of chemistry solvers have not delivered transformative change. We show here the potential of a machine learning (in this case random forest regression) replacement for the gas-phase chemistry in atmospheric chemistry transport models. Our training data consist of 1 month (July 2013) of output of chemical conditions together with the model physical state, produced from the GEOS-Chem chemistry model v10. From this data set we train random forest regression models to predict the concentration of each transported species after the integrator, based on the physical and chemical conditions before the integrator. The choice of prediction type has a strong impact on the skill of the regression model. We find best results from predicting the change in concentration for long-lived species and the absolute concentration for short-lived species. We also find improvements from a simple implementation of chemical families (NOx = NO + NO2). We then implement the trained random forest predictors back into GEOS-Chem to replace the numerical integrator. The machine-learning-driven GEOS-Chem model compares well to the standard simulation. For ozone (O3), errors from using the random forests (compared to the reference simulation) grow slowly and after 5 days the normalized mean bias (NMB), root mean square error (RMSE) and R2 are 4.2 %, 35 % and 0.9, respectively; after 30 days the errors increase to 13 %, 67 % and 0.75, respectively. The biases become largest in remote areas such as the tropical Pacific where errors in the chemistry can accumulate with little balancing influence from emissions or deposition. Over polluted regions the model error is less than 10 % and has significant fidelity in following the time series of the full model. Modelled NOx shows similar features, with the most significant errors occurring in remote locations far from recent emissions. For other species such as inorganic bromine species and short-lived nitrogen species, errors become large, with NMB, RMSE and R2 reaching >2100 % >400 % and <0.1, respectively. This proof-of-concept implementation takes 1.8 times more time than the direct integration of the differential equations, but optimization and software engineering should allow substantial increases in speed. We discuss potential improvements in the implementation, some of its advantages from both a software and hardware perspective, its limitations, and its applicability to operational air quality activities.


2020 ◽  
Vol 34 (10) ◽  
pp. 13971-13972
Author(s):  
Yang Qi ◽  
Farseev Aleksandr ◽  
Filchenkov Andrey

Nowadays, social networks play a crucial role in human everyday life and no longer purely associated with spare time spending. In fact, instant communication with friends and colleagues has become an essential component of our daily interaction giving a raise of multiple new social network types emergence. By participating in such networks, individuals generate a multitude of data points that describe their activities from different perspectives and, for example, can be further used for applications such as personalized recommendation or user profiling. However, the impact of the different social media networks on machine learning model performance has not been studied comprehensively yet. Particularly, the literature on modeling multi-modal data from multiple social networks is relatively sparse, which had inspired us to take a deeper dive into the topic in this preliminary study. Specifically, in this work, we will study the performance of different machine learning models when being learned on multi-modal data from different social networks. Our initial experimental results reveal that social network choice impacts the performance and the proper selection of data source is crucial.


2020 ◽  
Author(s):  
Ki-Jin Ryu ◽  
Kyong Wook Yi ◽  
Yong Jin Kim ◽  
Jung Ho Shin ◽  
Jun Young Hur ◽  
...  

Abstract Background To analyze the determinants of women’s vasomotor symptoms (VMS) using machine learning. Methods Data came from Korea University Anam Hospital in Seoul, Korea, with 3298 women, aged 40–80 years, who attended their general health check from January 2010 to December 2012. Five machine learning methods were applied and compared for the prediction of VMS, measured by a Menopause Rating Scale. Variable importance, the effect of a variable on model performance, was used for identifying major determinants of VMS. Results In terms of the mean squared error, the random forest (0.9326) was much better than linear regression (12.4856) and artificial neural networks with one, two and three hidden layers (1.5576, 1.5184 and 1.5833, respectively). Based on variable importance from the random forest, the most important determinants of VMS were age, menopause age, thyroid stimulating hormone, monocyte and triglyceride, as well as gamma glutamyl transferase, blood urea nitrogen, cancer antigen 19 − 9, C-reactive protein and low-density-lipoprotein cholesterol. Indeed, the following determinants ranked within the top 20 in terms of variable importance: cancer antigen 125, total cholesterol, insulin, free thyroxine, forced vital capacity, alanine aminotransferase, forced expired volume in one second, height, homeostatic model assessment for insulin resistance and carcinoembryonic antigen. Conclusions Machine learning provides an invaluable decision support system for the prediction of VMS. For preventing VMS, preventive measures would be needed regarding the thyroid function, the lipid profile, the liver function, inflammation markers, insulin resistance, the monocyte, cancer antigens and the lung function.


Author(s):  
Benjamin A Goldstein ◽  
Eric C Polley ◽  
Farren B. S. Briggs

The Random Forests (RF) algorithm has become a commonly used machine learning algorithm for genetic association studies. It is well suited for genetic applications since it is both computationally efficient and models genetic causal mechanisms well. With its growing ubiquity, there has been inconsistent and less than optimal use of RF in the literature. The purpose of this review is to breakdown the theoretical and statistical basis of RF so that practitioners are able to apply it in their work. An emphasis is placed on showing how the various components contribute to bias and variance, as well as discussing variable importance measures. Applications specific to genetic studies are highlighted. To provide context, RF is compared to other commonly used machine learning algorithms.


Author(s):  
Paul Gustafson

Abstract The article by Jiang et al (Am J. Epidemiol.) extends quantitative bias analysis from the realm of statistical models to the realm of machine learning algorithms. Given the rooting of statistical models in the spirit of explanation and the rooting of machine learning algorithms in the spirt of prediction, this extension is thought provoking indeed. Some such thoughts are expounded here.


2021 ◽  
Vol 5 (CHI PLAY) ◽  
pp. 1-29
Author(s):  
Alessandro Canossa ◽  
Dmitry Salimov ◽  
Ahmad Azadvar ◽  
Casper Harteveld ◽  
Georgios Yannakakis

Is it possible to detect toxicity in games just by observing in-game behavior? If so, what are the behavioral factors that will help machine learning to discover the unknown relationship between gameplay and toxic behavior? In this initial study, we examine whether it is possible to predict toxicity in the MOBA gameFor Honor by observing in-game behavior for players that have been labeled as toxic (i.e. players that have been sanctioned by Ubisoft community managers). We test our hypothesis of detecting toxicity through gameplay with a dataset of almost 1,800 sanctioned players, and comparing these sanctioned players with unsanctioned players. Sanctioned players are defined by their toxic action type (offensive behavior vs. unfair advantage) and degree of severity (warned vs. banned). Our findings, based on supervised learning with random forests, suggest that it is not only possible to behaviorally distinguish sanctioned from unsanctioned players based on selected features of gameplay; it is also possible to predict both the sanction severity (warned vs. banned) and the sanction type (offensive behavior vs. unfair advantage). In particular, all random forest models predict toxicity, its severity, and type, with an accuracy of at least 82%, on average, on unseen players. This research shows that observing in-game behavior can support the work of community managers in moderating and possibly containing the burden of toxic behavior.


SLEEP ◽  
2020 ◽  
Vol 43 (Supplement_1) ◽  
pp. A148-A148
Author(s):  
O J Veatch ◽  
D R Mazzotti

Abstract Introduction Transitions to and from daylight savings time (DST) are natural experiments of circadian disruption and are associated with negative health consequences. Yet, the majority of the United States and several other countries still adopt these changes. Large observational studies focused on understanding the impact of DST transitions on sleep are difficult to conduct. Social media platforms, like Twitter, are powerful sources of human behavior data. We used machine learning to identify tweets reporting sleep complaints (TRSC) during the week of the standard time (ST)-DST transition. Next, we evaluated the circadian patterns of TRSC and compared their prevalence before and after the transition. Methods Using data publicly available via the Twitter API, we collected 500 tweets with evidence of sleep complaints, and manually annotated each tweet to validate true sleep complaints. Next, we calculated term frequency-inverse document frequency of each word in each tweet and trained a random forest to classify TRSC using a 3-fold cross-validation design. The trained model was then used to annotate a collection of tweets captured between Oct. 30, 2019-Nov. 6, 2019, overlapping with the DST-ST transition, which occurred on Nov. 3, 2019. Results Random forest demonstrated good performance in classifying TRSC (AUC[95%CI]=0.85[0.82-0.89]). This model was applied to 3,738,383 tweets collected around the DST-ST transition, and identified 11,044 TRSC. Posting of these tweets had a circadian pattern, with peak during nighttime. We found a higher frequency of TRSC after the DST-ST transition (0.33% vs. 0.27%, p&lt;0.00001), corresponding to a ~20% increase in the odds of reporting sleep complaints (OR[95%CI]=1.21[1.16-1.25]). Conclusion Using machine learning and Twitter data, we identified tweets reporting sleep complaints, described their circadian patterns and demonstrated that the prevalence of these types of tweets is significantly increased after the transition from DST to ST. These results demonstrate the applicability of social media data mining for public health in sleep medicine. Support NIH (K01LM012870); AASM Foundation (194-SR-18)


2020 ◽  
Vol 18 (1) ◽  
Author(s):  
Kerry E. Poppenberg ◽  
Vincent M. Tutino ◽  
Lu Li ◽  
Muhammad Waqas ◽  
Armond June ◽  
...  

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.


Sign in / Sign up

Export Citation Format

Share Document