On machine learning, ROC analysis, and statistical tests of significance

Author(s):  
M.A. Maloof
Author(s):  
Dhamanpreet Kaur ◽  
Matthew Sobiesk ◽  
Shubham Patil ◽  
Jin Liu ◽  
Puran Bhagat ◽  
...  

Abstract Objective This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data. Materials and Methods We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data. Results Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules. Discussion Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools. Conclusion We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.


2019 ◽  
Vol 40 (Supplement_1) ◽  
Author(s):  
S W Rha ◽  
B G Choi ◽  
S Y Choi ◽  
J K Byun ◽  
J A Cha ◽  
...  

Abstract Background Chest pain is a major symptom of coronary artery disease (CAD), which can lead to acute coronary syndrome and sudden cardiac death. Accurate diagnosis of CAD in patients who experience chest pain is crucial to provide appropriate treatment and optimize clinical outcomes. Objective This study was to develop a machine learning model which can predict and diagnose CAD in patients complaining of chest pain based on a large real-world prospective registry database and computing power. Method A total of 10,177 subjects with typical or atypical chest pain who underwent a coronary angiography at the cardiovascular center of our University Hospital, South Korea between November 2004 and May 2014 were evaluated in this study. The generation of the diagnostic prediction model for CAD used the classification application by technical support of MATLAB R2017a. The performance evaluation of the learning model generated by machine learning was evaluated by the area under the curve (AUC) of the receiver-operating characteristic (ROC) analysis. Results The diagnostic prediction model of CAD had been generated according to the user's accessibility such as the general public or clinician (Model 1–4). The performance of the models has ranged from 0.78 to 0.96 by the AUC of ROC analysis. The prediction accuracy of the models ranged from 70.4% to 88.9%. The performance of the diagnostic prediction model of CAD by machine learning improved as the input information increased. Figure 1. Study Flow Chart Conclusion A diagnostic prediction model of CAD using the machine learning method and the registry database was developed. Further studies are needed to verify our results.


2012 ◽  
Vol 3 (2) ◽  
Author(s):  
Jeevan Prakash ◽  
Aruna Nigam ◽  
Pikee Saxena ◽  
Anmita S Acharya

Author(s):  
RUCHIKA MALHOTRA ◽  
ANKITA JAIN BANSAL

Due to various reasons such as ever increasing demands of the customer or change in the environment or detection of a bug, changes are incorporated in a software. This results in multiple versions or evolving nature of a software. Identification of parts of a software that are more prone to changes than others is one of the important activities. Identifying change prone classes will help developers to take focused and timely preventive actions on the classes of the software with similar characteristics in the future releases. In this paper, we have studied the relationship between various object oriented (OO) metrics and change proneness. We collected a set of OO metrics and change data of each class that appeared in two versions of an open source dataset, 'Java TreeView', i.e., version 1.1.6 and version 1.0.3. Besides this, we have also predicted various models that can be used to identify change prone classes, using machine learning and statistical techniques and then compared their performance. The results are analyzed using Area Under the Curve (AUC) obtained from Receiver Operating Characteristics (ROC) analysis. The results show that the models predicted using both machine learning and statistical methods demonstrate good performance in terms of predicting change prone classes. Based on the results, it is reasonable to claim that quality models have a significant relevance with OO metrics and hence can be used by researchers for early prediction of change prone classes.


2020 ◽  
Author(s):  
Julian Hatwell ◽  
Mohamed Medhat Gaber ◽  
R.M. Atif Azad

Abstract Background Computer Aided Diagnostics (CAD) can support medical practitioners to make critical decisions about their patients' disease conditions. Practitioners require access to the chain of reasoning behind CAD to build trust in the CAD advice and to supplement their own expertise. Yet, CAD systems might be based on black box machine learning (ML) models and high dimensional data sources (electronic health records, MRI scans, cardiotocograms, etc). These foundations make interpretation and explanation of the CAD advice very challenging. This challenge is recognised throughout the machine learning research community. eXplainable Artificial Intelligence (XAI) is emerging as one of the most important research areas of recent years because it addresses the interpretability and trust concerns of critical decision makers, including those in clinical and medical practice. Methods In this work, we focus on AdaBoost, a black box ML model that has been widely adopted in the CAD literature. We address the challenge -- to explain AdaBoost classification -- with a novel algorithm that extracts simple, logical rules from AdaBoost models. Our algorithm, Adaptive-Weighted High Importance Path Snippets (Ada-WHIPS), makes use of AdaBoost's adaptive classifier weights. Using a novel formulation, Ada-WHIPS uniquely redistributes the weights among individual decision nodes of the internal decision trees (DT) of the AdaBoost model. Then, a simple heuristic search of the weighted nodes finds a single rule that dominated the model's decision. We compare the explanations generated by our novel approach with the state of the art in an experimental study. We evaluate the derived explanations with simple statistical tests of well-known quality measures, precision and coverage, and a novel measure stability that is better suited to the XAI setting .


2019 ◽  
Vol 8 (4) ◽  
pp. 9155-9158

Classification is a machine learning task which consists in predicting the set association of unclassified examples, whose label is not known, by the properties of examples in a representation learned earlier as of training examples, that label was known. Classification tasks contain a huge assortment of domains and real world purpose: disciplines such as medical diagnosis, bioinformatics, financial engineering and image recognition between others, where domain experts can use the model erudite to sustain their decisions. All the Classification Approaches proposed in this paper were evaluate in an appropriate experimental framework in R Programming Language and the major emphasis is on k-nearest neighbor method which supports vector machines and decision trees over large number of data sets with varied dimensionality and by comparing their performance against other state-of-the-art methods. In this process the experimental results obtained have been verified by statistical tests which support the better performance of the methods. In this paper we have survey various classification techniques of Data Mining and then compared them by using diverse datasets from “University of California: Irvine (UCI) Machine Learning Repository” for acquiring the accurate calculations on Iris Data set.


2021 ◽  
Author(s):  
Sang Min Nam ◽  
Thomas A Peterson ◽  
Kyoung Yul Seo ◽  
Hyun Wook Han ◽  
Jee In Kang

BACKGROUND In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. OBJECTIVE Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. METHODS An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. RESULTS The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (<i>P</i>&lt;.05) and indirect (<i>P</i>≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. CONCLUSIONS XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.


Sensors ◽  
2020 ◽  
Vol 20 (6) ◽  
pp. 1692 ◽  
Author(s):  
Iván Silva ◽  
José Eugenio Naranjo

Identifying driving styles using classification models with in-vehicle data can provide automated feedback to drivers on their driving behavior, particularly if they are driving safely. Although several classification models have been developed for this purpose, there is no consensus on which classifier performs better at identifying driving styles. Therefore, more research is needed to evaluate classification models by comparing performance metrics. In this paper, a data-driven machine-learning methodology for classifying driving styles is introduced. This methodology is grounded in well-established machine-learning (ML) methods and literature related to driving-styles research. The methodology is illustrated through a study involving data collected from 50 drivers from two different cities in a naturalistic setting. Five features were extracted from the raw data. Fifteen experts were involved in the data labeling to derive the ground truth of the dataset. The dataset fed five different models (Support Vector Machines (SVM), Artificial Neural Networks (ANN), fuzzy logic, k-Nearest Neighbor (kNN), and Random Forests (RF)). These models were evaluated in terms of a set of performance metrics and statistical tests. The experimental results from performance metrics showed that SVM outperformed the other four models, achieving an average accuracy of 0.96, F1-Score of 0.9595, Area Under the Curve (AUC) of 0.9730, and Kappa of 0.9375. In addition, Wilcoxon tests indicated that ANN predicts differently to the other four models. These promising results demonstrate that the proposed methodology may support researchers in making informed decisions about which ML model performs better for driving-styles classification.


Sign in / Sign up

Export Citation Format

Share Document