scholarly journals A Variable Ranking Method for Machine Learning Models with Correlated Features: In-Silico Validation and Application for Diabetes Prediction

2021 ◽  
Vol 11 (16) ◽  
pp. 7740
Author(s):  
Martina Vettoretti ◽  
Barbara Di Camillo

When building a predictive model for predicting a clinical outcome using machine learning techniques, the model developers are often interested in ranking the features according to their predictive ability. A commonly used approach to obtain a robust variable ranking is to apply recursive feature elimination (RFE) on multiple resamplings of the training set and then to aggregate the ranking results using the Borda count method. However, the presence of highly correlated features in the training set can deteriorate the ranking performance. In this work, we propose a variant of the method based on RFE and Borda count that takes into account the correlation between variables during the ranking procedure in order to improve the ranking performance in the presence of highly correlated features. The proposed algorithm is tested on simulated datasets in which the true variable importance is known and compared to the standard RFE-Borda count method. According to the root mean square error between the estimated rank and the true (i.e., simulated) feature importance, the proposed algorithm overcomes the standard RFE-Borda count method. Finally, the proposed algorithm is applied to a case study related to the development of a predictive model of type 2 diabetes onset.

2019 ◽  
Vol 12 (1) ◽  
pp. 41-48 ◽  
Author(s):  
Nivedhitha Mahendran ◽  
Durai Raj Vincent

Background: Major Depressive Disorder (MDD) in simple terms is a psychiatric disorder which may be indicated by having mood disturbances which are consistent for more than a few weeks. It is considered a serious threat to psychophysiology which when left undiagnosed may even lead to the death of the victim so it is more important to have an effective predictive model. The major Depressive disorder is often termed as comorbid medical condition (medical condition that co-occurs with another), it is hardly possible for the physicians to predict that the victim is under depression, timely diagnosis of MDD may help in avoiding other comorbidities. Machine learning is a branch of artificial intelligence which makes the system capable of learning from the past and with that experience improves the future results even without programming explicitly. As in recent days because of the high dimensionality of features, the accuracy of the predictions is comparatively low. In order to get rid of redundant and unrelated features from the data and improve the accuracy, relevant features must be selected using effective feature selection methods. Objective: This study aims to develop a predictive model for diagnosing the Major Depressive Disorder among the IT professionals by reducing the feature dimension using feature selection techniques and evaluate them by implementing three machine learning classifiers such as Naïve Bayes, Support Vector Machines and Decision Tree. </P><P> Method: We have used Random Forest based Recursive Feature Elimination technique to reduce the feature dimensions. Results: The results show a considerable increase in prediction accuracy after applying feature selection technique. Conclusion: From the results, it is implied that the classification algorithms perform better after reducing the feature dimensions.


2021 ◽  
Vol 27 (1) ◽  
pp. 146045822198939
Author(s):  
Noratikah Nordin ◽  
Zurinahni Zainol ◽  
Mohd Halim Mohd Noor ◽  
Chan Lai Fong

Current suicide risk assessments for predicting suicide attempts are time consuming, of low predictive value and have inadequate reliability. This paper aims to develop a predictive model for suicide attempts among patients with depression using machine learning algorithms as well as presents a comparative study on single predictive models with ensemble predictive models for differentiating depressed patients with suicide attempts from non-suicide attempters. We applied and trained eight different machine learning algorithms using a dataset that consists of 75 patients diagnosed with a depressive disorder. A recursive feature elimination was used to reduce the features via three-fold cross validation. An ensemble predictive models outperformed the single predictive models. Voting and bagging revealed the highest accuracy of 92% compared to other machine learning algorithms. Our findings indicate that history of suicide attempt, religion, race, suicide ideation and severity of clinical depression are useful factors for prediction of suicide attempts.


Author(s):  
K Sooknunan ◽  
M Lochner ◽  
Bruce A Bassett ◽  
H V Peiris ◽  
R Fender ◽  
...  

Abstract With the advent of powerful telescopes such as the Square Kilometer Array and the Vera C. Rubin Observatory, we are entering an era of multiwavelength transient astronomy that will lead to a dramatic increase in data volume. Machine learning techniques are well suited to address this data challenge and rapidly classify newly detected transients. We present a multiwavelength classification algorithm consisting of three steps: (1) interpolation and augmentation of the data using Gaussian processes; (2) feature extraction using wavelets; (3) classification with random forests. Augmentation provides improved performance at test time by balancing the classes and adding diversity into the training set. In the first application of machine learning to the classification of real radio transient data, we apply our technique to the Green Bank Interferometer and other radio light curves. We find we are able to accurately classify most of the eleven classes of radio variables and transients after just eight hours of observations, achieving an overall test accuracy of 78%. We fully investigate the impact of the small sample size of 82 publicly available light curves and use data augmentation techniques to mitigate the effect. We also show that on a significantly larger simulated representative training set that the algorithm achieves an overall accuracy of 97%, illustrating that the method is likely to provide excellent performance on future surveys. Finally, we demonstrate the effectiveness of simultaneous multiwavelength observations by showing how incorporating just one optical data point into the analysis improves the accuracy of the worst performing class by 19%.


Computers ◽  
2021 ◽  
Vol 10 (12) ◽  
pp. 157
Author(s):  
Daniel Santos ◽  
José Saias ◽  
Paulo Quaresma ◽  
Vítor Beires Nogueira

Traffic accidents are one of the most important concerns of the world, since they result in numerous casualties, injuries, and fatalities each year, as well as significant economic losses. There are many factors that are responsible for causing road accidents. If these factors can be better understood and predicted, it might be possible to take measures to mitigate the damages and its severity. The purpose of this work is to identify these factors using accident data from 2016 to 2019 from the district of Setúbal, Portugal. This work aims at developing models that can select a set of influential factors that may be used to classify the severity of an accident, supporting an analysis on the accident data. In addition, this study also proposes a predictive model for future road accidents based on past data. Various machine learning approaches are used to create these models. Supervised machine learning methods such as decision trees (DT), random forests (RF), logistic regression (LR), and naive Bayes (NB) are used, as well as unsupervised machine learning techniques including DBSCAN and hierarchical clustering. Results show that a rule-based model using the C5.0 algorithm is capable of accurately detecting the most relevant factors describing a road accident severity. Further, the results of the predictive model suggests the RF model could be a useful tool for forecasting accident hotspots.


2017 ◽  
Vol 2017 ◽  
pp. 1-21 ◽  
Author(s):  
Carlos Fernández ◽  
David Fernández-Llorca ◽  
Miguel A. Sotelo

A hybrid vision-map system is presented to solve the road detection problem in urban scenarios. The standardized use of machine learning techniques in classification problems has been merged with digital navigation map information to increase system robustness. The objective of this paper is to create a new environment perception method to detect the road in urban environments, fusing stereo vision with digital maps by detecting road appearance and road limits such as lane markings or curbs. Deep learning approaches make the system hard-coupled to the training set. Even though our approach is based on machine learning techniques, the features are calculated from different sources (GPS, map, curbs, etc.), making our system less dependent on the training set.


Author(s):  
Chunsheng Yang ◽  
Yanni Zou ◽  
Jie Liu ◽  
Kyle R Mulligan

In the past decades, machine learning techniques or algorithms, particularly, classifiers have been widely applied to various real-world applications such as PHM. In developing high-performance classifiers, or machine learning-based models, i.e. predictive model for PHM, the predictive model evaluation remains a challenge. Generic methods such as accuracy may not fully meet the needs of models evaluation for prognostic applications. This paper addresses this issue from the point of view of PHM systems. Generic methods are first reviewed while outlining their limitations or deficiencies with respect to PHM. Then, two approaches developed for evaluating predictive models are presented with emphasis on specificities and requirements of PHM. A case of real prognostic application is studies to demonstrate the usefulness of two proposed methods for predictive model evaluation. We argue that predictive models for PHM must be evaluated not only using generic methods, but also domain-oriented approaches in order to deploy the models in real-world applications.


2020 ◽  
Vol 499 (4) ◽  
pp. 6009-6017
Author(s):  
Y-L Mong ◽  
K Ackley ◽  
D K Galloway ◽  
T Killestein ◽  
J Lyman ◽  
...  

ABSTRACT The amount of observational data produced by time-domain astronomy is exponentially increasing. Human inspection alone is not an effective way to identify genuine transients from the data. An automatic real-bogus classifier is needed and machine learning techniques are commonly used to achieve this goal. Building a training set with a sufficiently large number of verified transients is challenging, due to the requirement of human verification. We present an approach for creating a training set by using all detections in the science images to be the sample of real detections and all detections in the difference images, which are generated by the process of difference imaging to detect transients, to be the samples of bogus detections. This strategy effectively minimizes the labour involved in the data labelling for supervised machine learning methods. We demonstrate the utility of the training set by using it to train several classifiers utilizing as the feature representation the normalized pixel values in 21 × 21 pixel stamps centred at the detection position, observed with the Gravitational-wave Optical Transient Observer (GOTO) prototype. The real-bogus classifier trained with this strategy can provide up to $95{{\ \rm per\ cent}}$ prediction accuracy on the real detections at a false alarm rate of $1{{\ \rm per\ cent}}$.


Circulation ◽  
2018 ◽  
Vol 138 (Suppl_2) ◽  
Author(s):  
Tomohisa Seki ◽  
Tomoyoshi Tamura ◽  
Masaru Suzuki

Introduction and Objective: Early prognostication for cardiogenic out-of-hospital cardiac arrest (OHCA) patients remain challenging. Recently, advanced machine learning techniques have been employed for clinical diagnosis and prognostication for various conditions. Therefore, in this study, we attempted to establish a prognostication model for cardiogenic OHCA using an advanced machine learning technique. Methods and Results: Data of a prospective multi-center cohort study of OHCA patients transported by an ambulance to 67 medical institutions in Kanto area of Japan between January 2012 and March 2013 was used in this study. Data for cardiogenic OHCA patients aged ≥18 years were retrieved and patients were grouped according to the time of calls for ambulances (training set: between January 1, 2012 and December 12, 2012; test set: between January 1, 2013 and March 31, 2013). From among 421 variables observed during the period between calls for ambulances and initial in-hospital treatments of cardiogenic OHCA, 38 prehospital factors or 56 prehospital factors and initial in-hospital factors were used for prognostication, respectively. Prognostication models for 1-year survival were established with random forest method, an advanced machine learning method that aggregates a series of decision trees for classification and regression. After 10-fold internal cross validation in the training set, prognostication models were validated using test set. Area under the receiver operating characteristics curve (AUC) was used to evaluate the prediction performance of models. Prognostication models trained with 38 variables or 56 variables for 1-year survival showed AUC values of 0.93±0.01 and 0.95±0.01, respectively. Conclusions: Prognostication models trained with advanced machine learning technique showed favorable prediction capability for 1-year survival of cardiogenic OHCA. These results indicate that an advanced machine learning technique can be applicable to establish early prognostication model for cardiogenic OHCA.


Sign in / Sign up

Export Citation Format

Share Document