scholarly journals Minimum Relevant features to Obtain Explainable Systems for Predicting Cardiovascular Disease Using the Statlog Dataset

Author(s):  
Roberto Porto ◽  
Jose M. Molina ◽  
Antonio Berlanga ◽  
Miguel A. Patricio

Learning systems have been very focused on creating models that are capable of obtaining the best results in error metrics. Recently, the focus has shifted to improvement in order to interpret and explain their results. The need for interpretation is greater when these models are used to support decision making. In some areas this becomes an indispensable requirement, such as in medicine. This paper focuses on the prediction of cardiovascular disease by analyzing the well-known Statlog (Heart) Data Set from the UCI’s Automated Learning Repository. This study will analyze the cost of making predictions easier to interpret by reducing the number of features that explain the classification of health status versus the cost in accuracy. It will be analyzed on a large set of classification techniques and performance metrics. Demonstrating that it is possible to make explainable and reliable models that have a good commitment to predictive performance.

2021 ◽  
Vol 11 (3) ◽  
pp. 1285
Author(s):  
Roberto Porto ◽  
José M. Molina ◽  
Antonio Berlanga ◽  
Miguel A. Patricio

Learning systems have been focused on creating models capable of obtaining the best results in error metrics. Recently, the focus has shifted to improvement in the interpretation and explanation of the results. The need for interpretation is greater when these models are used to support decision making. In some areas, this becomes an indispensable requirement, such as in medicine. The goal of this study was to define a simple process to construct a system that could be easily interpreted based on two principles: (1) reduction of attributes without degrading the performance of the prediction systems and (2) selecting a technique to interpret the final prediction system. To describe this process, we selected a problem, predicting cardiovascular disease, by analyzing the well-known Statlog (Heart) data set from the University of California’s Automated Learning Repository. We analyzed the cost of making predictions easier to interpret by reducing the number of features that explain the classification of health status versus the cost in accuracy. We performed an analysis on a large set of classification techniques and performance metrics, demonstrating that it is possible to construct explainable and reliable models that provide high quality predictive performance.


2017 ◽  
Vol 30 (4) ◽  
pp. 538-564 ◽  
Author(s):  
Grant Duwe

This study examines the development and validation of the Minnesota Sex Offender Screening Tool–4 (MnSOST-4) on a dataset consisting of 5,745 sex offenders released from Minnesota prisons between 2003 and 2012. Bootstrap resampling was used to select predictors, and k-fold and split-sample methods were used to internally validate the MnSOST-4. Using sex offense reconviction within 4 years of release from prison as the failure criterion, the data showed that 130 (2.3%) offenders in the overall sample were recidivists. Multiple classification methods and performance metrics were used to develop the MnSOST-4 and evaluate its predictive performance on the test set. The results from the regularized logistic regression algorithm showed that the MnSOST-4 performed well in predicting sexual recidivism in the test set, achieving an area under the curve (AUC) of 0.835. Additional analyses on the test set revealed that the MnSOST-4 outperformed the Minnesota Sex Offender Screening Tool–3 (MnSOST-3), Minnesota Sex Offender Screening Tool–Revised (MnSOST-R), and Static-99 in predicting sexual reoffending.


2004 ◽  
Vol 43 (05) ◽  
pp. 439-444 ◽  
Author(s):  
Michae Schimek

Summary Objectives: A typical bioinformatics task in microarray analysis is the classification of biological samples into two alternative categories. A procedure is needed which, based on the expression levels measured, allows us to compute the probability that a new sample belongs to a certain class. Methods: For the purpose of classification the statistical approach of binary regression is considered. High-dimensionality and at the same time small sample sizes make it a challenging task. Standard logit or probit regression fails because of condition problems and poor predictive performance. The concepts of frequentist and of Bayesian penalization for binary regression are introduced. A Bayesian interpretation of the penalized log-likelihood is given. Finally the role of cross-validation for regularization and feature selection is discussed. Results: Penalization makes classical binary regression a suitable tool for microarray analysis. We illustrate penalized logit and Bayesian probit regression on a well-known data set and compare the obtained results, also with respect to published results from decision trees. Conclusions: The frequentist and the Bayesian penalization concept work equally well on the example data, however some method-specific differences can be made out. Moreover the Bayesian approach yields a quantification (posterior probabilities) of the bias due to the constraining assumptions.


Author(s):  
Kenneth Biggio ◽  
Rachel Backes ◽  
Jennifer Crawford

The thermal performance of parabolic trough concentrating solar collectors depends on both the structural and optical characteristics of the design. In order to reduce the cost of energy, advanced concentrating structures must significantly reduce the cost of collectors while maintaining good optical performance. This paper discusses a Finite Element Ray Tracer (FERT) that has been developed specifically to support the commercial design process. This is achieved by tying the whole of the support structure directly to its optical effects. Consequently, the optical performance metrics go beyond the typical reflector slope error RMS or average intercept factor to present the designer with spatially resolved analysis of localized performance. By incorporating this analytical method into the structural design process, collector cost and performance can be balanced efficiently and rapidly, allowing for an accelerated design period. At times, this insight has driven better, albeit unexpected, design decisions. The paper presents an overview of the development process that Abengoa R&D uses to take advantage of its analytical optical analysis capability throughout all phases of a project, as well as a review of its implementation. A selection of case studies is also presented to illustrate how FERT enables the designer to identify local areas of concern, diagnose the cause, and quickly develop possible redesign strategies. Finally, the significance of various parameters within the ray tracer are discussed.


Author(s):  
Sanjiban Sekhar Roy ◽  
Pulkit Kulshrestha ◽  
Pijush Samui

Drought is a condition of land in which the ground water faces a severe shortage. This condition affects the survival of plants and animals. Drought can impact ecosystem and agricultural productivity, severely. Hence, the economy also gets affected by this situation. This paper proposes Deep Belief Network (DBN) learning technique, which is one of the state of the art machine learning algorithms. This proposed work uses DBN, for classification of drought and non-drought images. Also, k nearest neighbour (kNN) and random forest learning methods have been proposed for the classification of the same drought images. The performance of the Deep Belief Network(DBN) has been compared with k nearest neighbour (kNN) and random forest. The data set has been split into 80:20, 70:30 and 60:40 as train and test. Finally, the effectiveness of the three proposed models have been measured by various performance metrics.


Author(s):  
Dorian Ruiz Alonso ◽  
Claudia Zepeda Cortés ◽  
Hilda Castillo Zacatelco ◽  
José Luis Carballido Carranza

In this work, we propose the extension of a methodology for the multi-label classification of feedback according to the Hattie and Timperley feedback model, incorporating a hyperparameter tuning stage. It is analyzed whether the incorporation of the hyperparameter tuning stage prior to the execution of the algorithms support vector machines, random forest and multi-label k-nearest neighbors, improves the performance metrics of multi-label classifiers that automatically locate the feedback generated by a teacher to the activities sent by students in online courses on the Blackboard platform at the task, process, regulation, praise and other levels proposed in the feedback model by Hattie and Timperley. The grid search strategy is used to refine the hyperparameters of each algorithm. The results show that the adjustment of the hyperparameters improves the performance metrics for the data set used.


Author(s):  
Władysław Homenda ◽  
Agnieszka Jastrzȩbska ◽  
Witold Pedrycz ◽  
Fusheng Yu

AbstractIn this paper, we look closely at the issue of contaminated data sets, where apart from legitimate (proper) patterns we encounter erroneous patterns. In a typical scenario, the classification of a contaminated data set is always negatively influenced by garbage patterns (referred to as foreign patterns). Ideally, we would like to remove them from the data set entirely. The paper is devoted to comparison and analysis of three different models capable to perform classification of proper patterns with rejection of foreign patterns. It should be stressed that the studied models are constructed using proper patterns only, and no knowledge about the characteristics of foreign patterns is needed. The methods are illustrated with a case study of handwritten digits recognition, but the proposed approach itself is formulated in a general manner. Therefore, it can be applied to different problems. We have distinguished three structures: global, local, and embedded, all capable to eliminate foreign patterns while performing classification of proper patterns at the same time. A comparison of the proposed models shows that the embedded structure provides the best results but at the cost of a relatively high model complexity. The local architecture provides satisfying results and at the same time is relatively simple.


2022 ◽  
Vol 13 (2) ◽  
pp. 0-0

Pulmonary disease is widespread worldwide. There is persistent blockage of the lungs, pneumonia, asthma, TB, etc. It is essential to diagnose the lungs promptly. For this reason, machine learning models were developed. For lung disease prediction, many deep learning technologies, including the CNN, and the capsule network, are used. The fundamental CNN has low rotating, inclined, or other irregular image orientation efficiency. Therefore by integrating the space transformer network (STN) with CNN, we propose a new hybrid deep learning architecture named STNCNN. The new model is implemented on the dataset from the Kaggle repository for an NIH chest X-ray image. STNCNN has an accuracy of 69% in respect of the entire dataset, while the accuracy values of vanilla grey, vanilla RGB, hybrid CNN are 67.8%, 69.5%, and 63.8%, respectively. When the sample data set is applied, STNCNN takes much less time to train at the cost of a slightly less reliable validation. Therefore both specialists and physicians are simplified by the proposed STNCNN System for the diagnosis of lung disease.


1996 ◽  
Vol 30 (6) ◽  
pp. 824-833 ◽  
Author(s):  
Gordon Parker ◽  
Dusan Hadzi-Pavlovic ◽  
Kay Wilhelm ◽  
Marie-Paule Austin ◽  
Catherine Mason ◽  
...  

Objective: We seek to improve the definition and classification of the personality disorders (PDs) and derive a large database for addressing this objective. Method: The paper describes the rationale for the development of a large set of descriptors of the PDs (including all DSM-IV and ICD-10 descriptors, but enriched by an additional 109 items), the design of parallel self-report (SR) and corroborative witness (CW) measures, sample recruitment (of 863 patients with a priori evidence of personality disorder or disturbance) and preliminary descriptive data. Results: Analyses (particularly those comparing ratings on molar PD descriptions with putative PD dimensions) argue for acceptable reliability of the data set, while both the size of the sample and the representation of all PD dimensions of interest argue for the adequacy of the database. Conclusions: We consider in some detail current limitations to the definition and classification of the PDs, and foreshadow the analytic techniques that will be used to address the key objectives of allowing the PDs to be modelled more clearly and, ideally, measured with greater precision and validity.


2018 ◽  
Vol 4 ◽  
pp. e156 ◽  
Author(s):  
Lucía Santamaría ◽  
Helena Mihaljević

The increased interest in analyzing and explaining gender inequalities in tech, media, and academia highlights the need for accurate inference methods to predict a person’s gender from their name. Several such services exist that provide access to large databases of names, often enriched with information from social media profiles, culture-specific rules, and insights from sociolinguistics. We compare and benchmark five name-to-gender inference services by applying them to the classification of a test data set consisting of 7,076 manually labeled names. The compiled names are analyzed and characterized according to their geographical and cultural origin. We define a series of performance metrics to quantify various types of classification errors, and define a parameter tuning procedure to search for optimal values of the services’ free parameters. Finally, we perform benchmarks of all services under study regarding several scenarios where a particular metric is to be optimized.


Sign in / Sign up

Export Citation Format

Share Document