scholarly journals A comparison of two dissimilarity functions for mixed-type predictor variables in the $$\delta $$-machine

Author(s):  
Beibei Yuan ◽  
Willem Heiser ◽  
Mark de Rooij

AbstractThe $$\delta $$ δ -machine is a statistical learning tool for classification based on dissimilarities or distances between profiles of the observations to profiles of a representation set, which was proposed by Yuan et al. (J Claasif 36(3): 442–470, 2019). So far, the $$\delta $$ δ -machine was restricted to continuous predictor variables only. In this article, we extend the $$\delta $$ δ -machine to handle continuous, ordinal, nominal, and binary predictor variables. We utilized a tailored dissimilarity function for mixed type variables which was defined by Gower. This measure has properties of a Manhattan distance. We develop, in a similar vein, a Euclidean dissimilarity function for mixed type variables. In simulation studies we compare the performance of the two dissimilarity functions and we compare the predictive performance of the $$\delta $$ δ -machine to logistic regression models. We generated data according to two population distributions where the type of predictor variables, the distribution of categorical variables, and the number of predictor variables was varied. The performance of the $$\delta $$ δ -machine using the two dissimilarity functions and different types of representation set was investigated. The simulation studies showed that the adjusted Euclidean dissimilarity function performed better than the adjusted Gower dissimilarity function; that the $$\delta $$ δ -machine outperformed logistic regression; and that for constructing the representation set, K-medoids clustering achieved fewer active exemplars than the one using K-means clustering while maintaining the accuracy. We also applied the $$\delta $$ δ -machine to an empirical example, discussed its interpretation in detail, and compared the classification performance with five other classification methods. The results showed that the $$\delta $$ δ -machine has a good balance between accuracy and interpretability.

2003 ◽  
Vol 93 (4) ◽  
pp. 428-435 ◽  
Author(s):  
E. D. De Wolf ◽  
L. V. Madden ◽  
P. E. Lipps

Logistic regression models for wheat Fusarium head blight were developed using information collected at 50 location-years, including four states, representing three different U.S. wheat-production regions. Non-parametric correlation analysis and stepwise logistic regression analysis identified combinations of temperature, relative humidity, and rainfall or durations of specified weather conditions, for 7 days prior to anthesis, and 10 days beginning at crop anthesis, as potential predictor variables. Prediction accuracy of developed logistic regression models ranged from 62 to 85%. Models suitable for application as a disease warning system were identified based on model prediction accuracy, sensitivity, specificity, and availability of weather variables at crop anthesis. Four of the identified models correctly classified 84% of the 50 location-years. A fifth model that used only pre-anthesis weather conditions correctly classified 70% of the location-years. The most useful predictor variables were the duration (h) of precipitation 7 days prior to anthesis, duration (h) that temperature was between 15 and 30°C 7 days prior to anthesis, and the duration (h) that temperature was between 15 and 30°C and relative humidity was greater than or equal to 90%. When model performance was evaluated with an independent validation set (n = 9), prediction accuracy was only 6% lower than the accuracy for the original data sets. These results indicate that narrow time periods around crop anthesis can be used to predict Fusarium head blight epidemics.


2021 ◽  
Vol 29 (1) ◽  
Author(s):  
Hezlin Aryani Abd Rahman ◽  
Yap Bee Wah ◽  
Ong Seng Huat

Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimates depended on sample size whereby for sample size 100, 500, 1000 – 2000 and 2500 – 3500, the estimates were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.


Stats ◽  
2021 ◽  
Vol 4 (3) ◽  
pp. 665-681
Author(s):  
Luca Insolia ◽  
Ana Kenney ◽  
Martina Calovi ◽  
Francesca Chiaromonte

High-dimensional classification studies have become widespread across various domains. The large dimensionality, coupled with the possible presence of data contamination, motivates the use of robust, sparse estimation methods to improve model interpretability and ensure the majority of observations agree with the underlying parametric model. In this study, we propose a robust and sparse estimator for logistic regression models, which simultaneously tackles the presence of outliers and/or irrelevant features. Specifically, we propose the use of L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem in a framework that allows one to pursue optimality guarantees. We use our proposal to investigate the main drivers of honey bee (Apis mellifera) loss through the annual winter loss survey data collected by the Pennsylvania State Beekeepers Association. Previous studies mainly focused on predictive performance, however our approach produces a more interpretable classification model and provides evidence for several outlying observations within the survey data. We compare our proposal with existing heuristic methods and non-robust procedures, demonstrating its effectiveness. In addition to the application to honey bee loss, we present a simulation study where our proposal outperforms other methods across most performance measures and settings.


2021 ◽  
Vol 7 ◽  
Author(s):  
Yi-Ting Hwang ◽  
Hui-Ling Lee ◽  
Cheng-Hui Lu ◽  
Po-Cheng Chang ◽  
Hung-Ta Wo ◽  
...  

Aims: Curved M-mode images of global strain (GS) and strain rate (GSR) provide sufficiently detailed spatiotemporal information of deformation mechanics. This study investigated whether a deep convolutional neural network (CNN) could accurately classify these images in patients with atrial fibrillation (AF) who underwent radiofrequency catheter ablation (RFCA) with different outcomes.Methods and Results: We retrospectively evaluated 606 consecutive patients who underwent RFCA for drug-refractory AF. Patients were divided into AF-free (n = 443) and AF-recurrent (n = 163) groups. Transthoracic echocardiography was performed within 24 h after RFCA. Left atrial curved M-mode speckle-tracking images were acquired from randomly selected 163 patients in AF-free group and 163 patients in AF-recurrent group as the dataset for deep CNN modeling. We used the ReLu activation function and repeatedly performed CNN model for 32 times to evaluate the stability of hyperparameters. Logistic regression models with the left atrial dimension, emptying fraction, and peak systolic GS as predictor variables were used for comparisons. Images from the apical 2-chamber (2-C) and 4-chamber (4-C) views had distinct features, leading to different CNN performance between settings; of them, the “4-C GS+4-C GSR” setting provided the highest performance index values. All four predictor variables used for logistic regression modeling were significant; however, none of them, individually or in any combined form, could outperform the optimal CNN model.Conclusion: The novel approach using deep CNNs for learning features of left atrial curved M-mode speckle-tracking images seems to be optimal for classifying outcome status after AF ablation.


2022 ◽  
Vol 11 (2) ◽  
pp. 313
Author(s):  
Mohsen Mazidi ◽  
Richard J. Webb ◽  
Gregory Y. H. Lip ◽  
Andre P. Kengne ◽  
Maciej Banach ◽  
...  

Low-density lipoprotein cholesterol (LDL-C) and apolipoprotein B (ApoB) are established markers of atherosclerotic cardiovascular disease (ASCVD), but when concentrations are discordant ApoB is the superior predictor. Chronic kidney disease (CKD) is associated with ASCVD, yet the independent role of atherogenic lipoproteins is contentious. Four groups were created based upon high and low levels of ApoB and LDL-C. Continuous and categorical variables were compared across groups, as were adjusted markers of CKD. Logistic regression analysis assessed association(s) with CKD based on the groups. Subjects were categorised by LDL-C and ApoB, using cut-off values of >160 mg/dL and >130 mg/dL, respectively. Those with low LDL-C and high ApoB, compared to those with high LDL-C and high ApoB, had significantly higher body mass index (30.7 vs. 30.1 kg/m2) and waist circumference (106.1 vs. 102.7 cm) and the highest fasting blood glucose (117.5 vs. 112.7 mg/dL), insulin (16.6 vs. 13.1 μU/mL) and homeostatic model assessment of insulin resistance (5.3 vs. 3.7) profiles (all p < 0.001). This group, compared to those with high LDL-C and high ApoB, also had the highest levels of urine albumin (2.3 vs. 2.2 mg/L), log albumin-creatinine ratio (2.2 vs. 2.1 mg/g) and serum uric acid (6.1 vs. 5.6 mg/dL) and the lowest estimated glomerular filtration rate (81.3 vs. 88.4 mL/min/1.73 m2) (all p < 0.001). In expanded logistic regression models, using the low LDL-C and low ApoB group as a reference, those with low LDL-C and high ApoB had the strongest association with CKD, odds ratio (95% CI) 1.12 (1.08–1.16). Discordantly high levels of ApoB are independently associated with increased likelihood of CKD. ApoB remains associated with metabolic dysfunction, regardless of LDL-C.


2004 ◽  
Vol 50 (1) ◽  
pp. 125-129 ◽  
Author(s):  
G. Brion ◽  
S. Lingeriddy ◽  
T.R. Neelakantan ◽  
M. Wang ◽  
R. Girones ◽  
...  

A database was examined using artificial neural network (ANN) models to investigate the efficacy of predicting PCR-identified Norwalk-like virus presence and absence in shellfish. The relative importance of variables in the model and the predictive power obtained by application of ANN modelling methods were compared with previously developed logistic regression models. In addition, two country-specific datasets were analysed separately with ANN models to determine if the relative importance of the input variables was similar for geographically diverse regions. The results of this analysis found that ANN models predicted Norwalk-like virus presence and absence in shellfish with equivalent, and better, precision than logistic regression models. For overall classification performance, ANN modelling had a rate of 93%, vs 75% for the logistic regression. ANN models were able to illuminate the site-specific relationships between indicators and pathogens.


Agronomy ◽  
2018 ◽  
Vol 8 (9) ◽  
pp. 176 ◽  
Author(s):  
Manuel Díaz-Pérez ◽  
Ángel Carreño-Ortega ◽  
Marta Gómez-Galán ◽  
Ángel-Jesús Callejón-Ferre

The purpose of this study was to demonstrate interest in applying simple and multiple logistic regression analyses to the marketability probability of commercial tomato (Solanum lycopersicum L.) cultivars when the tomatoes are harvested as loose fruit. A fruit’s firmness and commercial quality (softening or over-ripe fruit, cracking, cold damage, and rotting) were determined at 0, 7, 14, and 21 days of storage. The storage test simulated typical conditions from harvest to purchase-consumption by the consumer. The combined simple and multiple analyses of the primary continuous and categorical variables with the greatest influence on the commercial quality of postharvest fruit allowed for a more detailed understanding of the behavior of different tomato cultivars and identified the cultivars with greater marketability probability. The odds ratios allowed us to determine the increase or decrease in the marketability probability when we substituted one cultivar with a reference one. Thus, for example, the marketability probability was approximately 2.59 times greater for ‘Santyplum’ than for ‘Angelle’. Overall, of the studied cultivars, ‘Santyplum’, followed by ‘Dolchettini’, showed greater marketability probability than ‘Angelle’ and ‘Genio’. In conclusion, the logistic regression model is useful for studying and identifying tomato cultivars with good postharvest marketability characteristics.


2013 ◽  
Vol 103 (9) ◽  
pp. 906-919 ◽  
Author(s):  
D. A. Shah ◽  
J. E. Molineros ◽  
P. A. Paul ◽  
K. T. Willyerd ◽  
L. V. Madden ◽  
...  

Our objective was to identify weather-based variables in pre- and post-anthesis time windows for predicting major Fusarium head blight (FHB) epidemics (defined as FHB severity ≥ 10%) in the United States. A binary indicator of major epidemics for 527 unique observations (31% of which were major epidemics) was linked to 380 predictor variables summarizing temperature, relative humidity, and rainfall in 5-, 7-, 10-, 14-, or 15-day-long windows either pre- or post-anthesis. Logistic regression models were built with a training data set (70% of the 527 observations) using the leaps-and-bounds algorithm, coupled with bootstrap variable and model selection methods. Misclassification rates were estimated on the training and remaining (test) data. The predictive performance of models with indicator variables for cultivar resistance, wheat type (spring or winter), and corn residue presence was improved by adding up to four weather-based predictors. Because weather variables were intercorrelated, no single model or subset of predictor variables was best based on accuracy, model fit, and complexity. Weather-based predictors in the 15 final empirical models selected were all derivatives of relative humidity or temperature, except for one rainfall-based predictor, suggesting that relative humidity was better at characterizing moisture effects on FHB than other variables. The average test misclassification rate of the final models was 19% lower than that of models currently used in a national FHB prediction system.


Sign in / Sign up

Export Citation Format

Share Document