Weighted logistic regression for large-scale imbalanced and rare events data

2014 ◽  
Vol 59 ◽  
pp. 142-148 ◽  
Author(s):  
Maher Maalouf ◽  
Mohammad Siddiqi
2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Marjan Faghih ◽  
Zahra Bagheri ◽  
Dejan Stevanovic ◽  
Seyyed Mohhamad Taghi Ayatollahi ◽  
Peyman Jafari

The logistic regression (LR) model for assessing differential item functioning (DIF) is highly dependent on the asymptotic sampling distributions. However, for rare events data, the maximum likelihood estimation method may be biased and the asymptotic distributions may not be reliable. In this study, the performance of the regular maximum likelihood (ML) estimation is compared with two bias correction methods including weighted logistic regression (WLR) and Firth's penalized maximum likelihood (PML) to assess DIF for imbalanced or rare events data. The power and type I error rate of the LR model for detecting DIF were investigated under different combinations of sample size, moderate and severe magnitudes of uniform DIF (DIF = 0.4 and 0.8), sample size ratio, number of items, and the imbalanced degree (τ). Indeed, as compared with WLR and for severe imbalanced degree (τ = 0.069), there were reductions of approximately 30% and 24% under DIF = 0.4 and 27% and 23% under DIF = 0.8 in the power of the PML and ML, respectively. The present study revealed that the WLR outperforms both the ML and PML estimation methods when logistic regression is used to evaluate DIF for imbalanced or rare events data.


2001 ◽  
Vol 9 (2) ◽  
pp. 137-163 ◽  
Author(s):  
Gary King ◽  
Langche Zeng

We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quarter-million dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables. We provide methods that link these two results, enabling both types of corrections to work simultaneously, and software that implements the methods developed.


2021 ◽  
Vol 10 (5) ◽  
pp. 933
Author(s):  
Byung Woo Cho ◽  
Du Seong Kim ◽  
Hyuck Min Kwon ◽  
Ick Hwan Yang ◽  
Woo-Suk Lee ◽  
...  

Few studies have reported the relationship between knee pain and hypercholesterolemia in the elderly population with osteoarthritis (OA), independent of other variables. The aim of this study was to reveal the association between knee pain and metabolic diseases including hypercholesterolemia using a large-scale cohort. A cross-sectional study was conducted using data from the Korea National Health and the Nutrition Examination Survey (KNHANES-V, VI-1; 2010–2013). Among the subjects aged ≥60 years, 7438 subjects (weighted number estimate = 35,524,307) who replied knee pain item and performed the simple radiographs of knee were enrolled. Using multivariable ordinal logistic regression analysis, variables affecting knee pain were identified, and the odds ratio (OR) was calculated. Of the 35,524,307 subjects, 10,630,836 (29.9%) subjects experienced knee pain. Overall, 20,290,421 subjects (56.3%) had radiographic OA, and 8,119,372 (40.0%) of them complained of knee pain. Multivariable ordinal logistic regression analysis showed that among the metabolic diseases, only hypercholesterolemia was positively correlated with knee pain in the OA group (OR 1.24; 95% Confidence Interval 1.02–1.52, p = 0.033). There were no metabolic diseases correlated with knee pain in the non-OA group. This large-scale study revealed that in the elderly, hypercholesterolemia was positively associated with knee pain independent of body mass index and other metabolic diseases in the OA group, but not in the non-OA group. These results will help in understanding the nature of arthritic pain, and may support the need for exploring the longitudinal associations.


2021 ◽  
Vol 42 (Supplement_1) ◽  
pp. S33-S34
Author(s):  
Morgan A Taylor ◽  
Randy D Kearns ◽  
Jeffrey E Carter ◽  
Mark H Ebell ◽  
Curt A Harris

Abstract Introduction A nuclear disaster would generate an unprecedented volume of thermal burn patients from the explosion and subsequent mass fires (Figure 1). Prediction models characterizing outcomes for these patients may better equip healthcare providers and other responders to manage large scale nuclear events. Logistic regression models have traditionally been employed to develop prediction scores for mortality of all burn patients. However, other healthcare disciplines have increasingly transitioned to machine learning (ML) models, which are automatically generated and continually improved, potentially increasing predictive accuracy. Preliminary research suggests ML models can predict burn patient mortality more accurately than commonly used prediction scores. The purpose of this study is to examine the efficacy of various ML methods in assessing thermal burn patient mortality and length of stay in burn centers. Methods This retrospective study identified patients with fire/flame burn etiologies in the National Burn Repository between the years 2009 – 2018. Patients were randomly partitioned into a 67%/33% split for training and validation. A random forest model (RF) and an artificial neural network (ANN) were then constructed for each outcome, mortality and length of stay. These models were then compared to logistic regression models and previously developed prediction tools with similar outcomes using a combination of classification and regression metrics. Results During the study period, 82,404 burn patients with a thermal etiology were identified in the analysis. The ANN models will likely tend to overfit the data, which can be resolved by ending the model training early or adding additional regularization parameters. Further exploration of the advantages and limitations of these models is forthcoming as metric analyses become available. Conclusions In this proof-of-concept study, we anticipate that at least one ML model will predict the targeted outcomes of thermal burn patient mortality and length of stay as judged by the fidelity with which it matches the logistic regression analysis. These advancements can then help disaster preparedness programs consider resource limitations during catastrophic incidents resulting in burn injuries.


Sign in / Sign up

Export Citation Format

Share Document