Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate

Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimates depended on sample size whereby for sample size 100, 500, 1000 – 2000 and 2500 – 3500, the estimates were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.

Download Full-text

Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate

Pertanika Journal of Science and Technology ◽

10.47836/pjst.28.4.02 ◽

2020 ◽

Vol 28 (4) ◽

Author(s):

Hezlin Aryani Abd Rahman ◽

Yap Bee Wah ◽

Ong Seng Huat

Keyword(s):

Logistic Regression ◽

Sample Size ◽

Imbalanced Data ◽

Binary Logistic Regression ◽

Predictive Performance ◽

Parameter Estimate ◽

Classification Performance ◽

Parameter Estimates ◽

Binary Dependent Variable ◽

Simulation Results

Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using Mean Square Error (MSE). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimated depends on sample size whereby for sample size 100, 500, 1000 - 2000 and 2500 - 3500, the estimated were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.

Download Full-text

Least Likely Observations in Regression Models for Categorical Outcomes

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x0200200306 ◽

2002 ◽

Vol 2 (3) ◽

pp. 296-300 ◽

Cited By ~ 1

Author(s):

Jeremy Freese

Keyword(s):

Logistic Regression ◽

Regression Models ◽

Binary Logistic Regression ◽

Positive Outcome ◽

Ordinary Least Squares ◽

Parameter Estimates ◽

Least Squares Regression ◽

Binary Logistic Regression Model ◽

Ordered Logistic Regression ◽

Negative Outcome

This article presents a method and program for identifying poorly fitting observations for maximum-likelihood regression models for categorical dependent variables. After estimating a model, the program leastlikely will list the observations that have the lowest predicted probabilities of observing the value of the outcome category that was actually observed. For example, when run after estimating a binary logistic regression model, leastlikely will list the observations with a positive outcome that had the lowest predicted probabilities of a positive outcome and the observations with a negative outcome that had the lowest predicted probabilities of a negative outcome. These can be considered the observations in which the outcome is most surprising given the values of the independent variables and the parameter estimates and, like observations with large residuals in ordinary least squares regression, may warrant individual inspection. Use of the program is illustrated with examples using binary and ordered logistic regression.

Download Full-text

Sample Size and Test Length for Item Parameter Estimate and Exam Parameter Estimate

Al-Khwarizmi Jurnal Pendidikan Matematika dan Ilmu Pengetahuan Alam ◽

10.24256/jpmipa.v9i1.2384 ◽

2021 ◽

Vol 9 (1) ◽

pp. 69-78

Author(s):

Riswan Riswan

Keyword(s):

Item Response Theory ◽

Sample Size ◽

Item Response ◽

Parameter Estimate ◽

Test Theory ◽

Item Parameter ◽

Parameter Estimates ◽

Test Length ◽

Response Theory ◽

The Stability

The Item Response Theory (IRT) model contains one or more parameters in the model. These parameters are unknown, so it is necessary to predict them. This paper aims (1) to determine the sample size (N) on the stability of the item parameter (2) to determine the length (n) test on the stability of the estimate parameter examinee (3) to determine the effect of the model on the stability of the item and the parameter to examine (4) to find out Effect of sample size and test length on item stability and examinee parameter estimates (5) Effect of sample size, test length, and model on item stability and examinee parameter estimates. This paper is a simulation study in which the latent trait (q) sample simulation is derived from a standard normal population of ~ N (0.1), with a specific Sample Size (N) and test length (n) with the 1PL, 2PL and 3PL models using Wingen. Item analysis was carried out using the classical theory test approach and modern test theory. Item Response Theory and data were analyzed through software R with the ltm package. The results showed that the larger the sample size (N), the more stable the estimated parameter. For the length test, which is the greater the test length (n), the more stable the estimated parameter (q).

Download Full-text

Revisiting Socio- economic impact of Villagization, In the Case Assosa Zone , Ethiopia

International Journal for Innovation Education and Research ◽

10.31686/ijier.vol9.iss1.2775 ◽

2021 ◽

Vol 9 (1) ◽

pp. 01-11

Author(s):

Tadele Tesfaye Labiso

Keyword(s):

Logistic Regression ◽

Data Analysis ◽

Qualitative Methods ◽

Sample Size ◽

Social Services ◽

Research Method ◽

Binary Logistic Regression ◽

Survey Study ◽

Exploratory Research ◽

Socio Economic Impact

The overall objective of this study is to explore the practice and challenges of villagization; in the selected woredas of the Assosa zone Beninshangul Gumuz regional state. To achieve goals of the survey study mixed research method was employed. Generally. the Sample size of 168 sample households were determined by using S = X2NP(1-P) ÷ d2 (N-1) + X2P (1-P), The research employed exploratory research design on the challenges and implementation of the program, and it applied mainly qualitative methods. On the basis and types of data gathered and the instrument used, both quantitative and qualitative techniques of data analysis or binary logistic regression supported by SPSS was employed. The only good thing about this life was farming since people had fertile lands. But, when villagization was implemented the lives of the villagers improved because they started to have better access to social services. The study showed that villagization was implemented voluntarily and based on the consent of the local people. However, it is possible to conclude that villagization has significantly improved the lives of the villagers by bringing positive changes that did not exist before. people.

Download Full-text

COVARIATES AND SAMPLE SIZE EFFECTS ON PARAMETER ESTIMATION FOR BINARY LOGISTIC REGRESSION MODEL

Malaysian Journal of Science ◽

10.22452/mjs.vol35no1.7 ◽

2016 ◽

Vol 35 (1) ◽

pp. 44-62

Author(s):

Hamzah Abdul Hamid ◽

Bee Wah Yap ◽

Jin Xie Xian-

Keyword(s):

Logistic Regression ◽

Parameter Estimation ◽

Regression Model ◽

Sample Size ◽

Size Effects ◽

Logistic Regression Model ◽

Binary Logistic Regression ◽

Binary Logistic Regression Model

Download Full-text

Empirical distributions of parameter estimates in binary logistic regression using bootstrap

International Journal of Mathematical Analysis ◽

10.12988/ijma.2014.4394 ◽

2014 ◽

Vol 8 ◽

pp. 721-726 ◽

Cited By ~ 1

Author(s):

Anwar Fitrianto ◽

Ng Mei Cing

Keyword(s):

Logistic Regression ◽

Binary Logistic Regression ◽

Parameter Estimates ◽

Empirical Distributions

Download Full-text

Analyzing data from memory tasks - comparison of ANOVA, logistic regression and mixed logit model

Psihologija ◽

10.2298/psi170615023p ◽

2018 ◽

Vol 51 (4) ◽

pp. 469-488

Author(s):

Milica Popovic-Stijacic ◽

Ljiljana Mihic ◽

Dusica Filipovic-Djurdjevic

Keyword(s):

Logistic Regression ◽

Confidence Intervals ◽

Logit Model ◽

Linear Models ◽

Mixed Logit ◽

Binary Logistic Regression ◽

Standard Errors ◽

Parameter Estimates ◽

Mixed Logit Model ◽

Mixed Logit Models

We compared three statistical analyses over binary outcomes. As applying ANOVA over proportions violates at least two classical assumptions of linear models, two alternatives are described: the binary logistic regression and the mixed logit model. Firstly, we compared the effects obtained by the three methods over the same data from a previous memory research. All three methods gave similar results: the effects of the tasks and the number of sensory modalities were observed, but not their interaction. Secondly, by using the bootstrap estimates of the parameters, the efficacy of each method was explored. As predicted, the bootstrap parameter estimates of the ANOVA had large bias and standard errors, and consequently wide confidence intervals. On the other hand, the bootstrap parameter estimates of the binary logistic regression and the mixed logit models were similar ? both had low bias and standard errors and narrow confidence intervals.

Download Full-text

Diagnosing Multicollinearity of Logistic Regression Model

Asian Journal of Probability and Statistics ◽

10.9734/ajpas/2019/v5i230132 ◽

2019 ◽

pp. 1-9 ◽

Cited By ~ 6

Author(s):

N. A. M. R. Senaviratna ◽

T. M. J. A. Cooray

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Sample Size ◽

Logistic Regression Model ◽

Secondary Data ◽

Binary Logistic Regression ◽

Condition Index ◽

Binary Logistic Regression Model ◽

Correlated Variables ◽

Explanatory Variables

One of the key problems arises in binary logistic regression model is that explanatory variables being considered for the logistic regression model are highly correlated among themselves. Multicollinearity will cause unstable estimates and inaccurate variances that affects confidence intervals and hypothesis tests. Aim of this was to discuss some diagnostic measurements to detect multicollinearity namely tolerance, Variance Inflation Factor (VIF), condition index and variance proportions. The adapted diagnostics are illustrated with data based on a study of road accidents. Secondary data used from 2014 to 2016 in this study were acquired from the Traffic Police headquarters, Colombo in Sri Lanka. The response variable is accident severity that consists of two levels particularly grievous and non-grievous. Multicolinearity is identified by correlation matrix, tolerance and VIF values and confirmed by condition index and variance proportions. The range of solutions available for logistic regression such as increasing sample size, dropping one of the correlated variables and combining variables into an index. It is safely concluded that without increasing sample size, to omit one of the correlated variables can reduce multicollinearity considerably.

Download Full-text

Environmental Justice: Segregation, Noise Pollution and Health Disparities near the Hartsfield-Jackson Airport Area in Atlanta

The Review of Black Political Economy ◽

10.1177/00346446211065176 ◽

2021 ◽

pp. 003464462110651

Author(s):

Mona Ray

Keyword(s):

Logistic Regression ◽

Hearing Impairment ◽

Residential Segregation ◽

Sleep Disturbances ◽

Noise Exposure ◽

Binary Logistic Regression ◽

Noise Pollution ◽

Binary Logistic Regression Analysis ◽

Parameter Estimates ◽

Black And White

The inquiry in this paper has two parts: (1) an examination of potential disparities in exposure to airport noise pollution between Blacks (non-Hispanic) and Whites (non-Hispanic) around the Atlanta Hartsfield-Jackson airport (AHJA) area, and (2) a binary logistic regression analysis studying factors contributing to these disparities. The proposed model is that the difference in noise exposure measured by Net Exposure Difference score is a function of the degree of Black-White residential segregation; differences in poverty rates between Blacks and Whites; some socio-economic-demographic variables and four health indicators - noise annoyance (NA); sleep disturbance (SD); hearing impairment (HI); and cardiovascular disorder (CVD). A stratified random sampling method and telephonic survey using a 43-questions questionnaire among the adult households around the AHJA area produced 237 observations on Black and White households over a period of 2 years. Parameter estimates reveals disparities in exposure to aircraft noise exposure between the Black and White households within the 10-mile radius of the airport area indicating environmental injustice. The odds-ratios from the binary logistic regression suggests residential segregation, difference in poverty rates, race, education, as well as health conditions like hearing impairment and sleep disturbances have a statistically significant association with this disparity in noise exposure.

Download Full-text

Sample size for binary logistic prediction models: Beyond events per variable criteria

Statistical Methods in Medical Research ◽

10.1177/0962280218784726 ◽

2018 ◽

Vol 28 (8) ◽

pp. 2455-2474 ◽

Cited By ~ 56

Author(s):

Maarten van Smeden ◽

Karel GM Moons ◽

Joris AH de Groot ◽

Gary S Collins ◽

Douglas G Altman ◽

...

Keyword(s):

Sample Size ◽

Prediction Models ◽

Model Development ◽

Binary Logistic Regression ◽

Predictive Performance ◽

Total Sample ◽

Sample Size Determination ◽

Minimal Sample ◽

Out Of Sample ◽

Before And After

Binary logistic regression is one of the most frequently applied statistical approaches for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥10, to determine the minimal sample size required and the maximum number of candidate predictors that can be examined. We present an extensive simulation study in which we studied the influence of EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was studied before and after regression shrinkage and variable selection. The results indicate that EPV does not have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. We show that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction. We propose that the development of new sample size criteria for prediction models should be based on these three parameters, and provide suggestions for improving sample size determination.

Download Full-text