scholarly journals Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate

2021 ◽  
Vol 29 (1) ◽  
Author(s):  
Hezlin Aryani Abd Rahman ◽  
Yap Bee Wah ◽  
Ong Seng Huat

Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimates depended on sample size whereby for sample size 100, 500, 1000 – 2000 and 2500 – 3500, the estimates were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.

2020 ◽  
Vol 28 (4) ◽  
Author(s):  
Hezlin Aryani Abd Rahman ◽  
Yap Bee Wah ◽  
Ong Seng Huat

Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using Mean Square Error (MSE). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimated depends on sample size whereby for sample size 100, 500, 1000 - 2000 and 2500 - 3500, the estimated were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.


Author(s):  
Jeremy Freese

This article presents a method and program for identifying poorly fitting observations for maximum-likelihood regression models for categorical dependent variables. After estimating a model, the program leastlikely will list the observations that have the lowest predicted probabilities of observing the value of the outcome category that was actually observed. For example, when run after estimating a binary logistic regression model, leastlikely will list the observations with a positive outcome that had the lowest predicted probabilities of a positive outcome and the observations with a negative outcome that had the lowest predicted probabilities of a negative outcome. These can be considered the observations in which the outcome is most surprising given the values of the independent variables and the parameter estimates and, like observations with large residuals in ordinary least squares regression, may warrant individual inspection. Use of the program is illustrated with examples using binary and ordered logistic regression.


Author(s):  
Riswan Riswan

The Item Response Theory (IRT) model contains one or more parameters in the model. These parameters are unknown, so it is necessary to predict them. This paper aims (1) to determine the sample size (N) on the stability of the item parameter (2) to determine the length (n) test on the stability of the estimate parameter examinee (3) to determine the effect of the model on the stability of the item and the parameter to examine (4) to find out Effect of sample size and test length on item stability and examinee parameter estimates (5) Effect of sample size, test length, and model on item stability and examinee parameter estimates. This paper is a simulation study in which the latent trait (q) sample simulation is derived from a standard normal population of ~ N (0.1), with a specific Sample Size (N) and test length (n) with the 1PL, 2PL and 3PL models using Wingen. Item analysis was carried out using the classical theory test approach and modern test theory. Item Response Theory and data were analyzed through software R with the ltm package. The results showed that the larger the sample size (N), the more stable the estimated parameter. For the length test, which is the greater the test length (n), the more stable the estimated parameter (q).


Author(s):  
Tadele Tesfaye Labiso

The overall objective of this study is to explore the  practice and challenges of villagization; in the selected woredas of the Assosa zone  Beninshangul Gumuz regional state. To achieve goals of the survey study mixed research method was employed. Generally. the Sample size of 168 sample households were determined by using S = X2NP(1-P) ÷ d2 (N-1) + X2P (1-P), The research employed exploratory research design on the challenges and implementation of the program, and it applied mainly qualitative methods. On the basis and types of data gathered and the instrument used, both quantitative and qualitative techniques of data analysis or binary logistic regression supported by SPSS was employed. The only good thing about this life was farming since people had fertile lands. But, when villagization was implemented the lives of the villagers improved because they started to have better access to social services. The study showed that villagization was implemented voluntarily and based on the consent of the local people. However, it is possible to conclude that villagization has significantly improved the lives of the villagers by bringing positive changes that did not exist before. people.


Psihologija ◽  
2018 ◽  
Vol 51 (4) ◽  
pp. 469-488
Author(s):  
Milica Popovic-Stijacic ◽  
Ljiljana Mihic ◽  
Dusica Filipovic-Djurdjevic

We compared three statistical analyses over binary outcomes. As applying ANOVA over proportions violates at least two classical assumptions of linear models, two alternatives are described: the binary logistic regression and the mixed logit model. Firstly, we compared the effects obtained by the three methods over the same data from a previous memory research. All three methods gave similar results: the effects of the tasks and the number of sensory modalities were observed, but not their interaction. Secondly, by using the bootstrap estimates of the parameters, the efficacy of each method was explored. As predicted, the bootstrap parameter estimates of the ANOVA had large bias and standard errors, and consequently wide confidence intervals. On the other hand, the bootstrap parameter estimates of the binary logistic regression and the mixed logit models were similar ? both had low bias and standard errors and narrow confidence intervals.


Author(s):  
N. A. M. R. Senaviratna ◽  
T. M. J. A. Cooray

One of the key problems arises in binary logistic regression model is that explanatory variables being considered for the logistic regression model are highly correlated among themselves. Multicollinearity will cause unstable estimates and inaccurate variances that affects confidence intervals and hypothesis tests. Aim of this was to discuss some diagnostic measurements to detect multicollinearity namely tolerance, Variance Inflation Factor (VIF), condition index and variance proportions. The adapted diagnostics are illustrated with data based on a study of road accidents. Secondary data used from 2014 to 2016 in this study were acquired from the Traffic Police headquarters, Colombo in Sri Lanka. The response variable is accident severity that consists of two levels particularly grievous and non-grievous. Multicolinearity is identified by correlation matrix, tolerance and VIF values and confirmed by condition index and variance proportions. The range of solutions available for logistic regression such as increasing sample size, dropping one of the correlated variables and combining variables into an index. It is safely concluded that without increasing sample size, to omit one of the correlated variables can reduce multicollinearity considerably.


2021 ◽  
pp. 003464462110651
Author(s):  
Mona Ray

The inquiry in this paper has two parts: (1) an examination of potential disparities in exposure to airport noise pollution between Blacks (non-Hispanic) and Whites (non-Hispanic) around the Atlanta Hartsfield-Jackson airport (AHJA) area, and (2) a binary logistic regression analysis studying factors contributing to these disparities. The proposed model is that the difference in noise exposure measured by Net Exposure Difference score is a function of the degree of Black-White residential segregation; differences in poverty rates between Blacks and Whites; some socio-economic-demographic variables and four health indicators - noise annoyance (NA); sleep disturbance (SD); hearing impairment (HI); and cardiovascular disorder (CVD). A stratified random sampling method and telephonic survey using a 43-questions questionnaire among the adult households around the AHJA area produced 237 observations on Black and White households over a period of 2 years. Parameter estimates reveals disparities in exposure to aircraft noise exposure between the Black and White households within the 10-mile radius of the airport area indicating environmental injustice. The odds-ratios from the binary logistic regression suggests residential segregation, difference in poverty rates, race, education, as well as health conditions like hearing impairment and sleep disturbances have a statistically significant association with this disparity in noise exposure.


2018 ◽  
Vol 28 (8) ◽  
pp. 2455-2474 ◽  
Author(s):  
Maarten van Smeden ◽  
Karel GM Moons ◽  
Joris AH de Groot ◽  
Gary S Collins ◽  
Douglas G Altman ◽  
...  

Binary logistic regression is one of the most frequently applied statistical approaches for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥10, to determine the minimal sample size required and the maximum number of candidate predictors that can be examined. We present an extensive simulation study in which we studied the influence of EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was studied before and after regression shrinkage and variable selection. The results indicate that EPV does not have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. We show that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction. We propose that the development of new sample size criteria for prediction models should be based on these three parameters, and provide suggestions for improving sample size determination.


Sign in / Sign up

Export Citation Format

Share Document