scholarly journals Comparison of Logistic Regression and Linear Regression in Modeling Percentage Data

2001 ◽  
Vol 67 (5) ◽  
pp. 2129-2135 ◽  
Author(s):  
Lihui Zhao ◽  
Yuhuan Chen ◽  
Donald W. Schaffner

ABSTRACT Percentage is widely used to describe different results in food microbiology, e.g., probability of microbial growth, percent inactivated, and percent of positive samples. Four sets of percentage data, percent-growth-positive, germination extent, probability for one cell to grow, and maximum fraction of positive tubes, were obtained from our own experiments and the literature. These data were modeled using linear and logistic regression. Five methods were used to compare the goodness of fit of the two models: percentage of predictions closer to observations, range of the differences (predicted value minus observed value), deviation of the model, linear regression between the observed and predicted values, and bias and accuracy factors. Logistic regression was a better predictor of at least 78% of the observations in all four data sets. In all cases, the deviation of logistic models was much smaller. The linear correlation between observations and logistic predictions was always stronger. Validation (accomplished using part of one data set) also demonstrated that the logistic model was more accurate in predicting new data points. Bias and accuracy factors were found to be less informative when evaluating models developed for percentage data, since neither of these indices can compare predictions at zero. Model simplification for the logistic model was demonstrated with one data set. The simplified model was as powerful in making predictions as the full linear model, and it also gave clearer insight in determining the key experimental factors.

2008 ◽  
Vol 11 (1) ◽  
pp. 275-288 ◽  
Author(s):  
Maria Noel Rodríguez Ayán ◽  
Maria Teresa Coello García

University students' academic achievement measured by means of academic progress is modeled through linear and logistic regression, employing prior achievement and demographic factors as predictors. The main aim of the present paper is to compare results yielded by both statistical procedures, in order to identify the most suitable approach in terms of goodness of fit and predictive power. Grades awarded in basic scientific courses and demographic variables were entered into the models at the first step. Two hypotheses are proposed: (a) Grades in basic courses as well as demographic factors are directly related to academic progress, and (b) Logistic regression is more appropriate than linear regression due to its higher predictive power. Results partially confirm the first prediction, as grades are positively related to progress. However, not all demographic factors considered proved to be good predictors. With regard to the second hypothesis, logistic regression was shown to be a better approach than linear regression, yielding more stable estimates with regard to the presence of ill-fitting patterns.


Author(s):  
Patricia Cerrito

Ultimately, a patient severity index is used to compare patient outcomes across healthcare providers. If the outcome is mortality, logistic regression is used. If the outcome is cost, length of stay, or some other resource utilization, then linear regression is used. A provider is ranked based upon the differential between predicted outcome and actual outcome. The greater this differential, the higher the quality ranking. There are two ways to increase this differential. The first is to improve care to decrease actual mortality or length of stay. The second is to improve coding to increase the predicted mortality or length of stay. Ultimately, it is cheaper to increase the predicted values than it is to decrease the actual values. Many providers take this approach.


2009 ◽  
Vol 48 (03) ◽  
pp. 306-310 ◽  
Author(s):  
C. E. Minder ◽  
G. Gillmann

Summary Objectives: This paper is concerned with checking goodness-of-fit of binary logistic regression models. For the practitioners of data analysis, the broad classes of procedures for checking goodness-of-fit available in the literature are described. The challenges of model checking in the context of binary logistic regression are reviewed. As a viable solution, a simple graphical procedure for checking goodness-of-fit is proposed. Methods: The graphical procedure proposed relies on pieces of information available from any logistic analysis; the focus is on combining and presenting these in an informative way. Results: The information gained using this approach is presented with three examples. In the discussion, the proposed method is put into context and compared with other graphical procedures for checking goodness-of-fit of binary logistic models available in the literature. Conclusion: A simple graphical method can significantly improve the understanding of any logistic regression analysis and help to prevent faulty conclusions.


2020 ◽  
Vol 38 (15_suppl) ◽  
pp. 9062-9062
Author(s):  
Corey Carter ◽  
Yusuke Tomita ◽  
Akira Yuno ◽  
Jonathan Baker ◽  
Min-Jung Lee ◽  
...  

9062 Background: In a Phase 2 trial called QUADRUPLE THREAT (QT) (NCT02489903), where 2nd line+ small cell lung cancer (SCLC) patients were treated with RRx-001 and a platinum doublet, the programmed death-ligand 1 (PD-L1) status of circulating tumor cells (CTCs) in 14 patient samples was evaluated. Methods: 26 consented patients received weekly RRx-001 4 mg followed by a reintroduced platinum doublet; epithelial cell adhesion molecule (EPCAM+) CTCs from 10 ml of blood on two consecutive timepoints cycle 1 day 1 and cycle 3 day 8 (cycle duration = 1 week) were detected by EpCAM-based immunomagnetic capture and flow cytometric analysis. CTCs were further characterized for protein expression of PD-L1. Tumor response was classified as partial or complete response based on the response evaluation criteria in solid tumors (RECISTv1.1) measured every 6 weeks. Results: The analyzed clinical data set comprised 14 RECIST-evaluable patients. 50% were females (7/14) and the median age (years) at baseline was 64.5 (Min = 48.5, Max = 84.2, SD = 10.3). The logistic model McFadden goodness of fit score (0 to 100) is 0.477, which is a strong correlation value. The logistic model analyzing the association of CTC PD-L1 expression at two timepoints and response had an approximate 92.8% accuracy in its prediction of clinical benefit (SD/PR/CR). Accuracy is defined in the standard way as 1- (False positive rate + False negative rate). The estimated ROC displayed in Figure 1 suggests a ROC AUC of 0.93 (95% CI: 0.78, 0.99), an excellent measure of performance. Conclusions: Reduction of PD-L1 expression was correlated with good clinical outcome after RRx-001 + platinum doublet treatment. PD-L1 expression reduction in favor of RRx-001 RECIST clinical benefit was clinically significant as compared to non-responders with progressive disease (PD). In the ongoing SCLC Phase 3 study called REPLATINUM (NCT03699956), analyses are planned to correlate response and survival with expression of CD47 and PD-L1 on CTCs. Clinical trial information: NCT02489903.


Rainfall prediction is a significant part in agriculture, so prediction of rainfall is essential for the best financial development of our nation. In this paper, we represent the linear regression method to predict the yearly rainfall in different states of India. To predict the estimate of yearly rainfall, the linear regression is implemented on the data set and the coefficients are used to predict the yearly rainfall based on the corresponding parameter values. Finally an estimate value of what the rainfall might be at a given values and places can be establish easily. In this paper, we demonstrate how to predict the yearly rainfall in all the states from the year 1901 to 2015 by using simple multi linear regression concepts. Then we train the model using train _test_ split and analyze various performance measures like Mean squared error, Root mean squared error, R^2 and we visualize the data using scatter plots, box plots, expected and predicted values


2004 ◽  
Author(s):  
Paul Galambos

In this paper measured pressure vs. flow and flow resistance are compared with theoretical predictions (Poiseuille flow) for surface micromachined microfluidic channels with thin-film deformable covers. Three sets of data are compared. In data set 3 the channel width is narrow, channel deflections are relatively small, and there is very little deviation from the theoretical flow resistance prediction. In data sets 1 and 2 there is significant deviation from Poiseuille flow predictions—with flow resistances as little as 1/2 the predicted values that decrease with increasing pressure. Two hypotheses to explain this discrepancy are discussed: (1) channel cover deflection leading to deeper, lower resistance flow channels, and (2) an observed two-phase, air/water, flow phenomena leading to reduced effective wall friction in the channel.


2016 ◽  
Vol 72 (6) ◽  
pp. 696-703 ◽  
Author(s):  
Julian Henn

An alternative measure to the goodness of fit (GoF) is developed and applied to experimental data. The alternative goodness of fit squared (aGoFs) demonstrates that the GoF regularly fails to provide evidence for the presence of systematic errors, because certain requirements are not met. These requirements are briefly discussed. It is shown that in many experimental data sets a correlation between the squared residuals and the variance of observed intensities exists. These correlations corrupt the GoF and lead to artificially reduced values in the GoF and in the numerical value of thewR(F2). Remaining systematic errors in the data sets are veiled by this mechanism. In data sets where these correlations do not appear for the entire data set, they often appear for the decile of largest variances of observed intensities. Additionally, statistical errors for the squared goodness of fit, GoFs, and the aGoFs are developed and applied to experimental data. This measure shows how significantly the GoFs and aGoFs deviate from the ideal value one.


2019 ◽  
Vol 115 (3/4) ◽  
Author(s):  
Douw G. Breed ◽  
Tanja Verster

Segmentation of data for the purpose of enhancing predictive modelling is a well-established practice in the banking industry. Unsupervised and supervised approaches are the two main types of segmentation and examples of improved performance of predictive models exist for both approaches. However, both focus on a single aspect – either target separation or independent variable distribution – and combining them may deliver better results. This combination approach is called semi-supervised segmentation. Our objective was to explore four new semi-supervised segmentation techniques that may offer alternative strengths. We applied these techniques to six data sets from different domains, and compared the model performance achieved. The original semi-supervised segmentation technique was the best for two of the data sets (as measured by the improvement in validation set Gini), but others outperformed for the other four data sets. Significance: We propose four newly developed semi-supervised segmentation techniques that can be used as additional tools for segmenting data before fitting a logistic regression. In all comparisons, using semi-supervised segmentation before fitting a logistic regression improved the modelling performance (as measured by the Gini coefficient on the validation data set) compared to using unsegmented logistic regression.


Author(s):  
J. DIEBOLT ◽  
M.-A. EL-AROUI ◽  
V. DURBEC ◽  
B. VILLAIN

When extreme quantiles have to be estimated from a given data set, the classical parametric approach can lead to very poor estimations. This has led to the introduction of specific methods for estimating extreme quantiles (MEEQ's) in a nonparametric spirit, e.g., Pickands excess method, methods based on Hill's estimate of the Pareto index, exponential tail (ET) and quadratic tail (QT) methods. However, no practical technique for assessing and comparing these MEEQ's when they are to be used on a given data set is available. This paper is a first attempt to provide such techniques. We first compare the estimations given by the main MEEQ's on several simulated data sets. Then we suggest goodness-of-fit (Gof) tests to assess the MEEQ's by measuring the quality of their underlying approximations. It is shown that Gof techniques bring very relevant tools to assess and compare ET and excess methods. Other empirical criterions for comparing MEEQ's are also proposed and studied through Monte-Carlo analyses. Finally, these assessment and comparison techniques are experimented on real-data sets issued from an industrial context where extreme quantiles are needed to define maintenance policies.


Author(s):  
Jacob Nelson ◽  
G. Austin Marrs ◽  
Greg Schmidt ◽  
Joseph A. Donndelinger ◽  
Robert L. Nagel

The desire to use ever growing qualitative data sets of user generated content in the engineering design process in a computationally effective manner makes it increasingly necessary to draw representative samples. This work investigated the ability of alternative sampling algorithms to draw samples with conformance to characteristics of the original data set. Sampling methods investigated included: random sampling, interval sampling, fixed-increment (or systematic) sampling method, and stratified sampling. Data collected through the Vehicle Owner’s Questionnaire, a survey administered by the U.S. National Highway Traffic Safety Administration, is used as a case study throughout this paper. The paper demonstrates that existing statistical methods may be used to evaluate goodness of fit for samples drawn from large bodies of qualitative data. Evaluation of goodness of fit not only provides confidence that a sample is representative of the data set from which it is drawn, but also yields valuable real-time feedback during the sampling process. This investigation revealed two interesting and counterintuitive trends in sampling algorithm performance. The first is that larger sample sizes do not necessarily lead to improved goodness of fit. The second is that depending on the details of implementation, data cleansing may degrade performance of data sampling algorithms rather than improving it. This work illustrates the importance of aligning sampling procedures to data structures and validating the conformance of samples to characteristics of the larger data set to avoid drawing erroneous conclusions based on unexpectedly biased samples of data.


Sign in / Sign up

Export Citation Format

Share Document