Categorical missing data imputation for software cost estimation by multinomial logistic regression

2006 ◽  
Vol 79 (3) ◽  
pp. 404-414 ◽  
Author(s):  
Panagiotis Sentas ◽  
Lefteris Angelis
Author(s):  
Panagiota Chatzipetrou

Software cost estimation (SCE) is a critical phase in software development projects. A common problem in building software cost models is that the available datasets contain projects with lots of missing categorical data. There are several techniques for handling missing data in the context of SCE. The purpose of this article is to show a state-of-art statistical and visualization approach of evaluating and comparing the effect of missing data on the accuracy of cost estimation models. Five missing data techniques were used: multinomial logistic regression, listwise deletion, mean imputation, expectation maximization and regression imputation; and compared with respect to their effect on the prediction accuracy of a least squares regression cost model. The evaluation is based on various expressions of the prediction error. The comparisons are conducted using statistical tests, resampling techniques and visualization tools like the regression error characteristic curves.


2018 ◽  
pp. 345-372
Author(s):  
Lefteris Angelis ◽  
Nikolaos Mittas ◽  
Panagiota Chatzipetrou

Software Cost Estimation (SCE) is a critical phase in software development projects. However, due to the growing complexity of the software itself, a common problem in building software cost models is that the available datasets contain lots of missing categorical data. The purpose of this chapter is to show how a framework of statistical, computational, and visualization techniques can be used to evaluate and compare the effect of missing data techniques on the accuracy of cost estimation models. Hence, the authors use five missing data techniques: Multinomial Logistic Regression, Listwise Deletion, Mean Imputation, Expectation Maximization, and Regression Imputation. The evaluation and the comparisons are conducted using Regression Error Characteristic curves, which provide visual comparison of different prediction models, and Regression Error Operating Curves, which examine predictive power of models with respect to under- or over-estimation.


Author(s):  
Lefteris Angelis ◽  
Nikolaos Mittas ◽  
Panagiota Chatzipetrou

Software Cost Estimation (SCE) is a critical phase in software development projects. However, due to the growing complexity of the software itself, a common problem in building software cost models is that the available datasets contain lots of missing categorical data. The purpose of this chapter is to show how a framework of statistical, computational, and visualization techniques can be used to evaluate and compare the effect of missing data techniques on the accuracy of cost estimation models. Hence, the authors use five missing data techniques: Multinomial Logistic Regression, Listwise Deletion, Mean Imputation, Expectation Maximization, and Regression Imputation. The evaluation and the comparisons are conducted using Regression Error Characteristic curves, which provide visual comparison of different prediction models, and Regression Error Operating Curves, which examine predictive power of models with respect to under- or over-estimation.


BMJ Open ◽  
2020 ◽  
Vol 10 (12) ◽  
pp. e041421
Author(s):  
Theodore J Iwashyna ◽  
Cheng Ma ◽  
Xiao Qing Wang ◽  
Sarah Seelye ◽  
Ji Zhu ◽  
...  

ObjectiveThere has been a proliferation of approaches to statistical methods and missing data imputation as electronic health records become more plentiful; however, the relative performance on real-world problems is unclear.Materials and methodsUsing 355 823 intensive care unit (ICU) hospitalisations at over 100 hospitals in the nationwide Veterans Health Administration system (2014–2017), we systematically varied three approaches: how we extracted and cleaned physiologic variables; how we handled missing data (using mean value imputation, random forest, extremely randomised trees (extra-trees regression), ridge regression, normal value imputation and case-wise deletion) and how we computed risk (using logistic regression, random forest and neural networks). We applied these approaches in a 70% development sample and tested the results in an independent 30% testing sample. Area under the receiver operating characteristic curve (AUROC) was used to quantify model discrimination.ResultsIn 355 823 ICU stays, there were 34 867 deaths (9.8%) within 30 days of admission. The highest AUROCs obtained for each primary classification method were very similar: 0.83 (95% CI 0.83 to 0.83) to 0.85 (95% CI 0.84 to 0.85). Likewise, there was relatively little variation within classification method by the missing value imputation method used—except when casewise deletion was applied for missing data.ConclusionVariation in discrimination was seen as a function of data cleanliness, with logistic regression suffering the most loss of discrimination in the least clean data. Losses in discrimination were not present in random forest and neural networks even in naively extracted data. Data from a large nationwide health system revealed interactions between missing data imputation techniques, data cleanliness and classification methods for predicting 30-day mortality.


Author(s):  
Lefteris Angelis ◽  
Panagiotis Sentas ◽  
Nikolaos Mittas ◽  
Panagiota Chatzipetrou

Software Cost Estimation is a critical phase in the development of a software project, and over the years has become an emerging research area. A common problem in building software cost models is that the available datasets contain projects with lots of missing categorical data. The purpose of this chapter is to show how a combination of modern statistical and computational techniques can be used to compare the effect of missing data techniques on the accuracy of cost estimation. Specifically, a recently proposed missing data technique, the multinomial logistic regression, is evaluated and compared with four older methods: listwise deletion, mean imputation, expectation maximization and regression imputation with respect to their effect on the prediction accuracy of a least squares regression cost model. The evaluation is based on various expressions of the prediction error and the comparisons are conducted using statistical tests, resampling techniques and a visualization tool, the regression error characteristic curves.


Author(s):  
Lefteris Angelis ◽  
Nikolaos Mittas ◽  
Panagiota Chatzipetrou

Software Cost Estimation (SCE) is a critical phase in software development projects. However, due to the growing complexity of the software itself, a common problem in building software cost models is that the available datasets contain lots of missing categorical data. The purpose of this chapter is to show how a framework of statistical, computational, and visualization techniques can be used to evaluate and compare the effect of missing data techniques on the accuracy of cost estimation models. Hence, the authors use five missing data techniques: Multinomial Logistic Regression, Listwise Deletion, Mean Imputation, Expectation Maximization, and Regression Imputation. The evaluation and the comparisons are conducted using Regression Error Characteristic curves, which provide visual comparison of different prediction models, and Regression Error Operating Curves, which examine predictive power of models with respect to under- or over-estimation.


Sign in / Sign up

Export Citation Format

Share Document