Categorical missing data imputation for software cost estimation by multinomial logistic regression

Software cost estimation (SCE) is a critical phase in software development projects. A common problem in building software cost models is that the available datasets contain projects with lots of missing categorical data. There are several techniques for handling missing data in the context of SCE. The purpose of this article is to show a state-of-art statistical and visualization approach of evaluating and comparing the effect of missing data on the accuracy of cost estimation models. Five missing data techniques were used: multinomial logistic regression, listwise deletion, mean imputation, expectation maximization and regression imputation; and compared with respect to their effect on the prediction accuracy of a least squares regression cost model. The evaluation is based on various expressions of the prediction error. The comparisons are conducted using statistical tests, resampling techniques and visualization tools like the regression error characteristic curves.

Download Full-text

A Framework of Statistical and Visualization Techniques for Missing Data Analysis in Software Cost Estimation

Intelligent Systems ◽

10.4018/978-1-5225-5643-5.ch014 ◽

2018 ◽

pp. 345-372

Author(s):

Lefteris Angelis ◽

Nikolaos Mittas ◽

Panagiota Chatzipetrou

Keyword(s):

Missing Data ◽

Cost Estimation ◽

Prediction Models ◽

Software Cost Estimation ◽

Software Cost ◽

Regression Imputation ◽

Mean Imputation ◽

Regression Error ◽

Visualization Techniques ◽

Missing Data Techniques

Software Cost Estimation (SCE) is a critical phase in software development projects. However, due to the growing complexity of the software itself, a common problem in building software cost models is that the available datasets contain lots of missing categorical data. The purpose of this chapter is to show how a framework of statistical, computational, and visualization techniques can be used to evaluate and compare the effect of missing data techniques on the accuracy of cost estimation models. Hence, the authors use five missing data techniques: Multinomial Logistic Regression, Listwise Deletion, Mean Imputation, Expectation Maximization, and Regression Imputation. The evaluation and the comparisons are conducted using Regression Error Characteristic curves, which provide visual comparison of different prediction models, and Regression Error Operating Curves, which examine predictive power of models with respect to under- or over-estimation.

Download Full-text

A Framework of Statistical and Visualization Techniques for Missing Data Analysis in Software Cost Estimation

Handbook of Research on Innovations in Systems and Software Engineering - Advances in Systems Analysis, Software Engineering, and High Performance Computing ◽

10.4018/978-1-4666-6359-6.ch003 ◽

2015 ◽

pp. 71-97

Author(s):

Lefteris Angelis ◽

Nikolaos Mittas ◽

Panagiota Chatzipetrou

Keyword(s):

Missing Data ◽

Cost Estimation ◽

Prediction Models ◽

Software Cost Estimation ◽

Software Cost ◽

Regression Imputation ◽

Mean Imputation ◽

Regression Error ◽

Visualization Techniques ◽

Missing Data Techniques

Software Cost Estimation (SCE) is a critical phase in software development projects. However, due to the growing complexity of the software itself, a common problem in building software cost models is that the available datasets contain lots of missing categorical data. The purpose of this chapter is to show how a framework of statistical, computational, and visualization techniques can be used to evaluate and compare the effect of missing data techniques on the accuracy of cost estimation models. Hence, the authors use five missing data techniques: Multinomial Logistic Regression, Listwise Deletion, Mean Imputation, Expectation Maximization, and Regression Imputation. The evaluation and the comparisons are conducted using Regression Error Characteristic curves, which provide visual comparison of different prediction models, and Regression Error Operating Curves, which examine predictive power of models with respect to under- or over-estimation.

Download Full-text

Variation in model performance by data cleanliness and classification methods in the prediction of 30-day ICU mortality, a US nationwide retrospective cohort and simulation study

BMJ Open ◽

10.1136/bmjopen-2020-041421 ◽

2020 ◽

Vol 10 (12) ◽

pp. e041421

Author(s):

Theodore J Iwashyna ◽

Cheng Ma ◽

Xiao Qing Wang ◽

Sarah Seelye ◽

Ji Zhu ◽

...

Keyword(s):

Neural Networks ◽

Logistic Regression ◽

Missing Data ◽

Random Forest ◽

Characteristic Curve ◽

Classification Method ◽

Data Imputation ◽

Classification Methods ◽

Health Administration ◽

Missing Data Imputation

ObjectiveThere has been a proliferation of approaches to statistical methods and missing data imputation as electronic health records become more plentiful; however, the relative performance on real-world problems is unclear.Materials and methodsUsing 355 823 intensive care unit (ICU) hospitalisations at over 100 hospitals in the nationwide Veterans Health Administration system (2014–2017), we systematically varied three approaches: how we extracted and cleaned physiologic variables; how we handled missing data (using mean value imputation, random forest, extremely randomised trees (extra-trees regression), ridge regression, normal value imputation and case-wise deletion) and how we computed risk (using logistic regression, random forest and neural networks). We applied these approaches in a 70% development sample and tested the results in an independent 30% testing sample. Area under the receiver operating characteristic curve (AUROC) was used to quantify model discrimination.ResultsIn 355 823 ICU stays, there were 34 867 deaths (9.8%) within 30 days of admission. The highest AUROCs obtained for each primary classification method were very similar: 0.83 (95% CI 0.83 to 0.83) to 0.85 (95% CI 0.84 to 0.85). Likewise, there was relatively little variation within classification method by the missing value imputation method used—except when casewise deletion was applied for missing data.ConclusionVariation in discrimination was seen as a function of data cleanliness, with logistic regression suffering the most loss of discrimination in the least clean data. Losses in discrimination were not present in random forest and neural networks even in naively extracted data. Data from a large nationwide health system revealed interactions between missing data imputation techniques, data cleanliness and classification methods for predicting 30-day mortality.

Download Full-text

Methods for Statistical and Visual Comparison of Imputation Methods for Missing Data in Software Cost Estimation

Modern Software Engineering Concepts and Practices - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-60960-215-4.ch009 ◽

2011 ◽

pp. 221-241

Author(s):

Lefteris Angelis ◽

Panagiotis Sentas ◽

Nikolaos Mittas ◽

Panagiota Chatzipetrou

Keyword(s):

Missing Data ◽

Cost Estimation ◽

Cost Model ◽

Statistical Tests ◽

Computational Techniques ◽

Software Project ◽

Software Cost Estimation ◽

Software Cost ◽

Regression Imputation ◽

Mean Imputation

Software Cost Estimation is a critical phase in the development of a software project, and over the years has become an emerging research area. A common problem in building software cost models is that the available datasets contain projects with lots of missing categorical data. The purpose of this chapter is to show how a combination of modern statistical and computational techniques can be used to compare the effect of missing data techniques on the accuracy of cost estimation. Specifically, a recently proposed missing data technique, the multinomial logistic regression, is evaluated and compared with four older methods: listwise deletion, mean imputation, expectation maximization and regression imputation with respect to their effect on the prediction accuracy of a least squares regression cost model. The evaluation is based on various expressions of the prediction error and the comparisons are conducted using statistical tests, resampling techniques and a visualization tool, the regression error characteristic curves.

Download Full-text

A Framework of Statistical and Visualization Techniques for Missing Data Analysis in Software Cost Estimation

Computer Systems and Software Engineering ◽

10.4018/978-1-5225-3923-0.ch017 ◽

2017 ◽

pp. 433-460

Author(s):

Lefteris Angelis ◽

Nikolaos Mittas ◽

Panagiota Chatzipetrou

Keyword(s):

Missing Data ◽

Cost Estimation ◽

Prediction Models ◽

Software Cost Estimation ◽

Software Cost ◽

Regression Imputation ◽

Mean Imputation ◽

Regression Error ◽

Visualization Techniques ◽

Missing Data Techniques

Software Cost Estimation (SCE) is a critical phase in software development projects. However, due to the growing complexity of the software itself, a common problem in building software cost models is that the available datasets contain lots of missing categorical data. The purpose of this chapter is to show how a framework of statistical, computational, and visualization techniques can be used to evaluate and compare the effect of missing data techniques on the accuracy of cost estimation models. Hence, the authors use five missing data techniques: Multinomial Logistic Regression, Listwise Deletion, Mean Imputation, Expectation Maximization, and Regression Imputation. The evaluation and the comparisons are conducted using Regression Error Characteristic curves, which provide visual comparison of different prediction models, and Regression Error Operating Curves, which examine predictive power of models with respect to under- or over-estimation.

Download Full-text