A new imputation method for small software project data sets

Software Defect Prediction has been an important part of Software engineering research since the 1970s. This technique is used to calculate and analyze the measurement and defect information of the historical software module to complete the defect prediction of the new software module. Currently, most software defect prediction model is established on the basis of the same software project data set. The training date sets used to construct the model and the test data sets used to validate the model are from the same software projects. But in practice, for those has less historical data of a software project or new projects, the defect of traditional prediction method shows lower forecast performance. For the traditional method, when the historical data is insufficient, the software defect prediction model cannot be fully studied. It is difficult to achieve high prediction accuracy. In the process of cross-project prediction, the problem that we will faced is data distribution differences. For the above problems, this paper presents a software defect prediction model based on migration learning and traditional software defect prediction model. This model uses the existing project data sets to predict software defects across projects. The main work of this article includes: 1) Data preprocessing. This section includes data feature correlation analysis, noise reduction and so on, which effectively avoids the interference of over-fitting problem and noise data on prediction results. 2) Migrate learning. This section analyzes two different but related project data sets and reduces the impact of data distribution differences. 3) Artificial neural networks. According to class imbalance problems of the data set, using artificial neural network and dynamic selection training samples reduce the influence of prediction results because of the positive and negative samples data. The data set of the Relink project and AEEEM is studied to evaluate the performance of the f-measure and the ROC curve and AUC calculation. Experiments show that the model has high predictive performance.

Download Full-text

PERFORMANCE EVALUATION OF IMPUTATION METHODS FOR INCOMPLETE DATASETS

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194007003173 ◽

2007 ◽

Vol 17 (01) ◽

pp. 127-152 ◽

Cited By ~ 10

Author(s):

SUMANTH YENDURI ◽

S. S. IYENGAR

Keyword(s):

Performance Evaluation ◽

Stepwise Regression ◽

Prediction Models ◽

Software Project ◽

Data Sets ◽

Imputation Methods ◽

Listwise Deletion ◽

Project Data ◽

Incomplete Datasets ◽

The Impact

In this study, we compare the performance of four different imputation strategies ranging from the commonly used Listwise Deletion to model based approaches such as the Maximum Likelihood on enhancing completeness in incomplete software project data sets. We evaluate the impact of each of these methods by implementing them on six different real-time software project data sets which are classified into different categories based on their inherent properties. The reliability of the constructed data sets using these techniques are further tested by building prediction models using stepwise regression. The experimental results are noted and the findings are finally discussed.

Download Full-text

Empirical Evaluation of Mimic Software Project Data Sets for Software Effort Estimation

IEICE Transactions on Information and Systems ◽

10.1587/transinf.2019edp7150 ◽

2020 ◽

Vol E103.D (10) ◽

pp. 2094-2103

Author(s):

Maohua GAN ◽

Zeynep YÜCEL ◽

Akito MONDEN ◽

Kentaro SASAKI

Keyword(s):

Empirical Evaluation ◽

Software Project ◽

Data Sets ◽

Effort Estimation ◽

Software Effort Estimation ◽

Project Data

Download Full-text

Dealing with missing software project data

Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717) ◽

10.1109/metric.2003.1232464 ◽

2004 ◽

Cited By ~ 41

Author(s):

M.H. Cartwright ◽

M.J. Shepperd ◽

Q. Song

Keyword(s):

Software Project ◽

Project Data

Download Full-text

Single imputation method of missing values in environmental pollution data sets

Atmospheric Environment ◽

10.1016/j.atmosenv.2006.06.040 ◽

2006 ◽

Vol 40 (38) ◽

pp. 7316-7330 ◽

Cited By ~ 40

Author(s):

A PLAIA ◽

A BONDI

Keyword(s):

Environmental Pollution ◽

Missing Values ◽

Imputation Method ◽

Data Sets ◽

Single Imputation

Download Full-text

A pattern-based outlier detection method identifying abnormal attributes in software project data

Information and Software Technology ◽

10.1016/j.infsof.2009.08.005 ◽

2010 ◽

Vol 52 (2) ◽

pp. 137-151 ◽

Cited By ~ 7

Author(s):

Kyung-A Yoon ◽

Doo-Hwan Bae

Keyword(s):

Outlier Detection ◽

Detection Method ◽

Software Project ◽

Project Data

Download Full-text

Interactive Diffusion Tensor Tractography Visualization for Neurosurgical Planning

Neurosurgery ◽

10.1227/neu.0b013e3182061ebb ◽

2011 ◽

Vol 68 (2) ◽

pp. 496-505 ◽

Cited By ~ 71

Author(s):

Alexandra J. Golby ◽

Gordon Kindlmann ◽

Isaiah Norton ◽

Alexander Yarmarkovich ◽

Steven Pieper ◽

...

Keyword(s):

White Matter ◽

Diffusion Tensor ◽

Software Project ◽

Data Sets ◽

Large White ◽

Seed Point ◽

Project Echo ◽

Neurosurgical Planning ◽

Local Selection ◽

Tractography Data

Abstract BACKGROUND: Diffusion tensor imaging (DTI) infers the trajectory and location of large white matter tracts by measuring the anisotropic diffusion of water. DTI data may then be analyzed and presented as tractography for visualization of the tracts in 3 dimensions. Despite the important information contained in tractography images, usefulness for neurosurgical planning has been limited by the inability to define which are critical structures within the mass of demonstrated fibers and to clarify their relationship to the tumor. OBJECTIVE: To develop a method to allow the interactive querying of tractography data sets for surgical planning and to provide a working software package for the research community. METHODS: The tool was implemented within an open source software project. Echo-planar DTI at 3 T was performed on 5 patients, followed by tensor calculation. Software was developed that allowed the placement of a dynamic seed point for local selection of fibers and for fiber display around a segmented structure, both with tunable parameters. A neurosurgeon was trained in the use of software in < 1 hour and used it to review cases. RESULTS: Tracts near tumor and critical structures were interactively visualized in 3 dimensions to determine spatial relationships to lesion. Tracts were selected using 3 methods: anatomical and functional magnetic resonance imaging-defined regions of interest, distance from the segmented tumor volume, and dynamic seed-point spheres. CONCLUSION: Interactive tractography successfully enabled inspection of white matter structures that were in proximity to lesions, critical structures, and functional cortical areas, allowing the surgeon to explore the relationships between them.

Download Full-text

Comparison of Selected Multiple Imputation Methods for Continuous Variables – Preliminary Simulation Study Results

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.339.05 ◽

2019 ◽

Vol 6 (339) ◽

pp. 73-98

Author(s):

Małgorzata Aleksandra Misztal

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Imputation Accuracy ◽

Imputation Method ◽

Data Sets ◽

Continuous Variables ◽

Imputation Methods ◽

Study Results ◽

Almost All

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.

Download Full-text

A Goal Driven Framework for Software Project Data Analytics

Advanced Information Systems Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-642-38709-8_35 ◽

2013 ◽

pp. 546-561 ◽

Cited By ~ 1

Author(s):

George Chatzikonstantinou ◽

Kostas Kontogiannis ◽

Ioanna-Maria Attarian

Keyword(s):

Data Analytics ◽

Software Project ◽

Project Data

Download Full-text