Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Software Defect Prediction has been an important part of Software engineering research since the 1970s. This technique is used to calculate and analyze the measurement and defect information of the historical software module to complete the defect prediction of the new software module. Currently, most software defect prediction model is established on the basis of the same software project data set. The training date sets used to construct the model and the test data sets used to validate the model are from the same software projects. But in practice, for those has less historical data of a software project or new projects, the defect of traditional prediction method shows lower forecast performance. For the traditional method, when the historical data is insufficient, the software defect prediction model cannot be fully studied. It is difficult to achieve high prediction accuracy. In the process of cross-project prediction, the problem that we will faced is data distribution differences. For the above problems, this paper presents a software defect prediction model based on migration learning and traditional software defect prediction model. This model uses the existing project data sets to predict software defects across projects. The main work of this article includes: 1) Data preprocessing. This section includes data feature correlation analysis, noise reduction and so on, which effectively avoids the interference of over-fitting problem and noise data on prediction results. 2) Migrate learning. This section analyzes two different but related project data sets and reduces the impact of data distribution differences. 3) Artificial neural networks. According to class imbalance problems of the data set, using artificial neural network and dynamic selection training samples reduce the influence of prediction results because of the positive and negative samples data. The data set of the Relink project and AEEEM is studied to evaluate the performance of the f-measure and the ROC curve and AUC calculation. Experiments show that the model has high predictive performance.

Download Full-text

PERFORMANCE EVALUATION OF IMPUTATION METHODS FOR INCOMPLETE DATASETS

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194007003173 ◽

2007 ◽

Vol 17 (01) ◽

pp. 127-152 ◽

Cited By ~ 10

Author(s):

SUMANTH YENDURI ◽

S. S. IYENGAR

Keyword(s):

Performance Evaluation ◽

Stepwise Regression ◽

Prediction Models ◽

Software Project ◽

Data Sets ◽

Imputation Methods ◽

Listwise Deletion ◽

Project Data ◽

Incomplete Datasets ◽

The Impact

In this study, we compare the performance of four different imputation strategies ranging from the commonly used Listwise Deletion to model based approaches such as the Maximum Likelihood on enhancing completeness in incomplete software project data sets. We evaluate the impact of each of these methods by implementing them on six different real-time software project data sets which are classified into different categories based on their inherent properties. The reliability of the constructed data sets using these techniques are further tested by building prediction models using stepwise regression. The experimental results are noted and the findings are finally discussed.

Download Full-text

Empirical Evaluation of Mimic Software Project Data Sets for Software Effort Estimation

IEICE Transactions on Information and Systems ◽

10.1587/transinf.2019edp7150 ◽

2020 ◽

Vol E103.D (10) ◽

pp. 2094-2103

Author(s):

Maohua GAN ◽

Zeynep YÜCEL ◽

Akito MONDEN ◽

Kentaro SASAKI

Keyword(s):

Empirical Evaluation ◽

Software Project ◽

Data Sets ◽

Effort Estimation ◽

Software Effort Estimation ◽

Project Data

Download Full-text

Dealing with missing software project data

Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717) ◽

10.1109/metric.2003.1232464 ◽

2004 ◽

Cited By ~ 41

Author(s):

M.H. Cartwright ◽

M.J. Shepperd ◽

Q. Song

Keyword(s):

Software Project ◽

Project Data

Download Full-text

A pattern-based outlier detection method identifying abnormal attributes in software project data

Information and Software Technology ◽

10.1016/j.infsof.2009.08.005 ◽

2010 ◽

Vol 52 (2) ◽

pp. 137-151 ◽

Cited By ~ 7

Author(s):

Kyung-A Yoon ◽

Doo-Hwan Bae

Keyword(s):

Outlier Detection ◽

Detection Method ◽

Software Project ◽

Project Data

Download Full-text

Interactive Diffusion Tensor Tractography Visualization for Neurosurgical Planning

Neurosurgery ◽

10.1227/neu.0b013e3182061ebb ◽

2011 ◽

Vol 68 (2) ◽

pp. 496-505 ◽

Cited By ~ 71

Author(s):

Alexandra J. Golby ◽

Gordon Kindlmann ◽

Isaiah Norton ◽

Alexander Yarmarkovich ◽

Steven Pieper ◽

...

Keyword(s):

White Matter ◽

Diffusion Tensor ◽

Software Project ◽

Data Sets ◽

Large White ◽

Seed Point ◽

Project Echo ◽

Neurosurgical Planning ◽

Local Selection ◽

Tractography Data

Abstract BACKGROUND: Diffusion tensor imaging (DTI) infers the trajectory and location of large white matter tracts by measuring the anisotropic diffusion of water. DTI data may then be analyzed and presented as tractography for visualization of the tracts in 3 dimensions. Despite the important information contained in tractography images, usefulness for neurosurgical planning has been limited by the inability to define which are critical structures within the mass of demonstrated fibers and to clarify their relationship to the tumor. OBJECTIVE: To develop a method to allow the interactive querying of tractography data sets for surgical planning and to provide a working software package for the research community. METHODS: The tool was implemented within an open source software project. Echo-planar DTI at 3 T was performed on 5 patients, followed by tensor calculation. Software was developed that allowed the placement of a dynamic seed point for local selection of fibers and for fiber display around a segmented structure, both with tunable parameters. A neurosurgeon was trained in the use of software in < 1 hour and used it to review cases. RESULTS: Tracts near tumor and critical structures were interactively visualized in 3 dimensions to determine spatial relationships to lesion. Tracts were selected using 3 methods: anatomical and functional magnetic resonance imaging-defined regions of interest, distance from the segmented tumor volume, and dynamic seed-point spheres. CONCLUSION: Interactive tractography successfully enabled inspection of white matter structures that were in proximity to lesions, critical structures, and functional cortical areas, allowing the surgeon to explore the relationships between them.

Download Full-text

A Goal Driven Framework for Software Project Data Analytics

Advanced Information Systems Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-642-38709-8_35 ◽

2013 ◽

pp. 546-561 ◽

Cited By ~ 1

Author(s):

George Chatzikonstantinou ◽

Kostas Kontogiannis ◽

Ioanna-Maria Attarian

Keyword(s):

Data Analytics ◽

Software Project ◽

Project Data

Download Full-text

Weighted Mutual Information for Aggregated Kernel Clustering

Entropy ◽

10.3390/e22030351 ◽

2020 ◽

Vol 22 (3) ◽

pp. 351

Author(s):

Nezamoddin N. Kachouie ◽

Meshal Shutaywi

Keyword(s):

Mutual Information ◽

Kernel Function ◽

Dimensional Space ◽

Data Sets ◽

Clustering Methods ◽

Main Challenge ◽

Kernel Clustering ◽

Clustering Data ◽

Project Data ◽

The Right

Background: A common task in machine learning is clustering data into different groups based on similarities. Clustering methods can be divided in two groups: linear and nonlinear. A commonly used linear clustering method is K-means. Its extension, kernel K-means, is a non-linear technique that utilizes a kernel function to project the data to a higher dimensional space. The projected data will then be clustered in different groups. Different kernels do not perform similarly when they are applied to different datasets. Methods: A kernel function might be relevant for one application but perform poorly to project data for another application. In turn choosing the right kernel for an arbitrary dataset is a challenging task. To address this challenge, a potential approach is aggregating the clustering results to obtain an impartial clustering result regardless of the selected kernel function. To this end, the main challenge is how to aggregate the clustering results. A potential solution is to combine the clustering results using a weight function. In this work, we introduce Weighted Mutual Information (WMI) for calculating the weights for different clustering methods based on their performance to combine the results. The performance of each method is evaluated using a training set with known labels. Results: We applied the proposed Weighted Mutual Information to four data sets that cannot be linearly separated. We also tested the method in different noise conditions. Conclusions: Our results show that the proposed Weighted Mutual Information method is impartial, does not rely on a single kernel, and performs better than each individual kernel specially in high noise.

Download Full-text