Mining Quantitative Rules in a Software Project Data Set

Software Defect Prediction has been an important part of Software engineering research since the 1970s. This technique is used to calculate and analyze the measurement and defect information of the historical software module to complete the defect prediction of the new software module. Currently, most software defect prediction model is established on the basis of the same software project data set. The training date sets used to construct the model and the test data sets used to validate the model are from the same software projects. But in practice, for those has less historical data of a software project or new projects, the defect of traditional prediction method shows lower forecast performance. For the traditional method, when the historical data is insufficient, the software defect prediction model cannot be fully studied. It is difficult to achieve high prediction accuracy. In the process of cross-project prediction, the problem that we will faced is data distribution differences. For the above problems, this paper presents a software defect prediction model based on migration learning and traditional software defect prediction model. This model uses the existing project data sets to predict software defects across projects. The main work of this article includes: 1) Data preprocessing. This section includes data feature correlation analysis, noise reduction and so on, which effectively avoids the interference of over-fitting problem and noise data on prediction results. 2) Migrate learning. This section analyzes two different but related project data sets and reduces the impact of data distribution differences. 3) Artificial neural networks. According to class imbalance problems of the data set, using artificial neural network and dynamic selection training samples reduce the influence of prediction results because of the positive and negative samples data. The data set of the Relink project and AEEEM is studied to evaluate the performance of the f-measure and the ROC curve and AUC calculation. Experiments show that the model has high predictive performance.

Download Full-text

A digital phantom of the axilla based on the Visible Human Project data set

2000 IEEE Nuclear Science Symposium. Conference Record (Cat. No.00CH37149) ◽

10.1109/nssmic.2000.949322 ◽

2002 ◽

Author(s):

S.J. McCallum ◽

A. Welch ◽

L. Baker

Keyword(s):

Data Set ◽

Digital Phantom ◽

Project Data ◽

Visible Human ◽

Visible Human Project

Download Full-text

Curating GitHub for engineered software projects

10.7287/peerj.preprints.2617 ◽

2016 ◽

Cited By ~ 1

Author(s):

Nuthan Munaiah ◽

Steven Kroh ◽

Craig Cabrey ◽

Meiyappan Nagappan

Keyword(s):

Software Engineering ◽

High Precision ◽

Ground Truth ◽

Software Project ◽

Software Projects ◽

Data Set ◽

High Recall ◽

Home Work ◽

Reference Implementation ◽

Engineering Practices

Software forges like GitHub host millions of repositories. Software engineering researchers have been able to take advantage of such a large corpora of potential study subjects with the help of tools like GHTorrent and Boa. However, the simplicity in querying comes with a caveat: there are limited means of separating the signal (e.g. repositories containing engineered software projects) from the noise (e.g. repositories containing home work assignments). The proportion of noise in a random sample of repositories could skew the study and may lead to researchers reaching unrealistic, potentially inaccurate, conclusions. We argue that it is imperative to have the ability to sieve out the noise in such large repository forges. We propose a framework, and present a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project. We identify software engineering practices (called dimensions) and propose means for validating their existence in a GitHub repository. We used reaper to measure the dimensions of 1,994,977 GitHub repositories. We then used the data set train classifiers capable of predicting if a given GitHub repository contains an engineered software project. The performance of the classifiers was evaluated using a set of 200 repositories with known ground truth classification. We also compared the performance of the classifiers to other approaches to classification (e.g. number of GitHub Stargazers) and found our classifiers to outperform existing approaches. We found stargazers-based classifier to exhibit high precision (96%) but an inversely proportional recall (27%). On the other hand, our best classifier exhibited a high precision (82%) and a high recall (83%). The stargazer-based criteria offers precision but fails to recall a significant potion of the population.

Download Full-text

EP.FRI.857 Audit and quality improvement project; data entry into the NELA database from Royal Stoke university hospital

British Journal of Surgery ◽

10.1093/bjs/znab312.127 ◽

2021 ◽

Vol 108 (Supplement_7) ◽

Author(s):

Nandu Nair ◽

Vasileios Kalatzis ◽

Madhavi Gudipati ◽

Anne Gaunt ◽

Vishnu Machineni

Keyword(s):

Quality Improvement ◽

Data Entry ◽

Quality Improvement Project ◽

University Hospital ◽

Emergency Laparotomy ◽

Data Set ◽

Improvement Project ◽

Hospital Database ◽

Case Ascertainment ◽

Project Data

Abstract Aims During the period December-2018 to November-2019 a total of 84 cases were entered on the NELA website, corresponding to HES data suggesting 392 laparotomies. This suggests a possible case acquisition of 21% prompting us to look at our data acquisition in detail. Methods Interrogation of the NELA data from January–March 2020 was done from NELA website and hospital records. Results Analysis revealed that during this period 45 patients had laparotomy recorded whereas hospital database recorded 68 laparotomies. Of the 45 cases entered on the NELA database, only 1 patient had a complete data set entered. 22 cases had 87% data entry and 22 cases had <50% of the data fields completed. Firstly, we were not capturing all patients who underwent an emergency laparotomy and secondly our data entry for the patients we did report was incomplete. This led us to engage in a quality improvement project with following measures - Conclusions We re-assessed the case ascertainment and completeness of data collection in the period April 2020 – June 2020 and case ascertainment rate increased to 54% and all the entries were complete and locked.

Download Full-text

Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Journal of Systems and Software ◽

10.1016/j.jss.2008.05.008 ◽

2008 ◽

Vol 81 (12) ◽

pp. 2361-2370 ◽

Cited By ~ 35

Author(s):

Qinbao Song ◽

Martin Shepperd ◽

Xiangru Chen ◽

Jun Liu

Keyword(s):

Comparative Evaluation ◽

Software Project ◽

Data Sets ◽

Project Data

Download Full-text

Dealing with missing software project data

Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717) ◽

10.1109/metric.2003.1232464 ◽

2004 ◽

Cited By ~ 41

Author(s):

M.H. Cartwright ◽

M.J. Shepperd ◽

Q. Song

Keyword(s):

Software Project ◽

Project Data

Download Full-text

A new imputation method for small software project data sets

Journal of Systems and Software ◽

10.1016/j.jss.2006.05.003 ◽

2007 ◽

Vol 80 (1) ◽

pp. 51-62 ◽

Cited By ~ 45

Author(s):

Qinbao Song ◽

Martin Shepperd

Keyword(s):

Imputation Method ◽

Software Project ◽

Data Sets ◽

Project Data

Download Full-text

Learning to predict characteristics for engineering service projects

Artificial intelligence for engineering design analysis and manufacturing ◽

10.1017/s0890060416000470 ◽

2016 ◽

Vol 31 (3) ◽

pp. 313-326 ◽

Cited By ~ 1

Author(s):

Lei Shi ◽

Linda Newnes ◽

Steve Culley ◽

Bruce Allen

Keyword(s):

Information Overload ◽

Machine Learning Techniques ◽

Data Set ◽

Service Project ◽

Learning Techniques ◽

Proposed Model ◽

Service Projects ◽

Project Characteristics ◽

Project Data ◽

Analytical Approaches

AbstractAn engineering service project can be highly interactive, collaborative, and distributed. The implementation of such projects needs to generate, utilize, and share large amounts of data and heterogeneous digital objects. The information overload prevents the effective reuse of project data and knowledge, and makes the understanding of project characteristics difficult. Toward solving these issues, this paper emphasized the using of data mining and machine learning techniques to improve the project characteristic understanding process. The work presented in this paper proposed an automatic model and some analytical approaches for learning and predicting the characteristics of engineering service projects. To evaluate the model and demonstrate its functionalities, an industrial data set from the aerospace sector is considered as a the case study. This work shows that the proposed model could enable the project members to gain comprehensive understanding of project characteristics from a multidimensional perspective, and it has the potential to support them in implementing evidence-based design and decision making.

Download Full-text