complexFuzzy: A novel clustering method for selecting training instances of cross-project defect prediction

Over the last decade, researchers have investigated to what extent cross-project defect prediction (CPDP) shows advantages over traditional defect prediction settings. These works do not take training and testing data of defect prediction from the same project. Instead, dissimilar projects are employed. Selecting proper training data plays an important role in terms of the success of CPDP. In this study, a novel clustering method named complexFuzzy is presented for selecting training data of CPDP. The method is developed by determining membership values with the help of some metrics which can be considered as indicators of complexity. First, CPDP combinations are created on 29 different data sets. Subsequently, complexFuzzy is evaluated by considering cluster centers of data sets and comparing some performance measures including area under the curve (AUC) and F-measure. The method is superior to other five comparison algorithms in terms of the distance of cluster centers and prediction performance.

Download Full-text

A Framework for Homogeneous Cross-Project Defect Prediction

International Journal of Software Innovation ◽

10.4018/ijsi.2021010105 ◽

2021 ◽

Vol 9 (1) ◽

pp. 52-68

Author(s):

Lipika Goel ◽

Mayank Sharma ◽

Sunil Kumar Khatri ◽

D. Damodaran

Keyword(s):

Class Imbalance ◽

The Other ◽

Training Data ◽

Defect Prediction ◽

Rank Test ◽

Ensemble Model ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Proposed Model ◽

Cross Project

Often, the prior defect data of the same project is unavailable; researchers thought whether the defect data of the other projects can be used for prediction. This made cross project defect prediction an open research issue. In this approach, the training data often suffers from class imbalance problem. Here, the work is directed on homogeneous cross-project defect prediction. A novel ensemble model that will perform in dual fold is proposed. Firstly, it will handle the class imbalance problem of the dataset. Secondly, it will perform the prediction of the target class. For handling the imbalance problem, the training dataset is divided into data frames. Each data frame will be balanced. An ensemble model using the maximum voting of all random forest classifiers is implemented. The proposed model shows better performance in comparison to the other baseline models. Wilcoxon signed rank test is performed for validation of the proposed model.

Download Full-text

An Investigation of Imbalanced Ensemble Learning Methods for Cross-Project Defect Prediction

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001419590377 ◽

2019 ◽

Vol 33 (12) ◽

pp. 1959037 ◽

Cited By ~ 5

Author(s):

Shaojian Qiu ◽

Lu Lu ◽

Siyu Jiang ◽

Yang Guo

Keyword(s):

Ensemble Learning ◽

Class Imbalance ◽

Training Data ◽

Defect Prediction ◽

Class Imbalance Problem ◽

Learning Methods ◽

Imbalance Problem ◽

Intelligent Software ◽

Under Sampling ◽

Cross Project

Machine-learning-based software defect prediction (SDP) methods are receiving great attention from the researchers of intelligent software engineering. Most existing SDP methods are performed under a within-project setting. However, there usually is little to no within-project training data to learn an available supervised prediction model for a new SDP task. Therefore, cross-project defect prediction (CPDP), which uses labeled data of source projects to learn a defect predictor for a target project, was proposed as a practical SDP solution. In real CPDP tasks, the class imbalance problem is ubiquitous and has a great impact on performance of the CPDP models. Unlike previous studies that focus on subsampling and individual methods, this study investigated 15 imbalanced learning methods for CPDP tasks, especially for assessing the effectiveness of imbalanced ensemble learning (IEL) methods. We evaluated the 15 methods by extensive experiments on 31 open-source projects derived from five datasets. Through analyzing a total of 37504 results, we found that in most cases, the IEL method that combined under-sampling and bagging approaches will be more effective than the other investigated methods.

Download Full-text

An Improved Method for Training Data Selection for Cross-Project Defect Prediction

Arabian Journal for Science and Engineering ◽

10.1007/s13369-021-06088-3 ◽

2021 ◽

Author(s):

Nayeem Ahmad Bhat ◽

Sheikh Umar Farooq

Keyword(s):

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Improved Method ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text

PREDICTING TELECOMMUNICATION TOWER COSTS USING FUZZY SUBTRACTIVE CLUSTERING

Journal of Civil Engineering and Management ◽

10.3846/13923730.2013.802736 ◽

2014 ◽

Vol 21 (1) ◽

pp. 67-74 ◽

Cited By ~ 3

Author(s):

Mohamed Marzouk ◽

Mohamed Alaraby

Keyword(s):

Model Performance ◽

Training Data ◽

Percentage Error ◽

Subtractive Clustering ◽

Data Sets ◽

Test Model ◽

First Order ◽

Testing Data ◽

Telecommunication Towers ◽

Tower Height

This paper presents a fuzzy subtractive modelling technique to predict the weight of telecommunication towers which is used to estimate their respective costs. This is implemented through the utilization of data from previously installed telecommunication towers considering four input parameters: a) tower height; b) allowed tilt or deflection; c) antenna subjected area loading; and d) wind load. Telecommunication towers are classified according to designated code (TIA-222-F and TIA-222-G standards) and structures type (Self-Supporting Tower (SST) and Roof Top (RT)). As such, four fuzzy subtractive models are developed to represent the four classes. To build the fuzzy models, 90% of data are utilized and fed to Matlab software as training data. The remaining 10% of the data are utilized to test model performance. Sugeno-Type first order is used to optimize model performance in predicting tower weights. Errors are estimated using Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE) for both training and testing data sets. Sensitivity analysis is carried to validate the model and observe the effect of clusters’ radius on models performance.

Download Full-text

Training data selection for cross-project defect prediction

Proceedings of the 9th International Conference on Predictive Models in Software Engineering - PROMISE '13 ◽

10.1145/2499393.2499395 ◽

2013 ◽

Cited By ~ 50

Author(s):

Steffen Herbold

Keyword(s):

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text

An Empirical Study on Software Defect Prediction Using CodeBERT Model

Applied Sciences ◽

10.3390/app11114793 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4793

Author(s):

Cong Pan ◽

Minyan Lu ◽

Biao Xu

Keyword(s):

Deep Learning ◽

Software Engineering ◽

Empirical Study ◽

Empirical Studies ◽

Language Model ◽

Prediction Performance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Cross Project

Deep learning-based software defect prediction has been popular these days. Recently, the publishing of the CodeBERT model has made it possible to perform many software engineering tasks. We propose various CodeBERT models targeting software defect prediction, including CodeBERT-NT, CodeBERT-PS, CodeBERT-PK, and CodeBERT-PT. We perform empirical studies using such models in cross-version and cross-project software defect prediction to investigate if using a neural language model like CodeBERT could improve prediction performance. We also investigate the effects of different prediction patterns in software defect prediction using CodeBERT models. The empirical results are further discussed.

Download Full-text

Noises Cutting and Natural Neighbors Spectral Clustering Based on Coupling P System

Processes ◽

10.3390/pr9030439 ◽

2021 ◽

Vol 9 (3) ◽

pp. 439

Author(s):

Xiaoling Zhang ◽

Xiyu Liu

Keyword(s):

Spectral Clustering ◽

Critical Density ◽

Synthetic Data ◽

P System ◽

Data Sets ◽

Affinity Matrix ◽

Clustering Method ◽

Data Points ◽

Comparison Algorithms ◽

Searching Method

Clustering analysis, a key step for many data mining problems, can be applied to various fields. However, no matter what kind of clustering method, noise points have always been an important factor affecting the clustering effect. In addition, in spectral clustering, the construction of affinity matrix affects the formation of new samples, which in turn affects the final clustering results. Therefore, this study proposes a noise cutting and natural neighbors spectral clustering method based on coupling P system (NCNNSC-CP) to solve the above problems. The whole algorithm process is carried out in the coupled P system. We propose a natural neighbors searching method without parameters, which can quickly determine the natural neighbors and natural characteristic value of data points. Then, based on it, the critical density and reverse density are obtained, and noise identification and cutting are performed. The affinity matrix constructed using core natural neighbors greatly improve the similarity between data points. Experimental results on nine synthetic data sets and six UCI datasets demonstrate that the proposed algorithm is better than other comparison algorithms.

Download Full-text

Skin Cancer Detection using CNN Algorithm

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e1079.089620 ◽

2020 ◽

Vol 9 (6) ◽

pp. 45-49

Keyword(s):

Neural Network ◽

Skin Cancer ◽

Convolutional Neural Network ◽

Well Being ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Sequential Model ◽

Pigmented Lesions ◽

Testing Data

The project “Disease Prediction Model” focuses on predicting the type of skin cancer. It deals with constructing a Convolutional Neural Network(CNN) sequential model in order to find the type of a skin cancer which takes a huge troll on mankind well-being. Since development of programmed methods increases the accuracy at high scale for identifying the type of skin cancer, we use Convolutional Neural Network, CNN algorithm in order to build our model . For this we make use of a sequential model. The data set that we have considered for this project is collected from NCBI, which is well known as HAM10000 dataset, it consists of massive amounts of information regarding several dermatoscopic images of most trivial pigmented lesions of skin which are collected from different sufferers. Once the dataset is collected, cleaned, it is split into training and testing data sets. We used CNN to build our model and using the training data we trained the model , later using the testing data we tested the model. Once the model is implemented over the testing data, plots are made in order to analyze the relation between the echos and loss function. It is also used to analyse accuracy and echos for both training and testing data.

Download Full-text

AI Testing: Ensuring a Good Data Split Between Data Sets (Training and Test) using K-means Clustering and Decision Tree Analysis

International Journal on Soft Computing ◽

10.5121/ijsc.2021.12101 ◽

2021 ◽

Vol 12 (1) ◽

pp. 1-11

Author(s):

Kishore Sugali ◽

Chris Sprunger ◽

Venkata N Inukollu

Keyword(s):

Decision Tree ◽

Software Testing ◽

Training Data ◽

Data Sets ◽

Full Data ◽

Data Set ◽

Full Dataset ◽

Development Methodology ◽

Testing Data ◽

Long Time

Artificial Intelligence and Machine Learning have been around for a long time. In recent years, there has been a surge in popularity for applications integrating AI and ML technology. As with traditional development, software testing is a critical component of a successful AI/ML application. The development methodology used in AI/ML contrasts significantly from traditional development. In light of these distinctions, various software testing challenges arise. The emphasis of this paper is on the challenge of effectively splitting the data into training and testing data sets. By applying a k-Means clustering strategy to the data set followed by a decision tree, we can significantly increase the likelihood of the training data set to represent the domain of the full dataset and thus avoid training a model that is likely to fail because it has only learned a subset of the full data domain.

Download Full-text

Connected Component and Morphology Based Extraction of Arterial Centerlines of the Heart (CocomoBeach)

10.54294/cbngt2 ◽

2008 ◽

Author(s):

Pieter Kitslaar ◽

Michel Frenay ◽

Elco Oost ◽

Jouke Dijkstra ◽

Berend Stoel ◽

...

Keyword(s):

Coronary Artery ◽

Training Data ◽

Data Sets ◽

Fast Marching ◽

Geometrical Properties ◽

Testing Data ◽

Heart Region ◽

Extraction Scheme ◽

The Right ◽

Coronary Artery Tree

This document describes a novel scheme for the automated extraction of the central lumen lines of coronary arteries from computed tomography angiography (CTA) data. The scheme ﬁrst obtains a seg- mentation of the whole coronary tree and subsequently extracts the centerlines from this segmentation. The ﬁrst steps of the segmentation algorithm consist of the detection of the aorta and the entire heart region. Next, candidate coronary artery components are detected in the heart region after the masking of the cardiac blood pools. Based on their location and geometrical properties the structures representing the right and left arterties are selected from the candidate list. Starting from the aorta, connections between these structures are made resulting in a ﬁnal segmentation of the whole coronary artery tree, A fast-marching level set method combined with a backtracking algorithm is employed to obtain the initial centerlines within this segmentation. For all vessels a curved multiplanar reformatted image (CMPR) is constructed and used to detect the lumen contours. The ﬁnal centerline was then deﬁned by determining the center of gravity of the detected lumen in the transversal CMPR slices. Within the scope of the MICCAI Challenge “Coronary Artery Tracking 2008”, the coronary tree segmentation and centerline extraction scheme was used to automatically detect a set of centerlines in 24 datasets. For 8 data sets reference centerlines were available. This training data was used during the development and tuning of the algorithm. Sixteen other data sets were provided as testing data. Evaluation of the proposed methodology was performed through submission of the resulting centerlines to the MICCAI Challenge website

Download Full-text