scholarly journals complexFuzzy: A novel clustering method for selecting training instances of cross-project defect prediction

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Muhammed Maruf Ozturk

Over the last decade, researchers have investigated to what extent cross-project defect prediction (CPDP) shows advantages over traditional defect prediction settings. These works do not take training and testing data of defect prediction from the same project. Instead, dissimilar projects are employed. Selecting proper training data plays an important role in terms of the success of CPDP. In this study, a novel clustering method named complexFuzzy is presented for selecting training data of CPDP. The method is developed by determining membership values with the help of some metrics which can be considered as indicators of complexity. First, CPDP combinations are created on 29 different data sets. Subsequently, complexFuzzy is evaluated by considering cluster centers of data sets and comparing some performance measures including area under the curve (AUC) and F-measure. The method is superior to other five comparison algorithms in terms of the distance of cluster centers and prediction performance.

2021 ◽  
Vol 9 (1) ◽  
pp. 52-68
Author(s):  
Lipika Goel ◽  
Mayank Sharma ◽  
Sunil Kumar Khatri ◽  
D. Damodaran

Often, the prior defect data of the same project is unavailable; researchers thought whether the defect data of the other projects can be used for prediction. This made cross project defect prediction an open research issue. In this approach, the training data often suffers from class imbalance problem. Here, the work is directed on homogeneous cross-project defect prediction. A novel ensemble model that will perform in dual fold is proposed. Firstly, it will handle the class imbalance problem of the dataset. Secondly, it will perform the prediction of the target class. For handling the imbalance problem, the training dataset is divided into data frames. Each data frame will be balanced. An ensemble model using the maximum voting of all random forest classifiers is implemented. The proposed model shows better performance in comparison to the other baseline models. Wilcoxon signed rank test is performed for validation of the proposed model.


Author(s):  
Shaojian Qiu ◽  
Lu Lu ◽  
Siyu Jiang ◽  
Yang Guo

Machine-learning-based software defect prediction (SDP) methods are receiving great attention from the researchers of intelligent software engineering. Most existing SDP methods are performed under a within-project setting. However, there usually is little to no within-project training data to learn an available supervised prediction model for a new SDP task. Therefore, cross-project defect prediction (CPDP), which uses labeled data of source projects to learn a defect predictor for a target project, was proposed as a practical SDP solution. In real CPDP tasks, the class imbalance problem is ubiquitous and has a great impact on performance of the CPDP models. Unlike previous studies that focus on subsampling and individual methods, this study investigated 15 imbalanced learning methods for CPDP tasks, especially for assessing the effectiveness of imbalanced ensemble learning (IEL) methods. We evaluated the 15 methods by extensive experiments on 31 open-source projects derived from five datasets. Through analyzing a total of 37504 results, we found that in most cases, the IEL method that combined under-sampling and bagging approaches will be more effective than the other investigated methods.


2014 ◽  
Vol 21 (1) ◽  
pp. 67-74 ◽  
Author(s):  
Mohamed Marzouk ◽  
Mohamed Alaraby

This paper presents a fuzzy subtractive modelling technique to predict the weight of telecommunication towers which is used to estimate their respective costs. This is implemented through the utilization of data from previously installed telecommunication towers considering four input parameters: a) tower height; b) allowed tilt or deflection; c) antenna subjected area loading; and d) wind load. Telecommunication towers are classified according to designated code (TIA-222-F and TIA-222-G standards) and structures type (Self-Supporting Tower (SST) and Roof Top (RT)). As such, four fuzzy subtractive models are developed to represent the four classes. To build the fuzzy models, 90% of data are utilized and fed to Matlab software as training data. The remaining 10% of the data are utilized to test model performance. Sugeno-Type first order is used to optimize model performance in predicting tower weights. Errors are estimated using Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE) for both training and testing data sets. Sensitivity analysis is carried to validate the model and observe the effect of clusters’ radius on models performance.


2021 ◽  
Vol 11 (11) ◽  
pp. 4793
Author(s):  
Cong Pan ◽  
Minyan Lu ◽  
Biao Xu

Deep learning-based software defect prediction has been popular these days. Recently, the publishing of the CodeBERT model has made it possible to perform many software engineering tasks. We propose various CodeBERT models targeting software defect prediction, including CodeBERT-NT, CodeBERT-PS, CodeBERT-PK, and CodeBERT-PT. We perform empirical studies using such models in cross-version and cross-project software defect prediction to investigate if using a neural language model like CodeBERT could improve prediction performance. We also investigate the effects of different prediction patterns in software defect prediction using CodeBERT models. The empirical results are further discussed.


Processes ◽  
2021 ◽  
Vol 9 (3) ◽  
pp. 439
Author(s):  
Xiaoling Zhang ◽  
Xiyu Liu

Clustering analysis, a key step for many data mining problems, can be applied to various fields. However, no matter what kind of clustering method, noise points have always been an important factor affecting the clustering effect. In addition, in spectral clustering, the construction of affinity matrix affects the formation of new samples, which in turn affects the final clustering results. Therefore, this study proposes a noise cutting and natural neighbors spectral clustering method based on coupling P system (NCNNSC-CP) to solve the above problems. The whole algorithm process is carried out in the coupled P system. We propose a natural neighbors searching method without parameters, which can quickly determine the natural neighbors and natural characteristic value of data points. Then, based on it, the critical density and reverse density are obtained, and noise identification and cutting are performed. The affinity matrix constructed using core natural neighbors greatly improve the similarity between data points. Experimental results on nine synthetic data sets and six UCI datasets demonstrate that the proposed algorithm is better than other comparison algorithms.


The project “Disease Prediction Model” focuses on predicting the type of skin cancer. It deals with constructing a Convolutional Neural Network(CNN) sequential model in order to find the type of a skin cancer which takes a huge troll on mankind well-being. Since development of programmed methods increases the accuracy at high scale for identifying the type of skin cancer, we use Convolutional Neural Network, CNN algorithm in order to build our model . For this we make use of a sequential model. The data set that we have considered for this project is collected from NCBI, which is well known as HAM10000 dataset, it consists of massive amounts of information regarding several dermatoscopic images of most trivial pigmented lesions of skin which are collected from different sufferers. Once the dataset is collected, cleaned, it is split into training and testing data sets. We used CNN to build our model and using the training data we trained the model , later using the testing data we tested the model. Once the model is implemented over the testing data, plots are made in order to analyze the relation between the echos and loss function. It is also used to analyse accuracy and echos for both training and testing data.


2021 ◽  
Vol 12 (1) ◽  
pp. 1-11
Author(s):  
Kishore Sugali ◽  
Chris Sprunger ◽  
Venkata N Inukollu

Artificial Intelligence and Machine Learning have been around for a long time. In recent years, there has been a surge in popularity for applications integrating AI and ML technology. As with traditional development, software testing is a critical component of a successful AI/ML application. The development methodology used in AI/ML contrasts significantly from traditional development. In light of these distinctions, various software testing challenges arise. The emphasis of this paper is on the challenge of effectively splitting the data into training and testing data sets. By applying a k-Means clustering strategy to the data set followed by a decision tree, we can significantly increase the likelihood of the training data set to represent the domain of the full dataset and thus avoid training a model that is likely to fail because it has only learned a subset of the full data domain.


2008 ◽  
Author(s):  
Pieter Kitslaar ◽  
Michel Frenay ◽  
Elco Oost ◽  
Jouke Dijkstra ◽  
Berend Stoel ◽  
...  

This document describes a novel scheme for the automated extraction of the central lumen lines of coronary arteries from computed tomography angiography (CTA) data. The scheme first obtains a seg- mentation of the whole coronary tree and subsequently extracts the centerlines from this segmentation. The first steps of the segmentation algorithm consist of the detection of the aorta and the entire heart region. Next, candidate coronary artery components are detected in the heart region after the masking of the cardiac blood pools. Based on their location and geometrical properties the structures representing the right and left arterties are selected from the candidate list. Starting from the aorta, connections between these structures are made resulting in a final segmentation of the whole coronary artery tree, A fast-marching level set method combined with a backtracking algorithm is employed to obtain the initial centerlines within this segmentation. For all vessels a curved multiplanar reformatted image (CMPR) is constructed and used to detect the lumen contours. The final centerline was then defined by determining the center of gravity of the detected lumen in the transversal CMPR slices. Within the scope of the MICCAI Challenge “Coronary Artery Tracking 2008”, the coronary tree segmentation and centerline extraction scheme was used to automatically detect a set of centerlines in 24 datasets. For 8 data sets reference centerlines were available. This training data was used during the development and tuning of the algorithm. Sixteen other data sets were provided as testing data. Evaluation of the proposed methodology was performed through submission of the resulting centerlines to the MICCAI Challenge website


Sign in / Sign up

Export Citation Format

Share Document