A Novel Feature Selection Method Based on Maximum Likelihood Logistic Regression for Imbalanced Learning in Software Defect Prediction

The most frequently used machine learning feature ranking approaches failed to present optimal feature subset for accurate prediction of defective software modules in out-of-sample data. Machine learning Feature Selection (FS) algorithms such as Chi-Square (CS), Information Gain (IG), Gain Ratio (GR), RelieF (RF) and Symmetric Uncertainty (SU) perform relatively poor at prediction, even after balancing class distribution in the training data. In this study, we propose a novel FS method based on the Maximum Likelihood Logistic Regression (MLLR). We apply this method on six software defect datasets in their sampled and unsampled forms to select useful features for classification in the context of Software Defect Prediction (SDP). The Support Vector Machine (SVM) and Random Forest (RaF) classifiers are applied on the FS subsets that are based on sampled and unsampled datasets. The performance of the models captured using Area Ander Receiver Operating Characteristics Curve (AUC) metrics are compared for all FS methods considered. The Analysis Of Variance (ANOVA) F-test results validate the superiority of the proposed method over all the FS techniques, both in sampled and unsampled data. The results confirm that the MLLR can be useful in selecting optimal feature subset for more accurate prediction of defective modules in software development process

Download Full-text

PREDAIP: Computational Prediction and Analysis for Anti-inflammatory Peptide via a Hybrid Feature Selection Technique

Current Bioinformatics ◽

10.2174/1574893616666210601111157 ◽

2021 ◽

Vol 16 ◽

Author(s):

Dan Lin ◽

Jialin Yu ◽

Ju Zhang ◽

Huan He ◽

Xinyun Guo ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Machine Learning Algorithms ◽

Selection Strategy ◽

Feature Subset ◽

Feature Selection Technique ◽

Selection Technique ◽

Anti Inflammatory ◽

Optimal Feature Subset ◽

Optimal Feature

Background: Anti-inflammatory peptides (AIPs) are potent therapeutic agents for inflammatory and autoimmune disorders due to their high specificity and minimal toxicity under normal conditions. Therefore, it is greatly significant and beneficial to identify AIPs for further discovering novel and efficient AIPs-based therapeutics. Recently, three computational approaches, which can effectively identify potential AIPs, have been developed based on machine learning algorithms. However, there are several challenges with the existing three predictors. Objective: A novel machine learning algorithm needs to be proposed to improve the AIPs prediction accuracy. Methods: This study attempts to improve the recognition of AIPs by employing multiple primary sequence-based feature descriptors and an efficient feature selection strategy. By sorting features through four enhanced minimal redundancy maximal relevance (emRMR) methods, and then attaching seven different classifiers wrapper methods based on the sequential forward selection algorithm (SFS), we proposed a hybrid feature selection technique emRMR-SFS to optimize feature vectors. Furthermore, by evaluating seven classifiers trained with the optimal feature subset, we developed the extremely randomized tree (ERT) based predictor named PREDAIP for identifying AIPs. Results: We systematically compared the performance of PREDAIP with the existing tools on an independent test dataset. It demonstrates the effectiveness and power of the PREDAIP. The correlation criteria used in emRMR would affect the selection results of the optimal feature subset at the SFS-wrapper stage, which justifies the necessity for considering different correlation criteria in emRMR. Conclusion: We expect that PREDAIP will be useful for the high-throughput prediction of AIPs and the development of AIPs therapeutics.

Download Full-text

EMPIRICAL ASSESSMENT OF MACHINE LEARNING BASED SOFTWARE DEFECT PREDICTION TECHNIQUES

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213008003947 ◽

2008 ◽

Vol 17 (02) ◽

pp. 389-400 ◽

Cited By ~ 41

Author(s):

VENKATA UDAYA B. CHALLAGULLA ◽

FAROKH B. BASTANI ◽

I-LING YEN ◽

RAYMOND A. PAUL

Keyword(s):

Machine Learning ◽

Feature Subset Selection ◽

Defect Prediction ◽

Data Sets ◽

Feature Subset ◽

Software Defect Prediction ◽

Software Defect ◽

Intelligent Software ◽

Instance Based Learning ◽

Prediction Techniques

Automated reliability assessment is essential for systems that entail dynamic adaptation based on runtime mission-specific requirements. One approach along this direction is to monitor and assess the system using machine learning-based software defect prediction techniques. Due to the dynamic nature of software data collected, Instance-based learning algorithms are proposed for the above purposes. To evaluate the accuracy of these methods, the paper presents an empirical analysis of four different real-time software defect data sets using different predictor models. The results show that a combination of 1R and Instance-based learning along with Consistency-based subset evaluation technique provides a relatively better consistency in achieving accurate predictions as compared with other models. No direct relationship is observed between the skewness present in the data sets and the prediction accuracy of these models. Principal Component Analysis (PCA) does not show a consistent advantage in improving the accuracy of the predictions. While random reduction of attributes gave poor accuracy results, simple Feature Subset Selection methods performed better than PCA for most prediction models. Based on these results, the paper presents a high-level design of an Intelligent Software Defect Analysis tool (ISDAT) for dynamic monitoring and defect assessment of software modules.

Download Full-text

Improved Feature Selection Based on Mutual Information for Regression Tasks

Journal of IT in Asia ◽

10.33736/jita.330.2016 ◽

2016 ◽

Vol 6 (1) ◽

pp. 11-24

Author(s):

Muhammad A. Sulaiman ◽

Jane Labadin

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Mutual Information ◽

Greedy Algorithms ◽

Selection Criterion ◽

Feature Subset ◽

Classification Problems ◽

Machine Learning Applications ◽

Optimal Feature Subset ◽

Optimal Feature

Mutual Information (MI) is an information theory concept often used in the recent time as a criterion for feature selection methods. This is due to its ability to capture both linear and non-linear dependency relationships between two variables. In theory, mutual information is formulated based on probability density functions (pdfs) or entropies of the two variables. In most machine learning applications, mutual information estimation is formulated for classification problems (that is data with labeled output). This study investigates the use of mutual information estimation as a feature selection criterion for regression tasks and introduces enhancement in selecting optimal feature subset based on previous works. Specifically, while focusing on regression tasks, it builds on the previous work in which a scientifically sound stopping criteria for feature selection greedy algorithms was proposed. Four real-world regression datasets were used in this study, three of the datasets are public obtained from UCI machine learning repository and the remaining one is a private well log dataset. Two Machine learning models namely multiple regression and artificial neural networks (ANN) were used to test the performance of IFSMIR. The results obtained has proved the effectiveness of the proposed method.

Download Full-text

A Hybrid Data Preprocessing Technique based on Maximum Likelihood Logistic Regression with Filtering for Enhancing Software Defect Prediction

2019 IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE) ◽

10.1109/iske47853.2019.9170328 ◽

2019 ◽

Author(s):

Kamal Bashir ◽

Tayseer Ali ◽

Mahama Yahaya ◽

Ahmed Saad Hussein

Keyword(s):

Logistic Regression ◽

Maximum Likelihood ◽

Data Preprocessing ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Hybrid Data ◽

Preprocessing Technique

Download Full-text

A Novel Rank Aggregation-Based Hybrid Multifilter Wrapper Feature Selection Method in Software Defect Prediction

Computational Intelligence and Neuroscience ◽

10.1155/2021/5069016 ◽

2021 ◽

Vol 2021 ◽

pp. 1-19

Author(s):

Abdullateef O. Balogun ◽

Shuib Basri ◽

Saipunidzam Mahamad ◽

Luiz Fernando Capretz ◽

Abdullahi Abubakar Imam ◽

...

Keyword(s):

Feature Selection ◽

Rank Aggregation ◽

Feature Subset Selection ◽

Selection Problem ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Local Optima ◽

Software Defect ◽

Wrapper Feature Selection

The high dimensionality of software metric features has long been noted as a data quality problem that affects the performance of software defect prediction (SDP) models. This drawback makes it necessary to apply feature selection (FS) algorithm(s) in SDP processes. FS approaches can be categorized into three types, namely, filter FS (FFS), wrapper FS (WFS), and hybrid FS (HFS). HFS has been established as superior because it combines the strength of both FFS and WFS methods. However, selecting the most appropriate FFS (filter rank selection problem) for HFS is a challenge because the performance of FFS methods depends on the choice of datasets and classifiers. In addition, the local optima stagnation and high computational costs of WFS due to large search spaces are inherited by the HFS method. Therefore, as a solution, this study proposes a novel rank aggregation-based hybrid multifilter wrapper feature selection (RAHMFWFS) method for the selection of relevant and irredundant features from software defect datasets. The proposed RAHMFWFS is divided into two stepwise stages. The first stage involves a rank aggregation-based multifilter feature selection (RMFFS) method that addresses the filter rank selection problem by aggregating individual rank lists from multiple filter methods, using a novel rank aggregation method to generate a single, robust, and non-disjoint rank list. In the second stage, the aggregated ranked features are further preprocessed by an enhanced wrapper feature selection (EWFS) method based on a dynamic reranking strategy that is used to guide the feature subset selection process of the HFS method. This, in turn, reduces the number of evaluation cycles while amplifying or maintaining its prediction performance. The feasibility of the proposed RAHMFWFS was demonstrated on benchmarked software defect datasets with Naïve Bayes and Decision Tree classifiers, based on accuracy, the area under the curve (AUC), and F-measure values. The experimental results showed the effectiveness of RAHMFWFS in addressing filter rank selection and local optima stagnation problems in HFS, as well as the ability to select optimal features from SDP datasets while maintaining or enhancing the performance of SDP models. To conclude, the proposed RAHMFWFS achieved good performance by improving the prediction performances of SDP models across the selected datasets, compared to existing state-of-the-arts HFS methods.

Download Full-text

Software Defect Prediction Using Wrapper Feature Selection Based on Dynamic Re-Reranking Strategy

Symmetry ◽

10.3390/sym13112166 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2166

Author(s):

Abdullateef Oluwagbemiga Balogun ◽

Shuib Basri ◽

Luiz Fernando Capretz ◽

Saipunidzam Mahamad ◽

Abdullahi Abubakar Imam ◽

...

Keyword(s):

Feature Selection ◽

Computational Cost ◽

High Dimensionality ◽

Computational Time ◽

Defect Prediction ◽

Feature Subset ◽

Sequential Search ◽

Software Defect Prediction ◽

Local Maxima ◽

Software Defect

Finding defects early in a software system is a crucial task, as it creates adequate time for fixing such defects using available resources. Strategies such as symmetric testing have proven useful; however, its inability in differentiating incorrect implementations from correct ones is a drawback. Software defect prediction (SDP) is another feasible method that can be used for detecting defects early. Additionally, high dimensionality, a data quality problem, has a detrimental effect on the predictive capability of SDP models. Feature selection (FS) has been used as a feasible solution for solving the high dimensionality issue in SDP. According to current literature, the two basic forms of FS approaches are filter-based feature selection (FFS) and wrapper-based feature selection (WFS). Between the two, WFS approaches have been deemed to be superior. However, WFS methods have a high computational cost due to the unknown number of executions available for feature subset search, evaluation, and selection. This characteristic of WFS often leads to overfitting of classifier models due to its easy trapping in local maxima. The trapping of the WFS subset evaluator in local maxima can be overcome by using an effective search method in the evaluator process. Hence, this study proposes an enhanced WFS method that dynamically and iteratively selects features. The proposed enhanced WFS (EWFS) method is based on incrementally selecting features while considering previously selected features in its search space. The novelty of EWFS is based on the enhancement of the subset evaluation process of WFS methods by deploying a dynamic re-ranking strategy that iteratively selects germane features with a low subset evaluation cycle while not compromising the prediction performance of the ensuing model. For evaluation, EWFS was deployed with Decision Tree (DT) and Naïve Bayes classifiers on software defect datasets with varying granularities. The experimental findings revealed that EWFS outperformed existing metaheuristics and sequential search-based WFS approaches established in this work. Additionally, EWFS selected fewer features with less computational time as compared with existing metaheuristics and sequential search-based WFS methods.

Download Full-text