scholarly journals Comparative Analysis of Selected Heterogeneous Classifiers for Software Defects Prediction Using Filter-Based Feature Selection Methods

Author(s):  
Abimbola G Akintola ◽  
Abdullateef Balogun ◽  
Fatimah B Lafenwa-Balogun ◽  
Hameed A Mojeed

Classification techniques is a popular approach to predict software defects and it involves categorizing modules, which is represented by a set of metrics or code attributes into fault prone (FP) and non-fault prone (NFP) by means of a classification model. Nevertheless, there is existence of low quality, unreliable, redundant and noisy data which negatively affect the process of observing knowledge and useful pattern. Therefore, researchers need to retrieve relevant data from huge records using feature selection methods. Feature selection is the process of identifying the most relevant attributes and removing the redundant and irrelevant attributes. In this study, the researchers investigated the effect of filter feature selection on classification techniques in software defects prediction. Ten publicly available datasets of NASA and Metric Data Program software repository were used. The topmost discriminatory attributes of the dataset were evaluated using Principal Component Analysis (PCA), CFS and FilterSubsetEval. The datasets were classified by the selected classifiers which were carefully selected based on heterogeneity. Naïve Bayes was selected from Bayes category Classifier, KNN was selected from Instance Based Learner category, J48 Decision Tree from Trees Function classifier and Multilayer perceptron was selected from the neural network classifiers. The experimental results revealed that the application of feature selection to datasets before classification in software defects prediction is better and should be encouraged and Multilayer perceptron with FilterSubsetEval had the best accuracy. It can be concluded that feature selection methods are capable of improving the performance of learning algorithms in software defects prediction.

Author(s):  
Arshad Hashmi

In recent years, occupational stress mining has become a widely exciting issue in the research field. The primary purpose of this study is to analyze filter feature selection methods for the efficient occupational stress classification model. We propose and examine seven different techniques of filter feature selection such as Chi-Square, Information Gain, Information Gain Ratio, Correlation, Principal Component Analysis, and Relief. The resultant selected features are then used with popular classifiers like Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), Artificial Neural Network (ANN), and Gradient Boosted Trees (GBT) for detection of occupational stress in the insurance sector. A survey-based psychological primary occupational stress data set is used to evaluate the relative performance of these methods. This study effectively demonstrated the significance of filter feature selection methods and explained how accurately they could help classify stress levels. This study showed that the Correlation-based feature selection with the SVM classifier obtained the best performance compared to other filter feature selection methods and classification models.


Sensors ◽  
2021 ◽  
Vol 21 (9) ◽  
pp. 2910
Author(s):  
Kei Suzuki ◽  
Tipporn Laohakangvalvit ◽  
Ryota Matsubara ◽  
Midori Sugaya

In human emotion estimation using an electroencephalogram (EEG) and heart rate variability (HRV), there are two main issues as far as we know. The first is that measurement devices for physiological signals are expensive and not easy to wear. The second is that unnecessary physiological indexes have not been removed, which is likely to decrease the accuracy of machine learning models. In this study, we used single-channel EEG sensor and photoplethysmography (PPG) sensor, which are inexpensive and easy to wear. We collected data from 25 participants (18 males and 7 females) and used a deep learning algorithm to construct an emotion classification model based on Arousal–Valence space using several feature combinations obtained from physiological indexes selected based on our criteria including our proposed feature selection methods. We then performed accuracy verification, applying a stratified 10-fold cross-validation method to the constructed models. The results showed that model accuracies are as high as 90% to 99% by applying the features selection methods we proposed, which suggests that a small number of physiological indexes, even from inexpensive sensors, can be used to construct an accurate emotion classification model if an appropriate feature selection method is applied. Our research results contribute to the improvement of an emotion classification model with a higher accuracy, less cost, and that is less time consuming, which has the potential to be further applied to various areas of applications.


Author(s):  
Mohammad M. Masud ◽  
Latifur Khan ◽  
Bhavani Thuraisingham

This chapter applies data mining techniques to detect email worms. Email messages contain a number of different features such as the total number of words in message body/subject, presence/absence of binary attachments, type of attachments, and so on. The goal is to obtain an efficient classification model based on these features. The solution consists of several steps. First, the number of features is reduced using two different approaches: feature-selection and dimension-reduction. This step is necessary to reduce noise and redundancy from the data. The feature-selection technique is called Two-phase Selection (TPS), which is a novel combination of decision tree and greedy selection algorithm. The dimensionreduction is performed by Principal Component Analysis. Second, the reduced data is used to train a classifier. Different classification techniques have been used, such as Support Vector Machine (SVM), Naïve Bayes and their combination. Finally, the trained classifiers are tested on a dataset containing both known and unknown types of worms. These results have been compared with published results. It is found that the proposed TPS selection along with SVM classification achieves the best accuracy in detecting both known and unknown types of worms.


Author(s):  
Lidia S. Chao ◽  
Derek F. Wong ◽  
Philip C. L. Chen ◽  
Wing W. Y. Ng ◽  
Daniel S. Yeung

The ordinary feature selection methods select only the explicit relevant attributes by filtering the irrelevant ones. They trade the selection accuracy for the execution time and complexity. In which, the hidden supportive information possessed by the irrelevant attributes may be lost, so that they may miss some good combinations. We believe that attributes are useless regarding the classification task by themselves, sometimes may provide potentially useful supportive information to other attributes and thus benefit the classification task. Such a strategy can minimize the information lost, therefore is able to maximize the classification accuracy. Especially for the dataset contains hidden interactions among attributes. This paper proposes a feature selection methodology from a new angle that selects not only the relevant features, but also targeting at the potentially useful false irrelevant attributes by measuring their supportive importance to other attributes. The empirical results validate the hypothesis by demonstrating that the proposed approach outperforms most of the state-of-the-art filter based feature selection methods.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Hetal Chauhan ◽  
Kirit Modi ◽  
Saurabh Shrivastava

Purpose The COVID-19 pandemic situation is increasing day by day and has affected the lifestyle and economy worldwide. Due to the absence of specific treatment, the only way to control a pandemic is by stopping its spread. Early identification of affected persons is urgently in demand. Diagnostic methods applied in hospitals are time-consuming, which delay the identification of positive patients. This study aims to develop machine learning-based diagnosis model which can predict positive cases and helps in decision-making. Design/methodology/approach In this research, the authors have developed a diagnosis model to check coronavirus positivity based on an artificial neural network. The authors have trained the model with clinically assessed symptoms, patient-reported symptoms, other medical histories and exposure data of the person. The authors have explored filter-based feature selection methods such as Chi2, ANOVA F-score and Mutual Information for improving performance of a classification model. Metrics used to evaluate performance of the model are accuracy, precision, sensitivity and F1-score. Findings The authors got highest classification performance with model trained with features ranked according to ANOVA FS method. Highest scores for accuracy, sensitivity, precision and F1-score of predictions are 0.93, 0.99, 0.94 and 0.93, respectively. The study reveals that most relevant predictors for COVID-19 diagnosis are sob severity, cough severity, sob presence, cough presence, fatigue and number of days since symptom onset. Originality/value Treatment for COVID-19 is not available to date. The best way to control this pandemic is the isolation of positive persons. It is very much necessary to identify positive persons at an early stage. RT-PCR test used to check COVID-19 positivity is the time-consuming, expensive and laborious method. Current diagnosis methods used in hospital demand more medical resources with increasing cases of coronavirus that introduce shortage of resources. The developed model provides solution to the problem cheaper and faster decreases the immediate need for medical resources and helps in decision-making.


2014 ◽  
Vol 2014 ◽  
pp. 1-6 ◽  
Author(s):  
Hui-Qin Zou ◽  
Shuo Li ◽  
Ying-Hua Huang ◽  
Yong Liu ◽  
Rudolf Bauer ◽  
...  

Plants fromAsteraceaefamily are widely used as herbal medicines and food ingredients, especially in Asian area. Therefore, authentication and quality control of these differentAsteraceaeplants are important for ensuring consumers’ safety and efficacy. In recent decades, electronic nose (E-nose) has been studied as an alternative approach. In this paper, we aim to develop a novel discriminative model by improving radial basis function artificial neural network (RBF-ANN) classification model. Feature selection algorithms, including principal component analysis (PCA) and BestFirst + CfsSubsetEval (BC), were applied in the improvement of RBF-ANN models. Results illustrate that in the improved RBF-ANN models with lower dimension data classification accuracies (100%) remained the same as in the original model with higher-dimension data. It is the first time to introduce feature selection methods to get valuable information on how to attribute more relevant MOS sensors; namely, in this case, S1, S3, S4, S6, and S7 show better capability to distinguish theseAsteraceaeplants. This paper also gives insights to further research in this area, for instance, sensor array optimization and performance improvement of classification model.


2020 ◽  
Vol 3 (1) ◽  
pp. 58-63
Author(s):  
Y. Mansour Mansour ◽  
Majed A. Alenizi

Emails are currently the main communication method worldwide as it proven in its efficiency. Phishing emails in the other hand is one of the major threats which results in significant losses, estimated at billions of dollars. Phishing emails is a more dynamic problem, a struggle between the phishers and defenders where the phishers have more flexibility in manipulating the emails features and evading the anti-phishing techniques. Many solutions have been proposed to mitigate the phishing emails impact on the targeted sectors, but none have achieved 100% detection and accuracy. As phishing techniques are evolving, the solutions need to be evolved and generalized in order to mitigate as much as possible. This article presents a new emergent classification model based on hybrid feature selection method that combines two common feature selection methods, Information Gain and Genetic Algorithm that keep only significant and high-quality features in the final classifier. The Proposed hybrid approach achieved 98.9% accuracy rate against phishing emails dataset comprising 8266 instances and results depict enhancement by almost 4%. Furthermore, the presented technique has contributed to reducing the search space by reducing the number of selected features.


Author(s):  
Norsyela Muhammad Noor Mathivanan ◽  
Nor Azura Md.Ghani ◽  
Roziah Mohd Janor

<span>The curse of dimensionality and the empty space phenomenon emerged as a critical problem in text classification. One way of dealing with this problem is applying a feature selection technique before performing a classification model. This technique helps to reduce the time complexity and sometimes increase the classification accuracy. This study introduces a feature selection technique using K-Means clustering to overcome the weaknesses of traditional feature selection technique such as principal component analysis (PCA) that require a lot of time to transform all the inputs data. This proposed technique decides on features to retain based on the significance value of each feature in a cluster. This study found that k-means clustering helps to increase the efficiency of KNN model for a large data set while KNN model without feature selection technique is suitable for a small data set. A comparison between K-Means clustering and PCA as a feature selection technique shows that proposed technique is better than PCA especially in term of computation time. Hence, k-means clustering is found to be helpful in reducing the data dimensionality with less time complexity compared to PCA without affecting the accuracy of KNN model for a high frequency data.</span>


2021 ◽  
Vol 14 (1) ◽  
pp. 40
Author(s):  
Hamed Naseri ◽  
E. Owen D. Waygood ◽  
Bobin Wang ◽  
Zachary Patterson ◽  
Ricardo A. Daziano

Indications of people’s environmental concern are linked to transport decisions and can provide great support for policymaking on climate change. This study aims to better predict individual climate change stage of change (CC-SoC) based on different features of transport-related behavior, General Ecological Behavior, New Environmental Paradigm, and socio-demographic characteristics. Together these sources result in over 100 possible features that indicate someone’s level of environmental concern. Such a large number of features may create several analytical problems, such as overfitting, accuracy reduction, and high computational costs. To this end, a new feature selection technique, named the Coyote Optimization Algorithm-Quadratic Discriminant Analysis (COA-QDA), is first proposed to find the optimal features to predict CC-SoC with the highest accuracy. Different conventional feature selection methods (Lasso, Elastic Net, Random Forest Feature Selection, Extra Trees, and Principal Component Analysis Feature Selection) are employed to compare with the COA-QDA. Afterward, eight classification techniques are applied to solve the prediction problem. Finally, a sensitivity analysis is performed to determine the most important features affecting the prediction of CC-SoC. The results indicate that COA-QDA outperforms conventional feature selection methods by increasing average testing data accuracy from 0.7 to 5.6%. Logistic Regression surpasses other classifiers with the highest prediction accuracy.


2021 ◽  
Vol 12 ◽  
Author(s):  
Burcu Bakir-Gungor ◽  
Osman Bulut ◽  
Amhar Jabeer ◽  
O. Ufuk Nalbantoglu ◽  
Malik Yousef

Human gut microbiota is a complex community of organisms including trillions of bacteria. While these microorganisms are considered as essential regulators of our immune system, some of them can cause several diseases. In recent years, next-generation sequencing technologies accelerated the discovery of human gut microbiota. In this respect, the use of machine learning techniques became popular to analyze disease-associated metagenomics datasets. Type 2 diabetes (T2D) is a chronic disease and affects millions of people around the world. Since the early diagnosis in T2D is important for effective treatment, there is an utmost need to develop a classification technique that can accelerate T2D diagnosis. In this study, using T2D-associated metagenomics data, we aim to develop a classification model to facilitate T2D diagnosis and to discover T2D-associated biomarkers. The sequencing data of T2D patients and healthy individuals were taken from a metagenome-wide association study and categorized into disease states. The sequencing reads were assigned to taxa, and the identified species are used to train and test our model. To deal with the high dimensionality of features, we applied robust feature selection algorithms such as Conditional Mutual Information Maximization, Maximum Relevance and Minimum Redundancy, Correlation Based Feature Selection, and select K best approach. To test the performance of the classification based on the features that are selected by different methods, we used random forest classifier with 100-fold Monte Carlo cross-validation. In our experiments, we observed that 15 commonly selected features have a considerable effect in terms of minimizing the microbiota used for the diagnosis of T2D and thus reducing the time and cost. When we perform biological validation of these identified species, we found that some of them are known as related to T2D development mechanisms and we identified additional species as potential biomarkers. Additionally, we attempted to find the subgroups of T2D patients using k-means clustering. In summary, this study utilizes several supervised and unsupervised machine learning algorithms to increase the diagnostic accuracy of T2D, investigates potential biomarkers of T2D, and finds out which subset of microbiota is more informative than other taxa by applying state-of-the art feature selection methods.


Sign in / Sign up

Export Citation Format

Share Document