scholarly journals Feature Selection Approach for Improving the Accuracy of Software Bug Prediction

Author(s):  
Emad Kaen and Abdullah Algarni Emad Kaen and Abdullah Algarni

We recently noticed the advancement and growth in the field of artificial intelligence and in its various branches such as Machine Learning (ML) and Deep Learning in various vital fields such as robotics, smart cars, smart cities, health care, software engineering and many other fields. Software bug prediction are one of the most important ML uses in software engineering. In addition, the feature selection is one of ML methods that aim to reduce a feature set that are used for building models. In this paper, we propose to use the Chi-Square feature selection method to calculate features importance, then to build a ML models, first by using top ten important features and second by using top five important features, based on three of well-known ML classifications algorithms, Support Vector Machine, Naïve Bayes and Linear Discriminant Analysis, with adding and exploring more about the effeteness of new metric of code smell intensity, the performance results of our approach against baseline achieved an improvements as average accuracy among nine datasets reaching up to 5.12%, 4.15% and 1% on the NB, SVM and LDA classifiers respectively.

2021 ◽  
Vol 11 ◽  
Author(s):  
Qi Wan ◽  
Jiaxuan Zhou ◽  
Xiaoying Xia ◽  
Jianfeng Hu ◽  
Peng Wang ◽  
...  

ObjectiveTo evaluate the performance of 2D and 3D radiomics features with different machine learning approaches to classify SPLs based on magnetic resonance(MR) T2 weighted imaging (T2WI).Material and MethodsA total of 132 patients with pathologically confirmed SPLs were examined and randomly divided into training (n = 92) and test datasets (n = 40). A total of 1692 3D and 1231 2D radiomics features per patient were extracted. Both radiomics features and clinical data were evaluated. A total of 1260 classification models, comprising 3 normalization methods, 2 dimension reduction algorithms, 3 feature selection methods, and 10 classifiers with 7 different feature numbers (confined to 3–9), were compared. The ten-fold cross-validation on the training dataset was applied to choose the candidate final model. The area under the receiver operating characteristic curve (AUC), precision-recall plot, and Matthews Correlation Coefficient were used to evaluate the performance of machine learning approaches.ResultsThe 3D features were significantly superior to 2D features, showing much more machine learning combinations with AUC greater than 0.7 in both validation and test groups (129 vs. 11). The feature selection method Analysis of Variance(ANOVA), Recursive Feature Elimination(RFE) and the classifier Logistic Regression(LR), Linear Discriminant Analysis(LDA), Support Vector Machine(SVM), Gaussian Process(GP) had relatively better performance. The best performance of 3D radiomics features in the test dataset (AUC = 0.824, AUC-PR = 0.927, MCC = 0.514) was higher than that of 2D features (AUC = 0.740, AUC-PR = 0.846, MCC = 0.404). The joint 3D and 2D features (AUC=0.813, AUC-PR = 0.926, MCC = 0.563) showed similar results as 3D features. Incorporating clinical features with 3D and 2D radiomics features slightly improved the AUC to 0.836 (AUC-PR = 0.918, MCC = 0.620) and 0.780 (AUC-PR = 0.900, MCC = 0.574), respectively.ConclusionsAfter algorithm optimization, 2D feature-based radiomics models yield favorable results in differentiating malignant and benign SPLs, but 3D features are still preferred because of the availability of more machine learning algorithmic combinations with better performance. Feature selection methods ANOVA and RFE, and classifier LR, LDA, SVM and GP are more likely to demonstrate better diagnostic performance for 3D features in the current study.


Repositor ◽  
2019 ◽  
Vol 1 (1) ◽  
pp. 1
Author(s):  
Hendra Saputra ◽  
Setio Basuki ◽  
Mahar Faiqurahman

AbstrakPertumbuhan Malware Android telah meningkat secara signifikan seiring dengan majunya jaman dan meninggkatnya keragaman teknik dalam pengembangan Android. Teknik Machine Learning adalah metode yang saat ini bisa kita gunakan dalam memodelkan pola fitur statis dan dinamis dari Malware Android. Dalam tingkat keakurasian dari klasifikasi jenis Malware peneliti menghubungkan antara fitur aplikasi dengan fitur yang dibutuhkan dari setiap jenis kategori Malware. Kategori jenis Malware yang digunakan merupakan jenis Malware yang banyak beredar saat ini. Untuk mengklasifikasi jenis Malware pada penelitian ini digunakan Support Vector Machine (SVM). Jenis SVM yang akan digunakan adalah class SVM one against one menggunakan Kernel RBF. Fitur yang akan dipakai dalam klasifikasi ini adalah Permission dan Broadcast Receiver. Untuk meningkatkan akurasi dari hasil klasifikasi pada penelitian ini digunakan metode Seleksi Fitur. Seleksi Fitur yang digunakan ialah Correlation-based Feature  Selection (CSF), Gain Ratio (GR) dan Chi-Square (CHI). Hasil dari Seleksi Fitur akan di evaluasi bersama dengan hasil yang tidak menggunakan Seleksi Fitur. Akurasi klasifikasi Seleksi Fitur CFS menghasilkan akurasi sebesar 90.83% , GR dan CHI sebesar 91.25% dan data yang tidak menggunakan Seleksi Fitur sebesar 91.67%. Hasil dari pengujian menunjukan bahwa Permission dan Broadcast Receiver bisa digunakan dalam mengklasifikasi jenis Malware, akan tetapi metode Seleksi Fitur yang digunakan mempunyai akurasi yang berada sedikit dibawah data yang tidak menggunakan Seleksi Fitur. Kata kunci: klasifikasi malware android, seleksi fitur, SVM dan multi class SVM one agains one  Abstract Android Malware has growth significantly along with the advance of the times and the increasing variety of technique in the development of Android. Machine Learning technique is a method that now we can use in the modeling the pattern of a static and dynamic feature of Android Malware. In the level of accuracy of the Malware type classification, the researcher connect between the application feature with the feature required by each types of Malware category. The category of malware used is a type of Malware that many circulating today, to classify the type of Malware in this study used Support Vector Machine (SVM). The SVM type wiil be used is class SVM one against one using the RBF Kernel. The feature will be used in this classification are the Permission and Broadcast Receiver.  To improve the accuracy of the classification result in this study used Feature Selection method. Selection of feature used are Correlation-based Feature Selection (CFS), Gain Ratio (GR) and Chi-Square (CHI). Result from Feature Selection will be evaluated together with result that not use Feature Selection. Accuracy Classification Feature Selection CFS result accuracy of 90.83%, GR and CHI of 91.25% and data that not use Feature Selection of 91.67%. The result of testing indicate that permission and broadcast receiver can be used in classyfing type of Malware, but the Feature Selection method that used have accuracy is a little below the data that are not using Feature Selection. Keywords: Classification Android Malware, Feature Selection, SVM and Multi Class SVM one against one


2021 ◽  
pp. 134-146
Author(s):  
Surbhi Sharma ◽  
Anthony J. Bustamante

In this paper, we have focused to improve the performance of a speech-based uni-modal depression detection system, which is non-invasive, involves low cost and computation time in comparison to multi-modal systems. The performance of a decision system mainly depends on the choice of feature selection method and the classifier. We have investigated the combination of four well-known multivariate filter methods (minimum Redundancy Maximum Relevance, Scatter Ratio, Mahalanobis Distance, Fast Correlation Based feature selection) and four well-known classifiers (k-Nearest Neighbour, Linear Discriminant classifier, Decision Tree, Support Vector Machine) to obtain a minimal set of relevant and non-redundant features to improve the performance. This will speed up the acquisition of features from speech and build the decision system with low cost and complexity. Experimental results on the high and low-level features of recent work on the DAICWOZ dataset demonstrate the superior performance of the combination of Scatter Ratio and LDC as well as that of Mahalanobis Distance and LDC, in comparison to other combinations and existing speech-based depression results, for both gender independent and gender-based studies. Further, these combinations have also outperformed a few multimodal systems. It was noted that low-level features are more discriminatory and provide a better f1 score.


2013 ◽  
Vol 2013 ◽  
pp. 1-10 ◽  
Author(s):  
M. A. Duarte-Mermoud ◽  
N. H. Beltrán ◽  
S. A. Salah

Recently, a new crossover technique for genetic algorithms has been proposed. The technique, called probabilistic adaptive crossover (PAX), includes the estimation of the probability distribution of the population, storing the information regarding the best and the worst solutions of the problem being solved in a probability vector. The use of the proposed technique to face Chilean wine classification based on chromatograms obtained from an HPLC is reported in this paper. PAX is used in the first stage as the feature selection method and then support vector machines (SVM) and linear discriminant analysis (LDA) are used as classifiers. The results are compared with those obtained using the uniform (discrete) crossover standard technique and a variant of PAX called mixed crossover.


Author(s):  
B. Venkatesh ◽  
J. Anuradha

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.


2012 ◽  
Vol 532-533 ◽  
pp. 1191-1195 ◽  
Author(s):  
Zhen Yan Liu ◽  
Wei Ping Wang ◽  
Yong Wang

This paper introduces the design of a text categorization system based on Support Vector Machine (SVM). It analyzes the high dimensional characteristic of text data, the reason why SVM is suitable for text categorization. According to system data flow this system is constructed. This system consists of three subsystems which are text representation, classifier training and text classification. The core of this system is the classifier training, but text representation directly influences the currency of classifier and the performance of the system. Text feature vector space can be built by different kinds of feature selection and feature extraction methods. No research can indicate which one is the best method, so many feature selection and feature extraction methods are all developed in this system. For a specific classification task every feature selection method and every feature extraction method will be tested, and then a set of the best methods will be adopted.


Author(s):  
Gang Liu ◽  
Chunlei Yang ◽  
Sen Liu ◽  
Chunbao Xiao ◽  
Bin Song

A feature selection method based on mutual information and support vector machine (SVM) is proposed in order to eliminate redundant feature and improve classification accuracy. First, local correlation between features and overall correlation is calculated by mutual information. The correlation reflects the information inclusion relationship between features, so the features are evaluated and redundant features are eliminated with analyzing the correlation. Subsequently, the concept of mean impact value (MIV) is defined and the influence degree of input variables on output variables for SVM network based on MIV is calculated. The importance weights of the features described with MIV are sorted by descending order. Finally, the SVM classifier is used to implement feature selection according to the classification accuracy of feature combination which takes MIV order of feature as a reference. The simulation experiments are carried out with three standard data sets of UCI, and the results show that this method can not only effectively reduce the feature dimension and high classification accuracy, but also ensure good robustness.


2010 ◽  
Vol 9 ◽  
pp. CIN.S3794 ◽  
Author(s):  
Xiaosheng Wang ◽  
Osamu Gotoh

Gene selection is of vital importance in molecular classification of cancer using high-dimensional gene expression data. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust feature selection methods is extremely crucial. We investigated the properties of one feature selection approach proposed in our previous work, which was the generalization of the feature selection method based on the depended degree of attribute in rough sets. We compared the feature selection method with the established methods: the depended degree, chi-square, information gain, Relief-F and symmetric uncertainty, and analyzed its properties through a series of classification experiments. The results revealed that our method was superior to the canonical depended degree of attribute based method in robustness and applicability. Moreover, the method was comparable to the other four commonly used methods. More importantly, the method can exhibit the inherent classification difficulty with respect to different gene expression datasets, indicating the inherent biology of specific cancers.


Author(s):  
Sarfaraz Masood ◽  
Khwaja Wisal ◽  
Om Pal ◽  
Chanchal Kumar

Parkinson’s disease (PD) is a highly common neurological disease affecting a large population worldwide. Several studies revealed that the degradation of voice is one of its initial symptoms, which is also known as dysarthria. In this work, we attempt to explore and harness the correlation between various features in the voice samples observed in PD subjects. To do so, a novel two-level ensemble-based feature selection method has been proposed, whose results were combined with an MLP based classifier using K-fold cross-validation as the re-sampling strategy. Three separate benchmark datasets of voice samples were used for the experimentation work. Results strongly suggest that the proposed feature selection framework helps in identifying an optimal set of features which further helps in highly accurate identification of PD patients using a Multi-Layer Perceptron from their voice samples. The proposed model achieves an overall accuracy of 98.3%, 95.1% and 100% on the three selected datasets respectively. These results are significantly better than those achieved by a non-feature selection based option, and even the recently proposed chi-square based feature selection option.


Sign in / Sign up

Export Citation Format

Share Document