Towards a software defect proneness model: feature selection

Vitaliy S. Yakovyna;  ; Ivan I. Symets

doi:10.15276/aait.04.2021.5

Towards a software defect proneness model: feature selection

Applied Aspects of Information Technology ◽

10.15276/aait.04.2021.5 ◽

2021 ◽

Vol 4 (4) ◽

pp. 354-365

Author(s):

Vitaliy S. Yakovyna ◽

◽

Ivan I. Symets

Keyword(s):

Principal Component Analysis ◽

Feature Selection ◽

Random Forest ◽

Software Reliability ◽

Principal Component ◽

Component Analysis ◽

Support Vector ◽

Tree Classifier ◽

Code Metrics ◽

Software Code

This article is focused on improving static models of software reliability based on using machine learning methods to select the software code metrics that most strongly affect its reliability. The study used a merged dataset from the PROMISE Software Engineering repository, which contained data on testing software modules of five programs and twenty-one code metrics. For the prepared sampling, the most important features that affect the quality of software code have been selected using the following methods of feature selection: Boruta, Stepwise selection, Exhaustive Feature Selection, Random Forest Importance, LightGBM Importance, Genetic Algorithms, Principal Component Analysis, Xverse python. Basing on the voting on the results of the work of the methods of feature selection, a static (deterministic) model of software reliability has been built, which establishes the relationship between the probability of a defect in the software module and the metrics of its code. It has been shown that this model includes such code metrics as branch count of a program, McCabe’s lines of code and cyclomatic complexity, Halstead’s total number of operators and operands, intelligence, volume, and effort value. A comparison of the effectiveness of different methods of feature selection has been put into practice, in particular, a study of the effect of the method of feature selection on the accuracy of classification using the following classifiers: Random Forest, Support Vector Machine, k-Nearest Neighbors, Decision Tree classifier, AdaBoost classifier, Gradient Boosting for classification. It has been shown that the use of any method of feature selection increases the accuracy of classification by at least ten percent compared to the original dataset, which confirms the importance of this procedure for predicting software defects based on metric datasets that contain a significant number of highly correlated software code metrics. It has been found that the best accuracy of the forecast for most classifiers was reached using a set of features obtained from the proposed static model of software reliability. In addition, it has been shown that it is also possible to use separate methods, such as Autoencoder, Exhaustive Feature Selection and Principal Component Analysis with an insignificant loss of classification and prediction accuracy

Download Full-text

Multiclass classification of leukemia cancer data using Fuzzy Support Vector Machine (FSVM) with feature selection using Principal Component Analysis (PCA)

Journal of Physics Conference Series ◽

10.1088/1742-6596/1725/1/012012 ◽

2021 ◽

Vol 1725 ◽

pp. 012012

Author(s):

I R Fauzi ◽

Z Rustam ◽

A Wibowo

Keyword(s):

Principal Component Analysis ◽

Support Vector Machine ◽

Feature Selection ◽

Principal Component ◽

Component Analysis ◽

Multiclass Classification ◽

Support Vector ◽

Fuzzy Support Vector Machine ◽

Cancer Data

Download Full-text

Multivariate Analysis and Machine Learning for Ripeness Classification of Cape Gooseberry Fruits

Processes ◽

10.3390/pr7120928 ◽

2019 ◽

Vol 7 (12) ◽

pp. 928 ◽

Cited By ~ 2

Author(s):

Miguel De-la-Torre ◽

Omar Zatarain ◽

Himer Avila-George ◽

Mirna Muñoz ◽

Jimy Oblitas ◽

...

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Feature Selection ◽

Principal Component ◽

Component Analysis ◽

Support Vector ◽

Color Spaces ◽

Combination Methods ◽

Fruit Samples ◽

Cape Gooseberry

This paper explores five multivariate techniques for information fusion on sorting the visual ripeness of Cape gooseberry fruits (principal component analysis, linear discriminant analysis, independent component analysis, eigenvector centrality feature selection, and multi-cluster feature selection.) These techniques are applied to the concatenated channels corresponding to red, green, and blue (RGB), hue, saturation, value (HSV), and lightness, red/green value, and blue/yellow value (L*a*b) color spaces (9 features in total). Machine learning techniques have been reported for sorting the Cape gooseberry fruits’ ripeness. Classifiers such as neural networks, support vector machines, and nearest neighbors discriminate on fruit samples using different color spaces. Despite the color spaces being equivalent up to a transformation, a few classifiers enable better performances due to differences in the pixel distribution of samples. Experimental results show that selection and combination of color channels allow classifiers to reach similar levels of accuracy; however, combination methods still require higher computational complexity. The highest level of accuracy was obtained using the seven-dimensional principal component analysis feature space.

Download Full-text

Feature Extraction and Classification for the Detection of Knee Joint Disorders using Random Forest Classifier

International Journal of Emerging Trends in Engineering Research ◽

10.30534/ijeter/2021/099102021 ◽

2021 ◽

Vol 9 (10) ◽

pp. 1348-1356

Keyword(s):

Principal Component Analysis ◽

Feature Selection ◽

Random Forest ◽

Knee Joint ◽

Principal Component ◽

Approximate Entropy ◽

Component Analysis ◽

Random Forest Classifier ◽

Invasive Technique ◽

Entropy Measures

A non-invasive technique using knee joint vibroarthrographic (VAG) signals can be used for the early diagnosis of knee joint disorders. Among the algorithms devised for the detection of knee joint disorders using VAG signals, algorithms based on entropy measures can provide better performance. In this work, the VAG signal is preprocessed using wavelet decomposition into sub band signals. Features of the decomposed sub bands such as approximate entropy, sample entropy and wavelet energy are extracted as a quantified measure of complexity of the signal. A feature selection based on Principal Component Analysis (PCA) is performed in order to select the significant features. The extracted features are then used for classification of VAG signal into normal and abnormal VAG using random forest classifier. It is observed that the classifier provides a better accuracy with feature selection using principal component analysis. And the result shows that the classifier is able to classify the signal with an accuracy of 87%, error rate of 0.13, sensitivity of 0.874 and specificity of 0.777.

Download Full-text

Detection of Knee Joint Disorders using SVM Classifier

International Journal of Scientific Research in Science and Technology ◽

10.32628/ijsrst218535 ◽

2021 ◽

pp. 261-271

Author(s):

Alphonsa Salu S. J. ◽

Jeraldin Auxillia D

Keyword(s):

Principal Component Analysis ◽

Feature Selection ◽

Knee Joint ◽

Principal Component ◽

Approximate Entropy ◽

Component Analysis ◽

Invasive Technique ◽

Support Vector ◽

Svm Classifier ◽

Entropy Measures

A non-invasive technique using knee joint vibroarthographic (VAG) signals can be used for the early diagnosis of knee joint disorders. Among the algorithms devised for the detection of knee joint disorders using VAG signals, algorithms based on entropy measures can provide better performance. In this work, the VAG signal is preprocessed using wavelet decomposition into sub band signals. Features of the decomposed sub bands such as approximate entropy, sample entropy & wavelet energy are extracted as a quantified measure of complexity of the signal. A feature selection based on Principal Component Analysis (PCA) is performed in order to select the significant features. The extracted features are then used for classification of VAG signal into normal and abnormal VAG using support vector machine. It is observed that the classifier provides a better accuracy with feature selection using principal component analysis. And the results show that the classifier was able to classify the signal with an accuracy of 82.6%, error rate of 0.174, sensitivity of 1.0 and specificity of 0.888.

Download Full-text

Human activity recognition from smart watch sensor data using a hybrid of principal component analysis and random forest algorithm

Measurement and Control ◽

10.1177/0020294018813692 ◽

2018 ◽

Vol 52 (1-2) ◽

pp. 37-45 ◽

Cited By ~ 12

Author(s):

Serkan Balli ◽

Ensar Arif Sağbaş ◽

Musa Peker

Keyword(s):

Principal Component Analysis ◽

Random Forest ◽

Principal Component ◽

Component Analysis ◽

Sensor Data ◽

Support Vector ◽

Data Set ◽

Smart Watch ◽

Human Movements ◽

Daily Physical Activities

Background: Detecting of human movements is an important task in various areas such as healthcare, fitness and eldercare. It is now possible to achieve this aim using mobile applications. These applications provide users, doctors and related persons a better understanding about daily physical activities. It can also lead to various useful habits by following the activities of the users in their daily life. In addition, dangerous actions such as the fall of elderly people or young children are identified and necessary precautions are taken as soon as possible. Classification of human motions with motion sensor data is among the current topics of study. Smart watches have these sensors built-in. Thus, it is possible to follow the activities of a user carrying only a smart watch. Methods: The purpose of this work is to detect human movements using smart watch sensor data and machine learning methods. The data are obtained from the accelerometer, gyroscope, step counter and heart rate sensors of the smart watch. The obtained data have been divided into 2 s windows and a data set containing 500 patterns for each class has been created for each class. Results and Discussion: After the features were determined, the data set to which the principal component analysis has been applied was classified by random forest, support vector machine, C4.5 and k-nearest neighbor methods, and their performances were compared. The most successful result was obtained from the random forest method.

Download Full-text

Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest

Genes & Genomics ◽

10.1007/s13258-021-01057-4 ◽

2021 ◽

Author(s):

Weiwen Zhang ◽

Lianglun Cheng ◽

Guoheng Huang

Keyword(s):

Principal Component Analysis ◽

Random Forest ◽

Population Stratification ◽

Principal Component ◽

Component Analysis ◽

Kernel Principal Component Analysis ◽

Fine Scale ◽

Scale Population

Download Full-text

Feature Selection for Classification using Principal Component Analysis and Information Gain

Expert Systems with Applications ◽

10.1016/j.eswa.2021.114765 ◽

2021 ◽

Vol 174 ◽

pp. 114765 ◽

Cited By ~ 1

Author(s):

Erick Odhiambo Omuya ◽

George Onyango Okeyo ◽

Michael Waema Kimwele

Keyword(s):

Principal Component Analysis ◽

Feature Selection ◽

Information Gain ◽

Principal Component ◽

Component Analysis ◽

Selection For

Download Full-text

Longitudinal Crack Detection Approach Based on Principal Component Analysis and Support Vector Machine for Slab Continuous Casting

steel research international ◽

10.1002/srin.202100168 ◽

2021 ◽

Author(s):

Haiyang Duan ◽

Jingjing Wei ◽

Lin Qi ◽

Xudong Wang ◽

Yu Liu ◽

...

Keyword(s):

Principal Component Analysis ◽

Support Vector Machine ◽

Continuous Casting ◽

Crack Detection ◽

Longitudinal Crack ◽

Principal Component ◽

Component Analysis ◽

Support Vector ◽

Slab Continuous Casting ◽

Detection Approach

Download Full-text

Prediction of China’s Energy Consumption Based on Robust Principal Component Analysis and PSO-LSSVM Optimized by the Tabu Search Algorithm

Energies ◽

10.3390/en12010196 ◽

2019 ◽

Vol 12 (1) ◽

pp. 196 ◽

Cited By ~ 3

Author(s):

Lihui Zhang ◽

Riletu Ge ◽

Jianxue Chai

Keyword(s):

Principal Component Analysis ◽

Energy Consumption ◽

Tabu Search ◽

Industrial Structure ◽

Principal Component ◽

Component Analysis ◽

Support Vector ◽

Forecasting Model ◽

Robust Principal Component Analysis ◽

Consumption Structure

China’s energy consumption issues are closely associated with global climate issues, and the scale of energy consumption, peak energy consumption, and consumption investment are all the focus of national attention. In order to forecast the amount of energy consumption of China accurately, this article selected GDP, population, industrial structure and energy consumption structure, energy intensity, total imports and exports, fixed asset investment, energy efficiency, urbanization, the level of consumption, and fixed investment in the energy industry as a preliminary set of factors; Secondly, we corrected the traditional principal component analysis (PCA) algorithm from the perspective of eliminating “bad points” and then judged a “bad spot” sample based on signal reconstruction ideas. Based on the above content, we put forward a robust principal component analysis (RPCA) algorithm and chose the first five principal components as main factors affecting energy consumption, including: GDP, population, industrial structure and energy consumption structure, urbanization; Then, we applied the Tabu search (TS) algorithm to the least square to support vector machine (LSSVM) optimized by the particle swarm optimization (PSO) algorithm to forecast China’s energy consumption. We collected data from 1996 to 2010 as a training set and from 2010 to 2016 as the test set. For easy comparison, the sample data was input into the LSSVM algorithm and the PSO-LSSVM algorithm at the same time. We used statistical indicators including goodness of fit determination coefficient (R2), the root means square error (RMSE), and the mean radial error (MRE) to compare the training results of the three forecasting models, which demonstrated that the proposed TS-PSO-LSSVM forecasting model had higher prediction accuracy, generalization ability, and higher training speed. Finally, the TS-PSO-LSSVM forecasting model was applied to forecast the energy consumption of China from 2017 to 2030. According to predictions, we found that China shows a gradual increase in energy consumption trends from 2017 to 2030 and will breakthrough 6000 million tons in 2030. However, the growth rate is gradually tightening and China’s energy consumption economy will transfer to a state of diminishing returns around 2026, which guides China to put more emphasis on the field of energy investment.

Download Full-text