A comprehensive investigation of the impact of feature selection techniques on crashing fault residence prediction models

Author(s):  
Kunsong Zhao ◽  
Zhou Xu ◽  
Meng Yan ◽  
Tao Zhang ◽  
Dan Yang ◽  
...  
Author(s):  
Abhishek Bhattacharya ◽  
Radha Tamal Goswami ◽  
Kuntal Mukherjee ◽  
Nhu Gia Nguyen

Each Android application requires accumulations of permissions in installation time and they are considered as the features which can be utilized in permission-based identification of Android malwares. Recently, ensemble feature selection techniques have received increasing attention over conventional techniques in different applications. In this work, a cluster based voted ensemble voted feature selection technique combining five base wrapper approaches of R libraries is projected for identifying most prominent set of features in the predictive modeling of Android malwares. The proposed method preserves both the desirable features of an ensemble feature selector, accuracy and diversity. Moreover, in this work, five different data partitioning ratios are considered and the impact of those ratios on predictive model are measured using coefficient of determination (r-square) and root mean square error. The proposed strategy has created significant better outcome in term of the number of selected features and classification accuracy.


2021 ◽  
pp. 139-158
Author(s):  
Yi-Chen Chung ◽  
Hsien-Ming Chou ◽  
Chih-Neng Hung ◽  
Chihli Hung

Abstract This research proposes an integrated framework for the use of textual and economic features to predict the exchange rate of the TWD (Taiwan dollar) against the RMB (Chinese Renminbi). The exchange rate is affected by the current economic situation and expectations for the future economic climate. Exchange rate forecasting studies focus mainly on overall economic indices and the actual exchange rate, but overlook the influence of news. This research considers both textual and economic factors and builds three basic prediction models, i.e. multiple linear regression (MLR), support vector regression (SVR), and Gaussian process regression (GPR) for the prediction of the RMB exchange rate. In addition to the three basic prediction models, this research uses ensemble learning and feature selection techniques to improve prediction performance. Our experiments demonstrate that textual features also play an important role in predicting the RMB exchange rate. The SVR model is shown to outperform the other models and the MLR model is shown to perform worst. The ensemble of three basic models performs better than its individual counterparts. Finally, the models which use feature selection techniques demonstrate improved results in general, and different feature selection techniques are shown to be more suitable for different prediction models. JEL classification numbers: D80, F31, F47. Keywords: Exchange rate prediction, Text mining, Ensemble learning, Time series forecasting.


2021 ◽  
Vol 15 (1) ◽  
pp. 1-15
Author(s):  
Behrooz Abbaszadeh ◽  
Cesar Alexandre Domingues Teixeira ◽  
Mustapha C.E. Yagoub

Background: Because about 30% of epileptic patients suffer from refractory epilepsy, an efficient automatic seizure prediction tool is in great demand to improve their life quality. Methods: In this work, time-domain discriminating preictal and interictal features were efficiently extracted from the intracranial electroencephalogram of twelve patients, i.e., six with temporal and six with frontal lobe epilepsy. The performance of three types of feature selection methods was compared using Matthews’s correlation coefficient (MCC). Results: Kruskal Wallis, a non-parametric approach, was found to perform better than the other approaches due to a simple and less resource consuming strategy as well as maintaining the highest MCC score. The impact of dividing the electroencephalogram signals into various sub-bands was investigated as well. The highest performance of Kruskal Wallis may suggest considering the importance of univariate features like complexity and interquartile ratio (IQR), along with autoregressive (AR) model parameters and the maximum (MAX) cross-correlation to efficiently predict epileptic seizures. Conclusion: The proposed approach has the potential to be implemented on a low power device by considering a few simple time domain characteristics for a specific sub-band. It should be noted that, as there is not a great deal of literature on frontal lobe epilepsy, the results of this work can be considered promising.


Author(s):  
Malka N. Halgamuge

The emergence of new technologies to incorporate and analyze data with high-performance computing has expanded our capability to accurately predict any incident. Supervised Machine learning (ML) can be utilized for a fast and consistent prediction, and to obtain the underlying pattern of the data better. We develop a prediction strategy, for the first time, using supervised ML to observe the possible impact of weak radiofrequency electromagnetic field (RF-EMF) on human and animal cells without performing in-vitro laboratory experiments. We extracted laboratory experimental data from 300 peer-reviewed scientific publications (1990–2015) describing 1127 experimental case studies of human and animal cells response to RF-EMF. We used domain knowledge, Principal Component Analysis (PCA), and the Chi-squared feature selection techniques to select six optimal features for computation and cost-efficiency. We then develop grouping or clustering strategies to allocate these selected features into five different laboratory experiment scenarios. The dataset has been tested with ten different classifiers, and the outputs are estimated using the k-fold cross-validation method. The assessment of a classifier’s prediction performance is critical for assessing its suitability. Hence, a detailed comparison of the percentage of the model accuracy (PCC), Root Mean Squared Error (RMSE), precision, sensitivity (recall), 1 − specificity, Area under the ROC Curve (AUC), and precision-recall (PRC Area) for each classification method were observed. Our findings suggest that the Random Forest algorithm exceeds in all groups in terms of all performance measures and shows AUC = 0.903 where k-fold = 60. A robust correlation was observed in the specific absorption rate (SAR) with frequency and cumulative effect or exposure time with SAR×time (impact of accumulated SAR within the exposure time) of RF-EMF. In contrast, the relationship between frequency and exposure time was not significant. In future, with more experimental data, the sample size can be increased, leading to more accurate work.


Symmetry ◽  
2020 ◽  
Vol 12 (7) ◽  
pp. 1147 ◽  
Author(s):  
Abdullateef O. Balogun ◽  
Shuib Basri ◽  
Saipunidzam Mahamad ◽  
Said J. Abdulkadir ◽  
Malek A. Almomani ◽  
...  

Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naïve Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC values. Scott–KnottESD and the novel Double Scott–KnottESD rank statistical methods were used for statistical ranking of the studied FS methods. The experimental results showed that there is no one best FS method as their respective performances depends on the choice of classifiers, performance evaluation metrics, and dataset. However, we recommend the use of statistical-based, probability-based, and classifier-based filter feature ranking (FFR) methods, respectively, in SDP. For filter subset selection (FSS) methods, correlation-based feature selection (CFS) with metaheuristic search methods is recommended. For wrapper feature selection (WFS) methods, the IWSS-based WFS method is recommended as it outperforms the conventional SFS and LHS-based WFS methods.


2015 ◽  
Author(s):  
Muthukumaran Kasinathan ◽  
Lalita Bhanu Murthy Neti

Several change metrics and source code metrics have been introduced and proved to be effective in bug prediction. Researchers performed comparative studies of bug prediction models built using the individual metrics as well as combination of these metrics. In this paper, we investigate the impact of feature selection in bug prediction models by analyzing the misclassification rates of these models with and without feature selection in place. We conduct our experiments on five open source projects by considering numerous change metrics and source code metrics. And this study aims to figure out the reliable subset of metrics that are common amongst all projects.


2013 ◽  
Vol 22 (05) ◽  
pp. 1360010 ◽  
Author(s):  
HUANJING WANG ◽  
TAGHI M. KHOSHGOFTAAR ◽  
QIANHUI (ALTHEA) LIANG

Software metrics (features or attributes) are collected during the software development cycle. Metric selection is one of the most important preprocessing steps in the process of building defect prediction models and may improve the final prediction result. However, the addition or removal of program modules (instances or samples) can alter the subsets chosen by a feature selection technique, rendering the previously-selected feature sets invalid. Very limited research have been done considering both stability (or robustness) and defect prediction model performance together in the software engineering domain, despite the importance of both aspects when choosing a feature selection technique. In this paper, we test the stability and classification model performance of eighteen feature selection techniques as the magnitude of change to the datasets and the size of the selected feature subsets are varied. All experiments were conducted on sixteen datasets from three real-world software projects. The experimental results demonstrate that Gain Ratio shows the least stability while two different versions of ReliefF show the most stability, followed by the PRC- and AUC-based threshold-based feature selection techniques. Results also show that the signal-to-noise ranker performed moderately in terms of robustness and was the best ranker in terms of model performance. Finally, we conclude that while for some rankers, stability and classification performance are correlated, this is not true for other rankers, and therefore performance according to one scheme (stability or model performance) cannot be used to predict performance according to the other.


Sign in / Sign up

Export Citation Format

Share Document