Correlation-based feature subset selection technique for web spam classification

2018 ◽  
Vol 13 (4) ◽  
pp. 363
Author(s):  
Surender Singh ◽  
Ashutosh Kumar Singh
Data Science ◽  
2019 ◽  
pp. 51-68 ◽  
Author(s):  
Agnip Dasgupta ◽  
Ardhendu Banerjee ◽  
Aniket Ghosh Dastidar ◽  
Antara Barman ◽  
Sanjay Chakraborty

2015 ◽  
Vol 25 (09n10) ◽  
pp. 1531-1550 ◽  
Author(s):  
Kehan Gao ◽  
Taghi M. Khoshgoftaar ◽  
Amri Napolitano

Defect prediction is an important process activity frequently used for improving the quality and reliability of software products. Defect prediction results provide a list of fault-prone modules which are necessary in helping project managers better utilize valuable project resources. In the software quality modeling process, high dimensionality and class imbalance are the two potential problems that may exist in data repositories. In this study, we investigate three data preprocessing approaches, in which feature selection is combined with data sampling, to overcome these problems in the context of software quality estimation. These three approaches are: Approach 1 — sampling performed prior to feature selection, but retaining the unsampled data instances; Approach 2 — sampling performed prior to feature selection, retaining the sampled data instances; and Approach 3 — sampling performed after feature selection. A comparative investigation is presented for evaluating the three approaches. In the experiments, we employed three sampling methods (random undersampling, random oversampling, and synthetic minority oversampling), each combined with a filter-based feature subset selection technique called correlation-based feature selection. We built the defect prediction models using five common classification algorithms. The case study was based on software metrics and defect data collected from multiple releases of a real-world software system. The results demonstrated that the type of sampling methods used in data preprocessing significantly affected the performance of the combination approaches. It was found that when the random undersampling technique was used, Approach 1 performed better than the other two approaches. However, when the feature selection technique was used in conjunction with an oversampling method (random oversampling or synthetic minority oversampling), we strongly recommended Approach 3.


2019 ◽  
Vol 8 (2) ◽  
pp. 3316-3322

Huge amount of Healthcare data are produced every day from the various health care sectors. The accumulated data can be effectively analyzed to identify people's risk from chronic diseases. The process of predicting the presence or absence of the disease and also to diagnosing the various disease using the historical medical data is known as Health Care Analytics. Health care analytics will improve patient care and also the harness practice of medical practitioner. The feature selection is considered as a core aspect of the machine learning which hugely contribute towards the performance of the machine learning model. In this paper symmetry based feature subset selection is proposed to select the optimal features from the Health care data which contribute towards the prediction outcome. The Multilayer perceptron algorithm(MLP) used as a classifier which will predict the outcome by using the features which are selected from the Symmetry-based feature subset selection technique. The chronic disease dataset Diabetes, Cancer, Breast Cancer, and Heart Disease data set accumulated from UCI repository is used to conduct the experiment. The experimental results demonstrate that the proposed hybrid combination of feature selection technique and the multilayer perceptron outperforms in accuracy compare to the existing approaches.


Sign in / Sign up

Export Citation Format

Share Document