Liver Cancer Classification Model Using Hybrid Feature Selection Based on Class-Dependent Technique for the Central Region of Thailand

Liver cancer data always consist of a large number of multidimensional datasets. A dataset that has huge features and multiple classes may be irrelevant to the pattern classification in machine learning. Hence, feature selection improves the performance of the classification model to achieve maximum classification accuracy. The aims of the present study were to find the best feature subset and to evaluate the classification performance of the predictive model. This paper proposed a hybrid feature selection approach by combining information gain and sequential forward selection based on the class-dependent technique (IGSFS-CD) for the liver cancer classification model. Two different classifiers (decision tree and naïve Bayes) were used to evaluate feature subsets. The liver cancer datasets were obtained from the Cancer Hospital Thailand database. Three ensemble methods (ensemble classifiers, bagging, and AdaBoost) were applied to improve the performance of classification. The IGSFS-CD method provided good accuracy of 78.36% (sensitivity 0.7841 and specificity 0.9159) on LC_dataset-1. In addition, LC_dataset II delivered the best performance with an accuracy of 84.82% (sensitivity 0.8481 and specificity 0.9437). The IGSFS-CD method achieved better classification performance compared to the class-independent method. Furthermore, the best feature subset selection could help reduce the complexity of the predictive model.

Download Full-text

A New Hybrid Feature Subset Selection Framework Based on Binary Genetic Algorithm and Information Theory

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026819500202 ◽

2019 ◽

Vol 18 (03) ◽

pp. 1950020 ◽

Cited By ~ 13

Author(s):

Alok Kumar Shukla ◽

Pradeep Singh ◽

Manu Vardhan

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Classification Accuracy ◽

B Cell Lymphoma ◽

Feature Subset Selection ◽

Classification Model ◽

Significant Feature ◽

Support Vector ◽

Feature Subset ◽

Binary Genetic Algorithm

The explosion of the high-dimensional dataset in the scientific repository has been encouraging interdisciplinary research on data mining, pattern recognition and bioinformatics. The fundamental problem of the individual Feature Selection (FS) method is extracting informative features for classification model and to seek for the malignant disease at low computational cost. In addition, existing FS approaches overlook the fact that for a given cardinality, there can be several subsets with similar information. This paper introduces a novel hybrid FS algorithm, called Filter-Wrapper Feature Selection (FWFS) for a classification problem and also addresses the limitations of existing methods. In the proposed model, the front-end filter ranking method as Conditional Mutual Information Maximization (CMIM) selects the high ranked feature subset while the succeeding method as Binary Genetic Algorithm (BGA) accelerates the search in identifying the significant feature subsets. One of the merits of the proposed method is that, unlike an exhaustive method, it speeds up the FS procedure without lancing of classification accuracy on reduced dataset when a learning model is applied to the selected subsets of features. The efficacy of the proposed (FWFS) method is examined by Naive Bayes (NB) classifier which works as a fitness function. The effectiveness of the selected feature subset is evaluated using numerous classifiers on five biological datasets and five UCI datasets of a varied dimensionality and number of instances. The experimental results emphasize that the proposed method provides additional support to the significant reduction of the features and outperforms the existing methods. For microarray data-sets, we found the lowest classification accuracy is 61.24% on SRBCT dataset and highest accuracy is 99.32% on Diffuse large B-cell lymphoma (DLBCL). In UCI datasets, the lowest classification accuracy is 40.04% on the Lymphography using k-nearest neighbor (k-NN) and highest classification accuracy is 99.05% on the ionosphere using support vector machine (SVM).

Download Full-text

Optimal Feature Subset Selection for Imbalanced Class Data using SMOTE and Binary ALO Algorithm

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c4734.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 344-349

Keyword(s):

Feature Selection ◽

Class Imbalance ◽

Classification Performance ◽

Selection Model ◽

Feature Subset Selection ◽

Feature Subset ◽

Spatial Features ◽

Imbalanced Classes ◽

Optimal Feature Subset ◽

Optimal Feature

Feature selection in multispectral high dimensional information is a hard labour machine learning problem because of the imbalanced classes present in the data. The existing Most of the feature selection schemes in the literature ignore the problem of class imbalance by choosing the features from the classes having more instances and avoiding significant features of the classes having less instances. In this paper, SMOTE concept is exploited to produce the required samples form minority classes. Feature selection model is formulated with the objective of reducing number of features with improved classification performance. This model is based on dimensionality reduction by opt for a subset of relevant spectral, textural and spatial features while eliminating the redundant features for the purpose of improved classification performance. Binary ALO is engaged to solve the feature selection model for optimal selection of features. The proposed ALO-SVM with wrapper concept is applied to each potential solution obtained during optimization step. The working of this methodology is tested on LANDSAT multispectral image.

Download Full-text

MapReduce-based big data classification model using feature subset selection and hyperparameter tuned deep belief network

Scientific Reports ◽

10.1038/s41598-021-03019-y ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Surendran Rajendran ◽

Osamah Ibrahim Khalaf ◽

Youseef Alotaibi ◽

Saleh Alghamdi

Keyword(s):

Feature Selection ◽

Big Data ◽

Selection Process ◽

Data Classification ◽

Deep Belief Network ◽

Feature Subset Selection ◽

Classification Model ◽

Feature Subset ◽

Belief Network ◽

Big Data Classification

AbstractIn recent times, big data classification has become a hot research topic in various domains, such as healthcare, e-commerce, finance, etc. The inclusion of the feature selection process helps to improve the big data classification process and can be done by the use of metaheuristic optimization algorithms. This study focuses on the design of a big data classification model using chaotic pigeon inspired optimization (CPIO)-based feature selection with an optimal deep belief network (DBN) model. The proposed model is executed in the Hadoop MapReduce environment to manage big data. Initially, the CPIO algorithm is applied to select a useful subset of features. In addition, the Harris hawks optimization (HHO)-based DBN model is derived as a classifier to allocate appropriate class labels. The design of the HHO algorithm to tune the hyperparameters of the DBN model assists in boosting the classification performance. To examine the superiority of the presented technique, a series of simulations were performed, and the results were inspected under various dimensions. The resultant values highlighted the supremacy of the presented technique over the recent techniques.

Download Full-text

Optimization of Text Feature Selection Process Based on Advanced Searching for News Classification

International Journal of Swarm Intelligence Research ◽

10.4018/ijsir.2020100101 ◽

2020 ◽

Vol 11 (4) ◽

pp. 1-23

Author(s):

Khin Sandar Kyaw ◽

Somchai Limsiroratana

Keyword(s):

Feature Selection ◽

Selection Process ◽

Computation Time ◽

Classification Problem ◽

Classification Performance ◽

Classification Model ◽

Feature Subset ◽

Advanced Search ◽

Text Feature ◽

Optimal Feature Subset

Nowadays, the culture for accessing news around the world is changed from paper format to electronic and the rate of publication for newspapers and magazines on websites have increased dramatically. Therefore, the feature selection process from high-dimensional text feature set for an automatic news classification model is becoming the top challenge because irrelevant features can degrade the accuracy with high cost computation time for classification model. In this article, six-advanced search policies based on evolutionary, swarm intelligence, nature-inspired intelligence are observed for achieving the global optimal feature subset for optimal accuracy in news classification problem. According to the experimental results, the advanced search schemes that can provide flexibility in integrating classifier in accordance with its objective function such as optimal classification performance by adjusting the rate of modification parameters for the testing data.

Download Full-text

Efficient Feature Subset Selection Algorithm for High Dimensional Data

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i4.pp1880-1888 ◽

2016 ◽

Vol 6 (4) ◽

pp. 1880 ◽

Cited By ~ 3

Author(s):

Smita Chormunge ◽

Sudarson Jena

Keyword(s):

Feature Selection ◽

Information Gain ◽

High Dimensional Data ◽

Feature Subset Selection ◽

High Dimensional ◽

Feature Subset ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Computational Performance ◽

Optimal Feature Subset

<p>Feature selection approach solves the dimensionality problem by removing irrelevant and redundant features. Existing Feature selection algorithms take more time to obtain feature subset for high dimensional data. This paper proposes a feature selection algorithm based on Information gain measures for high dimensional data termed as IFSA (Information gain based Feature Selection Algorithm) to produce optimal feature subset in efficient time and improve the computational performance of learning algorithms. IFSA algorithm works in two folds: First apply filter on dataset. Second produce the small feature subset by using information gain measure. Extensive experiments are carried out to compare proposed algorithm and other methods with respect to two different classifiers (Naive bayes and IBK) on microarray and text data sets. The results demonstrate that IFSA not only produces the most select feature subset in efficient time but also improves the classifier performance.</p>

Download Full-text

Efficient Feature Subset Selection Algorithm for High Dimensional Data

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i4.9800 ◽

2016 ◽

Vol 6 (4) ◽

pp. 1880 ◽

Cited By ~ 1

Author(s):

Smita Chormunge ◽

Sudarson Jena

Keyword(s):

Feature Selection ◽

Information Gain ◽

High Dimensional Data ◽

Feature Subset Selection ◽

High Dimensional ◽

Feature Subset ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Computational Performance ◽

Optimal Feature Subset

Download Full-text

Feature Selection using Genetic Programming

Zambia ICT Journal ◽

10.33260/zictjournal.v3i2.62 ◽

2019 ◽

Vol 3 (2) ◽

pp. 11-18

Author(s):

George Mweshi

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

Information Gain ◽

Principal Component ◽

Search Space ◽

Classification Performance ◽

The Other ◽

Searching Strategy ◽

Mining Algorithms ◽

Feature Selection Techniques

Extracting useful and novel information from the large amount of collected data has become a necessity for corporations wishing to maintain a competitive advantage. One of the biggest issues in handling these significantly large datasets is the curse of dimensionality. As the dimension of the data increases, the performance of the data mining algorithms employed to mine the data deteriorates. This deterioration is mainly caused by the large search space created as a result of having irrelevant, noisy and redundant features in the data. Feature selection is one of the various techniques that can be used to remove these unnecessary features. Feature selection consequently reduces the dimension of the data as well as the search space which in turn increases the efficiency and the accuracy of the mining algorithms. In this paper, we investigate the ability of Genetic Programming (GP), an evolutionary algorithm searching strategy capable of automatically finding solutions in complex and large search spaces, to perform feature selection. We implement a basic GP algorithm and perform feature selection on 5 benchmark classification datasets from UCI repository. To test the competitiveness and feasibility of the GP approach, we examine the classification performance of four classifiers namely J48, Naives Bayes, PART, and Random Forests using the GP selected features, all the original features and the features selected by the other commonly used feature selection techniques i.e. principal component analysis, information gain, relief-f and cfs. The experimental results show that not only does GP select a smaller set of features from the original features, classifiers using GP selected features achieve a better classification performance than using all the original features. Furthermore, compared to the other well-known feature selection techniques, GP achieves very competitive results.

Download Full-text

A novel feature selection algorithm based on damping oscillation theory

PLoS ONE ◽

10.1371/journal.pone.0255307 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0255307

Author(s):

Fujun Wang ◽

Xing Wang

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Euclidean Distance ◽

Oscillation Theory ◽

Feature Subset Selection ◽

Support Vector ◽

Data Sets ◽

Feature Subset ◽

Selection Algorithm ◽

Filter Model

Feature selection is an important task in big data analysis and information retrieval processing. It reduces the number of features by removing noise, extraneous data. In this paper, one feature subset selection algorithm based on damping oscillation theory and support vector machine classifier is proposed. This algorithm is called the Maximum Kendall coefficient Maximum Euclidean Distance Improved Gray Wolf Optimization algorithm (MKMDIGWO). In MKMDIGWO, first, a filter model based on Kendall coefficient and Euclidean distance is proposed, which is used to measure the correlation and redundancy of the candidate feature subset. Second, the wrapper model is an improved grey wolf optimization algorithm, in which its position update formula has been improved in order to achieve optimal results. Third, the filter model and the wrapper model are dynamically adjusted by the damping oscillation theory to achieve the effect of finding an optimal feature subset. Therefore, MKMDIGWO achieves both the efficiency of the filter model and the high precision of the wrapper model. Experimental results on five UCI public data sets and two microarray data sets have demonstrated the higher classification accuracy of the MKMDIGWO algorithm than that of other four state-of-the-art algorithms. The maximum ACC value of the MKMDIGWO algorithm is at least 0.5% higher than other algorithms on 10 data sets.

Download Full-text

A Hybrid PSO-SFS-SBS Algorithm in Feature Selection for Liver Cancer Data

Lecture Notes in Electrical Engineering - Power Electronics and Renewable Energy Systems ◽

10.1007/978-81-322-2119-7_133 ◽

2014 ◽

pp. 1369-1376 ◽

Cited By ~ 1

Author(s):

S. Gunasundari ◽

S. Janakiraman

Keyword(s):

Feature Selection ◽

Liver Cancer ◽

Cancer Data ◽

Selection For

Download Full-text

Performance Analysis of Classifiers on Filter-Based Feature Selection Approaches on Microarray Data

Bio-Inspired Computing for Information Retrieval Applications - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-5225-2375-8.ch002 ◽

2017 ◽

pp. 41-70 ◽

Cited By ~ 5

Author(s):

Arunkumar Chinnaswamy ◽

Ramakrishnan Srinivasan

Keyword(s):

Feature Selection ◽

Microarray Data ◽

Classification Accuracy ◽

Information Gain ◽

Feature Subset ◽

Classification Problems ◽

Raw Data ◽

Correlation Based Feature Selection ◽

Feature Selection Approach ◽

Gene Expression Levels

The process of Feature selection in machine learning involves the reduction in the number of features (genes) and similar activities that results in an acceptable level of classification accuracy. This paper discusses the filter based feature selection methods such as Information Gain and Correlation coefficient. After the process of feature selection is performed, the selected genes are subjected to five classification problems such as Naïve Bayes, Bagging, Random Forest, J48 and Decision Stump. The same experiment is performed on the raw data as well. Experimental results show that the filter based approaches reduce the number of gene expression levels effectively and thereby has a reduced feature subset that produces higher classification accuracy compared to the same experiment performed on the raw data. Also Correlation Based Feature Selection uses very fewer genes and produces higher accuracy compared to Information Gain based Feature Selection approach.

Download Full-text