Ensemble Feature Selection from Cancer Gene Expression Data using Mutual Information and Recursive Feature Elimination

The application of gene expression data to the diagnosis and classification of cancer has become a hot issue in the field of cancer classification. Gene expression data usually contains a large number of tumor-free data and has the characteristics of high dimensions. In order to select determinant genes related to breast cancer from the initial gene expression data, we propose a new feature selection method, namely, support vector machine based on recursive feature elimination and parameter optimization (SVM-RFE-PO). The grid search (GS) algorithm, the particle swarm optimization (PSO) algorithm, and the genetic algorithm (GA) are applied to search the optimal parameters in the feature selection process. Herein, the new feature selection method contains three kinds of algorithms: support vector machine based on recursive feature elimination and grid search (SVM-RFE-GS), support vector machine based on recursive feature elimination and particle swarm optimization (SVM-RFE-PSO), and support vector machine based on recursive feature elimination and genetic algorithm (SVM-RFE-GA). Then the selected optimal feature subsets are used to train the SVM classifier for cancer classification. We also use random forest feature selection (RFFS), random forest feature selection and grid search (RFFS-GS), and minimal redundancy maximal relevance (mRMR) algorithm as feature selection methods to compare the effects of the SVM-RFE-PO algorithm. The results showed that the feature subset obtained by feature selection using SVM-RFE-PSO algorithm results has a better prediction performance of Area Under Curve (AUC) in the testing data set. This algorithm not only is time-saving, but also is capable of extracting more representative and useful genes.

Download Full-text

Feature Selection for Gene Expression Data Analysis – A Review

International Journal of Psychosocial Rehabilitation ◽

10.37200/ijpr/v24i5/pr2020695 ◽

2020 ◽

Vol 24 (5) ◽

pp. 6955-6964

Author(s):

Dr. Prema R

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Data Analysis ◽

Gene Expression Data ◽

Expression Data ◽

Gene Expression Data Analysis ◽

Selection For

Download Full-text

An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220124756 ◽

2019 ◽

Vol 21 (9) ◽

pp. 631-645 ◽

Cited By ~ 5

Author(s):

Saeed Ahmed ◽

Muhammad Kabir ◽

Zakir Ali ◽

Muhammad Arif ◽

Farman Ali ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Classification Accuracy ◽

Early Stage ◽

Small Sample Size ◽

Feature Selection Method ◽

Small Sample ◽

Expression Data ◽

Base Function

Aim and Objective: Cancer is a dangerous disease worldwide, caused by somatic mutations in the genome. Diagnosis of this deadly disease at an early stage is exceptionally new clinical application of microarray data. In DNA microarray technology, gene expression data have a high dimension with small sample size. Therefore, the development of efficient and robust feature selection methods is indispensable that identify a small set of genes to achieve better classification performance. Materials and Methods: In this study, we developed a hybrid feature selection method that integrates correlation-based feature selection (CFS) and Multi-Objective Evolutionary Algorithm (MOEA) approaches which select the highly informative genes. The hybrid model with Redial base function neural network (RBFNN) classifier has been evaluated on 11 benchmark gene expression datasets by employing a 10-fold cross-validation test. Results: The experimental results are compared with seven conventional-based feature selection and other methods in the literature, which shows that our approach owned the obvious merits in the aspect of classification accuracy ratio and some genes selected by extensive comparing with other methods. Conclusion: Our proposed CFS-MOEA algorithm attained up to 100% classification accuracy for six out of eleven datasets with a minimal sized predictive gene subset.

Download Full-text

Improving the Performance of Principal Components for Classification of Gene Expression Data Through Feature Selection

Studies in Classification, Data Analysis, and Knowledge Organization - Data Science and Classification ◽

10.1007/3-540-34416-0_35 ◽

2006 ◽

pp. 325-332

Author(s):

Edgar Acuña ◽

Jaime Porras

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Principal Components ◽

Expression Data

Download Full-text

Effective Cancer Classification based on Gene Expression Data using Multidimensional Mutual Information and ELM

2018 IEEE 7th Data Driven Control and Learning Systems Conference (DDCLS) ◽

10.1109/ddcls.2018.8515927 ◽

2018 ◽

Cited By ~ 1

Author(s):

Qun-Xiong Zhu ◽

Yuan Fan ◽

Yan-Lin He ◽

Yuan Xu

Keyword(s):

Gene Expression ◽

Mutual Information ◽

Gene Expression Data ◽

Cancer Classification ◽

Expression Data

Download Full-text

A filter feature selection method based LLRFC and redundancy analysis for tumor classification using gene expression data

2016 12th World Congress on Intelligent Control and Automation (WCICA) ◽

10.1109/wcica.2016.7578590 ◽

2016 ◽

Cited By ~ 2

Author(s):

Jiangeng Li ◽

Xiaodan Li ◽

Wei Zhang

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Redundancy Analysis ◽

Feature Selection Method ◽

Selection Method ◽

Tumor Classification ◽

Expression Data

Download Full-text

Inference of Genetic Networks From Time-Series and Static Gene Expression Data: Combining a Random-Forest-Based Inference Method With Feature Selection Methods

Frontiers in Genetics ◽

10.3389/fgene.2020.595912 ◽

2020 ◽

Vol 11 ◽

Author(s):

Shuhei Kimura ◽

Ryo Fukutomi ◽

Masato Tokuhisa ◽

Mariko Okada

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Random Forest ◽

Gene Expression Data ◽

Computational Cost ◽

Expression Data ◽

Selection Methods ◽

Inference Method ◽

Combined Application ◽

Inference Methods

Several researchers have focused on random-forest-based inference methods because of their excellent performance. Some of these inference methods also have a useful ability to analyze both time-series and static gene expression data. However, they are only of use in ranking all of the candidate regulations by assigning them confidence values. None have been capable of detecting the regulations that actually affect a gene of interest. In this study, we propose a method to remove unpromising candidate regulations by combining the random-forest-based inference method with a series of feature selection methods. In addition to detecting unpromising regulations, our proposed method uses outputs from the feature selection methods to adjust the confidence values of all of the candidate regulations that have been computed by the random-forest-based inference method. Numerical experiments showed that the combined application with the feature selection methods improved the performance of the random-forest-based inference method on 99 of the 100 trials performed on the artificial problems. However, the improvement tends to be small, since our combined method succeeded in removing only 19% of the candidate regulations at most. The combined application with the feature selection methods moreover makes the computational cost higher. While a bigger improvement at a lower computational cost would be ideal, we see no impediments to our investigation, given that our aim is to extract as much useful information as possible from a limited amount of gene expression data.

Download Full-text