Efficient Feature Selection Model for Gene Expression Data

2011 ◽  
Vol 110-116 ◽  
pp. 1948-1952
Author(s):  
Patharawut Saengsiri ◽  
Sageemas Na Wichian ◽  
Phayung Meesad

Finding subset of informative gene is very crucial for biology process because several genes increase sharply and most of them are not related with others. In general, feature selection technique consists of two steps 1) all genes is ranked by a filter approach 2) rank list is sent to a wrapper approach. Nevertheless, the accuracy rate for recognition gene is not enough. Therefore, this paper proposes efficient feature selection model for gene expression data. First, two filter approaches are used to define many subset of attribute such as Correlation based Feature Selection (Cfs) and Gain Ratio (GR). Second, wrapper approach is used to evaluate each length of attribute that based on Support Vector Machine (SVM) and Random Forest (RF). The result of experiment depicts CfsSVM, CfsRF, GRSVM, and GRRF based on proposed model produce higher accuracy rate such as 87.10%, 90.32%, 87.10, and 88.71%, respectively.

Author(s):  
Mekour Norreddine

One of the problems that gene expression data resolved is feature selection. There is an important process for choosing which features are important for prediction; there are two general approaches for feature selection: filter approach and wrapper approach. In this chapter, the authors combine the filter approach with method ranked information gain and wrapper approach with a searching method of the genetic algorithm. The authors evaluate their approach on two data sets of gene expression data: Leukemia, and the Central Nervous System. The classifier Decision tree (C4.5) is used for improving the classification performance.


Microarray technology has been developed as one of the powerful tools that have attracted many researchers to analyze gene expression level for a given organism. It has been observed that gene expression data have very large (in terms of thousands) of features and less number of samples (in terms of hundreds). This characteristic makes difficult to do an analysis of gene expression data. Hence efficient feature selection technique must be applied before we go for any kind of analysis. Feature selection plays a vital role in the classification of gene expression data. There are several feature selection techniques have been induced in this field. But Support Vector Machine with Recursive Feature Elimination (SVM-RFE) has been proven as the promising feature selection methods among others. SVM-RFE ranks the genes (features) by training the SVM classification model and with the combination of RFE method key genes are selected. Huge time consumption is the main issue of SVM-RFE. We introduced an efficient implementation of linier SVM to overcome this problem and improved the RFE with variable step size. Then, combined method was used for selecting informative genes. Effective resampling method is proposed to preprocess the datasets. This is used to make the distribution of samples balanced, which gives more reliable classification results. In this paper, we have also studied the applicability of common classifiers. Detailed experiments are conducted on four commonly used microarray gene expression datasets. The results show that the proposed method comparable classification performance


2019 ◽  
Vol 12 (04) ◽  
pp. 1950039 ◽  
Author(s):  
Sarah M. Ayyad ◽  
Ahmed I. Saleh ◽  
Labib M. Labib

Classification of gene expression data is a pivotal research area that plays a substantial role in diagnosis and prediction of diseases. Generally, feature selection is one of the extensively used techniques in data mining approaches, especially in classification. Gene expression data are usually composed of dozens of samples characterized by thousands of genes. This increases the dimensionality coupled with the existence of irrelevant and redundant features. Accordingly, the selection of informative genes (features) becomes difficult, which badly affects the gene classification accuracy. In this paper, we consider the feature selection for classifying gene expression microarray datasets. The goal is to detect the most possibly cancer-related genes in a distributed manner, which helps in effectively classifying the samples. Initially, the available huge amount of considered features are subdivided and distributed among several processors. Then, a new filter selection method based on a fuzzy inference system is applied to each subset of the dataset. Finally, all the resulted features are ranked, then a wrapper-based selection method is applied. Experimental results showed that our proposed feature selection technique performs better than other techniques since it produces lower time latency and improves classification performance.


2016 ◽  
Vol 78 (5-10) ◽  
Author(s):  
Farzana Kabir Ahmad

Deoxyribonucleic acid (DNA) microarray technology is the recent invention that provided colossal opportunities to measure a large scale of gene expressions simultaneously. However, interpreting large scale of gene expression data remain a challenging issue due to their innate nature of “high dimensional low sample size”. Microarray data mainly involved thousands of genes, n in a very small size sample, p which complicates the data analysis process. For such a reason, feature selection methods also known as gene selection methods have become apparently need to select significant genes that present the maximum discriminative power between cancerous and normal tissues. Feature selection methods can be structured into three basic factions; a) filter methods; b) wrapper methods and c) embedded methods. Among these methods, filter gene selection methods provide easy way to calculate the informative genes and can simplify reduce the large scale microarray datasets. Although filter based gene selection techniques have been commonly used in analyzing microarray dataset, these techniques have been tested separately in different studies. Therefore, this study aims to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. In this experiment, common classifiers, Support Vector Machine (SVM) is used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM. Furthermore, this study has shown SVM performance remained moderately unaffected unless a very small size of genes was selected.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Nageswara Rao Eluri ◽  
Gangadhara Rao Kancharla ◽  
Suresh Dara ◽  
Venkatesulu Dondeti

PurposeGene selection is considered as the fundamental process in the bioinformatics field. The existing methodologies pertain to cancer classification are mostly clinical basis, and its diagnosis capability is limited. Nowadays, the significant problems of cancer diagnosis are solved by the utilization of gene expression data. The researchers have been introducing many possibilities to diagnose cancer appropriately and effectively. This paper aims to develop the cancer data classification using gene expression data.Design/methodology/approachThe proposed classification model involves three main phases: “(1) Feature extraction, (2) Optimal Feature Selection and (3) Classification”. Initially, five benchmark gene expression datasets are collected. From the collected gene expression data, the feature extraction is performed. To diminish the length of the feature vectors, optimal feature selection is performed, for which a new meta-heuristic algorithm termed as quantum-inspired immune clone optimization algorithm (QICO) is used. Once the relevant features are selected, the classification is performed by a deep learning model called recurrent neural network (RNN). Finally, the experimental analysis reveals that the proposed QICO-based feature selection model outperforms the other heuristic-based feature selection and optimized RNN outperforms the other machine learning methods.FindingsThe proposed QICO-RNN is acquiring the best outcomes at any learning percentage. On considering the learning percentage 85, the accuracy of the proposed QICO-RNN was 3.2% excellent than RNN, 4.3% excellent than RF, 3.8% excellent than NB and 2.1% excellent than KNN for Dataset 1. For Dataset 2, at learning percentage 35, the accuracy of the proposed QICO-RNN was 13.3% exclusive than RNN, 8.9% exclusive than RF and 14.8% exclusive than NB and KNN. Hence, the developed QICO algorithm is performing well in classifying the cancer data using gene expression data accurately.Originality/valueThis paper introduces a new optimal feature selection model using QICO and QICO-based RNN for effective classification of cancer data using gene expression data. This is the first work that utilizes an optimal feature selection model using QICO and QICO-RNN for effective classification of cancer data using gene expression data.


Sign in / Sign up

Export Citation Format

Share Document