Methods for Gene Selection and Classification of Microarray Dataset

Handbook of Research on Biomimicry in Information Retrieval and Knowledge Management - Advances in Web Technologies and Engineering ◽

10.4018/978-1-5225-3004-6.ch004 ◽

2018 ◽

pp. 66-77

Author(s):

Mekour Norreddine

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Gene Selection ◽

Information Gain ◽

Microarray Dataset ◽

Data Sets ◽

Expression Data ◽

Wrapper Approach ◽

Filter Approach

One of the problems that gene expression data resolved is feature selection. There is an important process for choosing which features are important for prediction; there are two general approaches for feature selection: filter approach and wrapper approach. In this chapter, the authors combine the filter approach with method ranked information gain and wrapper approach with a searching method of the genetic algorithm. The authors evaluate their approach on two data sets of gene expression data: Leukemia, and the Central Nervous System. The classifier Decision tree (C4.5) is used for improving the classification performance.

Download Full-text

Filter/Wrapper Methods for Gene Selection and Classification of Microarray Dataset

International Journal of Software Innovation ◽

10.4018/ijsi.2019070104 ◽

2019 ◽

Vol 7 (3) ◽

pp. 65-80

Author(s):

Norreddine Mekour ◽

Reda Mohamed Hamou ◽

Abdelmalek Amine

Keyword(s):

Feature Selection ◽

Large Scale ◽

Gene Selection ◽

Information Gain ◽

Microarray Dataset ◽

Classification Performance ◽

Wrapper Approach ◽

Computer Scientists ◽

Filter Approach ◽

Searching Method

A wide variety of large-scale information has been made within the extraction of genomic information and the extraction of data. Problems addressed embody ordination sequencing, supermolecule structure modeling, or the reconstruction of biological process trees (phylogeny). These issues need collaboration between biologists and computer scientists as a result of the issues to be of nice recursive difficulties. One of the most modern problems that gene expression data is resolved is with feature selection. There are two general approaches for feature selection: filter approach and wrapper approach. In this article, the authors propose a new approach when combining the filter approach with method ranked information gain and a wrapper approach with the searching method of the genetic algorithm.in order to test their overall performance, an experimental study is presented based on two gene microarray datasets found in bioinformatics and biomedical domains leukemia, and the central nervous system (CNS). The classifier Decision tree (C4.5) is used for improving the classification performance. The results show that their approach selects genes for additional correct classification emphasizes the effectiveness of the chosen genes and its ability to filter the information from unsuitable genes.

Download Full-text

Efficient Feature Selection Model for Gene Expression Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.110-116.1948 ◽

2011 ◽

Vol 110-116 ◽

pp. 1948-1952

Author(s):

Patharawut Saengsiri ◽

Sageemas Na Wichian ◽

Phayung Meesad

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Selection Model ◽

Support Vector ◽

Expression Data ◽

Accuracy Rate ◽

Feature Selection Technique ◽

Wrapper Approach ◽

Filter Approach

Finding subset of informative gene is very crucial for biology process because several genes increase sharply and most of them are not related with others. In general, feature selection technique consists of two steps 1) all genes is ranked by a filter approach 2) rank list is sent to a wrapper approach. Nevertheless, the accuracy rate for recognition gene is not enough. Therefore, this paper proposes efficient feature selection model for gene expression data. First, two filter approaches are used to define many subset of attribute such as Correlation based Feature Selection (Cfs) and Gain Ratio (GR). Second, wrapper approach is used to evaluate each length of attribute that based on Support Vector Machine (SVM) and Random Forest (RF). The result of experiment depicts CfsSVM, CfsRF, GRSVM, and GRRF based on proposed model produce higher accuracy rate such as 87.10%, 90.32%, 87.10, and 88.71%, respectively.

Download Full-text

Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures

BioMed Research International ◽

10.1155/2019/2497509 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Suyan Tian ◽

Chi Wang ◽

Bing Wang

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Gene Selection ◽

Selection Process ◽

Biological Knowledge ◽

Expression Data ◽

Selection Methods ◽

Its Gene ◽

Active Research

To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.

Download Full-text

CLASSIFYING TEMPORAL MICROARRAY DATA BY SELECTING INFORMATIVE GENES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720013410060 ◽

2013 ◽

Vol 11 (03) ◽

pp. 1341006

Author(s):

QIANG LOU ◽

ZORAN OBRADOVIC

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Microarray Data ◽

Data Sets ◽

Temporal Data ◽

Expression Data ◽

Selection Methods ◽

Temporal Gene Expression ◽

Single Matrix

In order to more accurately predict an individual's health status, in clinical applications it is often important to perform analysis of high-dimensional gene expression data that varies with time. A major challenge in predicting from such temporal microarray data is that the number of biomarkers used as features is typically much larger than the number of labeled subjects. One way to address this challenge is to perform feature selection as a preprocessing step and then apply a classification method on selected features. However, traditional feature selection methods cannot handle multivariate temporal data without applying techniques that flatten temporal data into a single matrix in advance. In this study, a feature selection filter that can directly select informative features from temporal gene expression data is proposed. In our approach, we measure the distance between multivariate temporal data from two subjects. Based on this distance, we define the objective function of temporal margin based feature selection to maximize each subject's temporal margin in its own relevant subspace. The experimental results on synthetic and two real flu data sets provide evidence that our method outperforms the alternatives, which flatten the temporal data in advance.

Download Full-text

Gene Expression Data For Gene Selection Using Ensemble Based Feature Selection

2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS) ◽

10.1109/icicis46948.2019.9014722 ◽

2019 ◽

Author(s):

Mohamad Aouf ◽

Amr Sharawi ◽

Khaled Samir ◽

Sultan Almotatiri ◽

Abdulla Bajahzar ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Gene Selection ◽

Expression Data

Download Full-text

A hybrid filter/wrapper approach of feature selection for gene expression data

2008 IEEE International Conference on Systems, Man and Cybernetics ◽

10.1109/icsmc.2008.4811698 ◽

2008 ◽

Cited By ~ 1

Author(s):

Chao-Hsuan Ke ◽

Cheng-Hong Yang ◽

Li-Yeh Chuang ◽

Cheng-San Yang

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Expression Data ◽

Hybrid Filter ◽

Wrapper Approach ◽

Selection For

Download Full-text

A COMPARATIVE STUDY ON GENE SELECTION METHODS FOR TISSUES CLASSIFICATION ON LARGE SCALE GENE EXPRESSION DATA

Jurnal Teknologi ◽

10.11113/jt.v78.8843 ◽

2016 ◽

Vol 78 (5-10) ◽

Author(s):

Farzana Kabir Ahmad

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Large Scale ◽

Gene Selection ◽

Support Vector ◽

Breast Cancer Dataset ◽

Expression Data ◽

Selection Methods ◽

Normal Tissues

Deoxyribonucleic acid (DNA) microarray technology is the recent invention that provided colossal opportunities to measure a large scale of gene expressions simultaneously. However, interpreting large scale of gene expression data remain a challenging issue due to their innate nature of “high dimensional low sample size”. Microarray data mainly involved thousands of genes, n in a very small size sample, p which complicates the data analysis process. For such a reason, feature selection methods also known as gene selection methods have become apparently need to select significant genes that present the maximum discriminative power between cancerous and normal tissues. Feature selection methods can be structured into three basic factions; a) filter methods; b) wrapper methods and c) embedded methods. Among these methods, filter gene selection methods provide easy way to calculate the informative genes and can simplify reduce the large scale microarray datasets. Although filter based gene selection techniques have been commonly used in analyzing microarray dataset, these techniques have been tested separately in different studies. Therefore, this study aims to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. In this experiment, common classifiers, Support Vector Machine (SVM) is used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM. Furthermore, this study has shown SVM performance remained moderately unaffected unless a very small size of genes was selected.

Download Full-text

Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data

Entropy ◽

10.3390/e23010002 ◽

2020 ◽

Vol 23 (1) ◽

pp. 2

Author(s):

Malik Yousef ◽

Abhishek Kumar ◽

Burcu Bakir-Gungor

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Data Analysis ◽

Gene Expression Data ◽

Gene Selection ◽

Biological Data ◽

Biological Information ◽

Background Information ◽

Biological Knowledge ◽

Expression Data

In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. One of the main goals of this review is to explore the existing methods that integrate different types of information in order to improve the identification of the biomolecular signatures of diseases and the discovery of new potential targets for treatment. These integrative approaches are expected to aid the prediction, diagnosis, and treatment of diseases, as well as to enlighten us on disease state dynamics, mechanisms of their onset and progression. The integration of various types of biological information will necessitate the development of novel techniques for integration and data analysis. Another aim of this review is to boost the bioinformatics community to develop new approaches for searching and determining significant groups/clusters of features based on one or more biological grouping functions.

Download Full-text

Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data

10.20944/preprints202012.0377.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Malik Yousef ◽

Abhishek Kumar ◽

Burcu Bakir-Gungor

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Data Analysis ◽

Gene Expression Data ◽

Gene Selection ◽

Selection Process ◽

Biological Data ◽

Integrative Approach ◽

Biological Knowledge ◽

Expression Data

In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. For gene expression data analysis, most of the existing feature selection methods rely on expression values alone to select the genes; and biological knowledge is integrated at the end of the analysis in order to gain biological insights or to support the initial findings. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. Since the integrative approach attracted attention in the gene expression domain, lately the gene selection process shifted from being purely data-centric to more incorporative analysis with additional biological knowledge.

Download Full-text

A comparative study of biomarker gene selection methods in presence of outliers

Journal of Bio-Science ◽

10.3329/jbs.v25i0.37493 ◽

2018 ◽

Vol 25 ◽

pp. 9-16

Author(s):

M Shahjaman ◽

N Kumar ◽

AA Begum ◽

SMS Islam ◽

MNH Mollah

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Gene Expression Data ◽

Gene Selection ◽

Simulated Data ◽

Small Sample ◽

Data Sets ◽

Expression Data ◽

Selection Methods ◽

Cancer Data

The main purpose of gene expression data analysis is to identify the biomarker genes by comparing the gene expression levels between two different groups or conditions. There are several methods to select biomarker genes and many comparative studies have been performed to select the appropriate method. However, they did not consider the problems of outliers in their data sets though it is very essential to select the method from robustness point of view due to outliers may occur in the different steps of the gene expression data generating process. In this paper, it is evaluated the performance among five popular statistical biomarker gene selection methods viz. T-test, SAM, LIMMA, KW and FCROS using both simulated and real gene expression data sets in absence and presence of outliers. In the simulated data analysis, it was demonstrated the performance of these methods in terms of different performance measures such as TPR, TNR, FPR, FNR and AUC and based on these measures, it was found that in absence of outliers, for both small-and-large sample cases all the methods perform almost similar. Whereas, in presence of outliers, for small-sample case only the FCROS method perform well than other methods. From a real colon cancer data analysis, it was elucidated that FCROS method identified additional 59 genes that were not detected by the other methods and most of them belongs to the different cancer related pathways.J. bio-sci. 25: 9-16, 2017

Download Full-text