Gene Expression Data Classification Using Support Vector Machine and Mutual Information-based Gene Selection

In this paper we propose a novel Support Vector Machines-based architecture for medical diagnosis using multi-class gene expression data. It consists of a pre-processing unit and N-1 sequentially ordered blocks capable of classifying N classes in a cascading manner. Each block embodies both a gene selection and a classification module. It offers the flexibility of constructing block-specific gene expression spaces and hypersurfaces for the discrimination of the different classes. The proposed architecture was applied for medical diagnostic tasks including prostate and lung cancer diagnosis. Its performance was evaluated by using a leave-one-out cross validation approach which avoids the bias introduced by the gene selection process. The results show that it provides high accuracy which in most cases exceeds the accuracy achieved by the popular one-vs-one and one-vs-all SVM combination schemes and Nearest-Neighbor classifiers. The cascading SVMs can be successfully applied as a medical diagnostic tool.

Download Full-text

A COMPARATIVE STUDY ON GENE SELECTION METHODS FOR TISSUES CLASSIFICATION ON LARGE SCALE GENE EXPRESSION DATA

Jurnal Teknologi ◽

10.11113/jt.v78.8843 ◽

2016 ◽

Vol 78 (5-10) ◽

Author(s):

Farzana Kabir Ahmad

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Large Scale ◽

Gene Selection ◽

Support Vector ◽

Breast Cancer Dataset ◽

Expression Data ◽

Selection Methods ◽

Normal Tissues

Deoxyribonucleic acid (DNA) microarray technology is the recent invention that provided colossal opportunities to measure a large scale of gene expressions simultaneously. However, interpreting large scale of gene expression data remain a challenging issue due to their innate nature of “high dimensional low sample size”. Microarray data mainly involved thousands of genes, n in a very small size sample, p which complicates the data analysis process. For such a reason, feature selection methods also known as gene selection methods have become apparently need to select significant genes that present the maximum discriminative power between cancerous and normal tissues. Feature selection methods can be structured into three basic factions; a) filter methods; b) wrapper methods and c) embedded methods. Among these methods, filter gene selection methods provide easy way to calculate the informative genes and can simplify reduce the large scale microarray datasets. Although filter based gene selection techniques have been commonly used in analyzing microarray dataset, these techniques have been tested separately in different studies. Therefore, this study aims to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. In this experiment, common classifiers, Support Vector Machine (SVM) is used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM. Furthermore, this study has shown SVM performance remained moderately unaffected unless a very small size of genes was selected.

Download Full-text

Cancer classification through filtering progressive transductive support vector machine based on gene expression data

10.1063/1.4992918 ◽

2017 ◽

Cited By ~ 2

Author(s):

Xinguo Lu ◽

Dan Chen

Keyword(s):

Gene Expression ◽

Support Vector Machine ◽

Gene Expression Data ◽

Cancer Classification ◽

Support Vector ◽

Expression Data ◽

Transductive Support Vector Machine

Download Full-text

A Support Vector Machine Ensemble for Cancer Classification Using Gene Expression Data

Bioinformatics Research and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-540-72031-7_44 ◽

2007 ◽

pp. 488-495 ◽

Cited By ~ 1

Author(s):

Chen Liao ◽

Shutao Li

Keyword(s):

Gene Expression ◽

Support Vector Machine ◽

Gene Expression Data ◽

Cancer Classification ◽

Support Vector ◽

Expression Data

Download Full-text

Partial least squares regression, support vector machine regression, and transcriptome-based distances for prediction of maize hybrid performance with gene expression data

Theoretical and Applied Genetics ◽

10.1007/s00122-011-1747-9 ◽

2011 ◽

Vol 124 (5) ◽

pp. 825-833 ◽

Cited By ~ 19

Author(s):

Junjie Fu ◽

K. Christin Falke ◽

Alexander Thiemann ◽

Tobias A. Schrag ◽

Albrecht E. Melchinger ◽

...

Keyword(s):

Gene Expression ◽

Support Vector Machine ◽

Least Squares ◽

Gene Expression Data ◽

Partial Least Squares Regression ◽

Support Vector ◽

Hybrid Performance ◽

Expression Data ◽

Least Squares Regression ◽

Maize Hybrid

Download Full-text

Support vector machine model of developmental brain gene expression data for prioritization of Autism risk gene candidates

Bioinformatics ◽

10.1093/bioinformatics/btw498 ◽

2016 ◽

pp. btw498 ◽

Cited By ~ 5

Author(s):

S. Cogill ◽

L. Wang

Keyword(s):

Gene Expression ◽

Support Vector Machine ◽

Gene Expression Data ◽

Support Vector Machine Model ◽

Support Vector ◽

Expression Data ◽

Machine Model ◽

Brain Gene Expression ◽

Risk Gene

Download Full-text

Use of relevancy and complementary information for discriminatory gene selection from high-dimensional gene expression data

PLoS ONE ◽

10.1371/journal.pone.0230164 ◽

2021 ◽

Vol 16 (10) ◽

pp. e0230164

Author(s):

Md Nazmul Haque ◽

Sadia Sharmin ◽

Amin Ahsan Ali ◽

Abu Ashfaqur Sajib ◽

Mohammad Shoyaib

Keyword(s):

Gene Expression ◽

Mutual Information ◽

Gene Expression Data ◽

Gene Selection ◽

De Novo ◽

Expression Profiles ◽

Biological Data ◽

Expression Data ◽

Finite Sample ◽

Key Genes

With the advent of high-throughput technologies, life sciences are generating a huge amount of varied biomolecular data. Global gene expression profiles provide a snapshot of all the genes that are transcribed in a cell or in a tissue under a particular condition. The high-dimensionality of such gene expression data (i.e., very large number of features/genes analyzed with relatively much less number of samples) makes it difficult to identify the key genes (biomarkers) that are truly attributing to a particular phenotype or condition, (such as cancer), de novo. For identifying the key genes from gene expression data, among the existing literature, mutual information (MI) is one of the most successful criteria. However, the correction of MI for finite sample is not taken into account in this regard. It is also important to incorporate dynamic discretization of genes for more relevant gene selection, although this is not considered in the available methods. Besides, it is usually suggested in current studies to remove redundant genes which is particularly inappropriate for biological data, as a group of genes may connect to each other for downstreaming proteins. Thus, despite being redundant, it is needed to add the genes which provide additional useful information for the disease. Addressing these issues, we proposed Mutual information based Gene Selection method (MGS) for selecting informative genes. Moreover, to rank these selected genes, we extended MGS and propose two ranking methods on the selected genes, such as MGSf—based on frequency and MGSrf—based on Random Forest. The proposed method not only obtained better classification rates on gene expression datasets derived from different gene expression studies compared to recently reported methods but also detected the key genes relevant to pathways with a causal relationship to the disease, which indicate that it will also able to find the responsible genes for an unknown disease data.

Download Full-text