Informative Gene Selection Based on Cost-Sensitive Fast Correlation-Based Feature Selection

Background: Informative gene selection is an essential step in performing tumor classification. However, it is difficult to select informative genes related to tumors from large-scale gene expression profiles because of their characteristics, such as high dimensionality, relatively small samples, and class imbalance, and some genes being superfluous and irrelevant. Objective: Many researchers analyze and process gene expression data to obtain classified gene subsets by using machine learning methods. However, the gene expression profiles of tumors often have massive computational challenges. In addition, when improving feature importance and classification accuracy, cost estimation is often ignored in traditional feature selection algorithms, which makes tumor classification more difficult. Method: In this study, a novel informative gene selection method based on cost-sensitive fast correlation-based feature selection (CS-FCBF) is proposed. Results: First, the symmetric uncertainty index is used to evaluate the correlation between informative genes and class labels, and then a large number of irrelevant and redundant genes are quickly filtered according to importance. Thereby, a candidate gene subset is generated. Second, cost-sensitive learning, which introduces the misclassification cost matrix and support vector machine attribute evaluation, is used to obtain the top-ranked gene subset with minimum misclassification loss. Finally, the candidate gene subset is optimized. Conclusion: This experiment was verified in eight independent tumor datasets. By comparing and analyzing CS-FCBF with another three hybrids of typical gene selection algorithms combined with cost-sensitive learning, we found that the method proposed in this study exhibited a better classification performance with fewer selected genes, which might provide guidance in tumor diagnosis and research.

Download Full-text

A Robust Gene selection Method for Microarray-based Cancer Classification

Cancer Informatics ◽

10.4137/cin.s3794 ◽

2010 ◽

Vol 9 ◽

pp. CIN.S3794 ◽

Cited By ~ 21

Author(s):

Xiaosheng Wang ◽

Osamu Gotoh

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Selection ◽

Information Gain ◽

Expression Profiles ◽

Feature Selection Method ◽

Gene Expression Profiles ◽

Molecular Classification ◽

Selection Method ◽

Chi Square

Gene selection is of vital importance in molecular classification of cancer using high-dimensional gene expression data. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust feature selection methods is extremely crucial. We investigated the properties of one feature selection approach proposed in our previous work, which was the generalization of the feature selection method based on the depended degree of attribute in rough sets. We compared the feature selection method with the established methods: the depended degree, chi-square, information gain, Relief-F and symmetric uncertainty, and analyzed its properties through a series of classification experiments. The results revealed that our method was superior to the canonical depended degree of attribute based method in robustness and applicability. Moreover, the method was comparable to the other four commonly used methods. More importantly, the method can exhibit the inherent classification difficulty with respect to different gene expression datasets, indicating the inherent biology of specific cancers.

Download Full-text

Heuristic Breadth-First Search Algorithm for Informative Gene Selection Based on Gene Expression Profiles

Chinese Journal of Computers ◽

10.3724/sp.j.1016.2008.00636 ◽

2009 ◽

Vol 31 (4) ◽

pp. 636-649 ◽

Cited By ~ 3

Author(s):

Shu-Lin WANG ◽

Ji WANG ◽

Huo-Wang CHEN ◽

Shu-Tao LI ◽

Bo-Yun ZHANG

Keyword(s):

Gene Expression ◽

Gene Selection ◽

Search Algorithm ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Informative Gene ◽

Breadth First Search

Download Full-text

Correlation-based Gene Selection and Classification Using Taguchi-BPSO

Methods of Information in Medicine ◽

10.3414/me09-01-0010 ◽

2010 ◽

Vol 49 (03) ◽

pp. 254-268 ◽

Cited By ~ 10

Author(s):

C.-S. Yang ◽

K.-C. Wu ◽

C.-H. Yang ◽

L.-Y. Chuang

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Microarray Data ◽

Error Rate ◽

Gene Expression Analysis ◽

Gene Selection ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Classification Error ◽

Classification Error Rate

Summary Background: Microarray data with reference to gene expression profiles have provided some valuable results related to a variety of problems, and contributed to advances in clinical medicine. Microarray data characteristically have a high dimension and small sample size, which makes it difficult for a general classification method to obtain correct data for classification. However, not every gene is potentially relevant for distinguishing the sample class. Thus, in order to analyze gene expression profiles correctly, feature (gene) selection is crucial for the classification process, and an effective gene extraction method is necessary for eliminating irrelevant genes and decreasing the classification error rate. Objective: The purpose of gene expression analysis is to discriminate between classes of samples, and to predict the relative importance of each gene for sample classification. Method: In this paper, correlation-based feature selection (CFS) and Taguchi-binary particle swarm optimization (TBPSO) were combined into a hybrid method, and the K-nearest neighbor (K-NN) with leave-one-out cross-validation (LOOCV) method served as a classifier for ten gene expression profiles. Results: Experimental results show that this hybrid method effectively simplifies feature selection by reducing the number of features needed. The classification error rate obtained by the proposed method had the lowest classification error rate for all of the ten gene expression data set problems tested. For six of the gene expression profile data sets a classification error rate of zero could be reached. Conclusion: The introduced method outperformed five other methods from the literature in terms of classification error rate. It could thus constitute a valuable tool for gene expression analysis in future studies.

Download Full-text

Study of Informative Gene Selection for Gene Expression Profiles

2009 WRI Global Congress on Intelligent Systems ◽

10.1109/gcis.2009.94 ◽

2009 ◽

Author(s):

Quanzhong Liu ◽

Yang Zhang ◽

Yong Wang ◽

Zhengguo Hu

Keyword(s):

Gene Expression ◽

Gene Selection ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Informative Gene ◽

Selection For

Download Full-text

Learning and Feature Selection Using the Set Covering Machine with Data-Dependent Rays on Gene Expression Profiles

Artificial Neural Networks in Pattern Recognition - Lecture Notes in Computer Science ◽

10.1007/11829898_26 ◽

2006 ◽

pp. 286-297

Author(s):

Hans A. Kestler ◽

Wolfgang Lindner ◽

André Müller

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Set Covering

Download Full-text

Gene Selection via Discretized Gene-Expression Profiles and Greedy Feature-Elimination

Methods and Applications of Artificial Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-540-24674-9_27 ◽

2004 ◽

pp. 256-266 ◽

Cited By ~ 18

Author(s):

George Potamias ◽

Lefteris Koumakis ◽

Vassilis Moustakis

Keyword(s):

Gene Expression ◽

Gene Selection ◽

Expression Profiles ◽

Gene Expression Profiles

Download Full-text

Improved Feature Selection by Incorporating Gene Similarity into the LASSO

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/jkdb.2012010101 ◽

2012 ◽

Vol 3 (1) ◽

pp. 1-22 ◽

Cited By ~ 1

Author(s):

Christopher E. Gillies ◽

Xiaoli Gao ◽

Nilesh V. Patel ◽

Mohammad-Reza Siadat ◽

George D. Wilson

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Personalized Medicine ◽

Objective Function ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Genetic Profile ◽

Data Set ◽

Coordinate Descent Algorithm ◽

Gene Similarity

Personalized medicine is customizing treatments to a patient’s genetic profile and has the potential to revolutionize medical practice. An important process used in personalized medicine is gene expression profiling. Analyzing gene expression profiles is difficult, because there are usually few patients and thousands of genes, leading to the curse of dimensionality. To combat this problem, researchers suggest using prior knowledge to enhance feature selection for supervised learning algorithms. The authors propose an enhancement to the LASSO, a shrinkage and selection technique that induces parameter sparsity by penalizing a model’s objective function. Their enhancement gives preference to the selection of genes that are involved in similar biological processes. The authors’ modified LASSO selects similar genes by penalizing interaction terms between genes. They devise a coordinate descent algorithm to minimize the corresponding objective function. To evaluate their method, the authors created simulation data where they compared their model to the standard LASSO model and an interaction LASSO model. The authors’ model outperformed both the standard and interaction LASSO models in terms of detecting important genes and gene interactions for a reasonable number of training samples. They also demonstrated the performance of their method on a real gene expression data set from lung cancer cell lines.

Download Full-text

PCA-based unsupervised feature extraction for gene expression analysis of COVID-19 patients

Scientific Reports ◽

10.1038/s41598-021-95698-w ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Kota Fujisawa ◽

Mamoru Shimo ◽

Y.-H. Taguchi ◽

Shinya Ikematsu ◽

Ryota Miyata

Keyword(s):

Gene Expression ◽

Feature Extraction ◽

Target Genes ◽

Gene Selection ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Principal Component ◽

Data Set ◽

Immune Related Genes ◽

Unsupervised Feature Extraction

AbstractCoronavirus disease 2019 (COVID-19) is raging worldwide. This potentially fatal infectious disease is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However, the complete mechanism of COVID-19 is not well understood. Therefore, we analyzed gene expression profiles of COVID-19 patients to identify disease-related genes through an innovative machine learning method that enables a data-driven strategy for gene selection from a data set with a small number of samples and many candidates. Principal-component-analysis-based unsupervised feature extraction (PCAUFE) was applied to the RNA expression profiles of 16 COVID-19 patients and 18 healthy control subjects. The results identified 123 genes as critical for COVID-19 progression from 60,683 candidate probes, including immune-related genes. The 123 genes were enriched in binding sites for transcription factors NFKB1 and RELA, which are involved in various biological phenomena such as immune response and cell survival: the primary mediator of canonical nuclear factor-kappa B (NF-κB) activity is the heterodimer RelA-p50. The genes were also enriched in histone modification H3K36me3, and they largely overlapped the target genes of NFKB1 and RELA. We found that the overlapping genes were downregulated in COVID-19 patients. These results suggest that canonical NF-κB activity was suppressed by H3K36me3 in COVID-19 patient blood.

Download Full-text