Comprior: facilitating the implementation and automated benchmarking of prior knowledge-based feature selection approaches on gene expression data sets

Abstract Background Reproducible benchmarking is important for assessing the effectiveness of novel feature selection approaches applied on gene expression data, especially for prior knowledge approaches that incorporate biological information from online knowledge bases. However, no full-fledged benchmarking system exists that is extensible, provides built-in feature selection approaches, and a comprehensive result assessment encompassing classification performance, robustness, and biological relevance. Moreover, the particular needs of prior knowledge feature selection approaches, i.e. uniform access to knowledge bases, are not addressed. As a consequence, prior knowledge approaches are not evaluated amongst each other, leaving open questions regarding their effectiveness. Results We present the Comprior benchmark tool, which facilitates the rapid development and effortless benchmarking of feature selection approaches, with a special focus on prior knowledge approaches. Comprior is extensible by custom approaches, offers built-in standard feature selection approaches, enables uniform access to multiple knowledge bases, and provides a customizable evaluation infrastructure to compare multiple feature selection approaches regarding their classification performance, robustness, runtime, and biological relevance. Conclusion Comprior allows reproducible benchmarking especially of prior knowledge approaches, which facilitates their applicability and for the first time enables a comprehensive assessment of their effectiveness.

Download Full-text

Distributed feature selection (DFS) strategy for microarray gene expression data to improve the classification performance

Clinical Epidemiology and Global Health ◽

10.1016/j.cegh.2018.04.001 ◽

2019 ◽

Vol 7 (2) ◽

pp. 171-176 ◽

Cited By ~ 5

Author(s):

Sai Prasad Potharaju ◽

M. Sreedevi

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Classification Performance ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Integrative biomarker detection on high-dimensional gene expression data sets: a survey on prior knowledge approaches

Briefings in Bioinformatics ◽

10.1093/bib/bbaa151 ◽

2020 ◽

Cited By ~ 2

Author(s):

Cindy Perscheid

Keyword(s):

Gene Expression ◽

Prior Knowledge ◽

Gene Expression Data ◽

Gene Selection ◽

Knowledge Bases ◽

Biological Knowledge ◽

Biomarker Detection ◽

Expression Data ◽

Integrative Analyses ◽

Traditional Approaches

Abstract Gene expression data provide the expression levels of tens of thousands of genes from several hundred samples. These data are analyzed to detect biomarkers that can be of prognostic or diagnostic use. Traditionally, biomarker detection for gene expression data is the task of gene selection. The vast number of genes is reduced to a few relevant ones that achieve the best performance for the respective use case. Traditional approaches select genes based on their statistical significance in the data set. This results in issues of robustness, redundancy and true biological relevance of the selected genes. Integrative analyses typically address these shortcomings by integrating multiple data artifacts from the same objects, e.g. gene expression and methylation data. When only gene expression data are available, integrative analyses instead use curated information on biological processes from public knowledge bases. With knowledge bases providing an ever-increasing amount of curated biological knowledge, such prior knowledge approaches become more powerful. This paper provides a thorough overview on the status quo of biomarker detection on gene expression data with prior biological knowledge. We discuss current shortcomings of traditional approaches, review recent external knowledge bases, provide a classification and qualitative comparison of existing prior knowledge approaches and discuss open challenges for this kind of gene selection.

Download Full-text

Feature Selection for Gene Expression Data Analysis – A Review

International Journal of Psychosocial Rehabilitation ◽

10.37200/ijpr/v24i5/pr2020695 ◽

2020 ◽

Vol 24 (5) ◽

pp. 6955-6964

Author(s):

Dr. Prema R

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Data Analysis ◽

Gene Expression Data ◽

Expression Data ◽

Gene Expression Data Analysis ◽

Selection For

Download Full-text

An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220124756 ◽

2019 ◽

Vol 21 (9) ◽

pp. 631-645 ◽

Cited By ~ 5

Author(s):

Saeed Ahmed ◽

Muhammad Kabir ◽

Zakir Ali ◽

Muhammad Arif ◽

Farman Ali ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Classification Accuracy ◽

Early Stage ◽

Small Sample Size ◽

Feature Selection Method ◽

Small Sample ◽

Expression Data ◽

Base Function

Aim and Objective: Cancer is a dangerous disease worldwide, caused by somatic mutations in the genome. Diagnosis of this deadly disease at an early stage is exceptionally new clinical application of microarray data. In DNA microarray technology, gene expression data have a high dimension with small sample size. Therefore, the development of efficient and robust feature selection methods is indispensable that identify a small set of genes to achieve better classification performance. Materials and Methods: In this study, we developed a hybrid feature selection method that integrates correlation-based feature selection (CFS) and Multi-Objective Evolutionary Algorithm (MOEA) approaches which select the highly informative genes. The hybrid model with Redial base function neural network (RBFNN) classifier has been evaluated on 11 benchmark gene expression datasets by employing a 10-fold cross-validation test. Results: The experimental results are compared with seven conventional-based feature selection and other methods in the literature, which shows that our approach owned the obvious merits in the aspect of classification accuracy ratio and some genes selected by extensive comparing with other methods. Conclusion: Our proposed CFS-MOEA algorithm attained up to 100% classification accuracy for six out of eleven datasets with a minimal sized predictive gene subset.

Download Full-text

Improving the Performance of Principal Components for Classification of Gene Expression Data Through Feature Selection

Studies in Classification, Data Analysis, and Knowledge Organization - Data Science and Classification ◽

10.1007/3-540-34416-0_35 ◽

2006 ◽

pp. 325-332

Author(s):

Edgar Acuña ◽

Jaime Porras

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Principal Components ◽

Expression Data

Download Full-text

Ensemble Feature Selection from Cancer Gene Expression Data using Mutual Information and Recursive Feature Elimination

2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC) ◽

10.1109/icaecc50550.2020.9339518 ◽

2020 ◽

Author(s):

Nimrita Koul ◽

Sunilkumar S Manvi

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Mutual Information ◽

Gene Expression Data ◽

Recursive Feature Elimination ◽

Cancer Gene ◽

Expression Data

Download Full-text

A filter feature selection method based LLRFC and redundancy analysis for tumor classification using gene expression data

2016 12th World Congress on Intelligent Control and Automation (WCICA) ◽

10.1109/wcica.2016.7578590 ◽

2016 ◽

Cited By ~ 2

Author(s):

Jiangeng Li ◽

Xiaodan Li ◽

Wei Zhang

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Redundancy Analysis ◽

Feature Selection Method ◽

Selection Method ◽

Tumor Classification ◽

Expression Data

Download Full-text

Inference of Genetic Networks From Time-Series and Static Gene Expression Data: Combining a Random-Forest-Based Inference Method With Feature Selection Methods

Frontiers in Genetics ◽

10.3389/fgene.2020.595912 ◽

2020 ◽

Vol 11 ◽

Author(s):

Shuhei Kimura ◽

Ryo Fukutomi ◽

Masato Tokuhisa ◽

Mariko Okada

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Random Forest ◽

Gene Expression Data ◽

Computational Cost ◽

Expression Data ◽

Selection Methods ◽

Inference Method ◽

Combined Application ◽

Inference Methods

Several researchers have focused on random-forest-based inference methods because of their excellent performance. Some of these inference methods also have a useful ability to analyze both time-series and static gene expression data. However, they are only of use in ranking all of the candidate regulations by assigning them confidence values. None have been capable of detecting the regulations that actually affect a gene of interest. In this study, we propose a method to remove unpromising candidate regulations by combining the random-forest-based inference method with a series of feature selection methods. In addition to detecting unpromising regulations, our proposed method uses outputs from the feature selection methods to adjust the confidence values of all of the candidate regulations that have been computed by the random-forest-based inference method. Numerical experiments showed that the combined application with the feature selection methods improved the performance of the random-forest-based inference method on 99 of the 100 trials performed on the artificial problems. However, the improvement tends to be small, since our combined method succeeded in removing only 19% of the candidate regulations at most. The combined application with the feature selection methods moreover makes the computational cost higher. While a bigger improvement at a lower computational cost would be ideal, we see no impediments to our investigation, given that our aim is to extract as much useful information as possible from a limited amount of gene expression data.

Download Full-text