Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data

Selection of biologically relevant genes from high-dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was conducted on a single high-dimensional expression data, which led to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining a support vector machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes were selected through statistical significance values and computed using a nonparametric test statistic under a bootstrap-based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e., subject classification, biological relevant criteria based on quantitative trait loci and gene ontology. Our analytical results showed that the proposed approach selects genes which are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter and wrapper methods of gene selection.

Download Full-text

Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data

10.20944/preprints202009.0699.v1 ◽

2020 ◽

Author(s):

Samarendra Das ◽

Shesh N. Rai

Keyword(s):

Gene Expression ◽

Statistical Approach ◽

Gene Selection ◽

Statistical Significance ◽

High Dimensional ◽

Support Vector ◽

Expression Data ◽

Test Statistic ◽

Biologically Relevant ◽

Selection Of

Selection of biologically relevant genes from high dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was done on a single high-dimensional expression data, which leads to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining Support Vector Machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes are selected through statistical significance values computed using a non-parametric test statistic under a bootstrap based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e. subject classification, biological relevant criteria based on quantitative trait loci, and gene ontology. Our analytical results showed that the proposed approach selects genes that are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter, and wrapper methods of gene selection.

Download Full-text

Gene Selection using a Hybrid RFE Along with LASSO for Cancer Classification

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1096.1291s419 ◽

2019 ◽

Vol 9 (1S3) ◽

pp. 83-96

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Selection ◽

Classification Model ◽

Recursive Feature Elimination ◽

Support Vector ◽

Expression Data ◽

Cancer Prediction ◽

Selection Operator ◽

Selection Of

Gene expression profiling using microarray technology has done with the chip based phenomena. For studying gene expression data are more helpful in knowing various diseases and more useful in finding diseases. Recently in the bioinformatics field, cancer prediction using gene expression data had made the assuring area. Samples having the gene attributes will not surely give the efficient amount of classification. Overcoming these contribution, a strong method is required for selecting the relevant gene features for building the classification model effectively. Basically least absolute shrinkage and selection operator (LASSO) and Recursive feature elimination (RFE) are automatic gene feature selection methods used for classification. Here in our proposed work, we use these two methods as a hybrid one for selecting the features and later it applied into the Support vector machine (SVM) for easy classification. It made best when compared to the existing techniques by their performance measures, were regulated on six publically available cancer datasets. Just out it gives the good awareness in the selection of features.

Download Full-text

On the Performance of Variable Selection and Classification via Rank-Based Classifier

Mathematics ◽

10.3390/math7050457 ◽

2019 ◽

Vol 7 (5) ◽

pp. 457 ◽

Cited By ~ 1

Author(s):

Md Sarker ◽

Michael Pokojovy ◽

Sangjin Kim

Keyword(s):

Gene Expression ◽

Logistic Regression ◽

Gene Expression Data ◽

Gene Selection ◽

Laplacian Matrix ◽

Adaptive Lasso ◽

High Dimensional ◽

Expression Data ◽

Correlation Pattern ◽

Future Outcomes

In high-dimensional gene expression data analysis, the accuracy and reliability of cancer classification and selection of important genes play a very crucial role. To identify these important genes and predict future outcomes (tumor vs. non-tumor), various methods have been proposed in the literature. But only few of them take into account correlation patterns and grouping effects among the genes. In this article, we propose a rank-based modification of the popular penalized logistic regression procedure based on a combination of ℓ 1 and ℓ 2 penalties capable of handling possible correlation among genes in different groups. While the ℓ 1 penalty maintains sparsity, the ℓ 2 penalty induces smoothness based on the information from the Laplacian matrix, which represents the correlation pattern among genes. We combined logistic regression with the BH-FDR (Benjamini and Hochberg false discovery rate) screening procedure and a newly developed rank-based selection method to come up with an optimal model retaining the important genes. Through simulation studies and real-world application to high-dimensional colon cancer gene expression data, we demonstrated that the proposed rank-based method outperforms such currently popular methods as lasso, adaptive lasso and elastic net when applied both to gene selection and classification.

Download Full-text

Deep learning-based gene selection in comprehensive gene analysis in pancreatic cancer

Scientific Reports ◽

10.1038/s41598-021-95969-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yasukuni Mori ◽

Hajime Yokota ◽

Isamu Hoshino ◽

Yosuke Iwatate ◽

Kohei Wakamatsu ◽

...

Keyword(s):

Gene Expression ◽

Pancreatic Cancer ◽

Deep Learning ◽

Gene Selection ◽

Pancreatic Tissue ◽

Expression Data ◽

Tissue Samples ◽

Normal Pancreatic Tissue ◽

Model Training ◽

Selection Of

AbstractThe selection of genes that are important for obtaining gene expression data is challenging. Here, we developed a deep learning-based feature selection method suitable for gene selection. Our novel deep learning model includes an additional feature-selection layer. After model training, the units in this layer with high weights correspond to the genes that worked effectively in the processing of the networks. Cancer tissue samples and adjacent normal pancreatic tissue samples were collected from 13 patients with pancreatic ductal adenocarcinoma during surgery and subsequently frozen. After processing, gene expression data were extracted from the specimens using RNA sequencing. Task 1 for the model training was to discriminate between cancerous and normal pancreatic tissue in six patients. Task 2 was to discriminate between patients with pancreatic cancer (n = 13) who survived for more than one year after surgery. The most frequently selected genes were ACACB, ADAMTS6, NCAM1, and CADPS in Task 1, and CD1D, PLA2G16, DACH1, and SOWAHA in Task 2. According to The Cancer Genome Atlas dataset, these genes are all prognostic factors for pancreatic cancer. Thus, the feasibility of using our deep learning-based method for the selection of genes associated with pancreatic cancer development and prognosis was confirmed.

Download Full-text

Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1055 ◽

2004 ◽

Vol 3 (1) ◽

pp. 1-29 ◽

Cited By ~ 58

Author(s):

Jörg Rahnenführer ◽

Francisco S Domingues ◽

Jochen Maydt ◽

Thomas Lengauer

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Biological Networks ◽

Permutation Test ◽

Statistical Significance ◽

Data Sets ◽

Expression Data ◽

Biologically Relevant ◽

Gene Sets ◽

Best Fitting

We present a statistical approach to scoring changes in activity of metabolic pathways from gene expression data. The method identifies the biologically relevant pathways with corresponding statistical significance. Based on gene expression data alone, only local structures of genetic networks can be recovered. Instead of inferring such a network, we propose a hypothesis-based approach. We use given knowledge about biological networks to improve sensitivity and interpretability of findings from microarray experiments.Recently introduced methods test if members of predefined gene sets are enriched in a list of top-ranked genes in a microarray study. We improve this approach by defining scores that depend on all members of the gene set and that also take pairwise co-regulation of these genes into account. We calculate the significance of co-regulation of gene sets with a nonparametric permutation test. On two data sets the method is validated and its biological relevance is discussed. It turns out that useful measures for co-regulation of genes in a pathway can be identified adaptively.We refine our method in two aspects specific to pathways. First, to overcome the ambiguity of enzyme-to-gene mappings for a fixed pathway, we introduce algorithms for selecting the best fitting gene for a specific enzyme in a specific condition. In selected cases, functional assignment of genes to pathways is feasible. Second, the sensitivity of detecting relevant pathways is improved by integrating information about pathway topology. The distance of two enzymes is measured by the number of reactions needed to connect them, and enzyme pairs with a smaller distance receive a higher weight in the score calculation.

Download Full-text

Gene Selection Using High Dimensional Gene Expression Data: An Appraisal

Current Bioinformatics ◽

10.2174/1574893611666160610104946 ◽

2018 ◽

Vol 13 (3) ◽

pp. 225-233 ◽

Cited By ~ 7

Author(s):

Abhishek Bhola ◽

Shailendra Singh

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Selection ◽

High Dimensional ◽

Expression Data

Download Full-text

CASCADING SVMS AS A TOOL FOR MEDICAL DIAGNOSIS USING MULTI-CLASS GENE EXPRESSION DATA

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213006002709 ◽

2006 ◽

Vol 15 (03) ◽

pp. 335-352

Author(s):

ILIAS N. FLAOUNAS ◽

DIMITRIS K. IAKOVIDIS ◽

DIMITRIS E. MAROULIS

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Medical Diagnosis ◽

Gene Selection ◽

Selection Process ◽

Support Vector ◽

Specific Gene ◽

Processing Unit ◽

Expression Data ◽

Medical Diagnostic

In this paper we propose a novel Support Vector Machines-based architecture for medical diagnosis using multi-class gene expression data. It consists of a pre-processing unit and N-1 sequentially ordered blocks capable of classifying N classes in a cascading manner. Each block embodies both a gene selection and a classification module. It offers the flexibility of constructing block-specific gene expression spaces and hypersurfaces for the discrimination of the different classes. The proposed architecture was applied for medical diagnostic tasks including prostate and lung cancer diagnosis. Its performance was evaluated by using a leave-one-out cross validation approach which avoids the bias introduced by the gene selection process. The results show that it provides high accuracy which in most cases exceeds the accuracy achieved by the popular one-vs-one and one-vs-all SVM combination schemes and Nearest-Neighbor classifiers. The cascading SVMs can be successfully applied as a medical diagnostic tool.

Download Full-text

A COMPARATIVE STUDY ON GENE SELECTION METHODS FOR TISSUES CLASSIFICATION ON LARGE SCALE GENE EXPRESSION DATA

Jurnal Teknologi ◽

10.11113/jt.v78.8843 ◽

2016 ◽

Vol 78 (5-10) ◽

Author(s):

Farzana Kabir Ahmad

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Large Scale ◽

Gene Selection ◽

Support Vector ◽

Breast Cancer Dataset ◽

Expression Data ◽

Selection Methods ◽

Normal Tissues

Deoxyribonucleic acid (DNA) microarray technology is the recent invention that provided colossal opportunities to measure a large scale of gene expressions simultaneously. However, interpreting large scale of gene expression data remain a challenging issue due to their innate nature of “high dimensional low sample size”. Microarray data mainly involved thousands of genes, n in a very small size sample, p which complicates the data analysis process. For such a reason, feature selection methods also known as gene selection methods have become apparently need to select significant genes that present the maximum discriminative power between cancerous and normal tissues. Feature selection methods can be structured into three basic factions; a) filter methods; b) wrapper methods and c) embedded methods. Among these methods, filter gene selection methods provide easy way to calculate the informative genes and can simplify reduce the large scale microarray datasets. Although filter based gene selection techniques have been commonly used in analyzing microarray dataset, these techniques have been tested separately in different studies. Therefore, this study aims to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. In this experiment, common classifiers, Support Vector Machine (SVM) is used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM. Furthermore, this study has shown SVM performance remained moderately unaffected unless a very small size of genes was selected.

Download Full-text