A consensus multi-view multi-objective gene selection approach for improved sample classification

Sudipta Acharya; Laizhong Cui; Yi Pan

doi:10.1186/s12859-020-03681-5

A consensus multi-view multi-objective gene selection approach for improved sample classification

BMC Bioinformatics ◽

10.1186/s12859-020-03681-5 ◽

2020 ◽

Vol 21 (S13) ◽

Author(s):

Sudipta Acharya ◽

Laizhong Cui ◽

Yi Pan

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Classification Accuracy ◽

Gene Selection ◽

Data Sets ◽

Expression Data ◽

Data Set ◽

Multi Objective ◽

Sample Classification ◽

Redundant Genes

Abstract Background In the field of computational biology, analyzing complex data helps to extract relevant biological information. Sample classification of gene expression data is one such popular bio-data analysis technique. However, the presence of a large number of irrelevant/redundant genes in expression data makes a sample classification algorithm working inefficiently. Feature selection is one such high-dimensionality reduction technique that helps to maximize the effectiveness of any sample classification algorithm. Recent advances in biotechnology have improved the biological data to include multi-modal or multiple views. Different ‘omics’ resources capture various equally important biological properties of entities. However, most of the existing feature selection methodologies are biased towards considering only one out of multiple biological resources. Consequently, some crucial aspects of available biological knowledge may get ignored, which could further improve feature selection efficiency. Results In this present work, we have proposed a Consensus Multi-View Multi-objective Clustering-based feature selection algorithm called CMVMC. Three controlled genomic and proteomic resources like gene expression, Gene Ontology (GO), and protein-protein interaction network (PPIN) are utilized to build two independent views. The concept of multi-objective consensus clustering has been applied within our proposed gene selection method to satisfy both incorporated views. Gene expression data sets of Multiple tissues and Yeast from two different organisms (Homo Sapiens and Saccharomyces cerevisiae, respectively) are chosen for experimental purposes. As the end-product of CMVMC, a reduced set of relevant and non-redundant genes are found for each chosen data set. These genes finally participate in an effective sample classification. Conclusions The experimental study on chosen data sets shows that our proposed feature-selection method improves the sample classification accuracy and reduces the gene-space up to a significant level. In the case of Multiple Tissues data set, CMVMC reduces the number of genes (features) from 5565 to 41, with 92.73% of sample classification accuracy. For Yeast data set, the number of genes got reduced to 10 from 2884, with 95.84% sample classification accuracy. Two internal cluster validity indices - Silhouette and Davies-Bouldin (DB) and one external validity index Classification Accuracy (CA) are chosen for comparative study. Reported results are further validated through well-known biological significance test and visualization tool.

Multi-view feature selection for identifying gene markers: a diversified biological data driven approach

BMC Bioinformatics ◽

10.1186/s12859-020-03810-0 ◽

2020 ◽

Vol 21 (S18) ◽

Author(s):

Sudipta Acharya ◽

Laizhong Cui ◽

Yi Pan

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Selection ◽

Marker Gene ◽

Biological Data ◽

Protein Interaction Data ◽

Marker Genes ◽

Data Sets ◽

Gene Markers ◽

Multi Objective

Abstract Background In recent years, to investigate challenging bioinformatics problems, the utilization of multiple genomic and proteomic sources has become immensely popular among researchers. One such issue is feature or gene selection and identifying relevant and non-redundant marker genes from high dimensional gene expression data sets. In that context, designing an efficient feature selection algorithm exploiting knowledge from multiple potential biological resources may be an effective way to understand the spectrum of cancer or other diseases with applications in specific epidemiology for a particular population. Results In the current article, we design the feature selection and marker gene detection as a multi-view multi-objective clustering problem. Regarding that, we propose an Unsupervised Multi-View Multi-Objective clustering-based gene selection approach called UMVMO-select. Three important resources of biological data (gene ontology, protein interaction data, protein sequence) along with gene expression values are collectively utilized to design two different views. UMVMO-select aims to reduce gene space without/minimally compromising the sample classification efficiency and determines relevant and non-redundant gene markers from three cancer gene expression benchmark data sets. Conclusion A thorough comparative analysis has been performed with five clustering and nine existing feature selection methods with respect to several internal and external validity metrics. Obtained results reveal the supremacy of the proposed method. Reported results are also validated through a proper biological significance test and heatmap plotting.

Hybrid Feature Selection Algorithm mRMR-ICA for Cancer Classification from Microarray Gene Expression Data

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666180601074349 ◽

2018 ◽

Vol 21 (6) ◽

pp. 420-430 ◽

Cited By ~ 4

Author(s):

Shuaiqun Wang ◽

Wei Kong ◽

Aorigele ◽

Jin Deng ◽

Shangce Gao ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Classification Accuracy ◽

Microarray Gene Expression Data ◽

Cancer Classification ◽

Expression Data ◽

Microarray Gene Expression ◽

Redundant Genes ◽

Microarray Gene

Aims and Objective: Redundant information of microarray gene expression data makes it difficult for cancer classification. Hence, it is very important for researchers to find appropriate ways to select informative genes for better identification of cancer. This study was undertaken to present a hybrid feature selection method mRMR-ICA which combines minimum redundancy maximum relevance (mRMR) with imperialist competition algorithm (ICA) for cancer classification in this paper. Materials and Methods: The presented algorithm mRMR-ICA utilizes mRMR to delete redundant genes as preprocessing and provide the small datasets for ICA for feature selection. It will use support vector machine (SVM) to evaluate the classification accuracy for feature genes. The fitness function includes classification accuracy and the number of selected genes. Results: Ten benchmark microarray gene expression datasets are used to test the performance of mRMR-ICA. Experimental results including the accuracy of cancer classification and the number of informative genes are improved for mRMR-ICA compared with the original ICA and other evolutionary algorithms. Conclusion: The comparison results demonstrate that mRMR-ICA can effectively delete redundant genes to ensure that the algorithm selects fewer informative genes to get better classification results. It also can shorten calculation time and improve efficiency.

Methods for Gene Selection and Classification of Microarray Dataset

Handbook of Research on Biomimicry in Information Retrieval and Knowledge Management - Advances in Web Technologies and Engineering ◽

10.4018/978-1-5225-3004-6.ch004 ◽

2018 ◽

pp. 66-77

Author(s):

Mekour Norreddine

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Gene Selection ◽

Information Gain ◽

Microarray Dataset ◽

Data Sets ◽

Expression Data ◽

Wrapper Approach ◽

Filter Approach

One of the problems that gene expression data resolved is feature selection. There is an important process for choosing which features are important for prediction; there are two general approaches for feature selection: filter approach and wrapper approach. In this chapter, the authors combine the filter approach with method ranked information gain and wrapper approach with a searching method of the genetic algorithm. The authors evaluate their approach on two data sets of gene expression data: Leukemia, and the Central Nervous System. The classifier Decision tree (C4.5) is used for improving the classification performance.

An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220124756 ◽

2019 ◽

Vol 21 (9) ◽

pp. 631-645 ◽

Cited By ~ 5

Author(s):

Saeed Ahmed ◽

Muhammad Kabir ◽

Zakir Ali ◽

Muhammad Arif ◽

Farman Ali ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Classification Accuracy ◽

Early Stage ◽

Small Sample Size ◽

Feature Selection Method ◽

Small Sample ◽

Expression Data ◽

Base Function

Aim and Objective: Cancer is a dangerous disease worldwide, caused by somatic mutations in the genome. Diagnosis of this deadly disease at an early stage is exceptionally new clinical application of microarray data. In DNA microarray technology, gene expression data have a high dimension with small sample size. Therefore, the development of efficient and robust feature selection methods is indispensable that identify a small set of genes to achieve better classification performance. Materials and Methods: In this study, we developed a hybrid feature selection method that integrates correlation-based feature selection (CFS) and Multi-Objective Evolutionary Algorithm (MOEA) approaches which select the highly informative genes. The hybrid model with Redial base function neural network (RBFNN) classifier has been evaluated on 11 benchmark gene expression datasets by employing a 10-fold cross-validation test. Results: The experimental results are compared with seven conventional-based feature selection and other methods in the literature, which shows that our approach owned the obvious merits in the aspect of classification accuracy ratio and some genes selected by extensive comparing with other methods. Conclusion: Our proposed CFS-MOEA algorithm attained up to 100% classification accuracy for six out of eleven datasets with a minimal sized predictive gene subset.

Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures

BioMed Research International ◽

10.1155/2019/2497509 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Suyan Tian ◽

Chi Wang ◽

Bing Wang

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Gene Selection ◽

Selection Process ◽

Biological Knowledge ◽

Expression Data ◽

Selection Methods ◽

Its Gene ◽

Active Research

To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

BMC Genomics ◽

10.1186/s12864-020-07038-3 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Da Xu ◽

Jialin Zhang ◽

Hanxiao Xu ◽

Yusen Zhang ◽

Wei Chen ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Therapeutic Targets ◽

Genomic Data ◽

Data Sets ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Model Learning ◽

Data Set ◽

Multi Scale

Abstract Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.

CLASSIFYING TEMPORAL MICROARRAY DATA BY SELECTING INFORMATIVE GENES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720013410060 ◽

2013 ◽

Vol 11 (03) ◽

pp. 1341006

Author(s):

QIANG LOU ◽

ZORAN OBRADOVIC

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Microarray Data ◽

Data Sets ◽

Temporal Data ◽

Expression Data ◽

Selection Methods ◽

Temporal Gene Expression ◽

Single Matrix

In order to more accurately predict an individual's health status, in clinical applications it is often important to perform analysis of high-dimensional gene expression data that varies with time. A major challenge in predicting from such temporal microarray data is that the number of biomarkers used as features is typically much larger than the number of labeled subjects. One way to address this challenge is to perform feature selection as a preprocessing step and then apply a classification method on selected features. However, traditional feature selection methods cannot handle multivariate temporal data without applying techniques that flatten temporal data into a single matrix in advance. In this study, a feature selection filter that can directly select informative features from temporal gene expression data is proposed. In our approach, we measure the distance between multivariate temporal data from two subjects. Based on this distance, we define the objective function of temporal margin based feature selection to maximize each subject's temporal margin in its own relevant subspace. The experimental results on synthetic and two real flu data sets provide evidence that our method outperforms the alternatives, which flatten the temporal data in advance.

Defining Immune Response Signatures in DLBCL As Potential Predictive Biomarkers for Outcome to Immunotherapy

Blood ◽

10.1182/blood.v126.23.2663.2663 ◽

2015 ◽

Vol 126 (23) ◽

pp. 2663-2663

Author(s):

Matthew A Care ◽

Stephen M Thirdborough ◽

Andrew J Davies ◽

Peter W.M. Johnson ◽

Andrew Jack ◽

...

Keyword(s):

Gene Expression ◽

Immune Response ◽

Network Analysis ◽

Gene Expression Data ◽

Research Funding ◽

Data Sets ◽

Expression Data ◽

Data Set ◽

Gene Correlation ◽

Cancer Types

Abstract Purpose To assess whether comparative gene network analysis can reveal characteristic immune response signatures that predict clinical response in Diffuse large B-cell lymphoma (DLBCL). Background The wealth of available gene expression data sets for DLBCL and other cancer types provides a resource to define recurrent pathological processes at the level of gene expression and gene correlation neighbourhoods. This is of particular relevance in the context of cancer immune responses, where convergence onto common patterns may drive shared gene expression profiles. Where existing and novel immunotherapies harness the immune response for therapeutic benefit such responses may provide predictive biomarkers. Methods We independently analysed publically available DLBCL gene expression data sets and a wide compendium of gene expression data from diverse cancer types, and then asked whether common elements of cancer host response could be identified from resulting networks. Using 10 DLBCL gene expression data sets, encompassing 2030 cases, we established pairwise gene correlation matrices per data set, which were merged to generate median correlations of gene pairs across all data sets. Gene network analysis and unsupervised clustering was then applied to define global representations of DLBCL gene expression neighbourhoods. In parallel a diverse range of solid and lymphoid malignancies including; breast, colorectal, oesophageal, head and neck, non-small cell lung, prostate, pancreatic cancer, Hodgkin lymphoma, Follicular lymphoma and DLBCL were independently analysed using an orthogonal weighted gene correlation network analysis of gene expression data sets from which correlated modules across diverse cancer types were identified. The biology of resulting gene neighbourhoods was assessed by signature and ontology enrichment, and the overlap between gene correlation neighbourhoods and WGCNA derived modules associated with immune/host responses was analysed. Results Amongst DLBCL data, we identified distinct gene correlation neighbourhoods associated with the immune response. These included both elements of IFN-polarised responses, core T-cell, and cytotoxic signatures as well as distinct macrophage responses. Neighbourhoods linked to macrophages separated CD163 from CD68 and CD14. In the WGCNA analysis of diverse cancer types clusters corresponding to these immune response neighbourhoods were independently identified including a highly similar cluster related to CD163. The overlapping CD163 clusters in both analyses linked to diverse Fc-Receptors, complement pathway components and patterns of scavenger receptors potentially linked to alternative macrophage activation. The relationship between the CD163 macrophage gene expression cluster and outcome was tested in DLBCL data sets, identifying a poor response in CD163 -cluster high patients, which reached statistical significance in one data set (GSE10846). Notably, the effect of the CD163-associated gene neighbourhood which correlates with poor outcome post rituximab containing immunochemotherapy is distinct from the effect of IFNG-STAT1-IRF1 polarised cytotoxic responses. The latter represents the predominant immune response pattern separating cell of origin unclassifiable (Type-III) DLBCL from either ABC or GCB DLBCL subsets, and is associated with a trend toward positive outcome. Conclusion Comparative gene expression network analysis identifies common immune response signatures shared between DLBCL and other cancer types. Gene expression clusters linked to CD163 macrophage responses and IFNG-STAT1-IRF1 polarised cytotoxic responses are common patterns with apparent divergent outcome association. Disclosures Davies: CTI: Honoraria; GIlead: Consultancy, Honoraria, Research Funding; Mundipharma: Honoraria, Research Funding; Bayer: Research Funding; Takeda: Honoraria, Research Funding; Janssen: Honoraria, Research Funding; Roche: Honoraria, Research Funding; GSK: Research Funding; Pfizer: Honoraria; Celgene: Honoraria, Research Funding. Jack:Jannsen: Research Funding.

Tanimoto Coefficient Similarity based Mean Shift Gentle Adaptive Boosted Clustering for Genomic Predictive Pattern Analytics

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3817.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 2034-2042

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Mean Shift ◽

Expression Data ◽

Data Set ◽

Tanimoto Coefficient ◽

Functional Relationships ◽

Boosting Method ◽

Similar Gene

Gene expression data clustering is a significant problem to be resolved as it provides functional relationships of genes in a biological process. Finding co-expressed groups of genes is a challenging problem. To identify interesting patterns from the given gene expression data set, a Tanimoto Coefficient Similarity based Mean Shift Gentle Adaptive Boosted Clustering (TCS-MSGABC) Model is proposed. TCS-MSGABC model comprises two processes namely feature selection and clustering. In first process, Tanimoto Coefficient Similarity Measurement based Feature selection (TCSM-FS) is introduced to identify relevant gene features based on the similarity value for performing the genomic expression clustering. Tanimoto Coefficient Similarity Value ranges from ‘ ’ to ‘ ’ where ‘ ’ is highest similarity. The gene feature with higher similarity value is taken to perform clustering process. After feature selection, Mean Shift Gentle Adaptive Boosted Clustering (MSGABC) algorithm is carried out in TCS-MSGABC model to cluster the similar gene expression data based on the selected features. The MSGABC algorithm is a boosting method for combining the many weak clustering results into one strong learner. By this way, the similar gene expression data are clustered with higher accuracy with minimal time. Experimental evaluation of TCS-MSGABC model is carried out on factors such as clustering accuracy, clustering time and error rate with respect to number of gene data. The experimental results show that the TCS-MSGABC model is able to increases the clustering accuracy and also minimizes clustering time of genomic predictive pattern analytics as compared to state-of-the-art works.

Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method

Bioinformatics ◽

10.1093/bioinformatics/17.12.1131 ◽

2001 ◽

Vol 17 (12) ◽

pp. 1131-1142 ◽

Cited By ~ 392

Author(s):

L. Li ◽

C. R. Weinberg ◽

T. A. Darden ◽

L. G. Pedersen

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Selection ◽

Expression Data ◽

Sample Classification ◽

Selection For