A laminar augmented cascading flexible neural forest model for classification of cancer subtypes based on gene expression data

Abstract Background Correctly classifying the subtypes of cancer is of great significance for the in-depth study of cancer pathogenesis and the realization of personalized treatment for cancer patients. In recent years, classification of cancer subtypes using deep neural networks and gene expression data has gradually become a research hotspot. However, most classifiers may face overfitting and low classification accuracy when dealing with small sample size and high-dimensional biology data. Results In this paper, a laminar augmented cascading flexible neural forest (LACFNForest) model was proposed to complete the classification of cancer subtypes. This model is a cascading flexible neural forest using deep flexible neural forest (DFNForest) as the base classifier. A hierarchical broadening ensemble method was proposed, which ensures the robustness of classification results and avoids the waste of model structure and function as much as possible. We also introduced an output judgment mechanism to each layer of the forest to reduce the computational complexity of the model. The deep neural forest was extended to the densely connected deep neural forest to improve the prediction results. The experiments on RNA-seq gene expression data showed that LACFNForest has better performance in the classification of cancer subtypes compared to the conventional methods. Conclusion The LACFNForest model effectively improves the accuracy of cancer subtype classification with good robustness. It provides a new approach for the ensemble learning of classifiers in terms of structural design.

Download Full-text

A Cascade Flexible Neural Forest Model for Cancer Subtypes Classification on Gene Expression Data

Computational Intelligence and Neuroscience ◽

10.1155/2021/6480456 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Lianxin Zhong ◽

Qingfang Meng ◽

Yuehui Chen

Keyword(s):

Gene Expression ◽

Sample Size ◽

Gene Expression Data ◽

Small Sample Size ◽

Small Sample ◽

Expression Data ◽

Cancer Subtypes ◽

Subtype Classification ◽

Cancer Subtype

The correct classification of cancer subtypes is of great significance for the in-depth study of cancer pathogenesis and the realization of accurate treatment for cancer patients. In recent years, the classification of cancer subtypes using deep neural networks and gene expression data has become a hot topic. However, most classifiers may face the challenges of overfitting and low classification accuracy when dealing with small sample size and high-dimensional biological data. In this paper, the Cascade Flexible Neural Forest (CFNForest) Model was proposed to accomplish cancer subtype classification. CFNForest extended the traditional flexible neural tree structure to FNT Group Forest exploiting a bagging ensemble strategy and could automatically generate the model’s structure and parameters. In order to deepen the FNT Group Forest without introducing new hyperparameters, the multilayer cascade framework was exploited to design the FNT Group Forest model, which transformed features between levels and improved the performance of the model. The proposed CFNForest model also improved the operational efficiency and the robustness of the model by sample selection mechanism between layers and setting different weights for the output of each layer. To accomplish cancer subtype classification, FNT Group Forest with different feature sets was used to enrich the structural diversity of the model, which make it more suitable for processing small sample size datasets. The experiments on RNA-seq gene expression data showed that CFNForest effectively improves the accuracy of cancer subtype classification. The classification results have good robustness.

Download Full-text

COMBINING GENERALIZED NMF AND DISCRIMINATIVE MIXTURE MODELS FOR CLASSIFICATION OF GENE EXPRESSION DATA

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001408006892 ◽

2008 ◽

Vol 22 (08) ◽

pp. 1587-1598 ◽

Cited By ~ 3

Author(s):

WEIXIANG LIU ◽

KEHONG YUAN ◽

JIAN WU ◽

DATIAN YE ◽

ZHEN JI ◽

...

Keyword(s):

Gene Expression ◽

Mixture Model ◽

Gene Expression Data ◽

Small Sample Size ◽

Data Classification ◽

Small Sample ◽

Training Data ◽

Microarray Data Analysis ◽

Expression Data

Classification of gene expression samples is a core task in microarray data analysis. How to reduce thousands of genes and to select a suitable classifier are two key issues for gene expression data classification. This paper introduces a framework on combining both feature extraction and classifier simultaneously. Considering the non-negativity, high dimensionality and small sample size, we apply a discriminative mixture model which is designed for non-negative gene express data classification via non-negative matrix factorization (NMF) for dimension reduction. In order to enhance the sparseness of training data for fast learning of the mixture model, a generalized NMF is also adopted. Experimental results on several real gene expression datasets show that the classification accuracy, stability and decision quality can be significantly improved by using the generalized method, and the proposed method can give better performance than some previous reported results on the same datasets.

Download Full-text

An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220124756 ◽

2019 ◽

Vol 21 (9) ◽

pp. 631-645 ◽

Cited By ~ 5

Author(s):

Saeed Ahmed ◽

Muhammad Kabir ◽

Zakir Ali ◽

Muhammad Arif ◽

Farman Ali ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Classification Accuracy ◽

Early Stage ◽

Small Sample Size ◽

Feature Selection Method ◽

Small Sample ◽

Expression Data ◽

Base Function

Aim and Objective: Cancer is a dangerous disease worldwide, caused by somatic mutations in the genome. Diagnosis of this deadly disease at an early stage is exceptionally new clinical application of microarray data. In DNA microarray technology, gene expression data have a high dimension with small sample size. Therefore, the development of efficient and robust feature selection methods is indispensable that identify a small set of genes to achieve better classification performance. Materials and Methods: In this study, we developed a hybrid feature selection method that integrates correlation-based feature selection (CFS) and Multi-Objective Evolutionary Algorithm (MOEA) approaches which select the highly informative genes. The hybrid model with Redial base function neural network (RBFNN) classifier has been evaluated on 11 benchmark gene expression datasets by employing a 10-fold cross-validation test. Results: The experimental results are compared with seven conventional-based feature selection and other methods in the literature, which shows that our approach owned the obvious merits in the aspect of classification accuracy ratio and some genes selected by extensive comparing with other methods. Conclusion: Our proposed CFS-MOEA algorithm attained up to 100% classification accuracy for six out of eleven datasets with a minimal sized predictive gene subset.

Download Full-text

A Novel Deep Flexible Neural Forest Model for Classification of Cancer Subtypes Based on Gene Expression Data

IEEE Access ◽

10.1109/access.2019.2898723 ◽

2019 ◽

Vol 7 ◽

pp. 22086-22095 ◽

Cited By ~ 8

Author(s):

Jing Xu ◽

Peng Wu ◽

Yuehui Chen ◽

Qingfang Meng ◽

Hussain Dawood ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Data ◽

Cancer Subtypes ◽

Forest Model

Download Full-text

BCDForest: a boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data

BMC Bioinformatics ◽

10.1186/s12859-018-2095-4 ◽

2018 ◽

Vol 19 (S5) ◽

Cited By ~ 21

Author(s):

Yang Guo ◽

Shuhui Liu ◽

Zhanhuai Li ◽

Xuequn Shang

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Data ◽

Cancer Subtypes ◽

Forest Model ◽

Deep Forest

Download Full-text

Application of Sparse Bayesian Generalized Linear Model to Gene Expression Data for Classification of Prostate Cancer Subtypes

Open Journal of Statistics ◽

10.4236/ojs.2014.47049 ◽

2014 ◽

Vol 04 (07) ◽

pp. 518-526 ◽

Cited By ~ 4

Author(s):

Behrouz Madahian ◽

Lih Y. Deng ◽

Ramin Homayouni

Keyword(s):

Gene Expression ◽

Prostate Cancer ◽

Linear Model ◽

Gene Expression Data ◽

Generalized Linear Model ◽

Expression Data ◽

Cancer Subtypes ◽

Bayesian Generalized Linear Model

Download Full-text

Variance-based Feature Selection for Classification of Cancer Subtypes Using Gene Expression Data

2018 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn.2018.8489279 ◽

2018 ◽

Author(s):

Aedan G. K. Roberts ◽

Daniel R. Catchpoole ◽

Paul J. Kennedy

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Expression Data ◽

Cancer Subtypes ◽

Selection For

Download Full-text

Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique

International Journal of Molecular Sciences ◽

10.3390/ijms19113398 ◽

2018 ◽

Vol 19 (11) ◽

pp. 3398

Author(s):

Yuanting Yan ◽

Tao Dai ◽

Meili Yang ◽

Xiuquan Du ◽

Yiwen Zhang ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Incomplete Data ◽

Missing Values ◽

Small Sample Size ◽

Heuristic Method ◽

Small Sample ◽

Expression Data ◽

Best First Search ◽

The Impact

(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods.

Download Full-text

Feature Selection Using Approximate Conditional Entropy Based on Fuzzy Information Granule for Gene Expression Data Classification

Frontiers in Genetics ◽

10.3389/fgene.2021.631505 ◽

2021 ◽

Vol 12 ◽

Author(s):

Hengyi Zhang

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Large Scale ◽

Small Sample Size ◽

Conditional Entropy ◽

Small Sample ◽

Expression Data ◽

Fuzzy Information ◽

Information Granule

Classification is widely used in gene expression data analysis. Feature selection is usually performed before classification because of the large number of genes and the small sample size in gene expression data. In this article, a novel feature selection algorithm using approximate conditional entropy based on fuzzy information granule is proposed, and the correctness of the method is proved by the monotonicity of entropy. Firstly, the fuzzy relation matrix is established by Laplacian kernel. Secondly, the approximately equal relation on fuzzy sets is defined. And then, the approximate conditional entropy based on fuzzy information granule and the importance of internal attributes are defined. Approximate conditional entropy can measure the uncertainty of knowledge from two different perspectives of information and algebra theory. Finally, the greedy algorithm based on the approximate conditional entropy is designed for feature selection. Experimental results for six large-scale gene datasets show that our algorithm not only greatly reduces the dimension of the gene datasets, but also is superior to five state-of-the-art algorithms in terms of classification accuracy.

Download Full-text

Multi-view based integrative analysis of gene expression data for identifying biomarkers

Scientific Reports ◽

10.1038/s41598-019-49967-4 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

Zi-Yi Yang ◽

Xiao-Ying Liu ◽

Jun Shu ◽

Hui Zhang ◽

Yan-Qiong Ren ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Expression Data ◽

Gene Selection ◽

Small Sample Size ◽

Small Sample ◽

Integrative Analysis ◽

Expression Data ◽

Microarray Technology ◽

Classification Problems

Abstract The widespread applications in microarray technology have produced the vast quantity of publicly available gene expression datasets. However, analysis of gene expression data using biostatistics and machine learning approaches is a challenging task due to (1) high noise; (2) small sample size with high dimensionality; (3) batch effects and (4) low reproducibility of significant biomarkers. These issues reveal the complexity of gene expression data, thus significantly obstructing microarray technology in clinical applications. The integrative analysis offers an opportunity to address these issues and provides a more comprehensive understanding of the biological systems, but current methods have several limitations. This work leverages state of the art machine learning development for multiple gene expression datasets integration, classification and identification of significant biomarkers. We design a novel integrative framework, MVIAm - Multi-View based Integrative Analysis of microarray data for identifying biomarkers. It applies multiple cross-platform normalization methods to aggregate multiple datasets into a multi-view dataset and utilizes a robust learning mechanism Multi-View Self-Paced Learning (MVSPL) for gene selection in cancer classification problems. We demonstrate the capabilities of MVIAm using simulated data and studies of breast cancer and lung cancer, it can be applied flexibly and is an effective tool for facing the four challenges of gene expression data analysis. Our proposed model makes microarray integrative analysis more systematic and expands its range of applications.

Download Full-text