scholarly journals Relevant and Non-Redundant Feature Selection for Cancer Classification and Subtype Detection

Cancers ◽  
2021 ◽  
Vol 13 (17) ◽  
pp. 4297
Author(s):  
Pratip Rana ◽  
Phuc Thai ◽  
Thang Dinh ◽  
Preetam Ghosh

Biologists seek to identify a small number of significant features that are important, non-redundant, and relevant from diverse omics data. For example, statistical methods such as LIMMA and DEseq distinguish differentially expressed genes between a case and control group from the transcript profile. Researchers also apply various column subset selection algorithms on genomics datasets for a similar purpose. Unfortunately, genes selected by such statistical or machine learning methods are often highly co-regulated, making their performance inconsistent. Here, we introduce a novel feature selection algorithm that selects highly disease-related and non-redundant features from a diverse set of omics datasets. We successfully applied this algorithm to three different biological problems: (a) disease-to-normal sample classification; (b) multiclass classification of different disease samples; and (c) disease subtypes detection. Considering the classification of ROC-AUC, false-positive, and false-negative rates, our algorithm outperformed other gene selection and differential expression (DE) methods for all six types of cancer datasets from TCGA considered here for binary and multiclass classification problems. Moreover, genes picked by our algorithm improved the disease subtyping accuracy for four different cancer types over state-of-the-art methods. Hence, we posit that our proposed feature reduction method can support the community to solve various problems, including the selection of disease-specific biomarkers, precision medicine design, and disease sub-type detection.

Author(s):  
João Batista ◽  
Ana Cabral ◽  
Maria Vasconcelos ◽  
Leonardo Vanneschi ◽  
Sara Silva

Genetic Programming (GP) is a powerful Machine Learning (ML) algorithm that can produce readable white-box models. Although successfully used for solving an array of problems in different scientific areas, GP is still not well known in Remote Sensing. The M3GP algorithm, a variant of the standard GP algorithm, performs Feature Construction by evolving hyper-features from the original ones. In this work, we use the M3GP algorithm on several satellite images over different countries to perform binary classification of burnt areas and multiclass classification of land cover types. We add the evolved hyper-features to the reference datasets and observe a significant improvement of the performance of three state-of-the-art ML algorithms (Decision Trees, Random Forests and XGBoost) on the multiclass classification datasets, with no significant effect on the binary classification ones. We show that adding the M3GP hyper-features to the reference datasets brings better results than adding the well-known spectral indices NDVI, NDWI and NBR. We also compare the performance of the M3GP hyper-features in the binary classification problems with those created by other Feature Construction methods like FFX and EFS.


Big data analysis applications in the field of medical image processing have recently increased rapidly. Feature reduction plays a significant role in eliminating irrelevant features and creating a successful research model for Big Data applications. Fuzzy clustering is used for the segment of the nucleus. Various features, including shape, texture, and color-based features, have been used to address the segmented nucleus. The Modified Dominance Soft Set Feature Selection Algorithm (MDSSA) is intended in this paper to determine the most important features for the classification of leukaemia images. The results of the MDSSA are evaluated using the variance analysis called ANOVA. In the dataset extracted function, the MDSSA selected 17 percent of the features that were more promising than the existing reduction algorithms. The proposed approach also reduces the time needed for further analysis of Big Data. The experimental findings confirm that the performance of the proposed reduction approach is higher than other approaches.


Author(s):  
Sourav Das ◽  
Anup Kumar Kolya ◽  
Dipankar Das

Twitter-based research for sentiment analysis is popular for quite some time now. This is used to represent documents in a corpus usually. This increases the time of classification and also increases space complexity. It is hence very natural to say that non-redundant feature reduction of the input space for a classifier will improve the generalization property of a classifier. In this approach, the researchers have tried to do feature selection using Genetic Algorithm (GA) which will reduce the set of features into a smaller subset. The researchers have also tried to put forward an approach using Genetic Algorithm to reduce the modelling complexity and training time of classification algorithm for 10k Twitter data based on GST. They aim to improve the accuracy of the classification that the researchers have obtained in a preface work to this work and achieved an accuracy of 87% through this work. Hence the Genetic Algorithm will do the feature selection to reduce the complexity of the classifier and give us a better accuracy of the classification of the tweet.


Author(s):  
Cheng-San Yang ◽  
◽  
Li-Yeh Chuang ◽  
Chao-Hsuan Ke ◽  
Cheng-Hong Yang ◽  
...  

Microarray data referencing to gene expression profiles provides valuable answers to a variety of problems, and contributes to advances in clinical medicine. The application of microarray data to the classification of cancer types has recently assumed increasing importance. The classification of microarray data samples involves feature selection, whose goal is to identify subsets of differentially expressed gene potentially relevant for distinguishing sample classes and classifier design. We propose an efficient evolutionary approach for selecting gene subsets from gene expression data that effectively achieves higher accuracy for classification problems. Our proposal combines a shuffled frog-leaping algorithm (SFLA) and a genetic algorithm (GA), and chooses genes (features) related to classification. The K-nearest neighbor (KNN) with leave-one-out cross validation (LOOCV) is used to evaluate classification accuracy. We apply a novel hybrid approach based on SFLA-GA and KNN classification and compare 11 classification problems from the literature. Experimental results show that classification accuracy obtained using selected features was higher than the accuracy of datasets without feature selection.


Author(s):  
Mohammad Subhi Al-Batah ◽  
Belal Mohammad Zaqaibeh ◽  
Saleh Ali Alomari ◽  
Mowafaq Salem Alzboon

Gene microarray classification problems are considered a challenge task since the datasets contain few number of samples with high number of genes (features). The genes subset selection in microarray data play an important role for minimizing the computational load and solving classification problems. In this paper, the Correlation-based Feature Selection (CFS) algorithm is utilized in the feature selection process to reduce the dimensionality of data and finding a set of discriminatory genes. Then, the Decision Table, JRip, and OneR are employed for classification process. The proposed approach of gene selection and classification is tested on 11 microarray datasets and the performances of the filtered datasets are compared with the original datasets. The experimental results showed that CFS can effectively screen irrelevant, redundant, and noisy features. In addition, the results for all datasets proved that the proposed approach with a small number of genes can achieve high prediction accuracy and fast computational speed. Considering the average accuracy for all the analysis of microarray data, the JRip achieved the best result as compared to Decision Table, and OneR classifier. The proposed approach has a remarkable impact on the classification accuracy especially when the data is complicated with multiple classes and high number of genes.


Sign in / Sign up

Export Citation Format

Share Document