Identification of Potential Biomarkers using Improved Ranked Guided Iterative Feature Elimination

In healthcare, biomarkers serve an important role in disease classification. Many existing works are focusing in identifying potential biomarkers from gene expression. Moreover, the large number of redundant features in a high dimensional dataset such as gene expression would introduce bias in the classifier and reduce the classifierâ€™s performance. Embedded feature selection methods such as ranked guided iterative feature elimination have been widely adopted owing to the good performance in identification of informative features. However, method like ranked guided iterative feature elimination does not consider the redundancy of the features. Thus, this paper proposes an improved ranked guided iterative feature elimination method by introducing an additional filter selection based on minimum redundancy maximum relevance to filter out redundant features and maintain the relevant feature subset to be ranked and used for classification. Experiments are done using two gene expression datasets for prostate cancer and central nervous system. The performance of the classification is measured in terms of accuracy and compared with existing methods. Meanwhile, biological context verification of the identified features is done through available knowledge databases. Our method shows improved classification accuracy, and the selected genes were found to have relationship with the diseases.

Download Full-text

FEATURE SELECTION METHODS BASED ON MUTUAL INFORMATION FOR CLASSIFYING HETEROGENEOUS FEATURES

Jurnal Ilmu Komputer dan Informasi ◽

10.21609/jiki.v9i2.384 ◽

2016 ◽

Vol 9 (2) ◽

pp. 106

Author(s):

Ratri Enggar Pawening ◽

Tio Darmawan ◽

Rizqa Raaiqa Bintana ◽

Agus Zainal Arifin ◽

Darlis Herumurti

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Classification Accuracy ◽

Support Vector ◽

Feature Subset ◽

Feature Transformation ◽

Selection Methods ◽

Class Label ◽

Svm Algorithm ◽

Heterogeneous Features

Datasets with heterogeneous features can affect feature selection results that are not appropriate because it is difficult to evaluate heterogeneous features concurrently. Feature transformation (FT) is another way to handle heterogeneous features subset selection. The results of transformation from non-numerical into numerical features may produce redundancy to the original numerical features. In this paper, we propose a method to select feature subset based on mutual information (MI) for classifying heterogeneous features. We use unsupervised feature transformation (UFT) methods and joint mutual information maximation (JMIM) methods. UFT methods is used to transform non-numerical features into numerical features. JMIM methods is used to select feature subset with a consideration of the class label. The transformed and the original features are combined entirely, then determine features subset by using JMIM methods, and classify them using support vector machine (SVM) algorithm. The classification accuracy are measured for any number of selected feature subset and compared between UFT-JMIM methods and Dummy-JMIM methods. The average classification accuracy for all experiments in this study that can be achieved by UFT-JMIM methods is about 84.47% and Dummy-JMIM methods is about 84.24%. This result shows that UFT-JMIM methods can minimize information loss between transformed and original features, and select feature subset to avoid redundant and irrelevant features.

Download Full-text

Enhanced Filter Feature Selection Methods for Arabic Text Categorization

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2018040101 ◽

2018 ◽

Vol 8 (2) ◽

pp. 1-24 ◽

Cited By ~ 1

Author(s):

Abdullah Saeed Ghareb ◽

Azuraliza Abu Bakara ◽

Qasem A. Al-Radaideh ◽

Abdul Razak Hamdan

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Process ◽

High Dimensional Data ◽

Relevant Information ◽

High Dimensional ◽

Arabic Text ◽

Relevant Feature ◽

Associative Classification ◽

Selection Methods

The filtering of a large amount of data is an important process in data mining tasks, particularly for the categorization of unstructured high dimensional data. Therefore, a feature selection process is desired to reduce the space of high dimensional data into small relevant subset dimensions that represent the best features for text categorization. In this article, three enhanced filter feature selection methods, Category Relevant Feature Measure, Modified Category Discriminated Measure, and Odd Ratio2, are proposed. These methods combine the relevant information about features in both the inter- and intra-category. The effectiveness of the proposed methods with Naïve Bayes and associative classification is evaluated by traditional measures of text categorization, namely, macro-averaging of precision, recall, and F-measure. Experiments are conducted on three Arabic text datasets used for text categorization. The experimental results showed that the proposed methods are able to achieve better and comparable results when compared to 12 well known traditional methods.

Download Full-text

Feature Selection

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch135 ◽

2011 ◽

pp. 878-882

Author(s):

Damien François

Keyword(s):

Feature Selection ◽

Time Series Prediction ◽

High Dimensional Data ◽

Principal Component ◽

Point Of View ◽

High Dimensional ◽

Feature Subset ◽

Selection Methods ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

In many applications, like function approximation, pattern recognition, time series prediction, and data mining, one has to build a model relating some features describing the data to some response value. Often, the features that are relevant for building the model are not known in advance. Feature selection methods allow removing irrelevant and/or redundant features to only keep the feature subset that are most useful to build a prediction model. The model is simpler and easier to interpret, reducing the risks of overfitting, non-convergence, etc. By contrast with other dimensionality reduction techniques such as principal component analysis or more recent nonlinear projection techniques (Lee & Verleysen 2007), which build a new, smaller set of features, the features that are selected by feature selection methods preserve their initial meaning, potentially bringing extra information about the process being modeled (Guyon 2006). Recently, the advent of high-dimensional data has raised new challenges for feature selection methods, both from the algorithmic point of view and the conceptual point of view (Liu & Motoda 2007). The problem of feature selection is exponential in nature, and many approximate algorithms are cubic with respect to the initial number of features, which may be intractable when the dimensionality of the data is large. Furthermore, high-dimensional data are often highly redundant, and two distinct subsets of features may have very similar predictive power, which can make it difficult to identify the best subset.

Download Full-text

Two-stage improved Grey Wolf optimization algorithm for feature selection on high-dimensional classification

Complex & Intelligent Systems ◽

10.1007/s40747-021-00452-4 ◽

2021 ◽

Author(s):

Chaonan Shen ◽

Kai Zhang

Keyword(s):

Feature Selection ◽

Multilayer Perceptron ◽

Classification Accuracy ◽

Optimization Problem ◽

Computational Cost ◽

Group Lasso ◽

High Dimensional ◽

Feature Subset ◽

Discrete Optimization Problem ◽

Two Stage

AbstractIn recent years, evolutionary algorithms have shown great advantages in the field of feature selection because of their simplicity and potential global search capability. However, most of the existing feature selection algorithms based on evolutionary computation are wrapper methods, which are computationally expensive, especially for high-dimensional biomedical data. To significantly reduce the computational cost, it is essential to study an effective evaluation method. In this paper, a two-stage improved gray wolf optimization (IGWO) algorithm for feature selection on high-dimensional data is proposed. In the first stage, a multilayer perceptron (MLP) network with group lasso regularization terms is first trained to construct an integer optimization problem using the proposed algorithm for pre-selection of features and optimization of the hidden layer structure. The dataset is compressed using the feature subset obtained in the first stage. In the second stage, a multilayer perceptron network with group lasso regularization terms is retrained using the compressed dataset, and the proposed algorithm is employed to construct the discrete optimization problem for feature selection. Meanwhile, a rapid evaluation strategy is constructed to mitigate the evaluation cost and improve the evaluation efficiency in the feature selection process. The effectiveness of the algorithm was analyzed on ten gene expression datasets. The experimental results show that the proposed algorithm not only removes almost more than 95.7% of the features in all datasets, but also has better classification accuracy on the test set. In addition, the advantages of the proposed algorithm in terms of time consumption, classification accuracy and feature subset size become more and more prominent as the dimensionality of the feature selection problem increases. This indicates that the proposed algorithm is particularly suitable for solving high-dimensional feature selection problems.

Download Full-text

A Cosine-Similarity Mutual-Information Approach for Feature Selection on High Dimensional Datasets

Journal of Information Technology Research ◽

10.4018/jitr.2017010102 ◽

2017 ◽

Vol 10 (1) ◽

pp. 15-28 ◽

Cited By ~ 3

Author(s):

Vimal Kumar Dubey ◽

Amit Kumar Saxena

Keyword(s):

Mutual Information ◽

Nearest Neighbor ◽

Cosine Similarity ◽

High Dimensional ◽

Feature Subset ◽

Relevant Feature ◽

K Nearest Neighbor ◽

Informative Feature ◽

Classification And Regression ◽

High Dimensional Datasets

A novel hybrid method based on Cosine Similarity and Mutual Information is presented to find out relevant feature subset. Initially, the supervised Cosine Similarity of each feature is calculated with respect to the class vector and then features are grouped based on the obtained cosine similarity values. From each group the best mutual informative feature is selected. The selected features subset is tested using the three classifiers namely Naïve Bayes (NB), K-Nearest Neighbor and Classification and Regression trees (CART) for getting classification accuracy. The proposed method is applied to various high dimensional datasets. Obtained results showed that the proposed method is capable of eliminating the redundant and irrelevant features.

Download Full-text

Empirical evaluation of ensemble feature subset selection methods for learning from a high-dimensional database in drug design

Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings. ◽

10.1109/bibe.2003.1188959 ◽

2003 ◽

Cited By ~ 4

Author(s):

H. Mamitsuka

Keyword(s):

Drug Design ◽

Empirical Evaluation ◽

Subset Selection ◽

Feature Subset Selection ◽

High Dimensional ◽

Feature Subset ◽

Selection Methods

Download Full-text

Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising

Data Technologies and Applications ◽

10.1108/dta-09-2021-0233 ◽

2022 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Deepti Sisodia ◽

Dilip Singh Sisodia

Keyword(s):

Feature Selection ◽

Online Advertising ◽

Feature Selection Method ◽

Majority Voting ◽

Feature Subset ◽

Relevant Feature ◽

Selection Methods ◽

Content Type ◽

Optimal Feature Subset ◽

Optimal Feature

PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.

Download Full-text

Effect of Feature Selection on Gene Expression Datasets Classification Accurac

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i5.pp3194-3203 ◽

2018 ◽

Vol 8 (5) ◽

pp. 3194 ◽

Cited By ~ 2

Author(s):

Hicham Omara ◽

Mohamed Lazaar ◽

Youness Tabii

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Classification Accuracy ◽

Selection Methods ◽

Dimentionality Reduction

<span>Feature selection attracts researchers who deal with machine learning and data mining. It consists of selecting the variables that have the greatest impact on the dataset classification, and discarding the rest. This dimentionality reduction allows classifiers to be fast and more accurate. This paper traits the effect of feature selection on the accuracy of widely used classifiers in literature. These classifiers are compared with three real datasets which are pre-processed with feature selection methods. More than 9% amelioration in classification accuracy is observed, and k-means appears to be the most sensitive classifier to feature selection.</span>

Download Full-text

Finding Biomarkers from a High-Dimensional Imbalanced Dataset Using the Hybrid Method of Random Undersampling and Lasso

ComTech Computer Mathematics and Engineering Applications ◽

10.21512/comtech.v11i2.6452 ◽

2020 ◽

Vol 11 (2) ◽

pp. 75-81

Author(s):

Masithoh Yessi Rochayani ◽

Umu Sa'adah ◽

Ani Budi Astuti

Keyword(s):

Gene Expression ◽

Gene Selection ◽

High Accuracy ◽

High Dimensional ◽

Simulation Studies ◽

Selection Methods ◽

Imbalanced Dataset ◽

Random Undersampling ◽

Starting Point ◽

Imbalanced Class

The research conducted undersampling and gene selection as a starting point for cancer classification in gene expression datasets with a high-dimensional and imbalanced class. It investigated whether implementing undersampling before gene selection gave better results than without implementing undersampling. The used undersampling method was Random Undersampling (RUS), and for gene selection, it was Lasso. Then, the selected genes based on theory were validated. To explore the effectiveness of applying RUS before gene selection, the researchers used two gene expression datasets. Both of the datasets consisted of two classes, 1.545 observations and 10.935 genes, but had a different imbalance ratio. The results show that the proposed gene selection methods, namely Lasso and RUS + Lasso, can produce several important biomarkers, and the obtained model has high accuracy. However, the model is complicated since it involves too many genes. It also finds that undersampling is not affected when it is implemented in a less imbalanced class. Meanwhile, when the dataset is highly imbalanced, undersampling can remove a lot of information from the majority class. Nevertheless, the effectiveness of undersampling remains unclear. Simulation studies can be carried out in the next research to investigate when undersampling should be implemented.

Download Full-text

Joint Semi-Supervised Feature Selection and Classification through Bayesian Approach

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013983 ◽

2019 ◽

Vol 33 ◽

pp. 3983-3990 ◽

Cited By ~ 4

Author(s):

Bingbing Jiang ◽

Xingyu Wu ◽

Kui Yu ◽

Huanhuan Chen

Keyword(s):

Feature Selection ◽

Bayesian Approach ◽

High Dimensional ◽

Feature Subset ◽

Relevant Feature ◽

Unlabeled Sample ◽

Training Stage ◽

The Difference ◽

Selection Algorithms ◽

Classifier Training

With the increasing data dimensionality, feature selection has become a fundamental task to deal with high-dimensional data. Semi-supervised feature selection focuses on the problem of how to learn a relevant feature subset in the case of abundant unlabeled data with few labeled data. In recent years, many semi-supervised feature selection algorithms have been proposed. However, these algorithms are implemented by separating the processes of feature selection and classifier training, such that they cannot simultaneously select features and learn a classifier with the selected features. Moreover, they ignore the difference of reliability inside unlabeled samples and directly use them in the training stage, which might cause performance degradation. In this paper, we propose a joint semi-supervised feature selection and classification algorithm (JSFS) which adopts a Bayesian approach to automatically select the relevant features and simultaneously learn a classifier. Instead of using all unlabeled samples indiscriminately, JSFS associates each unlabeled sample with a self-adjusting weight to distinguish the difference between them, which can effectively eliminate the irrelevant unlabeled samples via introducing a left-truncated Gaussian prior. Experiments on various datasets demonstrate the effectiveness and superiority of JSFS.

Download Full-text