scholarly journals Experimental identification of hard data sets for classification and feature selection methods with insights on method selection

2018 ◽  
Vol 118 ◽  
pp. 41-51 ◽  
Author(s):  
Cuiju Luan ◽  
Guozhu Dong
2013 ◽  
Vol 11 (03) ◽  
pp. 1341006
Author(s):  
QIANG LOU ◽  
ZORAN OBRADOVIC

In order to more accurately predict an individual's health status, in clinical applications it is often important to perform analysis of high-dimensional gene expression data that varies with time. A major challenge in predicting from such temporal microarray data is that the number of biomarkers used as features is typically much larger than the number of labeled subjects. One way to address this challenge is to perform feature selection as a preprocessing step and then apply a classification method on selected features. However, traditional feature selection methods cannot handle multivariate temporal data without applying techniques that flatten temporal data into a single matrix in advance. In this study, a feature selection filter that can directly select informative features from temporal gene expression data is proposed. In our approach, we measure the distance between multivariate temporal data from two subjects. Based on this distance, we define the objective function of temporal margin based feature selection to maximize each subject's temporal margin in its own relevant subspace. The experimental results on synthetic and two real flu data sets provide evidence that our method outperforms the alternatives, which flatten the temporal data in advance.


Author(s):  
Wei Zheng ◽  
Xiaofeng Zhu ◽  
Yonghua Zhu ◽  
Shichao Zhang

Feature selection is an indispensable preprocessing procedure for high-dimensional data analysis,but previous feature selection methods usually ignore sample diversity (i.e., every sample has individual contribution for the model construction) andhave limited ability to deal with incomplete datasets where a part of training samples have unobserved data. To address these issues, in this paper, we firstly propose a robust feature selectionframework to relieve the influence of outliers, andthen introduce an indicator matrix to avoid unobserved data to take participation in numerical computation of feature selection so that both our proposed feature selection framework and exiting feature selection frameworks are available to conductfeature selection on incomplete data sets. We further propose a new optimization algorithm to optimize the resulting objective function as well asprove our algorithm to converge fast. Experimental results on both real and artificial incompletedata sets demonstrated that our proposed methodoutperformed the feature selection methods undercomparison in terms of clustering performance.  


Author(s):  
MINGXIA LIU ◽  
DAOQIANG ZHANG

As thousands of features are available in many pattern recognition and machine learning applications, feature selection remains an important task to find the most compact representation of the original data. In the literature, although a number of feature selection methods have been developed, most of them focus on optimizing specific objective functions. In this paper, we first propose a general graph-preserving feature selection framework where graphs to be preserved vary in specific definitions, and show that a number of existing filter-type feature selection algorithms can be unified within this framework. Then, based on the proposed framework, a new filter-type feature selection method called sparsity score (SS) is proposed. This method aims to preserve the structure of a pre-defined l1 graph that is proven robust to data noise. Here, the modified sparse representation based on an l1-norm minimization problem is used to determine the graph adjacency structure and corresponding affinity weight matrix simultaneously. Furthermore, a variant of SS called supervised SS (SuSS) is also proposed, where the l1 graph to be preserved is constructed by using only data points from the same class. Experimental results of clustering and classification tasks on a series of benchmark data sets show that the proposed methods can achieve better performance than conventional filter-type feature selection methods.


2021 ◽  
Author(s):  
Ping Zhang ◽  
Jiyao Sheng ◽  
Wanfu Gao ◽  
Juncheng Hu ◽  
Yonghao Li

Abstract Multi-label feature selection attracts considerable attention from multi-label learning. Information-theory based multi-label feature selection methods intend to select the most informative features and reduce the uncertain amount of information of labels. Previous methods regard the uncertain amount of information of labels as constant. In fact, as the classification information of the label set is captured by features, the remaining uncertainty of each label is changing dynamically. In this paper, we categorize labels into two groups: one contains the labels with few remaining uncertainty, which means that most of classification information with respect to the labels has been obtained by the already-selected features; another group contains the labels with extensive remaining uncertainty, which means that the classification information of these labels is neglected by already-selected features. Feature selection aims to select the new features with highly relevant to the labels in the second group. Existing methods do not distinguish the difference between two label groups and ignore the dynamic change amount of information of labels. To this end, a Relevancy Ratio is designed to clarify the dynamic change amount of information of each label under the condition of the already-selected features. Afterwards, a Weighted Feature Relevancy is defined to evaluate the candidate features. Finally, a new multi-label Feature Selection method based on Weighted Feature Relevancy (WFRFS) is proposed. The experiments obtain encouraging results of WFRFS in comparison to six multi-label feature selection methods on thirteen real-world data sets.


Author(s):  
Jia Zhang ◽  
Yidong Lin ◽  
Min Jiang ◽  
Shaozi Li ◽  
Yong Tang ◽  
...  

Information theoretical based methods have attracted a great attention in recent years, and gained promising results to deal with multi-label data with high dimensionality. However, most of the existing methods are either directly transformed from heuristic single-label feature selection methods or inefficient in exploiting labeling information. Thus, they may not be able to get an optimal feature selection result shared by multiple labels. In this paper, we propose a general global optimization framework, in which feature relevance, label relevance (i.e., label correlation), and feature redundancy are taken into account, thus facilitating multi-label feature selection. Moreover, the proposed method has an excellent mechanism for utilizing inherent properties of multi-label learning. Specially, we provide a formulation to extend the proposed method with label-specific features. Empirical studies on twenty multi-label data sets reveal the effectiveness and efficiency of the proposed method. Our implementation of the proposed method is available online at: https://jiazhang-ml.pub/GRRO-master.zip.


Author(s):  
Weichan Zhong ◽  
Xiaojun Chen ◽  
Guowen Yuan ◽  
Yiqin Li ◽  
Feiping Nie

In this paper, we propose a novel Adaptive Discriminant Analysis for semi-supervised feature selection, namely SADA. Instead of computing fixed similarities before performing feature selection, SADA simultaneously learns an adaptive similarity matrix S and a projection matrix W with an iterative method. In each iteration, S is computed from the projected distance with the learned W and W is computed with the learned S. Therefore, SADA can learn better projection matrix W by weakening the effect of noise features with the adaptive similarity matrix. Experimental results on 4 data sets show the superiority of SADA compared to 5 semisupervised feature selection methods.


2016 ◽  
Vol 15 ◽  
pp. CIN.S38000 ◽  
Author(s):  
Nathan Lawlor ◽  
Alec Fabbri ◽  
Peiyong Guan ◽  
Joshy George ◽  
R. Krishna Murthy Karuturi

Clustering is carried out to identify patterns in transcriptomics profiles to determine clinically relevant subgroups of patients. Feature (gene) selection is a critical and an integral part of the process. Currently, there are many feature selection and clustering methods to identify the relevant genes and perform clustering of samples. However, choosing an appropriate methodology is difficult. In addition, extensive feature selection methods have not been supported by the available packages. Hence, we developed an integrative R-package called multiClust that allows researchers to experiment with the choice of combination of methods for gene selection and clustering with ease. Using multiClust, we identified the best performing clustering methodology in the context of clinical outcome. Our observations demonstrate that simple methods such as variance-based ranking perform well on the majority of data sets, provided that the appropriate number of genes is selected. However, different gene ranking and selection methods remain relevant as no methodology works for all studies.


2021 ◽  
Vol 12 ◽  
Author(s):  
David Källberg ◽  
Linda Vidman ◽  
Patrik Rydén

Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (−0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.


2021 ◽  
Author(s):  
Sandra Plancade ◽  
Magali Berland ◽  
Melisande Blein Nicolas ◽  
Olivier Langella ◽  
Ariane Bassignani ◽  
...  

One of the difficulties encountered in the statistical analysis of metaproteomics data is the high proportion of missing values, which are usually treated by imputation. Nevertheless, imputation methods are based on restrictive assumptions regarding missingness mechanisms, namely "at random" or "not at random". To circumvent these limitations in the context of feature selection in a multi-class comparison, we propose a univariate selection method that combines a test of association between missingness and classes, and a test for difference of observed intensities between classes. This approach implicitly handles both missingness mechanisms. We performed a quantitative and qualitative comparison of our procedure with imputation-based feature selection methods on two experimental data sets. Whereas we observed similar performances in terms of prediction, the feature ranking from various imputation-based methods was strongly divergent. We showed that the combined test reaches a compromise by correlating reasonably with other methods.


Sign in / Sign up

Export Citation Format

Share Document