Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach

Author(s):  
Soha Ahmed ◽  
Mengjie Zhang ◽  
Lifeng Peng
2020 ◽  
Vol 36 (16) ◽  
pp. 4423-4431
Author(s):  
Wenbo Xu ◽  
Yan Tian ◽  
Siye Wang ◽  
Yupeng Cui

Abstract Motivation The classification of high-throughput protein data based on mass spectrometry (MS) is of great practical significance in medical diagnosis. Generally, MS data are characterized by high dimension, which inevitably leads to prohibitive cost of computation. To solve this problem, one-bit compressed sensing (CS), which is an extreme case of quantized CS, has been employed on MS data to select important features with low dimension. Though enjoying remarkably reduction of computation complexity, the current one-bit CS method does not consider the unavoidable noise contained in MS dataset, and does not exploit the inherent structure of the underlying MS data. Results We propose two feature selection (FS) methods based on one-bit CS to deal with the noise and the underlying block-sparsity features, respectively. In the first method, the FS problem is modeled as a perturbed one-bit CS problem, where the perturbation represents the noise in MS data. By iterating between perturbation refinement and FS, this method selects the significant features from noisy data. The second method formulates the problem as a perturbed one-bit block CS problem and selects the features block by block. Such block extraction is due to the fact that the significant features in the first method usually cluster in groups. Experiments show that, the two proposed methods have better classification performance for real MS data when compared with the existing method, and the second one outperforms the first one. Availability and implementation The source code of our methods is available at: https://github.com/tianyan8023/OBCS. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
VLADIMIR NIKULIN ◽  
TIAN-HSIANG HUANG ◽  
GEOFFREY J. MCLACHLAN

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.


Information ◽  
2018 ◽  
Vol 9 (12) ◽  
pp. 317 ◽  
Author(s):  
Vincenzo Dentamaro ◽  
Donato Impedovo ◽  
Giuseppe Pirlo

Multiclass classification in cancer diagnostics, using DNA or Gene Expression Signatures, but also classification of bacteria species fingerprints in MALDI-TOF mass spectrometry data, is challenging because of imbalanced data and the high number of dimensions with respect to the number of instances. In this study, a new oversampling technique called LICIC will be presented as a valuable instrument in countering both class imbalance, and the famous “curse of dimensionality” problem. The method enables preservation of non-linearities within the dataset, while creating new instances without adding noise. The method will be compared with other oversampling methods, such as Random Oversampling, SMOTE, Borderline-SMOTE, and ADASYN. F1 scores show the validity of this new technique when used with imbalanced, multiclass, and high-dimensional datasets.


Sign in / Sign up

Export Citation Format

Share Document