Feature selection and classification of noisy proteomics mass spectrometry data based on one-bit perturbed compressed sensing

2020 ◽  
Vol 36 (16) ◽  
pp. 4423-4431
Author(s):  
Wenbo Xu ◽  
Yan Tian ◽  
Siye Wang ◽  
Yupeng Cui

Abstract Motivation The classification of high-throughput protein data based on mass spectrometry (MS) is of great practical significance in medical diagnosis. Generally, MS data are characterized by high dimension, which inevitably leads to prohibitive cost of computation. To solve this problem, one-bit compressed sensing (CS), which is an extreme case of quantized CS, has been employed on MS data to select important features with low dimension. Though enjoying remarkably reduction of computation complexity, the current one-bit CS method does not consider the unavoidable noise contained in MS dataset, and does not exploit the inherent structure of the underlying MS data. Results We propose two feature selection (FS) methods based on one-bit CS to deal with the noise and the underlying block-sparsity features, respectively. In the first method, the FS problem is modeled as a perturbed one-bit CS problem, where the perturbation represents the noise in MS data. By iterating between perturbation refinement and FS, this method selects the significant features from noisy data. The second method formulates the problem as a perturbed one-bit block CS problem and selects the features block by block. Such block extraction is due to the fact that the significant features in the first method usually cluster in groups. Experiments show that, the two proposed methods have better classification performance for real MS data when compared with the existing method, and the second one outperforms the first one. Availability and implementation The source code of our methods is available at: https://github.com/tianyan8023/OBCS. Supplementary information Supplementary data are available at Bioinformatics online.

2021 ◽  
Author(s):  
◽  
Soha Ahmed

<p>Mass spectrometry (MS) is currently the most commonly used technology in biochemical research for proteomic analysis. The primary goal of proteomic profiling using mass spectrometry is the classification of samples from different experimental states. To classify the MS samples, the identification of protein or peptides (biomarker detection) that are expressed differently between the classes, is required.  However, due to the high dimensionality of the data and the small number of samples, classification of MS data is extremely challenging. Another important aspect of biomarker detection is the verification of the detected biomarker that acts as an intermediate step before passing these biomarkers to the experimental validation stage.  Biomarker detection aims at altering the input space of the learning algorithm for improving classification of proteomic or metabolomic data. This task is performed through feature manipulation.  Feature manipulation consists of three aspects: feature ranking, feature selection, and feature construction. Genetic programming (GP) is an evolutionary computation algorithm that has the intrinsic capability for the three aspects of feature manipulation. The ability of GP for feature manipulation in proteomic biomarker discovery has not been fully investigated. This thesis, therefore, proposes an embedded methodology for these three aspects of feature manipulation in high dimensional MS data using GP. The thesis also presents a method for biomarker verification, using GP. The thesis investigates the use of GP for both single-objective and multi-objective feature selection and construction.  In feature ranking, the thesis proposes a GP-based method for ranking subsets of features by using GP as an ensemble approach. The proposed algorithm uses GP capability to combine the advantages of different feature ranking metrics and evolve a new ranking scheme for the subset of the features selected from the top ranked features. The capability of GP as a classifier is also investigated by this method. The results show that GP can select a smaller number of features and provide a better ranking of the selected features, which can improve the classification performance of five classifiers.  In feature construction, this thesis proposes a novel multiple feature construction method, which uses a single GP tree to generate a new set of high-level features from the original set of selected features. The results show that the proposed new algorithm outperforms two feature selection algorithms.  In feature selection, the thesis introduces the first GP multi-objective method for biomarker detection, which simultaneously increase the classification accuracy and reduce the number of detected features. The proposed multi-objective method can obtain better subsets of features than the single-objective algorithm and two traditional multi-objective approaches for feature selection. This thesis also develops the first multi-objective multiple feature construction algorithm for MS data. The proposed method aims at both maximising the classification performance and minimizing the cardinality of the constructed new high-level features. The results show that GP can dis- cover the complex relationships between the features and can significantly improve classification performance and reduce the cardinality.  For biomarker verification, the thesis proposes the first GP biomarker verification method through measuring the peptide detectability. The method solves the imbalance problem in the data and shows improvement over the benchmark algorithms. Also, the algorithm outperforms a well-known peptide detection method. The thesis also introduces a new GP method for alignment of MS data as a preprocessing stage, which will further help in improving the biomarker detection process.</p>


2021 ◽  
Author(s):  
◽  
Soha Ahmed

<p>Mass spectrometry (MS) is currently the most commonly used technology in biochemical research for proteomic analysis. The primary goal of proteomic profiling using mass spectrometry is the classification of samples from different experimental states. To classify the MS samples, the identification of protein or peptides (biomarker detection) that are expressed differently between the classes, is required.  However, due to the high dimensionality of the data and the small number of samples, classification of MS data is extremely challenging. Another important aspect of biomarker detection is the verification of the detected biomarker that acts as an intermediate step before passing these biomarkers to the experimental validation stage.  Biomarker detection aims at altering the input space of the learning algorithm for improving classification of proteomic or metabolomic data. This task is performed through feature manipulation.  Feature manipulation consists of three aspects: feature ranking, feature selection, and feature construction. Genetic programming (GP) is an evolutionary computation algorithm that has the intrinsic capability for the three aspects of feature manipulation. The ability of GP for feature manipulation in proteomic biomarker discovery has not been fully investigated. This thesis, therefore, proposes an embedded methodology for these three aspects of feature manipulation in high dimensional MS data using GP. The thesis also presents a method for biomarker verification, using GP. The thesis investigates the use of GP for both single-objective and multi-objective feature selection and construction.  In feature ranking, the thesis proposes a GP-based method for ranking subsets of features by using GP as an ensemble approach. The proposed algorithm uses GP capability to combine the advantages of different feature ranking metrics and evolve a new ranking scheme for the subset of the features selected from the top ranked features. The capability of GP as a classifier is also investigated by this method. The results show that GP can select a smaller number of features and provide a better ranking of the selected features, which can improve the classification performance of five classifiers.  In feature construction, this thesis proposes a novel multiple feature construction method, which uses a single GP tree to generate a new set of high-level features from the original set of selected features. The results show that the proposed new algorithm outperforms two feature selection algorithms.  In feature selection, the thesis introduces the first GP multi-objective method for biomarker detection, which simultaneously increase the classification accuracy and reduce the number of detected features. The proposed multi-objective method can obtain better subsets of features than the single-objective algorithm and two traditional multi-objective approaches for feature selection. This thesis also develops the first multi-objective multiple feature construction algorithm for MS data. The proposed method aims at both maximising the classification performance and minimizing the cardinality of the constructed new high-level features. The results show that GP can dis- cover the complex relationships between the features and can significantly improve classification performance and reduce the cardinality.  For biomarker verification, the thesis proposes the first GP biomarker verification method through measuring the peptide detectability. The method solves the imbalance problem in the data and shows improvement over the benchmark algorithms. Also, the algorithm outperforms a well-known peptide detection method. The thesis also introduces a new GP method for alignment of MS data as a preprocessing stage, which will further help in improving the biomarker detection process.</p>


Author(s):  
Nor Idayu Mahat ◽  
Maz Jamilah Masnan ◽  
Ali Yeon Md Shakaff ◽  
Ammar Zakaria ◽  
Muhd Khairulzaman Abdul Kadir

This chapter overviews the issue of multicollinearity in electronic nose (e-nose) classification and investigates some analytical solutions to deal with the problem. Multicollinearity effect may harm classification analysis from producing good parameters estimate during the construction of the classification rule. The common approach to deal with multicollinearity is feature extraction. However, the criterion used in extracting the raw features based on variances may not be appropriate for the ultimate goal of classification accuracy. Alternatively, feature selection method would be advisable as it chooses only valuable features. Two distance-based criteria in determining the right features for classification purposes, Wilk's Lambda and bounded Mahalanobis distance, are applied. Classification with features determined by bounded Mahalanobis distance statistically performs better than Wilk's Lambda. This chapter suggests that classification of e-nose with feature selection is a good choice to limit the cost of experiments and maintain good classification performance.


Author(s):  
Marcela Aguilera Flores ◽  
Iulia M Lazar

Abstract Summary The ‘Unknown Mutation Analysis (XMAn)’ database is a compilation of Homo sapiens mutated peptides in FASTA format, that was constructed for facilitating the identification of protein sequence alterations by tandem mass spectrometry detection. The database comprises 2 539 031 non-redundant mutated entries from 17 599 proteins, of which 2 377 103 are missense and 161 928 are nonsense mutations. It can be used in conjunction with search engines that seek the identification of peptide amino acid sequences by matching experimental tandem mass spectrometry data to theoretical sequences from a database. Availability and implementation XMAn v2 can be accessed from github.com/lazarlab/XMAnv2. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document