Hierarchical classification and visualization with multiple feature ranking criteria

<p>Mass spectrometry (MS) is currently the most commonly used technology in biochemical research for proteomic analysis. The primary goal of proteomic profiling using mass spectrometry is the classification of samples from different experimental states. To classify the MS samples, the identification of protein or peptides (biomarker detection) that are expressed differently between the classes, is required. However, due to the high dimensionality of the data and the small number of samples, classification of MS data is extremely challenging. Another important aspect of biomarker detection is the verification of the detected biomarker that acts as an intermediate step before passing these biomarkers to the experimental validation stage. Biomarker detection aims at altering the input space of the learning algorithm for improving classification of proteomic or metabolomic data. This task is performed through feature manipulation. Feature manipulation consists of three aspects: feature ranking, feature selection, and feature construction. Genetic programming (GP) is an evolutionary computation algorithm that has the intrinsic capability for the three aspects of feature manipulation. The ability of GP for feature manipulation in proteomic biomarker discovery has not been fully investigated. This thesis, therefore, proposes an embedded methodology for these three aspects of feature manipulation in high dimensional MS data using GP. The thesis also presents a method for biomarker verification, using GP. The thesis investigates the use of GP for both single-objective and multi-objective feature selection and construction. In feature ranking, the thesis proposes a GP-based method for ranking subsets of features by using GP as an ensemble approach. The proposed algorithm uses GP capability to combine the advantages of different feature ranking metrics and evolve a new ranking scheme for the subset of the features selected from the top ranked features. The capability of GP as a classifier is also investigated by this method. The results show that GP can select a smaller number of features and provide a better ranking of the selected features, which can improve the classification performance of five classifiers. In feature construction, this thesis proposes a novel multiple feature construction method, which uses a single GP tree to generate a new set of high-level features from the original set of selected features. The results show that the proposed new algorithm outperforms two feature selection algorithms. In feature selection, the thesis introduces the first GP multi-objective method for biomarker detection, which simultaneously increase the classification accuracy and reduce the number of detected features. The proposed multi-objective method can obtain better subsets of features than the single-objective algorithm and two traditional multi-objective approaches for feature selection. This thesis also develops the first multi-objective multiple feature construction algorithm for MS data. The proposed method aims at both maximising the classification performance and minimizing the cardinality of the constructed new high-level features. The results show that GP can dis- cover the complex relationships between the features and can significantly improve classification performance and reduce the cardinality. For biomarker verification, the thesis proposes the first GP biomarker verification method through measuring the peptide detectability. The method solves the imbalance problem in the data and shows improvement over the benchmark algorithms. Also, the algorithm outperforms a well-known peptide detection method. The thesis also introduces a new GP method for alignment of MS data as a preprocessing stage, which will further help in improving the biomarker detection process.</p>

Download Full-text

USING STATISTICAL MEASURES FOR FEATURE RANKING

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001413500031 ◽

2013 ◽

Vol 27 (01) ◽

pp. 1350003 ◽

Cited By ~ 9

Author(s):

EGHBAL G. MANSOORI

Keyword(s):

Data Mining ◽

A Priori ◽

Scoring Function ◽

Cost Effective ◽

Classification Error ◽

Feature Ranking ◽

Statistical Parameters ◽

Statistical Measures ◽

Ranking Criteria ◽

Training Error

Feature ranking is a fundamental preprocess for feature selection, before performing any data mining task. Essentially, when there are too many features in the problem, dimensionality reduction through discarding weak features is highly desirable. In this paper, we have developed an efficient feature ranking algorithm for selecting the more relevant features prior to derivation of classification predictors. Regardless the ranking criteria which rely on the training error of a predictor based on a feature, our approach is distance-based, employing only the statistical distribution of classes in each feature. It uses a scoring function as ranking criterion to evaluate the correlation measure between each feature and the classes. This function comprises three measures for each class: the statistical between-class distance, the interclass overlapping measure, and an estimate of class impurity. In order to compute the statistical parameters, used in these measures, a normalized form of histogram, obtained for each class, is employed as its a priori probability density. Since the proposed algorithm examines each feature individually, it provides a fast and cost-effective method for feature ranking. We have tested the effectiveness of our approach on some benchmark data sets with high dimensions. For this purpose, some top-ranked features are selected and are used in some rule-based classifiers as the target data mining task. Comparing with some popular feature ranking methods, the experimental results show that our approach has better performance as it can identify the more relevant features eventuate to lower classification error.

Download Full-text

Genetic Programming for Biomarker Detection in Classification of Mass Spectrometry Data

10.26686/wgtn.17013680.v1 ◽

2021 ◽

Author(s):

◽

Soha Ahmed

Keyword(s):

Mass Spectrometry ◽

Feature Selection ◽

Classification Performance ◽

Feature Ranking ◽

Feature Construction ◽

Biomarker Detection ◽

Multi Objective ◽

Multiple Feature ◽

High Level

<p>Mass spectrometry (MS) is currently the most commonly used technology in biochemical research for proteomic analysis. The primary goal of proteomic profiling using mass spectrometry is the classification of samples from different experimental states. To classify the MS samples, the identification of protein or peptides (biomarker detection) that are expressed differently between the classes, is required. However, due to the high dimensionality of the data and the small number of samples, classification of MS data is extremely challenging. Another important aspect of biomarker detection is the verification of the detected biomarker that acts as an intermediate step before passing these biomarkers to the experimental validation stage. Biomarker detection aims at altering the input space of the learning algorithm for improving classification of proteomic or metabolomic data. This task is performed through feature manipulation. Feature manipulation consists of three aspects: feature ranking, feature selection, and feature construction. Genetic programming (GP) is an evolutionary computation algorithm that has the intrinsic capability for the three aspects of feature manipulation. The ability of GP for feature manipulation in proteomic biomarker discovery has not been fully investigated. This thesis, therefore, proposes an embedded methodology for these three aspects of feature manipulation in high dimensional MS data using GP. The thesis also presents a method for biomarker verification, using GP. The thesis investigates the use of GP for both single-objective and multi-objective feature selection and construction. In feature ranking, the thesis proposes a GP-based method for ranking subsets of features by using GP as an ensemble approach. The proposed algorithm uses GP capability to combine the advantages of different feature ranking metrics and evolve a new ranking scheme for the subset of the features selected from the top ranked features. The capability of GP as a classifier is also investigated by this method. The results show that GP can select a smaller number of features and provide a better ranking of the selected features, which can improve the classification performance of five classifiers. In feature construction, this thesis proposes a novel multiple feature construction method, which uses a single GP tree to generate a new set of high-level features from the original set of selected features. The results show that the proposed new algorithm outperforms two feature selection algorithms. In feature selection, the thesis introduces the first GP multi-objective method for biomarker detection, which simultaneously increase the classification accuracy and reduce the number of detected features. The proposed multi-objective method can obtain better subsets of features than the single-objective algorithm and two traditional multi-objective approaches for feature selection. This thesis also develops the first multi-objective multiple feature construction algorithm for MS data. The proposed method aims at both maximising the classification performance and minimizing the cardinality of the constructed new high-level features. The results show that GP can dis- cover the complex relationships between the features and can significantly improve classification performance and reduce the cardinality. For biomarker verification, the thesis proposes the first GP biomarker verification method through measuring the peptide detectability. The method solves the imbalance problem in the data and shows improvement over the benchmark algorithms. Also, the algorithm outperforms a well-known peptide detection method. The thesis also introduces a new GP method for alignment of MS data as a preprocessing stage, which will further help in improving the biomarker detection process.</p>

Download Full-text

Comparing Hierarchical Models of Personality Pathology

10.31234/osf.io/wzde5 ◽

2018 ◽

Author(s):

Whitney R. Ringwald ◽

Aidan G.C. Wright ◽

Joseph E. Beeney ◽

Paul A. Pilkonis

Keyword(s):

Hierarchical Models ◽

Empirical Work ◽

Hierarchical Classification ◽

Personality Pathology ◽

General Factor ◽

Factor Analyses ◽

Conceptual Structures ◽

Individual Personality ◽

Exploratory Factor Analyses ◽

Specific Factors

Two dimensional, hierarchical classification models of personality pathology have emerged as alternatives to traditional categorical systems: multi-tiered models with increasing numbers of factors and models that distinguish between a general factor of severity and specific factors reflecting style. Using a large sample (N=840) with a range of psychopathology, we conducted exploratory factor analyses of individual personality disorder criteria to evaluate the validity of these conceptual structures. We estimated an oblique, “unfolding” hierarchy and a bifactor model, then examined correlations between these and multi-method functioning measures to enrich interpretation. Four-factor solutions for each model, reflecting rotations of each other, fit well and equivalently. The resulting structures are consistent with previous empirical work and provide support for each theoretical model.

Download Full-text