CAFÉ-Map: Context Aware Feature Mapping for mining high dimensional biomedical data

Discretization is one of the popular pre-processing techniques that helps a learner overcome the difficulty in handling the wide range of continuous-valued attributes. The objective of this chapter is to explore the possibilities of performance improvement in large dimensional biomedical data with the alliance of machine learning and evolutionary algorithms to design effective healthcare systems. To accomplish the goal, the model targets the preprocessing phase and developed framework based on a Fisher Markov feature selection and evolutionary based binary discretization (EBD) for a microarray gene expression classification. Several experiments were conducted on publicly available microarray gene expression datasets, including colon tumors, and lung and prostate cancer. The performance is evaluated for accuracy and standard deviations, and is also compared with the other state-of-the-art techniques. The experimental results show that the EBD algorithm performs better when compared to other contemporary discretization techniques.

Download Full-text

SARA: A memetic algorithm for high-dimensional biomedical data

Applied Soft Computing ◽

10.1016/j.asoc.2020.107009 ◽

2020 ◽

pp. 107009

Author(s):

Santos Kumar Baliarsingh ◽

Khan Muhammad ◽

Sambit Bakshi

Keyword(s):

Memetic Algorithm ◽

High Dimensional ◽

Biomedical Data

Download Full-text

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm

Genes ◽

10.3390/genes11070717 ◽

2020 ◽

Vol 11 (7) ◽

pp. 717

Author(s):

Garba Abdulrauf Sharifai ◽

Zurinahni Zainol

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Imbalanced Data ◽

High Dimensional ◽

Data Sets ◽

Biomedical Data ◽

Data Set ◽

Grasshopper Optimization Algorithm ◽

Imbalanced Class ◽

Grasshopper Optimization

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.

Download Full-text

A sparse pair-preserving centroid-based supervised learning method for high-dimensional biomedical data or images

Journal of Applied Biomedicine ◽

10.1016/j.bbe.2020.03.008 ◽

2020 ◽

Vol 40 (2) ◽

pp. 774-786 ◽

Cited By ~ 1

Author(s):

Jan Kalina ◽

Ctirad Matonoha

Keyword(s):

Supervised Learning ◽

High Dimensional ◽

Biomedical Data ◽

Learning Method

Download Full-text

Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data

International Journal of Molecular Sciences ◽

10.3390/ijms21010079 ◽

2019 ◽

Vol 21 (1) ◽

pp. 79 ◽

Cited By ~ 5

Author(s):

Jörn Lötsch ◽

Alfred Ultsch

Keyword(s):

Dimensional Space ◽

Cluster Structure ◽

Projection Methods ◽

High Dimensional ◽

Computational Techniques ◽

Correct Identification ◽

Data Sets ◽

Biomedical Data ◽

Self Organizing Maps ◽

Wrong Number

Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroup structure. In other data sets, t-SNE occasionally suggested the wrong number of subgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.

Download Full-text