k-Means, Ward and Probabilistic Distance-Based Clustering Methods with Contiguity Constraint

Much interest has been paid in the last decade on molecular predictors of promiscuity, including molecular weight, log P, molecular complexity, acidity constant and molecular topology, with correlations between promiscuity and those descriptors seemingly being context-dependent. It has been observed that certain therapeutic categories (e.g. mood disorders therapies) display a tendency to include multi-target agents (i.e. selective non-selectivity). Numerous QSAR models based on topological descriptors suggest that the topology of a given drug could be used to infer its therapeutic applications. Here, we have used descriptive statistics to explore the distribution of molecular topology descriptors and other promiscuity predictors across different therapeutic categories. Working with the publicly available ChEMBL database and 14 molecular descriptors, both hierarchical and non-hierchical clustering methods were applied to the descriptors mean values of the therapeutic categories after the refinement of the database (770 drugs grouped into 34 therapeutic categories). On the other hand, another publicly available database (repoDB) was used to retrieve cases of clinically-approved drug repositioning examples that could be classified into the therapeutic categories considered by the aforementioned clusters (111 cases), and the correspondence between the two studies was evaluated. Interestingly, a 3- cluster hierarchical clustering scheme based on only 14 molecular descriptors linked to promiscuity seem to explain up to 82.9% of approved cases of drug repurposing retrieved of repoDB. Therapeutic categories seem to display distinctive molecular patterns, which could be used as a basis for drug screening and drug design campaigns, and to unveil drug repurposing opportunities between particular therapeutic categories.

Download Full-text

Unsupervised Identification of Targeted Spectra Applying Rank1-NMF and FCC Algorithms in Long-Wave Hyperspectral Infrared Imagery

Remote Sensing ◽

10.3390/rs13112125 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2125

Author(s):

Bardia Yousefi ◽

Clemente Ibarra-Castanedo ◽

Martin Chamberland ◽

Xavier P. V. Maldague ◽

Georges Beaudoin

Keyword(s):

Matched Filter ◽

Principal Component ◽

Hyperspectral Data ◽

Clustering Methods ◽

Spectral Angle Mapper ◽

Long Wave ◽

Clustering Approach ◽

Spectral Comparison ◽

Computational Simplicity ◽

Mineral Identification

Clustering methods unequivocally show considerable influence on many recent algorithms and play an important role in hyperspectral data analysis. Here, we challenge the clustering for mineral identification using two different strategies in hyperspectral long wave infrared (LWIR, 7.7–11.8 μm). For that, we compare two algorithms to perform the mineral identification in a unique dataset. The first algorithm uses spectral comparison techniques for all the pixel-spectra and creates RGB false color composites (FCC). Then, a color based clustering is used to group the regions (called FCC-clustering). The second algorithm clusters all the pixel-spectra to directly group the spectra. Then, the first rank of non-negative matrix factorization (NMF) extracts the representative of each cluster and compares results with the spectral library of JPL/NASA. These techniques give the comparison values as features which convert into RGB-FCC as the results (called clustering rank1-NMF). We applied K-means as clustering approach, which can be modified in any other similar clustering approach. The results of the clustering-rank1-NMF algorithm indicate significant computational efficiency (more than 20 times faster than the previous approach) and promising performance for mineral identification having up to 75.8% and 84.8% average accuracies for FCC-clustering and clustering-rank1 NMF algorithms (using spectral angle mapper (SAM)), respectively. Furthermore, several spectral comparison techniques are used also such as adaptive matched subspace detector (AMSD), orthogonal subspace projection (OSP) algorithm, principal component analysis (PCA), local matched filter (PLMF), SAM, and normalized cross correlation (NCC) for both algorithms and most of them show a similar range in accuracy. However, SAM and NCC are preferred due to their computational simplicity. Our algorithms strive to identify eleven different mineral grains (biotite, diopside, epidote, goethite, kyanite, scheelite, smithsonite, tourmaline, pyrope, olivine, and quartz).

Download Full-text

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Scientific Reports ◽

10.1038/s41598-021-83340-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Gregoire Preud’homme ◽

Kevin Duarte ◽

Kevin Dalleau ◽

Claire Lacomblez ◽

Emmanuel Bresso ◽

...

Keyword(s):

Hierarchical Clustering ◽

Latent Class ◽

Latent Class Model ◽

Real Life ◽

Heterogeneous Data ◽

Mixed Data ◽

Categorical Variables ◽

Clustering Methods ◽

Model Based ◽

Partitioning Around Medoids

AbstractThe choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

Download Full-text

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

Multimedia Tools and Applications ◽

10.1007/s11042-021-10594-9 ◽

2021 ◽

Author(s):

Himanshu Mittal ◽

Avinash Chandra Pandey ◽

Mukesh Saraswat ◽

Sumit Kumar ◽

Raju Pal ◽

...

Keyword(s):

Image Segmentation ◽

Performance Parameters ◽

Clustering Methods ◽

Benchmark Datasets ◽

Comprehensive Survey

Download Full-text

Heart rate variability analysis for the identification of the preictal interval in patients with drug-resistant epilepsy

Scientific Reports ◽

10.1038/s41598-021-85350-y ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Adriana Leal ◽

Mauro F. Pinto ◽

Fábio Lopes ◽

Anna M. Bianchi ◽

Jorge Henriques ◽

...

Keyword(s):

Heart Rate ◽

Heart Rate Variability ◽

Linear Time ◽

Epileptic Seizures ◽

Heart Rate Variability Analysis ◽

Drug Resistant ◽

Clustering Methods ◽

Temporal Lobe Seizures ◽

New Perspective ◽

Drug Resistant Epilepsy

AbstractElectrocardiogram (ECG) recordings, lasting hours before epileptic seizures, have been studied in the search for evidence of the existence of a preictal interval that follows a normal ECG trace and precedes the seizure’s clinical manifestation. The preictal interval has not yet been clinically parametrized. Furthermore, the duration of this interval varies for seizures both among patients and from the same patient. In this study, we performed a heart rate variability (HRV) analysis to investigate the discriminative power of the features of HRV in the identification of the preictal interval. HRV information extracted from the linear time and frequency domains as well as from nonlinear dynamics were analysed. We inspected data from 238 temporal lobe seizures recorded from 41 patients with drug-resistant epilepsy from the EPILEPSIAE database. Unsupervised methods were applied to the HRV feature dataset, thus leading to a new perspective in preictal interval characterization. Distinguishable preictal behaviour was exhibited by 41% of the seizures and 90% of the patients. Half of the preictal intervals were identified in the 40 min before seizure onset. The results demonstrate the potential of applying clustering methods to HRV features to deepen the current understanding of the preictal state.

Download Full-text

Clustering Single-Cell RNA-Seq Data with Regularized Gaussian Graphical Model

Genes ◽

10.3390/genes12020311 ◽

2021 ◽

Vol 12 (2) ◽

pp. 311

Author(s):

Zhenqiu Liu

Keyword(s):

Single Cell ◽

Free Parameter ◽

Graphical Model ◽

Expression Patterns ◽

Information Criterion ◽

Log P ◽

Rna Seq ◽

Clustering Methods ◽

Wide Range ◽

Free Parameters

Single-cell RNA-seq (scRNA-seq) is a powerful tool to measure the expression patterns of individual cells and discover heterogeneity and functional diversity among cell populations. Due to variability, it is challenging to analyze such data efficiently. Many clustering methods have been developed using at least one free parameter. Different choices for free parameters may lead to substantially different visualizations and clusters. Tuning free parameters is also time consuming. Thus there is need for a simple, robust, and efficient clustering method. In this paper, we propose a new regularized Gaussian graphical clustering (RGGC) method for scRNA-seq data. RGGC is based on high-order (partial) correlations and subspace learning, and is robust over a wide-range of a regularized parameter λ. Therefore, we can simply set λ=2 or λ=log(p) for AIC (Akaike information criterion) or BIC (Bayesian information criterion) without cross-validation. Cell subpopulations are discovered by the Louvain community detection algorithm that determines the number of clusters automatically. There is no free parameter to be tuned with RGGC. When evaluated with simulated and benchmark scRNA-seq data sets against widely used methods, RGGC is computationally efficient and one of the top performers. It can detect inter-sample cell heterogeneity, when applied to glioblastoma scRNA-seq data.

Download Full-text

Edge Computing for Data Anomaly Detection of Multi-Sensors in Underground Mining

Electronics ◽

10.3390/electronics10030302 ◽

2021 ◽

Vol 10 (3) ◽

pp. 302

Author(s):

Chunde Liu ◽

Xianli Su ◽

Chuanwen Li

Keyword(s):

Energy Consumption ◽

Anomaly Detection ◽

Underground Mining ◽

Heterogeneous Data ◽

Edge Computing ◽

Sensor Nodes ◽

Detection Methods ◽

Detection Accuracy ◽

Clustering Methods ◽

Safety Warning

There is a growing interest in safety warning of underground mining due to the huge threat being faced by those working in underground mining. Data acquisition of sensors based on Internet of Things (IoT) is currently the main method, but the data anomaly detection and analysis of multi-sensors is a challenging task: firstly, the data that are collected by different sensors of underground mining are heterogeneous; secondly, real-time is required for the data anomaly detection of safety warning. Currently, there are many anomaly detection methods, such as traditional clustering methods K-means and C-means. Meanwhile, Artificial Intelligence (AI) is widely used in data analysis and prediction. However, K-means and C-means cannot directly process heterogeneous data, and AI algorithms require equipment with high computing and storage capabilities. IoT equipment of underground mining cannot perform complex calculation due to the limitation of energy consumption. Therefore, many existing methods cannot be directly used for IoT applications in underground mining. In this paper, a multi-sensors data anomaly detection method based on edge computing is proposed. Firstly, an edge computing model is designed, and according to the computing capabilities of different types of devices, anomaly detection tasks are migrated to different edge devices, which solve the problem of insufficient computing capabilities of the devices. Secondly, according to the requirements of different anomaly detection tasks, edge anomaly detection algorithms for sensor nodes and sink nodes are designed respectively. Lastly, an experimental platform is built for performance comparison analysis, and the experimental results show that the proposed algorithm has better performance in anomaly detection accuracy, delay, and energy consumption.

Download Full-text