KMD clustering: Robust generic clustering of biological data

Mapping Intimacies ◽

10.1101/2020.10.04.325233 ◽

2020 ◽

Author(s):

Aviv Zelig ◽

Noam Kaplan

Keyword(s):

State Of The Art ◽

Clustering Algorithms ◽

Added Mass ◽

Biological Data ◽

High Dimensional ◽

Mass Cytometry ◽

Average Linkage ◽

Simple Generalization ◽

Clustering Approach ◽

Post Hoc

AbstractThe challenges of clustering noisy high-dimensional biological data have spawned advanced clustering algorithms that are tailored for specific subtypes of biological datatypes. However, the performance of such methods varies greatly between datasets, they require post hoc tuning of cryptic hyperparameters, and they are often not transferable to other types of data. Here we present a novel generic clustering approach called k minimal distances (KMD) clustering, based on a simple generalization of single and average linkage hierarchical clustering. We show how a generalized silhouette-like function is predictive of clustering accuracy and exploit this property to eliminate the main hyperparameter k. We evaluated KMD clustering on standard simulated datasets, simulated datasets with high noise added, mass cytometry datasets and scRNA-seq datasets. When compared to standard generic and state-of-the-art specialized algorithms, KMD clustering’s performance was consistently better or comparable to that of the best algorithm on each of the tested datasets.

Download Full-text

Review of the Applications of Deep Learning in Bioinformatics

Current Bioinformatics ◽

10.2174/1574893615999200711165743 ◽

2021 ◽

Vol 15 (8) ◽

pp. 898-911

Author(s):

Yongqing Zhang ◽

Jianrong Yan ◽

Siyu Chen ◽

Meiqin Gong ◽

Dongrui Gao ◽

...

Keyword(s):

Deep Learning ◽

Drug Discovery ◽

Biomedical Imaging ◽

State Of The Art ◽

Black Box ◽

Medical Data ◽

Biological Data ◽

High Dimensional ◽

Biological Research ◽

Process Data

Rapid advances in biological research over recent years have significantly enriched biological and medical data resources. Deep learning-based techniques have been successfully utilized to process data in this field, and they have exhibited state-of-the-art performances even on high-dimensional, nonstructural, and black-box biological data. The aim of the current study is to provide an overview of the deep learning-based techniques used in biology and medicine and their state-of-the-art applications. In particular, we introduce the fundamentals of deep learning and then review the success of applying such methods to bioinformatics, biomedical imaging, biomedicine, and drug discovery. We also discuss the challenges and limitations of this field, and outline possible directions for further research.

Download Full-text

Projected Clustering for Biological Data Analysis

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch247 ◽

2011 ◽

pp. 1617-1622

Author(s):

Ping Deng ◽

Qingkai Ma ◽

Weili Wu

Keyword(s):

Nearest Neighbor ◽

Dimensional Space ◽

Clustering Algorithms ◽

Biological Data ◽

High Dimensional ◽

Projected Clustering ◽

Cluster Data ◽

Biological Data Analysis ◽

Data Points ◽

Entire Dataset

Clustering can be considered as the most important unsupervised learning problem. It has been discussed thoroughly by both statistics and database communities due to its numerous applications in problems such as classification, machine learning, and data mining. A summary of clustering techniques can be found in (Berkhin, 2002). Most known clustering algorithms such as DBSCAN (Easter, Kriegel, Sander, & Xu, 1996) and CURE (Guha, Rastogi, & Shim, 1998) cluster data points based on full dimensions. When the dimensional space grows higher, the above algorithms lose their efficiency and accuracy because of the so-called “curse of dimensionality”. It is shown in (Beyer, Goldstein, Ramakrishnan, & Shaft, 1999) that computing the distance based on full dimensions is not meaningful in high dimensional space since the distance of a point to its nearest neighbor approaches the distance to its farthest neighbor as dimensionality increases. Actually, natural clusters might exist in subspaces. Data points in different clusters may be correlated with respect to different subsets of dimensions. In order to solve this problem, feature selection (Kohavi & Sommerfield, 1995) and dimension reduction (Raymer, Punch, Goodman, Kuhn, & Jain, 2000) have been proposed to find the closely correlated dimensions for all the data and the clusters in such dimensions. Although both methods reduce the dimensionality of the space before clustering, the case where clusters may exist in different subspaces of full dimensions is not handled well. Projected clustering has been proposed recently to effectively deal with high dimensionalities. Finding clusters and their relevant dimensions are the objectives of projected clustering algorithms. Instead of projecting the entire dataset on the same subspace, projected clustering focuses on finding specific projection for each cluster such that the similarity is reserved as much as possible.

Download Full-text

Visualizing Structure and Transitions for Biological Data Exploration

10.1101/120378 ◽

2017 ◽

Cited By ~ 21

Author(s):

Kevin R. Moon ◽

David van Dijk ◽

Zheng Wang ◽

Scott Gigante ◽

Daniel B. Burkhardt ◽

...

Keyword(s):

Biological Data ◽

Data Exploration ◽

High Dimensional ◽

Visualization Method ◽

Mass Cytometry ◽

Nonlinear Structure ◽

Single Cell Rna Sequencing ◽

Visualization Tools ◽

Microbiome Data ◽

Germ Layer Differentiation

AbstractWith the advent of high-throughput technologies measuring high-dimensional biological data, there is a pressing need for visualization tools that reveal the structure and emergent patterns of data in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure in data by an information-geometric distance between datapoints. We perform extensive comparison between PHATE and other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data including continual progressions, branches, and clusters. We define a manifold preservation metric DEMaP to show that PHATE produces quantitatively better denoised embeddings than existing visualization methods. We show that PHATE is able to gain unique insight from a newly generated scRNA-seq dataset of human germ layer differentiation. Here, PHATE reveals a dynamic picture of the main developmental branches in unparalleled detail, including the identification of three novel subpopulations. Finally, we show that PHATE is applicable to a wide variety of datatypes including mass cytometry, single-cell RNA-sequencing, Hi-C, and gut microbiome data, where it can generate interpretable insights into the underlying systems.

Download Full-text

Mining of high dimensional data using enhanced clustering approach

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.21.12384 ◽

2018 ◽

Vol 7 (2.21) ◽

pp. 291

Author(s):

S Sivakumar ◽

Kumar Narayanan ◽

Swaraj Paul Chinnaraju ◽

Senthil Kumar Janahan

Keyword(s):

Data Mining ◽

Similarity Measure ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional ◽

Projected Clustering ◽

Low Dimensionality ◽

Clustering Approach ◽

Numerous Data ◽

Dimension Space

Extraction of useful data from a set is known as Data mining. Clustering has top information mining process it supposed to help an individual, divide and recognize numerous data from records inside group consistent with positive similarity measure. Clustering excessive dimensional data has been a chief undertaking. Maximum present clustering algorithms have been inefficient if desired similarity is computed among statistics factors inside the complete dimensional space. Varieties of projected clustering algorithms were counseled for addressing those problems. However many of them face problems whilst clusters conceal in some space with low dimensionality. These worrying situations inspire our system to endorse a look at partitional distance primarily based projected clustering set of rules. The aimed paintings is successfully deliberate for projects clusters in excessive huge dimension space via adapting the stepped forward method in k Mediods set of pointers. The main goal for second one gadget is to take away outliers, at the same time as the 1/3 method will find clusters in numerous spaces. The (clustering) technique is based on the adequate Mediods set of guidelines, an excess distance managed to set of attributes everywhere values are dense.

Download Full-text

Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

International Journal of Intelligent Information Technologies ◽

10.4018/ijiit.2018070103 ◽

2018 ◽

Vol 14 (3) ◽

pp. 38-55 ◽

Cited By ~ 2

Author(s):

Kavan Fatehi ◽

Mohsen Rezvani ◽

Mansoor Fateh ◽

Mohammad-Reza Pajoohan

Keyword(s):

Similarity Measure ◽

State Of The Art ◽

Clustering Algorithms ◽

Cluster Structure ◽

High Dimensional Data ◽

Subspace Clustering ◽

The State ◽

High Dimensional ◽

Running Time ◽

Structure Similarity

This article describes how recently, because of the curse of dimensionality in high dimensional data, a significant amount of research has been conducted on subspace clustering aiming at discovering clusters embedded in any possible attributes combination. The main goal of subspace clustering algorithms is to find all clusters in all subspaces. Previous studies have mostly been generating redundant subspace clusters, leading to clustering accuracy loss and also increasing the running time of the algorithms. A bottom-up density-based approach is suggested in this article, in which the cluster structure serves as a similarity measure to generate the optimal subspaces which result in raising the accuracy of the subspace clustering. Based on this idea, the algorithm discovers similar subspaces by considering similarity in their cluster structure, then combines them and the data in the new subspaces would be clustered again. Finally, the algorithm determines all the subspaces and also finds all clusters within them. Experiments on various synthetic and real datasets show that the results of the proposed approach are significantly better in quality and runtime than the state-of-the-art on clustering high-dimensional data.

Download Full-text

Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets

Nature Communications ◽

10.1038/s41467-019-13055-y ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 28

Author(s):

Anna C. Belkina ◽

Christopher O. Ciccolella ◽

Rina Anno ◽

Richard Halpert ◽

Josef Spidlen ◽

...

Keyword(s):

Gradient Descent ◽

State Of The Art ◽

Computation Time ◽

Data Interpretation ◽

Large Datasets ◽

High Dimensional ◽

Mass Cytometry ◽

Data Resolution ◽

Leibler Divergence ◽

Specific Manner

AbstractAccurate and comprehensive extraction of information from high-dimensional single cell datasets necessitates faithful visualizations to assess biological populations. A state-of-the-art algorithm for non-linear dimension reduction, t-SNE, requires multiple heuristics and fails to produce clear representations of datasets when millions of cells are projected. We develop opt-SNE, an automated toolkit for t-SNE parameter selection that utilizes Kullback-Leibler divergence evaluation in real time to tailor the early exaggeration and overall number of gradient descent iterations in a dataset-specific manner. The precise calibration of early exaggeration together with opt-SNE adjustment of gradient descent learning rate dramatically improves computation time and enables high-quality visualization of large cytometry and transcriptomics datasets, overcoming limitations of analysis tools with hard-coded parameters that often produce poorly resolved or misleading maps of fluorescent and mass cytometry data. In summary, opt-SNE enables superior data resolution in t-SNE space and thereby more accurate data interpretation.

Download Full-text

Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and allow analysis of large datasets

10.1101/451690 ◽

2018 ◽

Cited By ~ 5

Author(s):

Anna C. Belkina ◽

Christopher O. Ciccolella ◽

Rina Anno ◽

Richard Halpert ◽

Josef Spidlen ◽

...

Keyword(s):

Gradient Descent ◽

Linear Dimension ◽

State Of The Art ◽

Computation Time ◽

Data Interpretation ◽

Large Datasets ◽

High Dimensional ◽

Mass Cytometry ◽

Data Resolution ◽

Specific Manner

Accurate and comprehensive extraction of information from high-dimensional single cell datasets necessitates faithful visualizations to assess biological populations. A state-of-the-art algorithm for non-linear dimension reduction, t-SNE, requires multiple heuristics and fails to produce clear representations of datasets when millions of cells are projected. We developed opt-SNE, an automated toolkit for t-SNE parameter selection that utilizes Kullback-Liebler divergence evaluation in real time to tailor the early exaggeration and overall number of gradient descent iterations in a dataset-specific manner. The precise calibration of early exaggeration together with opt-SNE adjustment of gradient descent learning rate dramatically improves computation time and enables high-quality visualization of large cytometry and transcriptomics datasets, overcoming limitations of analysis tools with hard-coded parameters that often produce poorly resolved or misleading maps of fluorescent and mass cytometry data. In summary, opt-SNE enables superior data resolution in t-SNE space and thereby more accurate data interpretation.

Download Full-text

Adjusting an Invalid Correlation Matrix with Applications to High-Dimensional Biological Data

SSRN Electronic Journal ◽

10.2139/ssrn.2332549 ◽

2013 ◽

Author(s):

Natapol Pornputtapong ◽

Amporn Atsawarungruangkit ◽

Kawee Numpacharoen

Keyword(s):

Correlation Matrix ◽

Biological Data ◽

High Dimensional

Download Full-text

Efficient Rank-Based Diffusion Process with Assured Convergence

Journal of Imaging ◽

10.3390/jimaging7030049 ◽

2021 ◽

Vol 7 (3) ◽

pp. 49

Author(s):

Daniel Carlos Guimarães Pedronette ◽

Lucas Pascotti Valem ◽

Longin Jan Latecki

Keyword(s):

Diffusion Process ◽

Learning Strategies ◽

State Of The Art ◽

Representation Learning ◽

Theoretical Background ◽

High Dimensional ◽

Visual Features ◽

Learning Approaches ◽

Previous Decade ◽

Asymptotic Complexity

Visual features and representation learning strategies experienced huge advances in the previous decade, mainly supported by deep learning approaches. However, retrieval tasks are still performed mainly based on traditional pairwise dissimilarity measures, while the learned representations lie on high dimensional manifolds. With the aim of going beyond pairwise analysis, post-processing methods have been proposed to replace pairwise measures by globally defined measures, capable of analyzing collections in terms of the underlying data manifold. The most representative approaches are diffusion and ranked-based methods. While the diffusion approaches can be computationally expensive, the rank-based methods lack theoretical background. In this paper, we propose an efficient Rank-based Diffusion Process which combines both approaches and avoids the drawbacks of each one. The obtained method is capable of efficiently approximating a diffusion process by exploiting rank-based information, while assuring its convergence. The algorithm exhibits very low asymptotic complexity and can be computed regionally, being suitable to outside of dataset queries. An experimental evaluation conducted for image retrieval and person re-ID tasks on diverse datasets demonstrates the effectiveness of the proposed approach with results comparable to the state-of-the-art.

Download Full-text

Benchmarking energy performance of building envelopes through a selective residual-clustering approach using high dimensional dataset

Energy and Buildings ◽

10.1016/j.enbuild.2013.12.055 ◽

2014 ◽

Vol 75 ◽

pp. 10-22 ◽

Cited By ~ 19

Author(s):

Endong Wang ◽

Zhigang Shen ◽

Kevin Grosskopf

Keyword(s):

Energy Performance ◽

High Dimensional ◽

Building Envelopes ◽

Clustering Approach

Download Full-text