scholarly journals KMD clustering: Robust generic clustering of biological data

2020 ◽  
Author(s):  
Aviv Zelig ◽  
Noam Kaplan

AbstractThe challenges of clustering noisy high-dimensional biological data have spawned advanced clustering algorithms that are tailored for specific subtypes of biological datatypes. However, the performance of such methods varies greatly between datasets, they require post hoc tuning of cryptic hyperparameters, and they are often not transferable to other types of data. Here we present a novel generic clustering approach called k minimal distances (KMD) clustering, based on a simple generalization of single and average linkage hierarchical clustering. We show how a generalized silhouette-like function is predictive of clustering accuracy and exploit this property to eliminate the main hyperparameter k. We evaluated KMD clustering on standard simulated datasets, simulated datasets with high noise added, mass cytometry datasets and scRNA-seq datasets. When compared to standard generic and state-of-the-art specialized algorithms, KMD clustering’s performance was consistently better or comparable to that of the best algorithm on each of the tested datasets.

2021 ◽  
Vol 15 (8) ◽  
pp. 898-911
Author(s):  
Yongqing Zhang ◽  
Jianrong Yan ◽  
Siyu Chen ◽  
Meiqin Gong ◽  
Dongrui Gao ◽  
...  

Rapid advances in biological research over recent years have significantly enriched biological and medical data resources. Deep learning-based techniques have been successfully utilized to process data in this field, and they have exhibited state-of-the-art performances even on high-dimensional, nonstructural, and black-box biological data. The aim of the current study is to provide an overview of the deep learning-based techniques used in biology and medicine and their state-of-the-art applications. In particular, we introduce the fundamentals of deep learning and then review the success of applying such methods to bioinformatics, biomedical imaging, biomedicine, and drug discovery. We also discuss the challenges and limitations of this field, and outline possible directions for further research.


Author(s):  
Ping Deng ◽  
Qingkai Ma ◽  
Weili Wu

Clustering can be considered as the most important unsupervised learning problem. It has been discussed thoroughly by both statistics and database communities due to its numerous applications in problems such as classification, machine learning, and data mining. A summary of clustering techniques can be found in (Berkhin, 2002). Most known clustering algorithms such as DBSCAN (Easter, Kriegel, Sander, & Xu, 1996) and CURE (Guha, Rastogi, & Shim, 1998) cluster data points based on full dimensions. When the dimensional space grows higher, the above algorithms lose their efficiency and accuracy because of the so-called “curse of dimensionality”. It is shown in (Beyer, Goldstein, Ramakrishnan, & Shaft, 1999) that computing the distance based on full dimensions is not meaningful in high dimensional space since the distance of a point to its nearest neighbor approaches the distance to its farthest neighbor as dimensionality increases. Actually, natural clusters might exist in subspaces. Data points in different clusters may be correlated with respect to different subsets of dimensions. In order to solve this problem, feature selection (Kohavi & Sommerfield, 1995) and dimension reduction (Raymer, Punch, Goodman, Kuhn, & Jain, 2000) have been proposed to find the closely correlated dimensions for all the data and the clusters in such dimensions. Although both methods reduce the dimensionality of the space before clustering, the case where clusters may exist in different subspaces of full dimensions is not handled well. Projected clustering has been proposed recently to effectively deal with high dimensionalities. Finding clusters and their relevant dimensions are the objectives of projected clustering algorithms. Instead of projecting the entire dataset on the same subspace, projected clustering focuses on finding specific projection for each cluster such that the similarity is reserved as much as possible.


2017 ◽  
Author(s):  
Kevin R. Moon ◽  
David van Dijk ◽  
Zheng Wang ◽  
Scott Gigante ◽  
Daniel B. Burkhardt ◽  
...  

AbstractWith the advent of high-throughput technologies measuring high-dimensional biological data, there is a pressing need for visualization tools that reveal the structure and emergent patterns of data in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure in data by an information-geometric distance between datapoints. We perform extensive comparison between PHATE and other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data including continual progressions, branches, and clusters. We define a manifold preservation metric DEMaP to show that PHATE produces quantitatively better denoised embeddings than existing visualization methods. We show that PHATE is able to gain unique insight from a newly generated scRNA-seq dataset of human germ layer differentiation. Here, PHATE reveals a dynamic picture of the main developmental branches in unparalleled detail, including the identification of three novel subpopulations. Finally, we show that PHATE is applicable to a wide variety of datatypes including mass cytometry, single-cell RNA-sequencing, Hi-C, and gut microbiome data, where it can generate interpretable insights into the underlying systems.


2018 ◽  
Vol 7 (2.21) ◽  
pp. 291
Author(s):  
S Sivakumar ◽  
Kumar Narayanan ◽  
Swaraj Paul Chinnaraju ◽  
Senthil Kumar Janahan

Extraction of useful data from a set is known as Data mining. Clustering has top information mining process it supposed to help an individual, divide and recognize numerous data from records inside group consistent with positive similarity measure. Clustering excessive dimensional data has been a chief undertaking. Maximum present clustering algorithms have been inefficient if desired similarity is computed among statistics factors inside the complete dimensional space. Varieties of projected clustering algorithms were counseled for addressing those problems. However many of them face problems whilst clusters conceal in some space with low dimensionality. These worrying situations inspire our system to endorse a look at partitional distance primarily based projected clustering set of rules. The aimed paintings is successfully deliberate for projects clusters in excessive huge dimension space via adapting the stepped forward method in k Mediods set of pointers. The main goal for second one gadget is to take away outliers, at the same time as the 1/3 method will find clusters in numerous spaces. The (clustering) technique is based on the adequate Mediods set of guidelines, an excess distance managed to set of attributes everywhere values are dense.


2018 ◽  
Vol 14 (3) ◽  
pp. 38-55 ◽  
Author(s):  
Kavan Fatehi ◽  
Mohsen Rezvani ◽  
Mansoor Fateh ◽  
Mohammad-Reza Pajoohan

This article describes how recently, because of the curse of dimensionality in high dimensional data, a significant amount of research has been conducted on subspace clustering aiming at discovering clusters embedded in any possible attributes combination. The main goal of subspace clustering algorithms is to find all clusters in all subspaces. Previous studies have mostly been generating redundant subspace clusters, leading to clustering accuracy loss and also increasing the running time of the algorithms. A bottom-up density-based approach is suggested in this article, in which the cluster structure serves as a similarity measure to generate the optimal subspaces which result in raising the accuracy of the subspace clustering. Based on this idea, the algorithm discovers similar subspaces by considering similarity in their cluster structure, then combines them and the data in the new subspaces would be clustered again. Finally, the algorithm determines all the subspaces and also finds all clusters within them. Experiments on various synthetic and real datasets show that the results of the proposed approach are significantly better in quality and runtime than the state-of-the-art on clustering high-dimensional data.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Anna C. Belkina ◽  
Christopher O. Ciccolella ◽  
Rina Anno ◽  
Richard Halpert ◽  
Josef Spidlen ◽  
...  

AbstractAccurate and comprehensive extraction of information from high-dimensional single cell datasets necessitates faithful visualizations to assess biological populations. A state-of-the-art algorithm for non-linear dimension reduction, t-SNE, requires multiple heuristics and fails to produce clear representations of datasets when millions of cells are projected. We develop opt-SNE, an automated toolkit for t-SNE parameter selection that utilizes Kullback-Leibler divergence evaluation in real time to tailor the early exaggeration and overall number of gradient descent iterations in a dataset-specific manner. The precise calibration of early exaggeration together with opt-SNE adjustment of gradient descent learning rate dramatically improves computation time and enables high-quality visualization of large cytometry and transcriptomics datasets, overcoming limitations of analysis tools with hard-coded parameters that often produce poorly resolved or misleading maps of fluorescent and mass cytometry data. In summary, opt-SNE enables superior data resolution in t-SNE space and thereby more accurate data interpretation.


2018 ◽  
Author(s):  
Anna C. Belkina ◽  
Christopher O. Ciccolella ◽  
Rina Anno ◽  
Richard Halpert ◽  
Josef Spidlen ◽  
...  

Accurate and comprehensive extraction of information from high-dimensional single cell datasets necessitates faithful visualizations to assess biological populations. A state-of-the-art algorithm for non-linear dimension reduction, t-SNE, requires multiple heuristics and fails to produce clear representations of datasets when millions of cells are projected. We developed opt-SNE, an automated toolkit for t-SNE parameter selection that utilizes Kullback-Liebler divergence evaluation in real time to tailor the early exaggeration and overall number of gradient descent iterations in a dataset-specific manner. The precise calibration of early exaggeration together with opt-SNE adjustment of gradient descent learning rate dramatically improves computation time and enables high-quality visualization of large cytometry and transcriptomics datasets, overcoming limitations of analysis tools with hard-coded parameters that often produce poorly resolved or misleading maps of fluorescent and mass cytometry data. In summary, opt-SNE enables superior data resolution in t-SNE space and thereby more accurate data interpretation.


2013 ◽  
Author(s):  
Natapol Pornputtapong ◽  
Amporn Atsawarungruangkit ◽  
Kawee Numpacharoen

2021 ◽  
Vol 7 (3) ◽  
pp. 49
Author(s):  
Daniel Carlos Guimarães Pedronette ◽  
Lucas Pascotti Valem ◽  
Longin Jan Latecki

Visual features and representation learning strategies experienced huge advances in the previous decade, mainly supported by deep learning approaches. However, retrieval tasks are still performed mainly based on traditional pairwise dissimilarity measures, while the learned representations lie on high dimensional manifolds. With the aim of going beyond pairwise analysis, post-processing methods have been proposed to replace pairwise measures by globally defined measures, capable of analyzing collections in terms of the underlying data manifold. The most representative approaches are diffusion and ranked-based methods. While the diffusion approaches can be computationally expensive, the rank-based methods lack theoretical background. In this paper, we propose an efficient Rank-based Diffusion Process which combines both approaches and avoids the drawbacks of each one. The obtained method is capable of efficiently approximating a diffusion process by exploiting rank-based information, while assuring its convergence. The algorithm exhibits very low asymptotic complexity and can be computed regionally, being suitable to outside of dataset queries. An experimental evaluation conducted for image retrieval and person re-ID tasks on diverse datasets demonstrates the effectiveness of the proposed approach with results comparable to the state-of-the-art.


Sign in / Sign up

Export Citation Format

Share Document