Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data

The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods. Code and example data for DEWÄKSS is available at https://gitlab.com/Xparx/dewakss/-/tree/Tjarnberg2020branch.

Download Full-text

Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data

10.1101/2020.02.28.970202 ◽

2020 ◽

Author(s):

Andreas Tjärnberg ◽

Omar Mahmood ◽

Christopher A Jackson ◽

Giuseppe-Antonio Saldi ◽

Kyunghyun Cho ◽

...

Keyword(s):

Objective Function ◽

Single Cell ◽

Missing Values ◽

Nearest Neighbor ◽

Projection Methods ◽

Future Research ◽

Specific Gene ◽

K Nearest Neighbor ◽

Single Cell Genomics ◽

And Diffusion

AbstractThe analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods and serve as a foundation for future research. Code and example data for DEWÄKSS is available at https://gitlab.com/Xparx/dewakss/-/tree/Tjarnberg2020branch.

Download Full-text

DRSA: a non-hierarchical clustering algorithm using k-NN graph and its application in vegetation classification

Vegetation of Russia ◽

10.31111/vegrus/2015.27.125 ◽

2015 ◽

pp. 125-138 ◽

Cited By ~ 2

Author(s):

I. V. Goncharenko

Keyword(s):

Cluster Analysis ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Protein Structures ◽

Hierarchical Cluster ◽

Vegetation Classification ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classiﬁcation was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.

Download Full-text

Symmetry Breaking and Training from Incomplete Data with Radial Basis Boltzmann Machines

International Journal of Neural Systems ◽

10.1142/s0129065797000318 ◽

1997 ◽

Vol 08 (03) ◽

pp. 301-315 ◽

Cited By ~ 8

Author(s):

Marcel J. Nijman ◽

Hilbert J. Kappen

Keyword(s):

Symmetry Breaking ◽

Incomplete Data ◽

Missing Values ◽

Nearest Neighbor ◽

Boltzmann Machine ◽

K Nearest Neighbor ◽

Data Set ◽

Input Space ◽

Learning Rules ◽

Radial Basis

A Radial Basis Boltzmann Machine (RBBM) is a specialized Boltzmann Machine architecture that combines feed-forward mapping with probability estimation in the input space, and for which very efficient learning rules exist. The hidden representation of the network displays symmetry breaking as a function of the noise in the dynamics. Thus, generalization can be studied as a function of the noise in the neuron dynamics instead of as a function of the number of hidden units. We show that the RBBM can be seen as an elegant alternative of k-nearest neighbor, leading to comparable performance without the need to store all data. We show that the RBBM has good classification performance compared to the MLP. The main advantage of the RBBM is that simultaneously with the input-output mapping, a model of the input space is obtained which can be used for learning with missing values. We derive learning rules for the case of incomplete data, and show that they perform better on incomplete data than the traditional learning rules on a 'repaired' data set.

Download Full-text

Multiple Regression and K-Nearest-Neighbor Based Algorithm for Estimating Missing Values within Sensor

10.1109/icnisc54316.2021.00116 ◽

2021 ◽

Author(s):

Xiantong Li ◽

Yuan Sui

Keyword(s):

Multiple Regression ◽

Missing Values ◽

Nearest Neighbor ◽

K Nearest Neighbor

Download Full-text

Differential abundance testing on single-cell data using k-nearest neighbor graphs

Nature Biotechnology ◽

10.1038/s41587-021-01033-z ◽

2021 ◽

Author(s):

Emma Dann ◽

Neil C. Henderson ◽

Sarah A. Teichmann ◽

Michael D. Morgan ◽

John C. Marioni

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

K Nearest Neighbor ◽

Differential Abundance ◽

Cell Data

Download Full-text

A Novel clustering method based on hybrid K-nearest-neighbor graph

Pattern Recognition ◽

10.1016/j.patcog.2017.09.008 ◽

2018 ◽

Vol 74 ◽

pp. 1-14 ◽

Cited By ~ 19

Author(s):

Yikun Qin ◽

Zhu Liang Yu ◽

Chang-Dong Wang ◽

Zhenghui Gu ◽

Yuanqing Li

Keyword(s):

Nearest Neighbor ◽

K Nearest Neighbor ◽

Clustering Method ◽

Neighbor Graph ◽

Nearest Neighbor Graph

Download Full-text

Discovery of Regional Co-location Patterns with k-Nearest Neighbor Graph

Advances in Knowledge Discovery and Data Mining - Lecture Notes in Computer Science ◽

10.1007/978-3-642-37453-1_15 ◽

2013 ◽

pp. 174-186 ◽

Cited By ~ 3

Author(s):

Feng Qian ◽

Kevin Chiew ◽

Qinming He ◽

Hao Huang ◽

Lianhang Ma

Keyword(s):

Nearest Neighbor ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Location Patterns ◽

Nearest Neighbor Graph

Download Full-text

Data Clustering Based on Community Structure in Mutual k-Nearest Neighbor Graph

2018 41st International Conference on Telecommunications and Signal Processing (TSP) ◽

10.1109/tsp.2018.8441226 ◽

2018 ◽

Author(s):

Honglei Zhang ◽

Serkan Kiranyaz ◽

Moncef Gabbouj

Keyword(s):

Community Structure ◽

Data Clustering ◽

Nearest Neighbor ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

Download Full-text

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

10.21203/rs.3.rs-32456/v2 ◽

2021 ◽

Author(s):

Ayesha Sania ◽

Nicolo Pini ◽

Morgan Nelson ◽

Michael Myers ◽

Lauren Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

First Trimester ◽

Epidemiologic Studies ◽

K Nearest Neighbor ◽

Timeline Followback

Abstract Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior. Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on randomly deleted data for 5-15 consecutive days. Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Download Full-text

Using the k-Nearest Neighbor Graph for Proximity Searching in Metric Spaces

String Processing and Information Retrieval - Lecture Notes in Computer Science ◽

10.1007/11575832_14 ◽

2005 ◽

pp. 127-138 ◽

Cited By ~ 14

Author(s):

Rodrigo Paredes ◽

Edgar Chávez

Keyword(s):

Metric Spaces ◽

Nearest Neighbor ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

Download Full-text