A Clustering Technique for Reducing Noise in High Dimensional Non-Linear Data Using M-DENCLUE Algorithm

Author(s):  
Nandhakumar R ◽  
Author(s):  
Jan-Tobias Sohns ◽  
Michaela Schmitt ◽  
Fabian Jirasek ◽  
Hans Hasse ◽  
Heike Leitte

2020 ◽  
Author(s):  
Victor Bernal ◽  
Rainer Bischoff ◽  
Peter Horvatovich ◽  
Victor Guryev ◽  
Marco Grzegorczyk

Abstract Background: In systems biology, it is important to reconstruct regulatory networks from quantitative molecular profiles. Gaussian graphical models (GGMs) are one of the most popular methods to this end. A GGM consists of nodes (representing the transcripts, metabolites or proteins) inter-connected by edges (reflecting their partial correlations). Learning the edges from quantitative molecular profiles is statistically challenging, as there are usually fewer samples than nodes (‘high dimensional problem’). Shrinkage methods address this issue by learning a regularized GGM. However, it is an open question how the shrinkage affects the final result and its interpretation.Results: We show that the shrinkage biases the partial correlation in a non-linear way. This bias does not only change the magnitudes of the partial correlations but also affects their order. Furthermore, it makes networks obtained from different experiments incomparable and hinders their biological interpretation. We propose a method, referred to as the ‘un-shrunk’ partial correlation, which corrects for this non-linear bias. Unlike traditional methods, which use a fixed shrinkage value, the new approach provides partial correlations that are closer to the actual (population) values and that are easier to interpret. We apply the ‘un-shrunk’ method to two gene expression datasets from Escherichia coli and Mus musculus.Conclusions: GGMs are popular undirected graphical models based on partial correlations. The application of GGMs to reconstruct regulatory networks is commonly performed using shrinkage to overcome the “high-dimensional” problem. Besides it advantages, we have identified that the shrinkage introduces a non-linear bias in the partial correlations. Ignoring this type of effects caused by the shrinkage can obscure the interpretation of the network, and impede the validation of earlier reported results.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Victor Bernal ◽  
Rainer Bischoff ◽  
Peter Horvatovich ◽  
Victor Guryev ◽  
Marco Grzegorczyk

Abstract Background In systems biology, it is important to reconstruct regulatory networks from quantitative molecular profiles. Gaussian graphical models (GGMs) are one of the most popular methods to this end. A GGM consists of nodes (representing the transcripts, metabolites or proteins) inter-connected by edges (reflecting their partial correlations). Learning the edges from quantitative molecular profiles is statistically challenging, as there are usually fewer samples than nodes (‘high dimensional problem’). Shrinkage methods address this issue by learning a regularized GGM. However, it remains open to study how the shrinkage affects the final result and its interpretation. Results We show that the shrinkage biases the partial correlation in a non-linear way. This bias does not only change the magnitudes of the partial correlations but also affects their order. Furthermore, it makes networks obtained from different experiments incomparable and hinders their biological interpretation. We propose a method, referred to as ‘un-shrinking’ the partial correlation, which corrects for this non-linear bias. Unlike traditional methods, which use a fixed shrinkage value, the new approach provides partial correlations that are closer to the actual (population) values and that are easier to interpret. This is demonstrated on two gene expression datasets from Escherichia coli and Mus musculus. Conclusions GGMs are popular undirected graphical models based on partial correlations. The application of GGMs to reconstruct regulatory networks is commonly performed using shrinkage to overcome the ‘high-dimensional problem’. Besides it advantages, we have identified that the shrinkage introduces a non-linear bias in the partial correlations. Ignoring this type of effects caused by the shrinkage can obscure the interpretation of the network, and impede the validation of earlier reported results.


Author(s):  
Dharmveer Singh Rajput ◽  
Pramod Kumar Singh ◽  
Mahua Bhattacharya

PLoS ONE ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. e0248046
Author(s):  
Elizabeth Hou ◽  
Earl Lawrence ◽  
Alfred O. Hero

The ensemble Kalman filter (EnKF) is a data assimilation technique that uses an ensemble of models, updated with data, to track the time evolution of a usually non-linear system. It does so by using an empirical approximation to the well-known Kalman filter. However, its performance can suffer when the ensemble size is smaller than the state space, as is often necessary for computationally burdensome models. This scenario means that the empirical estimate of the state covariance is not full rank and possibly quite noisy. To solve this problem in this high dimensional regime, we propose a computationally fast and easy to implement algorithm called the penalized ensemble Kalman filter (PEnKF). Under certain conditions, it can be theoretically proven that the PEnKF will be accurate (the estimation error will converge to zero) despite having fewer ensemble members than state dimensions. Further, as contrasted to localization methods, the proposed approach learns the covariance structure associated with the dynamical system. These theoretical results are supported with simulations of several non-linear and high dimensional systems.


2021 ◽  
Author(s):  
Ramon Nogueira ◽  
Chris C. Rodgers ◽  
Randy M. Bruno ◽  
Stefano Fusi

Adaptive behavior in humans, rodents, and other animals often requires the integration over time of multiple sensory inputs. Here we studied the behavior and the neural activity of mice trained to actively integrate information from different whiskers to report the curvature of an object. The analysis of high speed videos of the whiskers revealed that the task could be solved by integrating linearly the whisker contacts on the object. However, recordings from the mouse barrel cortex revealed that the neural representations are high dimensional as the inputs from multiple whiskers are mixed non-linearly to produce the observed neural activity. The observed representation enables the animal to perform a broad class of significantly more complex tasks, with minimal disruption of the ability to generalize to novel situations in simpler tasks. Simulated recurrent neural networks trained to perform similar tasks reproduced both the behavioral and neuronal experimental observations. Our work suggests that the somatosensory cortex operates in a regime that represents an efficient compromise between generalization, which typically requires pure and linear mixed selectivity representations, and the ability to perform complex discrimination tasks, which is granted by non-linear mixed representations.


Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Clustering in high-dimensional spaces is a recurrent problem in many domains. It affects time complexity, space complexity, scalability and accuracy of clustering methods. Highdimensional non-linear datausually live in different low dimensional subspaces hidden in the original space. As high‐dimensional objects appear almost alike, new approaches for clustering are required. This research has focused on developing Mathematical models, techniques and clustering algorithms specifically for high‐dimensional data. The innocent growth in the fields of communication and technology, there is tremendous growth in high dimensional data spaces. As the variant of dimensions on high dimensional non-linear data increases, many clustering techniques begin to suffer from the curse of dimensionality, de-grading the quality of the results. In high dimensional non-linear data, the data becomes very sparse and distance measures become increasingly meaningless. The principal challenge for clustering high dimensional data is to overcome the “curse of dimensionality”. This research work concentrates on devising an enhanced algorithm for clustering high dimensional non-linear data.


Sign in / Sign up

Export Citation Format

Share Document