Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience

Mapping Intimacies ◽

10.1101/273128 ◽

2018 ◽

Cited By ~ 2

Author(s):

Emily L. Mackevicius ◽

Andrew H. Bahle ◽

Alex H. Williams ◽

Shijie Gu ◽

Natalia I. Denissenko ◽

...

Keyword(s):

Large Scale ◽

Temporal Structure ◽

Simulated Data ◽

Salient Feature ◽

High Dimensional ◽

Neural Data ◽

Neural Recordings ◽

Reduction Techniques ◽

Low Dimensional ◽

Neural Sequences

AbstractIdentifying low-dimensional features that describe large-scale neural recordings is a major challenge in neuroscience. Repeated temporal patterns (sequences) are thought to be a salient feature of neural dynamics, but are not succinctly captured by traditional dimensionality reduction techniques. Here we describe a software toolbox—called seqNMF—with new methods for extracting informative, non-redundant, sequences from high-dimensional neural data, testing the significance of these extracted patterns, and assessing the prevalence of sequential structure in data. We test these methods on simulated data under multiple noise conditions, and on several real neural and behavioral data sets. In hippocampal data, seqNMF identifies neural sequences that match those calculated manually by reference to behavioral events. In songbird data, seqNMF discovers neural sequences in untutored birds that lack stereotyped songs. Thus, by identifying temporal structure directly from neural data, seqNMF enables dissection of complex neural circuits without relying on temporal references from stimuli or behavioral outputs.

Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience

eLife ◽

10.7554/elife.38471 ◽

2019 ◽

Vol 8 ◽

Cited By ~ 26

Author(s):

Emily L Mackevicius ◽

Andrew H Bahle ◽

Alex H Williams ◽

Shijie Gu ◽

Natalia I Denisenko ◽

...

Keyword(s):

Large Scale ◽

Temporal Structure ◽

Simulated Data ◽

Salient Feature ◽

High Dimensional ◽

Neural Data ◽

Neural Recordings ◽

Reduction Techniques ◽

Low Dimensional ◽

Neural Sequences

Identifying low-dimensional features that describe large-scale neural recordings is a major challenge in neuroscience. Repeated temporal patterns (sequences) are thought to be a salient feature of neural dynamics, but are not succinctly captured by traditional dimensionality reduction techniques. Here, we describe a software toolbox—called seqNMF—with new methods for extracting informative, non-redundant, sequences from high-dimensional neural data, testing the significance of these extracted patterns, and assessing the prevalence of sequential structure in data. We test these methods on simulated data under multiple noise conditions, and on several real neural and behavioral data sets. In hippocampal data, seqNMF identifies neural sequences that match those calculated manually by reference to behavioral events. In songbird data, seqNMF discovers neural sequences in untutored birds that lack stereotyped songs. Thus, by identifying temporal structure directly from neural data, seqNMF enables dissection of complex neural circuits without relying on temporal references from stimuli or behavioral outputs.

Parallel Framework for Dimensionality Reduction of Large-Scale Datasets

Scientific Programming ◽

10.1155/2015/180214 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Sai Kiranmayee Samudrala ◽

Jaroslaw Zola ◽

Srinivas Aluru ◽

Baskar Ganapathysubramanian

Keyword(s):

Dimensionality Reduction ◽

Organic Solar Cells ◽

Large Scale ◽

Parallel Implementation ◽

High Dimensional Data ◽

Real Life ◽

Processing Parameters ◽

High Dimensional ◽

Morphology Evolution ◽

Reduction Techniques

Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.

Using computational theory to constrain statistical models of neural data

10.1101/104737 ◽

2017 ◽

Cited By ~ 1

Author(s):

Scott W. Linderman ◽

Samuel J. Gershman

Keyword(s):

Large Scale ◽

Recent Literature ◽

Theory Of Computation ◽

Temporal Difference Learning ◽

Computational Theory ◽

Neural Data ◽

Neural Recordings ◽

First Order ◽

Clear Cut ◽

Worked Example

AbstractComputational neuroscience is, to first order, dominated by two approaches: the “bottom-up” approach, which searches for statistical patterns in large-scale neural recordings, and the “top-down” approach, which begins with a theory of computation and considers plausible neural implementations. While this division is not clear-cut, we argue that these approaches should be much more intimately linked. From a Bayesian perspective, computational theories provide constrained prior distributions on neural data—albeit highly sophisticated ones. By connecting theory to observation via a probabilistic model, we provide the link necessary to test, evaluate, and revise our theories in a data-driven and statistically rigorous fashion. This review highlights examples of this theory-driven pipeline for neural data analysis in recent literature and illustrates it with a worked example based on the temporal difference learning model of dopamine.

An Evaluation of Supervised Dimensionality Reduction For Large Scale Data

Journal of Machine and Computing ◽

10.53759/7669/jmc202202003 ◽

2022 ◽

pp. 17-25

Author(s):

Nancy Jan Sliper

Keyword(s):

Dimensionality Reduction ◽

Large Scale ◽

Simulated Data ◽

Principal Component ◽

Low Rank ◽

Learning Tools ◽

Large Scale Data ◽

Reduction Methods ◽

Low Dimensional ◽

Scale Data

Experimenters today frequently quantify millions or even billions of characteristics (measurements) each sample to address critical biological issues, in the hopes that machine learning tools would be able to make correct data-driven judgments. An efficient analysis requires a low-dimensional representation that preserves the differentiating features in data whose size and complexity are orders of magnitude apart (e.g., if a certain ailment is present in the person's body). While there are several systems that can handle millions of variables and yet have strong empirical and conceptual guarantees, there are few that can be clearly understood. This research presents an evaluation of supervised dimensionality reduction for large scale data. We provide a methodology for expanding Principal Component Analysis (PCA) by including category moment estimations in low-dimensional projections. Linear Optimum Low-Rank (LOLR) projection, the cheapest variant, includes the class-conditional means. We show that LOLR projections and its extensions enhance representations of data for future classifications while retaining computing flexibility and reliability using both experimental and simulated data benchmark. When it comes to accuracy, LOLR prediction outperforms other modular linear dimension reduction methods that require much longer computation times on conventional computers. LOLR uses more than 150 million attributes in brain image processing datasets, and many genome sequencing datasets have more than half a million attributes.

Unsupervised Clusterless Decoding using a Switching Poisson Hidden Markov Model

10.1101/760470 ◽

2019 ◽

Cited By ~ 1

Author(s):

Etienne Ackermann ◽

Caleb T. Kemere ◽

John P. Cunningham

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Single Unit ◽

Hidden Markov ◽

Temporal Structure ◽

Simulated Data ◽

Model Parameters ◽

Behavioral Correlates ◽

Neural Data ◽

Multichannel Recordings

AbstractSpike sorting is a standard preprocessing step to obtain ensembles of single unit data from multiunit, multichannel recordings in neuroscience. However, more recently, some researchers have started doing analyses directly on the unsorted data. Here we present a new computational model that is an extension of the standard (unsupervised) switching Poisson hidden Markov model (where observations are time-binned spike counts from each of N neurons), to a clusterless approximation in which we observe only a d-dimensional mark for each spike. Such an unsupervised yet clusterless approach has the potential to incorporate more information than is typically available from spike-sorted approaches, and to uncover temporal structure in neural data without access to behavioral correlates. We show that our approach can recover model parameters from simulated data, and that it can uncover task-relevant structure from real neural data.

Feature selection using autoencoders with Bayesian methods to high-dimensional data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211348 ◽

2021 ◽

pp. 1-10

Author(s):

Lei Shu ◽

Kun Huang ◽

Wenhao Jiang ◽

Wenming Wu ◽

Hongling Liu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Bayesian Methods ◽

Large Scale ◽

High Dimensional Data ◽

Hybrid Approach ◽

High Dimensional ◽

Real World Data ◽

Learning Tasks ◽

Low Dimensional

It is easy to lead to poor generalization in machine learning tasks using real-world data directly, since such data is usually high-dimensional dimensionality and limited. Through learning the low dimensional representations of high-dimensional data, feature selection can retain useful features for machine learning tasks. Using these useful features effectively trains machine learning models. Hence, it is a challenge for feature selection from high-dimensional data. To address this issue, in this paper, a hybrid approach consisted of an autoencoder and Bayesian methods is proposed for a novel feature selection. Firstly, Bayesian methods are embedded in the proposed autoencoder as a special hidden layer. This of doing is to increase the precision during selecting non-redundant features. Then, the other hidden layers of the autoencoder are used for non-redundant feature selection. Finally, compared with the mainstream approaches for feature selection, the proposed method outperforms them. We find that the way consisted of autoencoders and probabilistic correction methods is more meaningful than that of stacking architectures or adding constraints to autoencoders as regards feature selection. We also demonstrate that stacked autoencoders are more suitable for large-scale feature selection, however, sparse autoencoders are beneficial for a smaller number of feature selection. We indicate that the value of the proposed method provides a theoretical reference to analyze the optimality of feature selection.

A distance based multisample test for high-dimensional compositional data with applications to the human microbiome

BMC Bioinformatics ◽

10.1186/s12859-020-3530-x ◽

2020 ◽

Vol 21 (S9) ◽

Author(s):

Qingyang Zhang ◽

Thy Dao

Keyword(s):

Large Scale ◽

Compositional Data ◽

Statistical Significance ◽

Human Microbiome ◽

Simulated Data ◽

Real Data ◽

Nonparametric Test ◽

High Dimensional ◽

Regularity Conditions ◽

Compositional Difference

Abstract Background Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data. Results In this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method. Conclusions Our simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets.

Performance Analysis of Dimensionality Reduction Techniques in the Context of Clustering

Asian Journal of Computer Science and Technology ◽

10.51983/ajcst-2019.8.s3.2084 ◽

2019 ◽

Vol 8 (S3) ◽

pp. 66-71

Author(s):

T. Sudha ◽

P. Nagendra Kumar

Keyword(s):

Principal Component Analysis ◽

Dimensionality Reduction ◽

High Dimensional Data ◽

Principal Component ◽

Component Analysis ◽

High Dimensional ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques ◽

Low Dimensional ◽

Probabilistic Principal Component Analysis

Data mining is one of the major areas of research. Clustering is one of the main functionalities of datamining. High dimensionality is one of the main issues of clustering and Dimensionality reduction can be used as a solution to this problem. The present work makes a comparative study of dimensionality reduction techniques such as t-distributed stochastic neighbour embedding and probabilistic principal component analysis in the context of clustering. High dimensional data have been reduced to low dimensional data using dimensionality reduction techniques such as t-distributed stochastic neighbour embedding and probabilistic principal component analysis. Cluster analysis has been performed on the high dimensional data as well as the low dimensional data sets obtained through t-distributed stochastic neighbour embedding and Probabilistic principal component analysis with varying number of clusters. Mean squared error; time and space have been considered as parameters for comparison. The results obtained show that time taken to convert the high dimensional data into low dimensional data using probabilistic principal component analysis is higher than the time taken to convert the high dimensional data into low dimensional data using t-distributed stochastic neighbour embedding.The space required by the data set reduced through Probabilistic principal component analysis is less than the storage space required by the data set reduced through t-distributed stochastic neighbour embedding.

Decoding of neural data using cohomological learning

10.1101/222331 ◽

2018 ◽

Cited By ~ 2

Author(s):

Erik Rybakken ◽

Nils Baas ◽

Benjamin Dunn

Keyword(s):

Spatial Organization ◽

Large Population ◽

Dimensional Structure ◽

Neural Population ◽

Head Direction ◽

Population Activity ◽

Neural Data ◽

Neural Recordings ◽

Data Driven Approach ◽

Low Dimensional

AbstractWe introduce a novel data-driven approach to discover and decode features in the neural code coming from large population neural recordings with minimal assumptions, using cohomological learning. We apply our approach to neural recordings of mice moving freely in a box, where we find a circular feature. We then observe that the decoded value corresponds well to the head direction of the mouse. Thus we capture head direction cells and decode the head direction from the neural population activity without having to process the behaviour of the mouse. Interestingly, the decoded values convey more information about the neural activity than the tracked head direction does, with differences that have some spatial organization. Finally, we note that the residual population activity, after the head direction has been accounted for, retains some low-dimensional structure which is correlated with the speed of the mouse.

Evaluating State Space Discovery by Persistent Cohomology in the Spatial Representation System

Frontiers in Computational Neuroscience ◽

10.3389/fncom.2021.616748 ◽

2021 ◽

Vol 15 ◽

Author(s):

Louis Kang ◽

Boyan Xu ◽

Dmitriy Morozov

Keyword(s):

Topological Structure ◽

Spatial Representation ◽

Persistent Homology ◽

High Dimensional ◽

Head Direction ◽

Representation System ◽

Topological Structures ◽

Neural Recordings ◽

Persistent Cohomology ◽

Low Dimensional

Persistent cohomology is a powerful technique for discovering topological structure in data. Strategies for its use in neuroscience are still undergoing development. We comprehensively and rigorously assess its performance in simulated neural recordings of the brain's spatial representation system. Grid, head direction, and conjunctive cell populations each span low-dimensional topological structures embedded in high-dimensional neural activity space. We evaluate the ability for persistent cohomology to discover these structures for different dataset dimensions, variations in spatial tuning, and forms of noise. We quantify its ability to decode simulated animal trajectories contained within these topological structures. We also identify regimes under which mixtures of populations form product topologies that can be detected. Our results reveal how dataset parameters affect the success of topological discovery and suggest principles for applying persistent cohomology, as well as persistent homology, to experimental neural recordings.