Constructing plasticity phenotypes to classify experience-dependent development of the visual cortex

AbstractMany neural mechanisms regulate experience-dependent plasticity in the visual cortex (V1) and new techniques for quantifying large numbers of proteins or genes are transforming how plasticity is studied into the era of big data. With those large data sets comes the challenge of extracting biologically meaningful results about visual plasticity from data-driven analytical methods designed for high-dimensional data. In other areas of neuroscience, high-information content methodologies are revealing more subtle aspects of neural development and individual variations that give rise to a richer picture of brain disorders. We have developed an approach for studying V1 plasticity that takes advantage of the known functions of many synaptic proteins for regulating visual plasticity and using that to rebrand the results of high-dimensional analyses into a plasticity phenotype. Here we provide a primer for analyzing experience-dependent plasticity in V1 using example R code to identify high-dimensional changes in a group of proteins. We describe using PCA to classify high-dimensional plasticity features and use them to construct a plasticity phenotype. In the examples, we show how the plasticity phenotype can be visualized and used to identify neurobiological features in V1 that change during development or after different visual rearing conditions. We include an R package “v1hdexplorer” that aggregates the various coding packages and custom visualization scripts written in R Studio.

Download Full-text

A primer on high-dimensional data analysis workflows for studying visual cortex development and plasticity

10.1101/554378 ◽

2019 ◽

Cited By ~ 1

Author(s):

Justin L. Balsor ◽

David G. Jones ◽

Kathryn M. Murphy

Keyword(s):

Big Data ◽

Visual Cortex ◽

Clustering Algorithms ◽

High Dimensional Data ◽

R Package ◽

High Dimensional ◽

Data Sets ◽

Data Set ◽

Dimensional Changes ◽

Or Genes

AbstractNew techniques for quantifying large numbers of proteins or genes are transforming the study of plasticity mechanisms in visual cortex (V1) into the era of big data. With those changes comes the challenge of applying new analytical methods designed for high-dimensional data. Studies of V1, however, can take advantage of the known functions that many proteins have in regulating experience-dependent plasticity to facilitate linking big data analyses with neurobiological functions. Here we discuss two workflows and provide example R code for analyzing high-dimensional changes in a group of proteins (or genes) using two data sets. The first data set includes 7 neural proteins, 9 visual conditions, and 3 regions in V1 from an animal model for amblyopia. The second data set includes 23 neural proteins and 31 ages (20d-80yrs) from human post-mortem samples of V1. Each data set presents different challenges and we describe using PCA, tSNE, and various clustering algorithms including sparse high-dimensional clustering. Also, we describe a new approach for identifying high-dimensional features and using them to construct a plasticity phenotype that identifies neurobiological differences among clusters. We include an R package “v1hdexplorer” that aggregates the various coding packages and custom visualization scripts written in R Studio.

Download Full-text

Understanding High Dimensional and Large Data Sets: Some Mathematical Challenges and Opportunities

Data Mining for Scientific and Engineering Applications - Massive Computing ◽

10.1007/978-1-4615-1733-7_2 ◽

2001 ◽

pp. 23-34

Author(s):

Jagdish Chandra

Keyword(s):

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Challenges And Opportunities

Download Full-text

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

Taxa: An R package implementing data standards and methods for taxonomic data

F1000Research ◽

10.12688/f1000research.14013.1 ◽

2018 ◽

Vol 7 ◽

pp. 272

Author(s):

Zachary S.L. Foster ◽

Scott Chamberlain ◽

Niklaus J. Grünwald

Keyword(s):

Large Data ◽

R Package ◽

Large Data Sets ◽

Specific Information ◽

Data Sets ◽

Widespread Application ◽

Taxonomic Information ◽

Hierarchical Nature ◽

Application Specific ◽

Associated Data

The taxa R package provides a set of tools for defining and manipulating taxonomic data. The recent and widespread application of DNA sequencing to community composition studies is making large data sets with taxonomic information commonplace. However, compared to typical tabular data, this information is encoded in many different ways and the hierarchical nature of taxonomic classifications makes it difficult to work with. There are many R packages that use taxonomic data to varying degrees but there is currently no cross-package standard for how this information is encoded and manipulated. We developed the R package taxa to provide a robust and flexible solution to storing and manipulating taxonomic data in R and any application-specific information associated with it. Taxa provides parsers that can read common sources of taxonomic information (taxon IDs, sequence IDs, taxon names, and classifications) from nearly any format while preserving associated data. Once parsed, the taxonomic data and any associated data can be manipulated using a cohesive set of functions modeled after the popular R package dplyr. These functions take into account the hierarchical nature of taxa and can modify the taxonomy or associated data in such a way that both are kept in sync. Taxa is currently being used by the metacoder and taxize packages, which provide broadly useful functionality that we hope will speed adoption by users and developers.

Download Full-text

Singular Value Decomposition, Clustering, and Indexing for Similarity Search for Large Data Sets in High-Dimensional Spaces

Big Data ◽

10.1201/b18050-8 ◽

2015 ◽

pp. 76-107

Keyword(s):

Singular Value Decomposition ◽

Similarity Search ◽

Large Data ◽

Singular Value ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Value Decomposition

Download Full-text

Tensor Numerical Methods: Actual Theory and Recent Applications

Computational Methods in Applied Mathematics ◽

10.1515/cmam-2018-0014 ◽

2019 ◽

Vol 19 (1) ◽

pp. 1-4 ◽

Cited By ~ 1

Author(s):

Ivan Gavrilyuk ◽

Boris N. Khoromskij

Keyword(s):

Numerical Modeling ◽

Big Data ◽

Data Analysis ◽

Numerical Methods ◽

Differential Equations ◽

Numerical Solution ◽

Large Data ◽

High Dimensional ◽

Data Sets ◽

Integral Differential Equations

AbstractMost important computational problems nowadays are those related to processing of the large data sets and to numerical solution of the high-dimensional integral-differential equations. These problems arise in numerical modeling in quantum chemistry, material science, and multiparticle dynamics, as well as in machine learning, computer simulation of stochastic processes and many other applications related to big data analysis. Modern tensor numerical methods enable solution of the multidimensional partial differential equations (PDE) in {\mathbb{R}^{d}} by reducing them to one-dimensional calculations. Thus, they allow to avoid the so-called “curse of dimensionality”, i.e. exponential growth of the computational complexity in the dimension size d, in the course of numerical solution of high-dimensional problems. At present, both tensor numerical methods and multilinear algebra of big data continue to expand actively to further theoretical and applied research topics. This issue of CMAM is devoted to the recent developments in the theory of tensor numerical methods and their applications in scientific computing and data analysis. Current activities in this emerging field on the effective numerical modeling of temporal and stationary multidimensional PDEs and beyond are presented in the following ten articles, and some future trends are highlighted therein.

Download Full-text

Visualization of High-Dimensional Data with Polar Coordinates

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch315 ◽

2011 ◽

pp. 2062-2067

Author(s):

Frank Rehm ◽

Frank Klawonn ◽

Rudolf Kruse

Keyword(s):

High Dimensional Data ◽

Computation Time ◽

Large Data ◽

Feature Space ◽

Three Dimensions ◽

Polar Coordinates ◽

High Dimensional ◽

Data Sets ◽

Memory Space ◽

Resource Requirements

Many applications in science and business such as signal analysis or costumer segmentation deal with large amounts of data which are usually high dimensional in the feature space. As a part of preprocessing and exploratory data analysis, visualization of the data helps to decide which kind of data mining method probably leads to good results or whether outliers or noisy data need to be treated before (Barnett & Lewis, 1994; Hawkins, 1980). Since the visual assessment of a feature space that has more than three dimensions is not possible, it becomes necessary to find an appropriate visualization scheme for such data sets. Multidimensional scaling (MDS) is a family of methods that seek to present the important structure of the data in a reduced number of dimensions. Due to the approach of distance preservation that is followed by conventional MDS techniques, resource requirements regarding memory space and computation time are fairly high and prevent their application to large data sets. In this work we will present two methods that visualize high-dimensional data on the plane using a new approach. An algorithm will be presented that allows applying our method on larger data sets. We will also present some results on a benchmark data set.

Download Full-text

Computing Dense Cubes Embedded in Sparse Data

Research and Trends in Data Mining Technologies and Applications ◽

10.4018/978-1-59904-271-8.ch002 ◽

2007 ◽

pp. 29-52

Author(s):

Lixin Fu

Keyword(s):

Input Data ◽

Response Times ◽

Large Data ◽

Sparse Data ◽

High Dimensional ◽

Data Sets ◽

Aggregated Data ◽

Evaluation Algorithm ◽

Big Picture ◽

Computation Algorithms

In high-dimensional data sets, both the number of dimensions and the cardinalities of the dimensions are large and data is often very sparse, that is, most cubes are empty. For such large data sets, it is a well-known challenging problem to compute the aggregation of a measure over arbitrary combinations of dimensions efficiently. However, in real-world applications, users are usually not interested in all the sparse cubes, most of which are empty or contain only one or few tuples. Instead, they focus more on the “big picture” information the highly aggregated data, where the “where clauses” of the SQL queries involve only few dimensions. Although the input data set is sparse, this aggregate data is dense. The existing multi-pass, full-cube computation algorithms are prohibitively slow for this type of application involving very large input data sets. We propose a new dynamic data structure called Restricted Sparse Statistics Tree (RSST) and a novel cube evaluation algorithm, which are especially well suited for efficiently computing dense sub-cubes imbedded in high-dimensional sparse data sets. RSST only computes the aggregations of non-empty cube cells where the number of non-star coordinates (i.e., the number of group by attributes) is restricted to be no more than a user-specified threshold. Our innovative algorithms are scalable and I/O efficient. RSST is incrementally maintainable, which makes it suitable for data warehousing and the analysis of streaming data. We have compared our algorithms with top, state-of-the-art cube computation algorithms such as Dwarf and QCT in construction times, query response times, and data compression. Experiments demonstrate the excellent performance and good scalability of our approach.

Download Full-text

Combining thermodynamics with tensor completion techniques to enable multicomponent microstructure prediction

npj Computational Materials ◽

10.1038/s41524-019-0268-y ◽

2020 ◽

Vol 6 (1) ◽

Author(s):

Yuri Amorim Coutinho ◽

Nico Vervliet ◽

Lieven De Lathauwer ◽

Nele Moelans

Keyword(s):

Microstructure Evolution ◽

Materials Science ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Thermodynamic Quantities ◽

Tensor Completion ◽

Microstructure Modeling ◽

Microstructure Prediction

AbstractMulticomponent alloys show intricate microstructure evolution, providing materials engineers with a nearly inexhaustible variety of solutions to enhance material properties. Multicomponent microstructure evolution simulations are indispensable to exploit these opportunities. These simulations, however, require the handling of high-dimensional and prohibitively large data sets of thermodynamic quantities, of which the size grows exponentially with the number of elements in the alloy, making it virtually impossible to handle the effects of four or more elements. In this paper, we introduce the use of tensor completion for high-dimensional data sets in materials science as a general and elegant solution to this problem. We show that we can obtain an accurate representation of the composition dependence of high-dimensional thermodynamic quantities, and that the decomposed tensor representation can be evaluated very efficiently in microstructure simulations. This realization enables true multicomponent thermodynamic and microstructure modeling for alloy design.

Download Full-text

Gaussian Mixture Models Based on Principal Components and Applications

Mathematical Problems in Engineering ◽

10.1155/2020/1202307 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Nada A. Alqahtani ◽

Zakiah I. Kalantan

Keyword(s):

Mixture Model ◽

Principal Components ◽

Gaussian Mixture Models ◽

Expectation Maximization Algorithm ◽

Large Data ◽

Real Data ◽

R Package ◽

Gaussian Mixture ◽

Machine Learning Algorithms ◽

Data Sets

Data scientists use various machine learning algorithms to discover patterns in large data that can lead to actionable insights. In general, high-dimensional data are reduced by obtaining a set of principal components so as to highlight similarities and differences. In this work, we deal with the reduced data using a bivariate mixture model and learning with a bivariate Gaussian mixture model. We discuss a heuristic for detecting important components by choosing the initial values of location parameters using two different techniques: cluster means, k-means and hierarchical clustering, and default values in the “mixtools” R package. The parameters of the model are obtained via an expectation maximization algorithm. The criteria from Bayesian point are evaluated for both techniques, demonstrating that both techniques are efficient with respect to computation capacity. The effectiveness of the discussed techniques is demonstrated through a simulation study and using real data sets from different fields.

Download Full-text