A learned embedding for efficient joint analysis of millions of mass spectra

AbstractDespite an explosion of data in public repositories, peptide mass spectra are usually analyzed by each laboratory in isolation, treating each experiment as if it has no relationship to any others. This approach fails to exploit the wealth of existing, previously analyzed mass spectrometry data. Others have jointly analyzed many mass spectra, often using clustering. However, mass spectra are not necessarily best summarized as clusters, and although new spectra can be added to existing clusters, clustering methods previously applied to mass spectra do not allow new clusters to be defined without completely re-clustering. As an alternative, we propose to train a deep neural network, called “GLEAMS,” to learn an embedding of spectra into a low-dimensional space in which spectra generated by the same peptide are close to one another. We demonstrate empirically the utility of this learned embedding by propagating annotations from labeled to unlabeled spectra. We further use GLEAMS to detect groups of unidentified, proximal spectra representing the same peptide, and we show how to use these spectral communities to reveal misidentified spectra and to characterize frequently observed but consistently unidentified molecular species. We provide a software implementation of our approach, along with a tool to quickly embed additional spectra using a pre-trained model, to facilitate large-scale analyses.

Download Full-text

A theory of multineuronal dimensionality, dynamics and measurement

10.1101/214262 ◽

2017 ◽

Cited By ~ 38

Author(s):

Peiran Gao ◽

Eric Trautmann ◽

Byron Yu ◽

Gopal Santhanam ◽

Stephen Ryu ◽

...

Keyword(s):

Experimental Design ◽

Large Scale ◽

Task Complexity ◽

Dimensional Space ◽

Neural Dynamics ◽

Firing Rates ◽

Control Behavior ◽

Reduction Methods ◽

Low Dimensional

AbstractIn many experiments, neuroscientists tightly control behavior, record many trials, and obtain trial-averaged firing rates from hundreds of neurons in circuits containing billions of behaviorally relevant neurons. Di-mensionality reduction methods reveal a striking simplicity underlying such multi-neuronal data: they can be reduced to a low-dimensional space, and the resulting neural trajectories in this space yield a remarkably insightful dynamical portrait of circuit computation. This simplicity raises profound and timely conceptual questions. What are its origins and its implications for the complexity of neural dynamics? How would the situation change if we recorded more neurons? When, if at all, can we trust dynamical portraits obtained from measuring an infinitesimal fraction of task relevant neurons? We present a theory that answers these questions, and test it using physiological recordings from reaching monkeys. This theory reveals conceptual insights into how task complexity governs both neural dimensionality and accurate recovery of dynamic portraits, thereby providing quantitative guidelines for future large-scale experimental design.

Download Full-text

Context-Aware Content Generation for Virtual Environments

Volume 1B: 36th Computers and Information in Engineering Conference ◽

10.1115/detc2016-59997 ◽

2016 ◽

Author(s):

Andrew Brock ◽

Theodore Lim ◽

J. M. Ritchie ◽

Nick Weston

Keyword(s):

Large Scale ◽

Dimensional Space ◽

Context Aware ◽

3 Dimensional ◽

Latent Space ◽

Variational Autoencoder ◽

Computationally Intensive ◽

Expert Input ◽

Low Dimensional ◽

Content Generation

Large scale scene generation is a computationally intensive operation, and added complexities arise when dynamic content generation is required. We propose a system capable of generating virtual content from non-expert input. The proposed system uses a 3-dimensional variational autoencoder to interactively generate new virtual objects by interpolating between extant objects in a learned low-dimensional space, as well as by randomly sampling in that space. We present an interface that allows a user to intuitively explore the latent manifold, taking advantage of the network’s ability to perform algebra in the latent space to help infer context and generalize to previously unseen inputs.

Download Full-text

Physical model of the genotype-to-phenotype map of proteins

10.1101/069039 ◽

2016 ◽

Author(s):

Tsvi Tlusty ◽

Albert Libchaber ◽

Jean-Pierre Eckmann

Keyword(s):

Shear Band ◽

Physical Model ◽

Protein Function ◽

Large Scale ◽

Dimensional Space ◽

Binary Sequences ◽

Basic Question ◽

Mechanical Basis ◽

Low Dimensional ◽

Scale Motion

How DNA is mapped to functional proteins is a basic question of living matter. We introduce and study a physical model of protein evolution which suggests a mechanical basis for this map. Many proteins rely on large-scale motion to function. We therefore treat protein as learning amorphous matter that evolves towards such a mechanical function: Genes are binary sequences that encode the connectivity of the amino acid network that makes a protein. The gene is evolved until the network forms a shear band across the protein, which allows for long-range, soft modes required for protein function. The evolution reduces the high-dimensional sequence space to a low-dimensional space of mechanical modes, in accord with the observed dimensional reduction between genotype and phenotype of proteins. Spectral analysis of the space of 106 solutions shows a strong correspondence between localization around the shear band of both mechanical modes and the sequence structure. Specifically, our model shows how mutations are correlated among amino acids whose interactions determine the functional mode.PACS numbers: 87.14.E-, 87.15.-v, 87.10.-e

Download Full-text

Community Detection Based on DeepWalk Model in Large-Scale Networks

Security and Communication Networks ◽

10.1155/2020/8845942 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Yunfang Chen ◽

Li Wang ◽

Dehao Qi ◽

Tinghuai Ma ◽

Wei Zhang

Keyword(s):

Community Structure ◽

Community Detection ◽

Large Scale ◽

Dimensional Space ◽

Complex Structure ◽

Gaussian Mixture ◽

Detection Algorithm ◽

Detection Methods ◽

Large Scale Networks ◽

Low Dimensional

The large-scale and complex structure of real networks brings enormous challenges to traditional community detection methods. In order to detect community structure in large-scale networks more accurately and efficiently, we propose a community detection algorithm based on the network embedding representation method. Firstly, in order to solve the scarce problem of network data, this paper uses the DeepWalk model to embed a high-dimensional network into low-dimensional space with topology information. Then, low-dimensional data are processed, with each node treated as a sample and each dimension of the node as a feature. Finally, samples are fed into a Gaussian mixture model (GMM), and in order to automatically learn the number of communities, variational inference is introduced into GMM. Experimental results on the DBLP dataset show that the model method of this paper can more effectively discover the communities in large-scale networks. By further analyzing the excavated community structure, the organizational characteristics within the community are better revealed.

Download Full-text

SIMNETS: a computationally efficient and scalable framework for identifying sub-networks of functionally similar neurons

10.1101/463364 ◽

2018 ◽

Author(s):

Jacqueline B. Hynes ◽

David M. Brandman ◽

Jonas B. Zimmerman ◽

John P. Donoghue ◽

Carlos E. Vargas-Irwin

Keyword(s):

Large Scale ◽

Dimensional Space ◽

Ground Truth ◽

Computationally Efficient ◽

Intrinsic Geometry ◽

Experimental Conditions ◽

Ca1 Region ◽

Hippocampal Ca1 Region ◽

Technological Advances ◽

Low Dimensional

AbstractRecent technological advances have made it possible to simultaneously record the activity of thousands of individual neurons in the cortex of awake behaving animals. However, the comparatively slower development of analytical tools capable of handling the scale and complexity of large-scale recordings is a growing problem for the field of neuroscience. We present the Similarity Networks (SIMNETS) algorithm: a computationally efficient and scalable method for identifying and visualizing sub-networks of functionally similar neurons within larger simultaneously recorded ensembles. While traditional approaches tend to group neurons according to the statistical similarities of inter-neuron spike patterns, our approach begins by mathematically capturing the intrinsic relationship between the spike train outputs of each neuron across experimental conditions, before any comparisons are made between neurons. This strategy estimates the intrinsic geometry of each neuron’s output space, allowing us to capture the information processing properties of each neuron in a common format that is easily compared between neurons. Dimensionality reduction tools are then used to map high-dimensional neuron similarity vectors into a low-dimensional space where functional groupings are identified using clustering and statistical techniques. SIMNETS makes minimal assumptions about single neuron encoding properties; is efficient enough to run on consumer-grade hardware (100 neurons < 4s run-time); and has a computational complexity that scales near-linearly with neuron number. These properties make SIMNETS well-suited for examining large networks of neurons during complex behaviors. We validate the ability of our approach for detecting statistically and physiologically meaningful functional groupings in a population of synthetic neurons with known ground-truth, as well three publicly available datasets of ensemble recordings from primate primary visual and motor cortex and the rat hippocampal CA1 region.

Download Full-text

Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics

10.1101/385534 ◽

2018 ◽

Cited By ~ 5

Author(s):

Qiwen Hu ◽

Casey S. Greene

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Expression Data ◽

Large Scale ◽

Dimensional Space ◽

Parameter Tuning ◽

Generative Models ◽

Underlying Structure ◽

Expression Data ◽

Low Dimensional

Single-cell RNA sequencing (scRNA-seq) is a powerful tool to profile the transcriptomes of a large number of individual cells at a high resolution. These data usually contain measurements of gene expression for many genes in thousands or tens of thousands of cells, though some datasets now reach the million-cell mark. Projecting high-dimensional scRNA-seq data into a low dimensional space aids downstream analysis and data visualization. Many recent preprints accomplish this using variational autoencoders (VAE), generative models that learn underlying structure of data by compress it into a constrained, low dimensional space. The low dimensional spaces generated by VAEs have revealed complex patterns and novel biological signals from large-scale gene expression data and drug response predictions. Here, we evaluate a simple VAE approach for gene expression data, Tybalt, by training and measuring its performance on sets of simulated scRNA-seq data. We find a number of counter-intuitive performance features: i.e., deeper neural networks can struggle when datasets contain more observations under some parameter configurations. We show that these methods are highly sensitive to parameter tuning: when tuned, the performance of the Tybalt model, which was not optimized for scRNA-seq data, outperforms other popular dimension reduction approaches – PCA, ZIFA, UMAP and t-SNE. On the other hand, without tuning performance can also be remarkably poor on the same data. Our results should discourage authors and reviewers from relying on self-reported performance comparisons to evaluate the relative value of contributions in this area at this time. Instead, we recommend that attempts to compare or benchmark autoencoder methods for scRNA-seq data be performed by disinterested third parties or by methods developers only on unseen benchmark data that are provided to all participants simultaneously because the potential for performance differences due to unequal parameter tuning is so high.

Download Full-text

Finding Approximate POMDP solutions Through Belief Compression

Journal of Artificial Intelligence Research ◽

10.1613/jair.1496 ◽

2005 ◽

Vol 23 ◽

pp. 1-40 ◽

Cited By ~ 27

Author(s):

N. Roy ◽

G. Gordon ◽

S. Thrun

Keyword(s):

Optimal Policy ◽

Large Scale ◽

Value Function ◽

Dimensional Space ◽

Good Control ◽

Dimensional Subspace ◽

Mobile Robot Navigation ◽

High Dimensional ◽

Markov Decision ◽

Low Dimensional

Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are generally considered to be intractable for large models. The intractability of these algorithms is to a large extent a consequence of computing an exact, optimal policy over the entire belief space. However, in real-world POMDP problems, computing the optimal policy for the full belief space is often unnecessary for good control even for problems with complicated policy classes. The beliefs experienced by the controller often lie near a structured, low-dimensional subspace embedded in the high-dimensional belief space. Finding a good approximation to the optimal value function for only this subspace can be much easier than computing the full value function. We introduce a new method for solving large-scale POMDPs by reducing the dimensionality of the belief space. We use Exponential family Principal Components Analysis (Collins, Dasgupta & Schapire, 2002) to represent sparse, high-dimensional belief spaces using small sets of learned features of the belief state. We then plan only in terms of the low-dimensional belief features. By planning in this low-dimensional space, we can find policies for POMDP models that are orders of magnitude larger than models that can be handled by conventional techniques. We demonstrate the use of this algorithm on a synthetic problem and on mobile robot navigation tasks.

Download Full-text

Details (Don't) Matter: Isolating Cluster Information in Deep Embedded Spaces

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/389 ◽

2021 ◽

Author(s):

Lukas Miklautz ◽

Lena G. M. Bauer ◽

Dominik Mautz ◽

Sebastian Tschiatschek ◽

Christian Böhm ◽

...

Keyword(s):

Dimensional Space ◽

Representation Learning ◽

Specific Information ◽

Data Sets ◽

Clustering Methods ◽

Cluster Performance ◽

Clustering Techniques ◽

General Variation ◽

Improved Performance ◽

Low Dimensional

Deep clustering techniques combine representation learning with clustering objectives to improve their performance. Among existing deep clustering techniques, autoencoder-based methods are the most prevalent ones. While they achieve promising clustering results, they suffer from an inherent conflict between preserving details, as expressed by the reconstruction loss, and finding similar groups by ignoring details, as expressed by the clustering loss. This conflict leads to brittle training procedures, dependence on trade-off hyperparameters and less interpretable results. We propose our framework, ACe/DeC, that is compatible with Autoencoder Centroid based Deep Clustering methods and automatically learns a latent representation consisting of two separate spaces. The clustering space captures all cluster-specific information and the shared space explains general variation in the data. This separation resolves the above mentioned conflict and allows our method to learn both detailed reconstructions and cluster specific abstractions. We evaluate our framework with extensive experiments to show several benefits: (1) cluster performance – on various data sets we outperform relevant baselines; (2) no hyperparameter tuning – this improved performance is achieved without introducing new clustering specific hyperparameters; (3) interpretability – isolating the cluster specific information in a separate space is advantageous for data exploration and interpreting the clustering results; and (4) dimensionality of the embedded space – we automatically learn a low dimensional space for clustering. Our ACe/DeC framework isolates cluster information, increases stability and interpretability, while improving cluster performance.

Download Full-text

Evaluation of integrative clustering methods for the analysis of multi-omics data

Briefings in Bioinformatics ◽

10.1093/bib/bbz015 ◽

2019 ◽

Vol 21 (2) ◽

pp. 541-552 ◽

Cited By ~ 9

Author(s):

Cécile Chauvel ◽

Alexei Novoloaca ◽

Pierre Veyre ◽

Frédéric Reynier ◽

Jérémie Becker

Keyword(s):

Matrix Factorization ◽

Large Scale ◽

The Cancer Genome Atlas ◽

Added Value ◽

Joint Analysis ◽

Omics Data ◽

Clustering Methods ◽

Data Set ◽

Cancer Data ◽

Opposite Behavior

Abstract Recent advances in sequencing, mass spectrometry and cytometry technologies have enabled researchers to collect large-scale omics data from the same set of biological samples. The joint analysis of multiple omics offers the opportunity to uncover coordinated cellular processes acting across different omic layers. In this work, we present a thorough comparison of a selection of recent integrative clustering approaches, including Bayesian (BCC and MDI) and matrix factorization approaches (iCluster, moCluster, JIVE and iNMF). Based on simulations, the methods were evaluated on their sensitivity and their ability to recover both the correct number of clusters and the simulated clustering at the common and data-specific levels. Standard non-integrative approaches were also included to quantify the added value of integrative methods. For most matrix factorization methods and one Bayesian approach (BCC), the shared and specific structures were successfully recovered with high and moderate accuracy, respectively. An opposite behavior was observed on non-integrative approaches, i.e. high performances on specific structures only. Finally, we applied the methods on the Cancer Genome Atlas breast cancer data set to check whether results based on experimental data were consistent with those obtained in the simulations.

Download Full-text

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching

10.1101/2021.02.05.429957 ◽

2021 ◽

Author(s):

Wout Bittremieux ◽

Kris Laukens ◽

William Stafford Noble ◽

Pieter C. Dorrestein

Keyword(s):

Large Scale ◽

Nearest Neighbor ◽

Distance Matrix ◽

Relevant Information ◽

Downstream Processing ◽

Mass Spectrometry Data ◽

Similarity Searching ◽

Mass Spectral ◽

Data Volume ◽

Low Dimensional

AbstractRationaleAdvanced algorithmic solutions are necessary to process the ever increasing amounts of mass spectrometry data that is being generated. Here we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra.Methodsfalcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pair-wise distance matrix without having to exhaustively compare all spectra to each other. Finally, density-based clustering is performed to group similar spectra into clusters.ResultsUsing a large draft human proteome dataset consisting of 25 million spectra, falcon is able to generate clusters of a similar quality as MS-Cluster and spectra-cluster, two widely used clustering tools, while being considerably faster. Notably, at comparable cluster quality levels, falcon generates larger clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing.Conclusionsfalcon is a highly efficient spectrum clustering tool. It is publicly available as open source under the permissive BSD license at https://github.com/bittremieux/falcon.

Download Full-text