scholarly journals Deciphering protein evolution and fitness landscapes with latent space models

2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Xinqiang Ding ◽  
Zhengting Zou ◽  
Charles L. Brooks III

AbstractProtein sequences contain rich information about protein evolution, fitness landscapes, and stability. Here we investigate how latent space models trained using variational auto-encoders can infer these properties from sequences. Using both simulated and real sequences, we show that the low dimensional latent space representation of sequences, calculated using the encoder model, captures both evolutionary and ancestral relationships between sequences. Together with experimental fitness data and Gaussian process regression, the latent space representation also enables learning the protein fitness landscape in a continuous low dimensional space. Moreover, the model is also useful in predicting protein mutational stability landscapes and quantifying the importance of stability in shaping protein evolution. Overall, we illustrate that the latent space models learned using variational auto-encoders provide a mechanism for exploration of the rich data contained in protein sequences regarding evolution, fitness and stability and hence are well-suited to help guide protein engineering efforts.

2021 ◽  
Vol 11 (3) ◽  
pp. 1013
Author(s):  
Zvezdan Lončarević ◽  
Rok Pahič ◽  
Aleš Ude ◽  
Andrej Gams

Autonomous robot learning in unstructured environments often faces the problem that the dimensionality of the search space is too large for practical applications. Dimensionality reduction techniques have been developed to address this problem and describe motor skills in low-dimensional latent spaces. Most of these techniques require the availability of a sufficiently large database of example task executions to compute the latent space. However, the generation of many example task executions on a real robot is tedious, and prone to errors and equipment failures. The main result of this paper is a new approach for efficient database gathering by performing a small number of task executions with a real robot and applying statistical generalization, e.g., Gaussian process regression, to generate more data. We have shown in our experiments that the data generated this way can be used for dimensionality reduction with autoencoder neural networks. The resulting latent spaces can be exploited to implement robot learning more efficiently. The proposed approach has been evaluated on the problem of robotic throwing at a target. Simulation and real-world results with a humanoid robot TALOS are provided. They confirm the effectiveness of generalization-based database acquisition and the efficiency of learning in a low-dimensional latent space.


2021 ◽  
Vol 15 ◽  
pp. 174830262110249
Author(s):  
Cong-Zhe You ◽  
Zhen-Qiu Shu ◽  
Hong-Hui Fan

Recently, in the area of artificial intelligence and machine learning, subspace clustering of multi-view data is a research hotspot. The goal is to divide data samples from different sources into different groups. We proposed a new subspace clustering method for multi-view data which termed as Non-negative Sparse Laplacian regularized Latent Multi-view Subspace Clustering (NSL2MSC) in this paper. The method proposed in this paper learns the latent space representation of multi view data samples, and performs the data reconstruction on the latent space. The algorithm can cluster data in the latent representation space and use the relationship of different views. However, the traditional representation-based method does not consider the non-linear geometry inside the data, and may lose the local and similar information between the data in the learning process. By using the graph regularization method, we can not only capture the global low dimensional structural features of data, but also fully capture the nonlinear geometric structure information of data. The experimental results show that the proposed method is effective and its performance is better than most of the existing alternatives.


Author(s):  
Andrew Brock ◽  
Theodore Lim ◽  
J. M. Ritchie ◽  
Nick Weston

Large scale scene generation is a computationally intensive operation, and added complexities arise when dynamic content generation is required. We propose a system capable of generating virtual content from non-expert input. The proposed system uses a 3-dimensional variational autoencoder to interactively generate new virtual objects by interpolating between extant objects in a learned low-dimensional space, as well as by randomly sampling in that space. We present an interface that allows a user to intuitively explore the latent manifold, taking advantage of the network’s ability to perform algebra in the latent space to help infer context and generalize to previously unseen inputs.


eLife ◽  
2016 ◽  
Vol 5 ◽  
Author(s):  
Nicholas C Wu ◽  
Lei Dai ◽  
C Anders Olson ◽  
James O Lloyd-Smith ◽  
Ren Sun

The structure of fitness landscapes is critical for understanding adaptive protein evolution. Previous empirical studies on fitness landscapes were confined to either the neighborhood around the wild type sequence, involving mostly single and double mutants, or a combinatorially complete subgraph involving only two amino acids at each site. In reality, the dimensionality of protein sequence space is higher (20L) and there may be higher-order interactions among more than two sites. Here we experimentally characterized the fitness landscape of four sites in protein GB1, containing 204 = 160,000 variants. We found that while reciprocal sign epistasis blocked many direct paths of adaptation, such evolutionary traps could be circumvented by indirect paths through genotype space involving gain and subsequent loss of mutations. These indirect paths alleviate the constraint on adaptive protein evolution, suggesting that the heretofore neglected dimensions of sequence space may change our views on how proteins evolve.


2021 ◽  
Author(s):  
Arnaud Mounier ◽  
Laure Raynaud ◽  
Lucie Rottner ◽  
Matthieu Plu

<p>The use of ensemble prediction systems (EPS) is challenging because of the huge information it provides. Forecasts from ensemble prediction systems (EPS) are often summarised by statistical quantities (ie quantiles maps). Although such mathematical representation is efficient for capturing the ensemble distribution, it lacks physical consistency, which raises issues for many applications of EPS in an operational context. In order to provide a physically-consistent synthesis of the French convection-permitting AROME-EPS forecasts, we propose to automatically draw a few scenarios that are representative of the different possible outcomes. Each scenario is a reduced set of EPS members.</p><p>To design a scenario synthesis, the procedure can be divided into two parts. A first step aims at extracting relevant features in each EPS member in order to reduce the problem dimensionality. Then, a clustering is done based on these features.</p><p>The originality of our work is to leverage the capacities of deep learning for the features extraction. For that purpose, we use a convolutional autoencodeur (CAE) to learn an optimal low-dimensional representation (also called latent space representation) of the input forecast field. In this work, the algorithm is developed to work on 1h-accumulated rainfall from AROME-EPS, with a focus on convective cases.</p><p>The CAE is trained on around 150 000 forecasts and its performance is evaluated based on the quality of the reconstructed input fields from the latent space. To examine the reconstruction quality, an object-oriented approach is used. CAE is also compared with the commonly-used principal component analysis (PCA). In a second part, different clustering methods (kmeans, HDBSCAN, …) are applied to EPS members in the latent space and evaluated using subjective and objective diagnostics.</p>


2016 ◽  
Author(s):  
Nicholas C. Wu ◽  
Lei Dai ◽  
C. Anders Olson ◽  
James O. Lloyd-Smith ◽  
Ren Sun

The structure of fitness landscapes is critical for understanding adaptive protein evolution (e.g. antimicrobial resistance, affinity maturation, etc.). Due to limited throughput in fitness measurements, previous empirical studies on fitness landscapes were confined to either the neighborhood around the wild type sequence, involving mostly single and double mutants, or a combinatorially complete subgraph involving only two amino acids at each site. In reality, however, the dimensionality of protein sequence space is higher (20L,Lbeing the length of the relevant sequence) and there may be higher-order interactions among more than two sites. To study how these features impact the course of protein evolution, we experimentally characterized the fitness landscape of four sites in the IgG-binding domain of protein G, containing 204= 160,000 variants. We found that the fitness landscape was rugged and direct paths of adaptation were often constrained by pairwise epistasis. However, while direct paths were blocked by reciprocal sign epistasis, we found systematic evidence that such evolutionary traps could be circumvented by "extra-dimensional bypass". Extra dimensions in sequence space - with a different amino acid at the site of interest or an additional interacting site - open up indirect paths of adaptation via gain and subsequent loss of mutations. These indirect paths alleviate the constraint on reaching high fitness genotypes via selectively accessible trajectories, suggesting that the heretofore neglected dimensions of sequence space may completely change our views on how proteins evolve.


NeuroImage ◽  
2021 ◽  
pp. 118200
Author(s):  
Sayan Ghosal ◽  
Qiang Chen ◽  
Giulio Pergola ◽  
Aaron L. Goldman ◽  
William Ulrich ◽  
...  

2021 ◽  
Vol 13 (2) ◽  
pp. 51
Author(s):  
Lili Sun ◽  
Xueyan Liu ◽  
Min Zhao ◽  
Bo Yang

Variational graph autoencoder, which can encode structural information and attribute information in the graph into low-dimensional representations, has become a powerful method for studying graph-structured data. However, most existing methods based on variational (graph) autoencoder assume that the prior of latent variables obeys the standard normal distribution which encourages all nodes to gather around 0. That leads to the inability to fully utilize the latent space. Therefore, it becomes a challenge on how to choose a suitable prior without incorporating additional expert knowledge. Given this, we propose a novel noninformative prior-based interpretable variational graph autoencoder (NPIVGAE). Specifically, we exploit the noninformative prior as the prior distribution of latent variables. This prior enables the posterior distribution parameters to be almost learned from the sample data. Furthermore, we regard each dimension of a latent variable as the probability that the node belongs to each block, thereby improving the interpretability of the model. The correlation within and between blocks is described by a block–block correlation matrix. We compare our model with state-of-the-art methods on three real datasets, verifying its effectiveness and superiority.


Author(s):  
Alireza Vafaei Sadr ◽  
Bruce A. Bassett ◽  
M. Kunz

AbstractAnomaly detection is challenging, especially for large datasets in high dimensions. Here, we explore a general anomaly detection framework based on dimensionality reduction and unsupervised clustering. DRAMA is released as a general python package that implements the general framework with a wide range of built-in options. This approach identifies the primary prototypes in the data with anomalies detected by their large distances from the prototypes, either in the latent space or in the original, high-dimensional space. DRAMA is tested on a wide variety of simulated and real datasets, in up to 3000 dimensions, and is found to be robust and highly competitive with commonly used anomaly detection algorithms, especially in high dimensions. The flexibility of the DRAMA framework allows for significant optimization once some examples of anomalies are available, making it ideal for online anomaly detection, active learning, and highly unbalanced datasets. Besides, DRAMA naturally provides clustering of outliers for subsequent analysis.


Sensors ◽  
2019 ◽  
Vol 19 (20) ◽  
pp. 4454 ◽  
Author(s):  
Marek Piorecky ◽  
Vlastimil Koudelka ◽  
Jan Strobl ◽  
Martin Brunovsky ◽  
Vladimir Krajca

Simultaneous recordings of electroencephalogram (EEG) and functional magnetic resonance imaging (fMRI) are at the forefront of technologies of interest to physicians and scientists because they combine the benefits of both modalities—better time resolution (hdEEG) and space resolution (fMRI). However, EEG measurements in the scanner contain an electromagnetic field that is induced in leads as a result of gradient switching slight head movements and vibrations, and it is corrupted by changes in the measured potential because of the Hall phenomenon. The aim of this study is to design and test a methodology for inspecting hidden EEG structures with respect to artifacts. We propose a top-down strategy to obtain additional information that is not visible in a single recording. The time-domain independent component analysis algorithm was employed to obtain independent components and spatial weights. A nonlinear dimension reduction technique t-distributed stochastic neighbor embedding was used to create low-dimensional space, which was then partitioned using the density-based spatial clustering of applications with noise (DBSCAN). The relationships between the found data structure and the used criteria were investigated. As a result, we were able to extract information from the data structure regarding electrooculographic, electrocardiographic, electromyographic and gradient artifacts. This new methodology could facilitate the identification of artifacts and their residues from simultaneous EEG in fMRI.


Sign in / Sign up

Export Citation Format

Share Document