scholarly journals De Novo Protein Design for Novel Folds using Guided Conditional Wasserstein Generative Adversarial Networks (gcWGAN)

2019 ◽  
Author(s):  
Mostafa Karimi ◽  
Shaowen Zhu ◽  
Yue Cao ◽  
Yang Shen

AbstractMotivationFacing data quickly accumulating on protein sequence and structure, this study is addressing the following question: to what extent could current data alone reveal deep insights into the sequence-structure relationship, such that new sequences can be designed accordingly for novel structure folds?ResultsWe have developed novel deep generative models, constructed low-dimensional and generalizable representation of fold space, exploited sequence data with and without paired structures, and developed ultra-fast fold predictor as an oracle providing feedback. The resulting semi-supervised gcWGAN is assessed with the oracle over 100 novel folds not in the training set and found to generate more yields and cover 3.6 times more target folds compared to a competing data-driven method (cVAE). Assessed with structure predictor over representative novel folds (including one not even part of basis folds), gcWGAN designs are found to have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. gcWGAN explores uncharted sequence space to design proteins by learning from current sequence-structure data. The ultra fast data-driven model can be a powerful addition to principle-driven design methods through generating seed designs or tailoring sequence space.AvailabilityData and source codes will be available upon [email protected] informationSupplementary data are available at Bioinformatics online.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Jeanne Trinquier ◽  
Guido Uguzzoni ◽  
Andrea Pagnani ◽  
Francesco Zamponi ◽  
Martin Weigt

AbstractGenerative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 102 and 103). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model’s entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 1068 possible sequences, which nevertheless constitute only the astronomically small fraction 10−80 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.


2021 ◽  
Author(s):  
Jeanne Trinquier ◽  
Guido Uguzzoni ◽  
Andrea Pagnani ◽  
Francesco Zamponi ◽  
Martin Weigt

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally extremely efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost. Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Using these models, we can easily estimate both the model probability of a given sequence, and the size of the functional sequence space related to a specific protein family. In the case of response regulators, we find a huge number of ca. 1068 sequences, which nevertheless constitute only the astronomically small fraction 10-80 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.


2019 ◽  
Author(s):  
Donatas Repecka ◽  
Vykintas Jauniskis ◽  
Laurynas Karpus ◽  
Elzbieta Rembeza ◽  
Jan Zrimec ◽  
...  

ABSTRACTDe novo protein design for catalysis of any desired chemical reaction is a long standing goal in protein engineering, due to the broad spectrum of technological, scientific and medical applications. Currently, mapping protein sequence to protein function is, however, neither computationionally nor experimentally tangible 1,2. Here we developed ProteinGAN, a specialised variant of the generative adversarial network 3 that is able to ‘learn’ natural protein sequence diversity and enables the generation of functional protein sequences. ProteinGAN learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino acid sequence space and creates new, highly diverse sequence variants with natural-like physical properties. Using malate dehydrogenase as a template enzyme, we show that 24% of the ProteinGAN-generated and experimentally tested sequences are soluble and display wild-type level catalytic activity in the tested conditions in vitro, even in highly mutated (>100 mutations) sequences. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse novel functional proteins within the allowed biological constraints of the sequence space.


2016 ◽  
Author(s):  
Peter A. Andrews ◽  
Ivan Iossifov ◽  
Jude Kendall ◽  
Steven Marks ◽  
Lakshmi Muthuswamy ◽  
...  

AbstractMotivationStandard genome sequence alignment tools primarily designed to find one alignment per read have difficulty detecting inversion, translocation and large insertion and deletion (indel) events. Moreover, dedicated split read alignment methods that depend only upon the reference genome may misidentify or find too many potential split read alignments because of reference genome anomalies.MethodsWe introduce MUMdex, a Maximal Unique Match (MUM)-based genomic analysis software package consisting of a sequence aligner to the reference genome, a storage-indexing format and analysis software. Discordant reference alignments of MUMs are especially suitable for identifying inversion, translocation and large indel differences in unique regions. Extracted population databases are used as filters for flaws in the reference genome. We describe the concepts underlying MUM-based analysis, the software implementation and its usage.ResultsWe demonstrate via simulation that the MUMdex aligner and alignment format are able to correctly detect and record genomic events. We characterize alignment performance and output file sizes for human whole genome data and compare to Bowtie 2 and the BAM format. Preliminary results demonstrate the practicality of the analysis approach by detecting de novo mutation candidates in human whole genome DNA sequence data from 510 families. We provide a population database of events from these families for use by others.Availabilityhttp://mumdex.com/[email protected] (or [email protected])Supplementary informationSupplementary data are available online.


Molecules ◽  
2020 ◽  
Vol 25 (14) ◽  
pp. 3250 ◽  
Author(s):  
Eugene Lin ◽  
Chieh-Hsin Lin ◽  
Hsien-Yuan Lane

A growing body of evidence now suggests that artificial intelligence and machine learning techniques can serve as an indispensable foundation for the process of drug design and discovery. In light of latest advancements in computing technologies, deep learning algorithms are being created during the development of clinically useful drugs for treatment of a number of diseases. In this review, we focus on the latest developments for three particular arenas in drug design and discovery research using deep learning approaches, such as generative adversarial network (GAN) frameworks. Firstly, we review drug design and discovery studies that leverage various GAN techniques to assess one main application such as molecular de novo design in drug design and discovery. In addition, we describe various GAN models to fulfill the dimension reduction task of single-cell data in the preclinical stage of the drug development pipeline. Furthermore, we depict several studies in de novo peptide and protein design using GAN frameworks. Moreover, we outline the limitations in regard to the previous drug design and discovery studies using GAN models. Finally, we present a discussion of directions and challenges for future research.


2021 ◽  
Vol 2 ◽  
Author(s):  
George Tsialiamanis ◽  
David J. Wagg ◽  
Nikolaos Dervilis ◽  
Keith Worden

Abstract A framework is proposed for generative models as a basis for digital twins or mirrors of structures. The proposal is based on the premise that deterministic models cannot account for the uncertainty present in most structural modeling applications. Two different types of generative models are considered here. The first is a physics-based model based on the stochastic finite element (SFE) method, which is widely used when modeling structures that have material and loading uncertainties imposed. Such models can be calibrated according to data from the structure and would be expected to outperform any other model if the modeling accurately captures the true underlying physics of the structure. The potential use of SFE models as digital mirrors is illustrated via application to a linear structure with stochastic material properties. For situations where the physical formulation of such models does not suffice, a data-driven framework is proposed, using machine learning and conditional generative adversarial networks (cGANs). The latter algorithm is used to learn the distribution of the quantity of interest in a structure with material nonlinearities and uncertainties. For the examples considered in this work, the data-driven cGANs model outperforms the physics-based approach. Finally, an example is shown where the two methods are coupled such that a hybrid model approach is demonstrated.


2021 ◽  
Vol 17 (2) ◽  
pp. e1008736
Author(s):  
Alex Hawkins-Hooker ◽  
Florence Depardieu ◽  
Sebastien Baur ◽  
Guillaume Couairon ◽  
Arthur Chen ◽  
...  

The vast expansion of protein sequence databases provides an opportunity for new protein design approaches which seek to learn the sequence-function relationship directly from natural sequence variation. Deep generative models trained on protein sequence data have been shown to learn biologically meaningful representations helpful for a variety of downstream tasks, but their potential for direct use in the design of novel proteins remains largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the luxA bacterial luciferase. We propose separate VAE models to work with aligned sequence input (MSA VAE) and raw sequence input (AR-VAE), and offer evidence that while both are able to reproduce patterns of amino acid usage characteristic of the family, the MSA VAE is better able to capture long-distance dependencies reflecting the influence of 3D structure. To confirm the practical utility of the models, we used them to generate variants of luxA whose luminescence activity was validated experimentally. We further showed that conditional variants of both models could be used to increase the solubility of luxA without disrupting function. Altogether 6/12 of the variants generated using the unconditional AR-VAE and 9/11 generated using the unconditional MSA VAE retained measurable luminescence, together with all 23 of the less distant variants generated by conditional versions of the models; the most distant functional variant contained 35 differences relative to the nearest training set sequence. These results demonstrate the feasibility of using deep generative models to explore the space of possible protein sequences and generate useful variants, providing a method complementary to rational design and directed evolution approaches.


Sign in / Sign up

Export Citation Format

Share Document