Correction for Wu et al., Machine learning-assisted directed protein evolution with combinatorial libraries

To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning into the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine-learning models trained on tested variants provide a fast method for testing sequence space computationally. We validated this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches. We then provide an example application in evolving an enzyme to produce each of the two possible product enantiomers (i.e., stereodivergence) of a new-to-nature carbene Si–H insertion reaction. The approach predicted libraries enriched in functional enzymes and fixed seven mutations in two rounds of evolution to identify variants for selective catalysis with 93% and 79% ee (enantiomeric excess). By greatly increasing throughput with in silico modeling, machine learning enhances the quality and diversity of sequence solutions for a protein engineering problem.

Download Full-text

Informed training set design enables efficient machine learning-assisted directed protein evolution

Cell Systems ◽

10.1016/j.cels.2021.07.008 ◽

2021 ◽

Cited By ~ 2

Author(s):

Bruce J. Wittmann ◽

Yisong Yue ◽

Frances H. Arnold

Keyword(s):

Machine Learning ◽

Protein Evolution ◽

Set Design ◽

Training Set ◽

Efficient Machine

Download Full-text

Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden

10.1101/2020.12.04.408955 ◽

2020 ◽

Author(s):

Bruce J. Wittmann ◽

Yisong Yue ◽

Frances H. Arnold

Keyword(s):

Machine Learning ◽

Directed Evolution ◽

Path Dependence ◽

Fitness Landscape ◽

Combinatorial Libraries ◽

Single Step ◽

Saturation Mutagenesis ◽

Global Maximum ◽

Training Procedure ◽

Greedy Optimization

AbstractDue to screening limitations, in directed evolution (DE) of proteins it is rarely feasible to fully evaluate combinatorial mutant libraries made by mutagenesis at multiple sites. Instead, DE often involves a single-step greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. However, because the effects of a mutation can depend on the presence or absence of other mutations, the efficiency and effectiveness of a single-step greedy walk is influenced by both the starting variant and the order in which beneficial mutations are identified—the process is path-dependent. We recently demonstrated a path-independent machine learning-assisted approach to directed evolution (MLDE) that allows in silico screening of full combinatorial libraries made by simultaneous saturation mutagenesis, thus explicitly capturing the effects of cooperative mutations and bypassing the path-dependence that can limit greedy optimization. Here, we thoroughly investigate and optimize an MLDE workflow by testing a number of design considerations of the MLDE pipeline. Specifically, we (1) test the effects of different encoding strategies on MLDE efficiency, (2) integrate new models and a training procedure more amenable to protein engineering tasks, and (3) incorporate training set design strategies to avoid information-poor low-fitness protein variants (“holes”) in the training data. When applied to an epistatic, hole-filled, four-site combinatorial fitness landscape of protein G domain B1 (GB1), the resulting focused training MLDE (ftMLDE) protocol achieved the global fitness maximum up to 92% of the time at a total screening burden of 470 variants. In contrast, minimal-screening-burden single-step greedy optimization over the GB1 fitness landscape reached the global maximum just 1.2% of the time; ftMLDE matching this minimal screening burden (80 total variants) achieved the global optimum up to 9.6% of the time with a 49% higher expected maximum fitness achieved. To facilitate further development of MLDE, we present the MLDE software package (https://github.com/fhalab/MLDE), which is designed for use by protein engineers without computational or machine learning expertise.

Download Full-text

Creating the New from the Old: Combinatorial Libraries Generation with Machine-Learning-Based Compound Structure Optimization

Journal of Chemical Information and Modeling ◽

10.1021/acs.jcim.6b00426 ◽

2017 ◽

Vol 57 (2) ◽

pp. 133-147 ◽

Cited By ~ 7

Author(s):

Sabina Podlewska ◽

Wojciech M. Czarnecki ◽

Rafał Kafel ◽

Andrzej J. Bojarski

Keyword(s):

Machine Learning ◽

Combinatorial Libraries ◽

Structure Optimization ◽

Compound Structure

Download Full-text

Bayesian optimization with evolutionary and structure-based regularization for directed protein evolution

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00195-4 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Trevor S. Frisby ◽

Christopher James Langmead

Keyword(s):

Machine Learning ◽

Protein Engineering ◽

Directed Evolution ◽

Protein Evolution ◽

Sequence Space ◽

Optimization Problem ◽

Bayesian Optimization ◽

Variant Selection ◽

Model Free ◽

Optimization Routine

Abstract Background Directed evolution (DE) is a technique for protein engineering that involves iterative rounds of mutagenesis and screening to search for sequences that optimize a given property, such as binding affinity to a specified target. Unfortunately, the underlying optimization problem is under-determined, and so mutations introduced to improve the specified property may come at the expense of unmeasured, but nevertheless important properties (ex. solubility, thermostability, etc). We address this issue by formulating DE as a regularized Bayesian optimization problem where the regularization term reflects evolutionary or structure-based constraints. Results We applied our approach to DE to three representative proteins, GB1, BRCA1, and SARS-CoV-2 Spike, and evaluated both evolutionary and structure-based regularization terms. The results of these experiments demonstrate that: (i) structure-based regularization usually leads to better designs (and never hurts), compared to the unregularized setting; (ii) evolutionary-based regularization tends to be least effective; and (iii) regularization leads to better designs because it effectively focuses the search in certain areas of sequence space, making better use of the experimental budget. Additionally, like previous work in Machine learning assisted DE, we find that our approach significantly reduces the experimental burden of DE, relative to model-free methods. Conclusion Introducing regularization into a Bayesian ML-assisted DE framework alters the exploratory patterns of the underlying optimization routine, and can shift variant selections towards those with a range of targeted and desirable properties. In particular, we find that structure-based regularization often improves variant selection compared to unregularized approaches, and never hurts.

Download Full-text

Mind wandering as data augmentation: How mental travel supports abstraction

Behavioral and Brain Sciences ◽

10.1017/s0140525x1900311x ◽

2020 ◽

Vol 43 ◽

Author(s):

Myrthe Faber

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Mental Content ◽

Mind Wandering ◽

Theoretical Framework ◽

Important Addition

Abstract Gilead et al. state that abstraction supports mental travel, and that mental travel critically relies on abstraction. I propose an important addition to this theoretical framework, namely that mental travel might also support abstraction. Specifically, I argue that spontaneous mental travel (mind wandering), much like data augmentation in machine learning, provides variability in mental content and context necessary for abstraction.

Download Full-text