Machine learning-assisted directed protein evolution with combinatorial libraries

To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning into the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine-learning models trained on tested variants provide a fast method for testing sequence space computationally. We validated this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches. We then provide an example application in evolving an enzyme to produce each of the two possible product enantiomers (i.e., stereodivergence) of a new-to-nature carbene Si–H insertion reaction. The approach predicted libraries enriched in functional enzymes and fixed seven mutations in two rounds of evolution to identify variants for selective catalysis with 93% and 79% ee (enantiomeric excess). By greatly increasing throughput with in silico modeling, machine learning enhances the quality and diversity of sequence solutions for a protein engineering problem.

Download Full-text

Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden

10.1101/2020.12.04.408955 ◽

2020 ◽

Author(s):

Bruce J. Wittmann ◽

Yisong Yue ◽

Frances H. Arnold

Keyword(s):

Machine Learning ◽

Directed Evolution ◽

Path Dependence ◽

Fitness Landscape ◽

Combinatorial Libraries ◽

Single Step ◽

Saturation Mutagenesis ◽

Global Maximum ◽

Training Procedure ◽

Greedy Optimization

AbstractDue to screening limitations, in directed evolution (DE) of proteins it is rarely feasible to fully evaluate combinatorial mutant libraries made by mutagenesis at multiple sites. Instead, DE often involves a single-step greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. However, because the effects of a mutation can depend on the presence or absence of other mutations, the efficiency and effectiveness of a single-step greedy walk is influenced by both the starting variant and the order in which beneficial mutations are identified—the process is path-dependent. We recently demonstrated a path-independent machine learning-assisted approach to directed evolution (MLDE) that allows in silico screening of full combinatorial libraries made by simultaneous saturation mutagenesis, thus explicitly capturing the effects of cooperative mutations and bypassing the path-dependence that can limit greedy optimization. Here, we thoroughly investigate and optimize an MLDE workflow by testing a number of design considerations of the MLDE pipeline. Specifically, we (1) test the effects of different encoding strategies on MLDE efficiency, (2) integrate new models and a training procedure more amenable to protein engineering tasks, and (3) incorporate training set design strategies to avoid information-poor low-fitness protein variants (“holes”) in the training data. When applied to an epistatic, hole-filled, four-site combinatorial fitness landscape of protein G domain B1 (GB1), the resulting focused training MLDE (ftMLDE) protocol achieved the global fitness maximum up to 92% of the time at a total screening burden of 470 variants. In contrast, minimal-screening-burden single-step greedy optimization over the GB1 fitness landscape reached the global maximum just 1.2% of the time; ftMLDE matching this minimal screening burden (80 total variants) achieved the global optimum up to 9.6% of the time with a 49% higher expected maximum fitness achieved. To facilitate further development of MLDE, we present the MLDE software package (https://github.com/fhalab/MLDE), which is designed for use by protein engineers without computational or machine learning expertise.

Download Full-text

Bayesian optimization with evolutionary and structure-based regularization for directed protein evolution

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00195-4 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Trevor S. Frisby ◽

Christopher James Langmead

Keyword(s):

Machine Learning ◽

Protein Engineering ◽

Directed Evolution ◽

Protein Evolution ◽

Sequence Space ◽

Optimization Problem ◽

Bayesian Optimization ◽

Variant Selection ◽

Model Free ◽

Optimization Routine

Abstract Background Directed evolution (DE) is a technique for protein engineering that involves iterative rounds of mutagenesis and screening to search for sequences that optimize a given property, such as binding affinity to a specified target. Unfortunately, the underlying optimization problem is under-determined, and so mutations introduced to improve the specified property may come at the expense of unmeasured, but nevertheless important properties (ex. solubility, thermostability, etc). We address this issue by formulating DE as a regularized Bayesian optimization problem where the regularization term reflects evolutionary or structure-based constraints. Results We applied our approach to DE to three representative proteins, GB1, BRCA1, and SARS-CoV-2 Spike, and evaluated both evolutionary and structure-based regularization terms. The results of these experiments demonstrate that: (i) structure-based regularization usually leads to better designs (and never hurts), compared to the unregularized setting; (ii) evolutionary-based regularization tends to be least effective; and (iii) regularization leads to better designs because it effectively focuses the search in certain areas of sequence space, making better use of the experimental budget. Additionally, like previous work in Machine learning assisted DE, we find that our approach significantly reduces the experimental burden of DE, relative to model-free methods. Conclusion Introducing regularization into a Bayesian ML-assisted DE framework alters the exploratory patterns of the underlying optimization routine, and can shift variant selections towards those with a range of targeted and desirable properties. In particular, we find that structure-based regularization often improves variant selection compared to unregularized approaches, and never hurts.

Download Full-text

Machine-learning-guided library design cycle for directed evolution of enzymes: the effects of training data composition on sequence space exploration

10.1101/2021.08.13.456323 ◽

2021 ◽

Author(s):

Yutaka Saito ◽

Misaki Oikawa ◽

Takumi Sato ◽

Hikaru Nakazawa ◽

Tomoyuki Ito ◽

...

Keyword(s):

Machine Learning ◽

Enzyme Activity ◽

Directed Evolution ◽

Sequence Space ◽

Training Data ◽

Sortase A ◽

Library Design ◽

Design Cycle ◽

High Enzyme ◽

Evolution Study

Machine learning (ML) is becoming an attractive tool in mutagenesis-based protein engineering because of its ability to design a variant library containing proteins with a desired function. However, it remains unclear how ML guides directed evolution in sequence space depending on the composition of training data. Here, we present a ML-guided directed evolution study of an enzyme to investigate the effects of a known "highly positive" variant (i.e., variant known to have high enzyme activity) in training data. We performed two separate series of ML-guided directed evolution of Sortase A with and without a known highly positive variant called 5M in training data. In each series, two rounds of ML were conducted: variants predicted by the first round were experimentally evaluated, and used as additional training data for the second-round prediction. The improvements in enzyme activity were comparable between the two series, both achieving enzyme activity 2.2-2.5 times higher than 5M. Intriguingly, the sequences of the improved variants were largely different between the two series, indicating that ML guided the directed evolution to the distinct regions of sequence space depending on the presence/absence of the highly positive variant in the training data. This suggests that the sequence diversity of improved variants can be expanded not only by conventional ML using the whole training data, but also by ML using a subset of the training data even when it lacks highly positive variants. In summary, this study demonstrates the importance of regulating the composition of training data in ML-guided directed evolution.

Download Full-text

Machine-Learning-Guided Library Design Cycle for Directed Evolution of Enzymes: The Effects of Training Data Composition on Sequence Space Exploration

ACS Catalysis ◽

10.1021/acscatal.1c03753 ◽

2021 ◽

pp. 14615-14624

Author(s):

Yutaka Saito ◽

Misaki Oikawa ◽

Takumi Sato ◽

Hikaru Nakazawa ◽

Tomoyuki Ito ◽

...

Keyword(s):

Machine Learning ◽

Directed Evolution ◽

Sequence Space ◽

Space Exploration ◽

Training Data ◽

Library Design ◽

Data Composition ◽

Design Cycle

Download Full-text

Adaptation in protein fitness landscapes is facilitated by indirect paths

eLife ◽

10.7554/elife.16965 ◽

2016 ◽

Vol 5 ◽

Cited By ~ 66

Author(s):

Nicholas C Wu ◽

Lei Dai ◽

C Anders Olson ◽

James O Lloyd-Smith ◽

Ren Sun

Keyword(s):

Protein Evolution ◽

Sequence Space ◽

Fitness Landscape ◽

Empirical Studies ◽

Fitness Landscapes ◽

Complete Subgraph ◽

Genotype Space ◽

Subsequent Loss ◽

Protein Sequence Space ◽

Type Sequence

The structure of fitness landscapes is critical for understanding adaptive protein evolution. Previous empirical studies on fitness landscapes were confined to either the neighborhood around the wild type sequence, involving mostly single and double mutants, or a combinatorially complete subgraph involving only two amino acids at each site. In reality, the dimensionality of protein sequence space is higher (20L) and there may be higher-order interactions among more than two sites. Here we experimentally characterized the fitness landscape of four sites in protein GB1, containing 204 = 160,000 variants. We found that while reciprocal sign epistasis blocked many direct paths of adaptation, such evolutionary traps could be circumvented by indirect paths through genotype space involving gain and subsequent loss of mutations. These indirect paths alleviate the constraint on adaptive protein evolution, suggesting that the heretofore neglected dimensions of sequence space may change our views on how proteins evolve.

Download Full-text

Correction for Wu et al., Machine learning-assisted directed protein evolution with combinatorial libraries

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1921770117 ◽

2019 ◽

Vol 117 (1) ◽

pp. 788-789

Keyword(s):

Machine Learning ◽

Protein Evolution ◽

Combinatorial Libraries

Download Full-text

Adaptation in protein fitness landscapes is facilitated by indirect paths

10.1101/045096 ◽

2016 ◽

Cited By ~ 1

Author(s):

Nicholas C. Wu ◽

Lei Dai ◽

C. Anders Olson ◽

James O. Lloyd-Smith ◽

Ren Sun

Keyword(s):

Protein Evolution ◽

Sequence Space ◽

Fitness Landscape ◽

Empirical Studies ◽

Affinity Maturation ◽

Fitness Landscapes ◽

Complete Subgraph ◽

Igg Binding ◽

Subsequent Loss ◽

Protein Sequence Space

The structure of fitness landscapes is critical for understanding adaptive protein evolution (e.g. antimicrobial resistance, affinity maturation, etc.). Due to limited throughput in fitness measurements, previous empirical studies on fitness landscapes were confined to either the neighborhood around the wild type sequence, involving mostly single and double mutants, or a combinatorially complete subgraph involving only two amino acids at each site. In reality, however, the dimensionality of protein sequence space is higher (20L,Lbeing the length of the relevant sequence) and there may be higher-order interactions among more than two sites. To study how these features impact the course of protein evolution, we experimentally characterized the fitness landscape of four sites in the IgG-binding domain of protein G, containing 204= 160,000 variants. We found that the fitness landscape was rugged and direct paths of adaptation were often constrained by pairwise epistasis. However, while direct paths were blocked by reciprocal sign epistasis, we found systematic evidence that such evolutionary traps could be circumvented by "extra-dimensional bypass". Extra dimensions in sequence space - with a different amino acid at the site of interest or an additional interacting site - open up indirect paths of adaptation via gain and subsequent loss of mutations. These indirect paths alleviate the constraint on reaching high fitness genotypes via selectively accessible trajectories, suggesting that the heretofore neglected dimensions of sequence space may completely change our views on how proteins evolve.

Download Full-text

"Multi-Agent" Screening Improves the Efficiency of Directed Enzyme Evolution

10.1101/2021.04.06.438652 ◽

2021 ◽

Author(s):

Tian Yang ◽

Zhixia Ye ◽

Michael D Lynch

Keyword(s):

Directed Evolution ◽

Tertiary Structure ◽

Fitness Landscape ◽

Combinatorial Libraries ◽

Enzyme Engineering ◽

Enzyme Evolution ◽

Combinatorial Search ◽

Screening Process ◽

Multiple Substrates ◽

Multi Agent

Enzyme evolution has enabled numerous advances in biotechnology. However, directed evolution programs can still require many iterative rounds of screening to identify optimal mutant sequences. This is due to the sparsity of the fitness landscape, which in turn, is due to hidden mutations that only offer improvements synergistically in combination with other mutations. These hidden mutations are only identified by evaluating mutant combinations, necessitating large combinatorial libraries or iterative rounds of screening. Here, we report a multi-agent directed evolution approach that incorporates diverse substrate analogues in the screening process. With multiple substrates acting like multiple agents navigating the fitness landscape, we are able to identify hidden mutant residues that impact substrate specificity without a need for testing numerous combinations. We initially validate this approach in engineering a malonyl-CoA synthetase for improved activity with a wide variety of non-natural substrates. We found that hidden mutations are often distant from the active site, making them hard to predict using popular structure-based methods. Interestingly, many of the hidden mutations identified in this case are expected to destabilize interactions between elements of tertiary structure, potentially affecting protein flexibility. This approach may be widely applicable to accelerate enzyme engineering. Lastly, multi-agent system inspired approaches may be more broadly useful in tackling other complex combinatorial search problems in biology.

Download Full-text