scholarly journals Facilitating Machine Learning‐Guided Protein Engineering with Smart Library Design and Massively Parallel Assays

2021 ◽  
pp. 2100038
Author(s):  
Hoi Yee Chu ◽  
Alan S. L. Wong
2021 ◽  
Vol 42 (3) ◽  
pp. 151-165
Author(s):  
Harini Narayanan ◽  
Fabian Dingfelder ◽  
Alessandro Butté ◽  
Nikolai Lorenzen ◽  
Michael Sokolov ◽  
...  

2021 ◽  
Author(s):  
Yutaka Saito ◽  
Misaki Oikawa ◽  
Takumi Sato ◽  
Hikaru Nakazawa ◽  
Tomoyuki Ito ◽  
...  

Machine learning (ML) is becoming an attractive tool in mutagenesis-based protein engineering because of its ability to design a variant library containing proteins with a desired function. However, it remains unclear how ML guides directed evolution in sequence space depending on the composition of training data. Here, we present a ML-guided directed evolution study of an enzyme to investigate the effects of a known "highly positive" variant (i.e., variant known to have high enzyme activity) in training data. We performed two separate series of ML-guided directed evolution of Sortase A with and without a known highly positive variant called 5M in training data. In each series, two rounds of ML were conducted: variants predicted by the first round were experimentally evaluated, and used as additional training data for the second-round prediction. The improvements in enzyme activity were comparable between the two series, both achieving enzyme activity 2.2-2.5 times higher than 5M. Intriguingly, the sequences of the improved variants were largely different between the two series, indicating that ML guided the directed evolution to the distinct regions of sequence space depending on the presence/absence of the highly positive variant in the training data. This suggests that the sequence diversity of improved variants can be expanded not only by conventional ML using the whole training data, but also by ML using a subset of the training data even when it lacks highly positive variants. In summary, this study demonstrates the importance of regulating the composition of training data in ML-guided directed evolution.


2022 ◽  
Vol 72 ◽  
pp. 145-152
Author(s):  
Brian L. Hie ◽  
Kevin K. Yang

Author(s):  
Benjamin B. Yellen ◽  
Jon S. Zawistowski ◽  
Eric A. Czech ◽  
Caleb I. Sanford ◽  
Elliott D. SoRelle ◽  
...  

AbstractSingle cell analysis tools have made significant advances in characterizing genomic heterogeneity, however tools for measuring phenotypic heterogeneity have lagged due to the increased difficulty of handling live biology. Here, we report a single cell phenotyping tool capable of measuring image-based clonal properties at scales approaching 100,000 clones per experiment. These advances are achieved by exploiting a novel flow regime in ladder microfluidic networks that, under appropriate conditions, yield a mathematically perfect cell trap. Machine learning and computer vision tools are used to control the imaging hardware and analyze the cellular phenotypic parameters within these images. Using this platform, we quantified the responses of tens of thousands of single cell-derived acute myeloid leukemia (AML) clones to targeted therapy, identifying rare resistance and morphological phenotypes at frequencies down to 0.05%. This approach can be extended to higher-level cellular architectures such as cell pairs and organoids and on-chip live-cell fluorescence assays.


Author(s):  
Hanjie Shen ◽  
Pengjuan Liu ◽  
Zhanqing Li ◽  
Fang Chen ◽  
Hui Jiang ◽  
...  

AbstractBackgroundSystematic errors can be introduced from DNA amplification during massively parallel sequencing (MPS) library preparation and sequencing array formation. Polymerase chain reaction (PCR)-free genomic library preparation methods were previously shown to improve whole genome sequencing (WGS) quality on the Illumina platform, especially in calling insertions and deletions (InDels). We hypothesized that substantial InDel errors continue to be introduced by the remaining PCR step of DNA cluster generation. In addition to library preparation and sequencing, data analysis methods are also important for the accuracy of the output data.In recent years, several machine learning variant calling pipelines have emerged, which can correct the systematic errors from MPS and improve the data performance of variant calling.ResultsHere, PCR-free libraries were sequenced on the PCR-free DNBSEQ™ arrays from MGI Tech Co., Ltd. (referred to as MGI) to accomplish the first true PCR-free WGS which the whole process is truly not only PCR-free during library preparation but also PCR-free during sequencing. We demonstrated that PCR-based WGS libraries have significantly (about 5 times) more InDel errors than PCR-free libraries.Furthermore, PCR-free WGS libraries sequenced on the PCR-free DNBSEQ™ platform have up to 55% less InDel errors compared to the NovaSeq platform, confirming that DNA clusters contain PCR-generated errors.In addition, low coverage bias and less than 1% read duplication rate was reproducibly obtained in DNBSEQ™ PCR-free using either ultrasonic or enzymatic DNA fragmentation MGI kits combined with MGISEQ-2000. Meanwhile, variant calling performance (single-nucleotide polymorphisms (SNPs) F-score>99.94%, InDels F-score>99.6%) exceeded widely accepted standards using machine learning (ML) methods (DeepVariant or DNAscope).ConclusionsEnabled by the new PCR-free library preparation kits, ultra high-thoughput PCR-free sequencers and ML-based variant calling, true PCR-free DNBSEQ™ WGS provides a powerful solution for improving WGS accuracy while reducing cost and analysis time, thus facilitating future precision medicine, cohort studies, and large population genome projects.


2020 ◽  
Author(s):  
Adam C. Mater ◽  
Mahakaran Sandhu ◽  
Colin Jackson

AbstractMachine learning (ML) has the potential to revolutionize protein engineering. However, the field currently lacks standardized and rigorous evaluation benchmarks for sequence-fitness prediction, which makes accurate evaluation of the performance of different architectures difficult. Here we propose a unifying framework for ML-driven sequence-fitness prediction. Using simulated (the NK model) and empirical sequence landscapes, we define four key performance metrics: interpolation within the training domain, extrapolation outside the training domain, robustness to sparse training data, and ability to cope with epistasis/ruggedness. We show that architectural differences between algorithms consistently affect performance against these metrics across both experimental and theoretical landscapes. Moreover, landscape ruggedness is revealed to be the greatest determinant of the accuracy of sequence-fitness prediction. We hope that this benchmarking method and the code that accompanies it will enable robust evaluation and comparison of novel architectures in this emerging field and assist in the adoption of ML for protein engineering.


2021 ◽  
Author(s):  
Christian Dallago ◽  
Jody Mou ◽  
Kadina E Johnston ◽  
Bruce Wittmann ◽  
Nicholas Bhattacharya ◽  
...  

Machine learning could enable an unprecedented level of control in protein engineering for therapeutic and industrial applications. Critical to its use in designing proteins with desired properties, machine learning models must capture the protein sequence-function relationship, often termed fitness landscape. Existing benchmarks like CASP or CAFA assess structure and function predictions of proteins, respectively, yet they do not target metrics relevant for protein engineering. In this work, we introduce Fitness Landscape Inference for Proteins (FLIP), a benchmark for function prediction to encourage rapid scoring of representation learning for protein engineering. Our curated tasks, baselines, and metrics probe model generalization in settings relevant for protein engineering, e.g. low-resource and extrapolative. Currently, FLIP encompasses experimental data across adeno-associated virus stability for gene therapy, protein domain B1 stability and immunoglobulin binding, and thermostability from multiple protein families. In order to enable ease of use and future expansion to new tasks, all data are presented in a standard format. FLIP scripts and data are freely accessible at https://benchmark.protein.properties/home.


Sign in / Sign up

Export Citation Format

Share Document