CLADE: Cluster learning-assisted directed evolution

Abstract Directed evolution (DE), a strategy for protein engineering, optimizes protein properties (i.e. fitness) by expensive and time-consuming screen or selection of a large combinatorial sequence space. Machine learning-assisted directed evolution (MLDE) that screens variant properties in silico can reduce the experimental burden. However, the MLDE utilizing small experimentally labeled training data from random sampling renders low global maximal fitness hitting rates. This work introduces a cluster learning-assisted directed evolution (CLADE) framework, particularly designed for systems without high-throughput screening assays, that combines sampling through hierarchical unsupervised clustering and supervised learning to guide protein engineering. Based on general biological information, CLADE splits the genetic combinatorial space into various subspaces with heterogeneous evolutionary traits, which guides the selection of experimental sampling sets and the subsequent building up of supervised learning training sets. By virtually screening two four-site combinatorial fitness landscapes from protein G domain B1 (GB1) and PhoQ, our CLADE consistently showed near 3-fold improvement on global maximal fitness hitting rate than using randomly sampled training data. Our CLADE can be easily applied to various biological systems and customized for systems with different throughput levels to maximize its accuracy and efficiency. It promises a significant impact to protein engineering.

Download Full-text

Towards a supervised rescoring system for unstructured data bases used to build specialized dictionaries

Revista Facultad de Ingeniería ◽

10.19053/01211129.3161 ◽

2014 ◽

Vol 24 (38) ◽

pp. 97

Author(s):

Antonio Rico-Sulayes

Keyword(s):

Supervised Learning ◽

Learning Model ◽

Training Data ◽

Unstructured Data ◽

High Quality ◽

Data Bases ◽

Common Resource ◽

The Face ◽

Lexical Items ◽

Selection Of

This article proposes the architecture for a system that uses previously learned weights to sort query results from unstructured data bases when building specialized dictionaries. A common resource in the construction of dictionaries, unstructured data bases have been especially useful in providing information about lexical items frequencies and examples in use. However, when building specialized dictionaries, whose selection of lexical items does not rely on frequency, the use of these data bases gets restricted to a simple provider of examples. Even in this task, the information unstructured data bases provide may not be very useful when looking for specialized uses of lexical items with various meanings and very long lists of results. In the face of this problem, long lists of hits can be rescored based on a supervised learning model that relies on previously helpful results. The allocation of a vast set of high quality training data for this rescoring system is reported here. Finally, the architecture of sucha system,an unprecedented tool in specialized lexicography, is proposed.

Download Full-text

Droplet Microfluidics-Enabled High-Throughput Screening for Protein Engineering

Micromachines ◽

10.3390/mi10110734 ◽

2019 ◽

Vol 10 (11) ◽

pp. 734 ◽

Cited By ~ 6

Author(s):

Lindong Weng ◽

James E. Spoonamore

Keyword(s):

Protein Engineering ◽

Directed Evolution ◽

High Throughput ◽

Protein Design ◽

High Throughput Screening ◽

Fluid Phase ◽

Droplet Microfluidics ◽

New Paradigm ◽

Wide Range ◽

Challenges And Opportunities

Protein engineering—the process of developing useful or valuable proteins—has successfully created a wide range of proteins tailored to specific agricultural, industrial, and biomedical applications. Protein engineering may rely on rational techniques informed by structural models, phylogenic information, or computational methods or it may rely upon random techniques such as chemical mutation, DNA shuffling, error prone polymerase chain reaction (PCR), etc. The increasing capabilities of rational protein design coupled to the rapid production of large variant libraries have seriously challenged the capacity of traditional screening and selection techniques. Similarly, random approaches based on directed evolution, which relies on the Darwinian principles of mutation and selection to steer proteins toward desired traits, also requires the screening of very large libraries of mutants to be truly effective. For either rational or random approaches, the highest possible screening throughput facilitates efficient protein engineering strategies. In the last decade, high-throughput screening (HTS) for protein engineering has been leveraging the emerging technologies of droplet microfluidics. Droplet microfluidics, featuring controlled formation and manipulation of nano- to femtoliter droplets of one fluid phase in another, has presented a new paradigm for screening, providing increased throughput, reduced reagent volume, and scalability. We review here the recent droplet microfluidics-based HTS systems developed for protein engineering, particularly directed evolution. The current review can also serve as a tutorial guide for protein engineers and molecular biologists who need a droplet microfluidics-based HTS system for their specific applications but may not have prior knowledge about microfluidics. In the end, several challenges and opportunities are identified to motivate the continued innovation of microfluidics with implications for protein engineering.

Download Full-text

TrainSel: An R Package for Selection of Training Populations

Frontiers in Genetics ◽

10.3389/fgene.2021.655287 ◽

2021 ◽

Vol 12 ◽

Author(s):

Deniz Akdemir ◽

Simon Rio ◽

Julio Isidro y Sánchez

Keyword(s):

Supervised Learning ◽

Prediction Models ◽

R Package ◽

Training Data ◽

Major Barrier ◽

Predictive Learning ◽

Learning Tasks ◽

Training Examples ◽

Selection Of

A major barrier to the wider use of supervised learning in emerging applications, such as genomic selection, is the lack of sufficient and representative labeled data to train prediction models. The amount and quality of labeled training data in many applications is usually limited and therefore careful selection of the training examples to be labeled can be useful for improving the accuracies in predictive learning tasks. In this paper, we present an R package, TrainSel, which provides flexible, efficient, and easy-to-use tools that can be used for the selection of training populations (STP). We illustrate its use, performance, and potentials in four different supervised learning applications within and outside of the plant breeding area.

Download Full-text

Colorimetric High-Throughput Screening Assays for the Directed Evolution of Fungal Laccases

Methods in Molecular Biology - Protein Engineering ◽

10.1007/978-1-4939-7366-8_14 ◽

2017 ◽

pp. 247-254

Author(s):

Isabel Pardo ◽

Susana Camarero

Keyword(s):

Directed Evolution ◽

High Throughput ◽

High Throughput Screening ◽

Fungal Laccases ◽

Screening Assays

Download Full-text

Selection of an Optimal Reporter Gene for Cell-Based High Throughput Screening Assays

CrossRef Listing of Deleted DOIs ◽

10.1177/108705719700200103 ◽

1997 ◽

Vol 2 (1) ◽

pp. 7-9 ◽

Cited By ~ 27

Author(s):

Carla M. Suto ◽

Diane M. Ignar

Keyword(s):

Reporter Gene ◽

High Throughput ◽

High Throughput Screening ◽

Screening Assays ◽

Selection Of

Download Full-text

U.S. EPA’s Endocrine Disruptor Screening Program: Moving forward with high throughput screening assays and computational models

10.1603/ice.2016.95195 ◽

2016 ◽

Cited By ~ 1

Author(s):

Sharlene Matten

Keyword(s):

High Throughput ◽

High Throughput Screening ◽

Endocrine Disruptor ◽

Computational Models ◽

Screening Program ◽

Screening Assays

Download Full-text

Directed Evolution of P450 Fatty Acid Decarboxylases via High-Throughput Screening Towards Improved Catalytic Activity

10.26434/chemrxiv.9791162 ◽

2019 ◽

Author(s):

Huifang Xu ◽

Weinan Liang ◽

Linlin Ning ◽

Yuanyuan Jiang ◽

Wenxia Yang ◽

...

Keyword(s):

Catalytic Activity ◽

Fatty Acid ◽

Directed Evolution ◽

High Throughput ◽

High Throughput Screening ◽

Colorimetric Detection ◽

Screening Method ◽

Random Mutagenesis ◽

Direct Production ◽

Consumption Activity

P450 fatty acid decarboxylases (FADCs) have recently been attracting considerable attention owing to their one-step direct production of industrially important 1-alkenes from biologically abundant feedstock free fatty acids under mild conditions. However, attempts to improve the catalytic activity of FADCs have met with little success. Protein engineering has been limited to selected residues and small mutant libraries due to lack of an effective high-throughput screening (HTS) method. Here, we devise a catalase-deficient Escherichia coli host strain and report an HTS approach based on colorimetric detection of H2O2-consumption activity of FADCs. Directed evolution enabled by this method has led to effective identification for the first time of improved FADC variants for medium-chain 1-alkene production from both DNA shuffling and random mutagenesis libraries. Advantageously, this screening method can be extended to other enzymes that stoichiometrically utilize H2O2 as co-substrate.

Download Full-text

Improving Semi-Supervised Learning for Audio Classification with FixMatch

Electronics ◽

10.3390/electronics10151807 ◽

2021 ◽

Vol 10 (15) ◽

pp. 1807

Author(s):

Sascha Grollmisch ◽

Estefanía Cano

Keyword(s):

Neural Networks ◽

Supervised Learning ◽

Transfer Learning ◽

Data Transfer ◽

State Of The Art ◽

Training Data ◽

Audio Classification ◽

Image Domain ◽

Full Dataset ◽

Audio Data

Including unlabeled data in the training process of neural networks using Semi-Supervised Learning (SSL) has shown impressive results in the image domain, where state-of-the-art results were obtained with only a fraction of the labeled data. The commonality between recent SSL methods is that they strongly rely on the augmentation of unannotated data. This is vastly unexplored for audio data. In this work, SSL using the state-of-the-art FixMatch approach is evaluated on three audio classification tasks, including music, industrial sounds, and acoustic scenes. The performance of FixMatch is compared to Convolutional Neural Networks (CNN) trained from scratch, Transfer Learning, and SSL using the Mean Teacher approach. Additionally, a simple yet effective approach for selecting suitable augmentation methods for FixMatch is introduced. FixMatch with the proposed modifications always outperformed Mean Teacher and the CNNs trained from scratch. For the industrial sounds and music datasets, the CNN baseline performance using the full dataset was reached with less than 5% of the initial training data, demonstrating the potential of recent SSL methods for audio data. Transfer Learning outperformed FixMatch only for the most challenging dataset from acoustic scene classification, showing that there is still room for improvement.

Download Full-text

Evolutionary design of optimal surface topographies for biomaterials

Scientific Reports ◽

10.1038/s41598-020-78777-2 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Aliaksei Vasilevich ◽

Aurélie Carlier ◽

David A. Winkler ◽

Shantanu Singh ◽

Jan de Boer

Keyword(s):

High Throughput Screening ◽

Random Mutagenesis ◽

Material Design ◽

Fitness Assessment ◽

Surface Topographies ◽

Optimal Surface ◽

Design Production ◽

Brute Force Approach ◽

Survival Of The Fittest ◽

Selection Of

AbstractNatural evolution tackles optimization by producing many genetic variants and exposing these variants to selective pressure, resulting in the survival of the fittest. We use high throughput screening of large libraries of materials with differing surface topographies to probe the interactions of implantable device coatings with cells and tissues. However, the vast size of possible parameter design space precludes a brute force approach to screening all topographical possibilities. Here, we took inspiration from Nature to optimize materials surface topographies using evolutionary algorithms. We show that successive cycles of material design, production, fitness assessment, selection, and mutation results in optimization of biomaterials designs. Starting from a small selection of topographically designed surfaces that upregulate expression of an osteogenic marker, we used genetic crossover and random mutagenesis to generate new generations of topographies.

Download Full-text

Laplacian networks: bounding indicator function smoothness for neural networks robustness

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2021.2 ◽

2021 ◽

Vol 10 ◽

Author(s):

Carlos Lassance ◽

Vincent Gripon ◽

Antonio Ortega

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Supervised Learning ◽

Indicator Function ◽

Training Data ◽

Theoretical Justification ◽

The Past ◽

Noisy Examples

For the past few years, deep learning (DL) robustness (i.e. the ability to maintain the same decision when inputs are subject to perturbations) has become a question of paramount importance, in particular in settings where misclassification can have dramatic consequences. To address this question, authors have proposed different approaches, such as adding regularizers or training using noisy examples. In this paper we introduce a regularizer based on the Laplacian of similarity graphs obtained from the representation of training data at each layer of the DL architecture. This regularizer penalizes large changes (across consecutive layers in the architecture) in the distance between examples of different classes, and as such enforces smooth variations of the class boundaries. We provide theoretical justification for this regularizer and demonstrate its effectiveness to improve robustness on classical supervised learning vision datasets for various types of perturbations. We also show it can be combined with existing methods to increase overall robustness.

Download Full-text