Learning Peptide Recognition Rules for a Low-Specificity Protein

AbstractMany proteins interact with short linear regions of target proteins. For some proteins, however, it is difficult to identify a well-defined sequence motif that defines its target peptides. To overcome this difficulty, we used supervised machine learning to train a model that treats each peptide as a collection of easily-calculated biochemical features rather than as an amino acid sequence. As a test case, we dissected the peptide-recognition rules for human S100A5 (hA5), a low-specificity calcium binding protein. We trained a Random Forest model against a recently released, high-throughput phage display dataset collected for hA5. The model identifies hydrophobicity and shape complementarity, rather than polar contacts, as the primary determinants of peptide binding specificity in hA5. We tested this hypothesis by solving a crystal structure of hA5 and through computational docking studies of diverse peptides onto hA5. These structural studies revealed that peptides exhibit multiple binding modes at the hA5 peptide interface—all of which have few polar contacts with hA5. Finally, we used our trained model to predict new, plausible binding targets in the human proteome. This revealed a fragment of the protein α-1-syntrophin binds to hA5. Our work helps better understand the biochemistry and biology of hA5, as well as demonstrating how high-throughput experiments coupled with machine learning of biochemical features can reveal the determinants of binding specificity in low-specificity proteins.

Download Full-text

Machine learning-assisted high-throughput exploration of interface energy space in multi-phase-field model with CALPHAD potential

Materials Theory ◽

10.1186/s41313-021-00038-0 ◽

2022 ◽

Vol 6 (1) ◽

Author(s):

Vahid Attari ◽

Raymundo Arroyave

Keyword(s):

Machine Learning ◽

High Throughput ◽

Phase Field ◽

Interface Energy ◽

Energy Space ◽

Supervised Machine Learning ◽

Phase Field Method ◽

Phase Field Modeling ◽

Field Modeling ◽

Multi Phase

AbstractComputational methods are increasingly being incorporated into the exploitation of microstructure–property relationships for microstructure-sensitive design of materials. In the present work, we propose non-intrusive materials informatics methods for the high-throughput exploration and analysis of a synthetic microstructure space using a machine learning-reinforced multi-phase-field modeling scheme. We specifically study the interface energy space as one of the most uncertain inputs in phase-field modeling and its impact on the shape and contact angle of a growing phase during heterogeneous solidification of secondary phase between solid and liquid phases. We evaluate and discuss methods for the study of sensitivity and propagation of uncertainty in these input parameters as reflected on the shape of the Cu6Sn5 intermetallic during growth over the Cu substrate inside the liquid Sn solder due to uncertain interface energies. The sensitivity results rank σSI,σIL, and σIL, respectively, as the most influential parameters on the shape of the intermetallic. Furthermore, we use variational autoencoder, a deep generative neural network method, and label spreading, a semi-supervised machine learning method for establishing correlations between inputs of outputs of the computational model. We clustered the microstructures into three categories (“wetting”, “dewetting”, and “invariant”) using the label spreading method and compared it with the trend observed in the Young-Laplace equation. On the other hand, a structure map in the interface energy space is developed that shows σSI and σSL alter the shape of the intermetallic synchronously where an increase in the latter and decrease in the former changes the shape from dewetting structures to wetting structures. The study shows that the machine learning-reinforced phase-field method is a convenient approach to analyze microstructure design space in the framework of the ICME.

Download Full-text

Exploiting High-Throughput Indoor Phenotyping to Characterize the Founders of a Structured B. napus Breeding Population

Frontiers in Plant Science ◽

10.3389/fpls.2021.780250 ◽

2022 ◽

Vol 12 ◽

Author(s):

Jana Ebersbach ◽

Nazifa Azam Khan ◽

Ian McQuillan ◽

Erin E. Higgins ◽

Kyla Horner ◽

...

Keyword(s):

Machine Learning ◽

Image Processing ◽

Drought Stress ◽

High Throughput ◽

Complex Traits ◽

Phenotypic Diversity ◽

Crop Improvement ◽

Breeding Population ◽

Supervised Machine Learning ◽

Oilseed Crop

Phenotyping is considered a significant bottleneck impeding fast and efficient crop improvement. Similar to many crops, Brassica napus, an internationally important oilseed crop, suffers from low genetic diversity, and will require exploitation of diverse genetic resources to develop locally adapted, high yielding and stress resistant cultivars. A pilot study was completed to assess the feasibility of using indoor high-throughput phenotyping (HTP), semi-automated image processing, and machine learning to capture the phenotypic diversity of agronomically important traits in a diverse B. napus breeding population, SKBnNAM, introduced here for the first time. The experiment comprised 50 spring-type B. napus lines, grown and phenotyped in six replicates under two treatment conditions (control and drought) over 38 days in a LemnaTec Scanalyzer 3D facility. Growth traits including plant height, width, projected leaf area, and estimated biovolume were extracted and derived through processing of RGB and NIR images. Anthesis was automatically and accurately scored (97% accuracy) and the number of flowers per plant and day was approximated alongside relevant canopy traits (width, angle). Further, supervised machine learning was used to predict the total number of raceme branches from flower attributes with 91% accuracy (linear regression and Huber regression algorithms) and to identify mild drought stress, a complex trait which typically has to be empirically scored (0.85 area under the receiver operating characteristic curve, random forest classifier algorithm). The study demonstrates the potential of HTP, image processing and computer vision for effective characterization of agronomic trait diversity in B. napus, although limitations of the platform did create significant variation that limited the utility of the data. However, the results underscore the value of machine learning for phenotyping studies, particularly for complex traits such as drought stress resistance.

Download Full-text

Supervised machine learning for power and bandwidth management in very high throughput satellite systems

International Journal of Satellite Communications and Networking ◽

10.1002/sat.1422 ◽

2021 ◽

Author(s):

Flor G. Ortiz‐Gómez ◽

Daniele Tarchi ◽

Ramón Martínez ◽

Alessandro Vanelli‐Coralli ◽

Miguel A. Salas‐Natera ◽

...

Keyword(s):

Machine Learning ◽

High Throughput ◽

Supervised Machine Learning ◽

Bandwidth Management ◽

Satellite Systems ◽

Very High

Download Full-text

Rapid Assessment of T-Cell Receptor Specificity of the Immune Repertoire

10.1101/2020.04.06.028415 ◽

2020 ◽

Author(s):

Xingcheng Lin ◽

Jason T. George ◽

Nicholas P. Schafer ◽

Kevin Ng Chau ◽

Cecilia Clementi ◽

...

Keyword(s):

Machine Learning ◽

T Cell ◽

Cancer Immunotherapy ◽

High Throughput ◽

Rapid Assessment ◽

Peptide Binding ◽

Supervised Machine Learning ◽

Accurate Assessment ◽

Antigen Specificity ◽

Immune Repertoire

AbstractAccurate assessment of TCR-antigen specificity at the whole immune repertoire level lies at the heart of improved cancer immunotherapy, but predictive models capable of high-throughput assessment of TCR-peptide pairs are lacking. Recent advances in deep sequencing and crystallography have enriched the data available for studying TCR-p-MHC systems. Here, we introduce a pairwise energy model, RACER, for rapid assessment of TCR-peptide affinity at the immune repertoire level. RACER applies supervised machine learning to efficiently and accurately resolve strong TCR-peptide binding pairs from weak ones. The trained parameters further enable a physical interpretation of interacting patterns encoded in each specific TCR-p-MHC system. When applied to simulate thymic selection of an MHC-restricted T-cell repertoire, RACER accurately estimates recognition rates for tumor-associated neoantigens and foreign peptides, thus demonstrating its utility in helping address the large computational challenge of reliably identifying the properties of tumor antigen-specific T-cells at the level of an individual patient’s immune repertoire.Significance StatementEffective TCR-epitope prediction for optimized cancer immunotherapy requires an accurate assessment of billions of TCR-antigen interacting pairs. We introduce RACER, a supervised, physics-based machine learning algorithm trained on deposited TCR-p-MHCs sequences and structures. RACER is capable of estimating TCR-peptide binding affinity at a rate of 0.02 seconds per pair, thus enabling large-scale evaluations of TCR epitope recognition. When restricted to the same MHC allele, RACER accurately estimates TCR binding specificities by determining their associated strong binders. We apply RACER to simulate thymic negative selection, demonstrating that this technique can accurately quantify the recognition rate of tumor-associated neoantigens and foreign peptides. Taken together, our approach demonstrates RACER’s potential as a high-throughput tool for investigating TCR-peptide interactions between the TCR repertoire cancer peptidome.

Download Full-text

Exploring the Use of Machine Learning to Automate the Qualitative Coding of Church-related Tweets

Fieldwork in Religion ◽

10.1558/firn.40610 ◽

2020 ◽

Vol 14 (2) ◽

pp. 140-159

Author(s):

Anthony-Paul Cooper ◽

Emmanuel Awuni Kolog ◽

Erkki Sutinen

Keyword(s):

Machine Learning ◽

Online Community ◽

High Volume ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Social Media Data ◽

Twitter Data ◽

Resource Intensity ◽

Media Data ◽

Better Than

This article builds on previous research around the exploration of the content of church-related tweets. It does so by exploring whether the qualitative thematic coding of such tweets can, in part, be automated by the use of machine learning. It compares three supervised machine learning algorithms to understand how useful each algorithm is at a classification task, based on a dataset of human-coded church-related tweets. The study finds that one such algorithm, Naïve-Bayes, performs better than the other algorithms considered, returning Precision, Recall and F-measure values which each exceed an acceptable threshold of 70%. This has far-reaching consequences at a time where the high volume of social media data, in this case, Twitter data, means that the resource-intensity of manual coding approaches can act as a barrier to understanding how the online community interacts with, and talks about, church. The findings presented in this article offer a way forward for scholars of digital theology to better understand the content of online church discourse.

Download Full-text

Application of Supervised Machine Learning Algorithms for Lithofacies Classification.

10.2523/19349-ms ◽

2019 ◽

Author(s):

Subhadeep Sarkar ◽

Chandan Majumdar

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Lithofacies Classification

Download Full-text

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Download Full-text