Inferring protein sequence-function relationships with large-scale positive-unlabeled learning

SummaryMachine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It’s challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Importantly, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function data sets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.

Download Full-text

Neural networks to learn protein sequence-function relationships from deep mutational scanning data

10.1101/2020.10.25.353946 ◽

2020 ◽

Author(s):

Sam Gelman ◽

Philip A. Romero ◽

Anthony Gitter

Keyword(s):

Protein Structure ◽

Protein Sequence ◽

Internal Representation ◽

Superior Performance ◽

Network Architectures ◽

Convolutional Network ◽

Learning Framework ◽

And Function ◽

Multiple Neural Network ◽

Function Mapping

ABSTRACTThe mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein’s behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network’s internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks’ ability to learn biologically meaningful information about protein structure and mechanism. Our software is available from https://github.com/gitter-lab/nn4dms.

Download Full-text

Data analytics accelerates the experimental discovery of new thermoelectric materials with extremely high figure of merit

10.21203/rs.3.rs-926972/v1 ◽

2021 ◽

Author(s):

Sergey Levchenko ◽

Yaqiong Zhong ◽

Xiaojuan Hu ◽

Debalaya Sarker ◽

Qingrui Xia ◽

...

Keyword(s):

Thermoelectric Materials ◽

Figure Of Merit ◽

Large Scale ◽

Chemical Space ◽

Small Data ◽

Data Sets ◽

High Performing ◽

Learning Framework ◽

Small Data Sets ◽

P Type

Abstract Thermoelectric (TE) materials are among very few sustainable yet feasible energy solutions of present time. This huge promise of energy harvesting is contingent on identifying/designing materials having higher efficiency than presently available ones. However, due to the vastness of the chemical space of materials, only its small fraction was scanned experimentally and/or computationally so far. Employing a compressed-sensing based symbolic regression in an active-learning framework, we have not only identified a trend in materials’ compositions for superior TE performance, but have also predicted and experimentally synthesized several extremely high performing novel TE materials. Among these, we found polycrystalline p-type Cu0.45Ag0.55GaTe2 to possess an experimental figure of merit as high as ~2.8 at 827 K. This is a breakthrough in the field, because all previously known thermoelectric materials with a comparable figure of merit are either unstable or much more difficult to synthesize, rendering them unusable in large-scale applications. The presented methodology demonstrates the importance and tremendous potential of physically informed descriptors in material science, in particular for relatively small data sets typically available from experiments at well-controlled conditions.

Download Full-text

SPOT-Contact-Single: Improving Single-Sequence-Based Prediction of Protein Contact Map using a Transformer Language Model, Large Training Set and Ensembled Deep Learning

10.1101/2021.06.19.449089 ◽

2021 ◽

Author(s):

Jaspreet Singh ◽

Thomas Litfin ◽

Jaswinder Singh ◽

Kuldip Paliwal ◽

Yaoqi Zhou

Keyword(s):

Protein Sequence ◽

Large Scale ◽

Language Model ◽

Structure And Function ◽

Evolutionary Information ◽

Contact Map ◽

Training Set ◽

Computationally Intensive ◽

And Function ◽

Single Sequence

Motivation: Accurate prediction of protein contact map is essential for accurate proteins structure and function prediction. As a result, many methods have been developed for protein contact map prediction. However, most contact map prediction methods rely on protein sequence evolutionary information which may not exist for many proteins due to lack of sequence homology. Moreover, generating evolutionary profiles is computationally intensive and time consuming. Therefore, we developed a contact map predictor utilizing the output of a pre-trained language model ESM-1B as an input along with a large training set and an ensemble of residual neural networks. Results: We showed that the proposed method makes a significant improvement over a single-sequence-based predictor SSCpred with 15% improvement in the F1-score for the independent CASP14-FM test set. It also outperforms evolutionary-profile-based methods TrRosetta and SPOT-Contact with 48.7% and 48.5% respective improvement in the F1-score on the proteins in the SPOT-2018 set without homologs (Neff=1). The new method provides a much faster and reasonably accurate alternative to profile-based methods, useful for large-scale prediction, in particular.

Download Full-text

A fast hierarchical clustering algorithm for large-scale protein sequence data sets

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2014.02.016 ◽

2014 ◽

Vol 48 ◽

pp. 94-101 ◽

Cited By ~ 10

Author(s):

Sándor M. Szilágyi ◽

László Szilágyi

Keyword(s):

Hierarchical Clustering ◽

Protein Sequence ◽

Large Scale ◽

Clustering Algorithm ◽

Sequence Data ◽

Data Sets ◽

Protein Sequence Data ◽

Hierarchical Clustering Algorithm

Download Full-text

Unsupervised Detection of Sub-Events in Large Scale Disasters

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5370 ◽

2020 ◽

Vol 34 (01) ◽

pp. 354-361 ◽

Cited By ~ 1

Author(s):

Chidubem Arachie ◽

Manas Gaur ◽

Sam Anzaroot ◽

William Groves ◽

Ke Zhang ◽

...

Keyword(s):

Social Media ◽

Large Scale ◽

State Of The Art ◽

Qualitative Evaluation ◽

Irrelevant Information ◽

Data Sets ◽

Emergency Responders ◽

Learning Framework ◽

Emergency Event ◽

2015 Nepal Earthquake

Social media plays a major role during and after major natural disasters (e.g., hurricanes, large-scale fires, etc.), as people “on the ground” post useful information on what is actually happening. Given the large amounts of posts, a major challenge is identifying the information that is useful and actionable. Emergency responders are largely interested in finding out what events are taking place so they can properly plan and deploy resources. In this paper we address the problem of automatically identifying important sub-events (within a large-scale emergency “event”, such as a hurricane). In particular, we present a novel, unsupervised learning framework to detect sub-events in Tweets for retrospective crisis analysis. We first extract noun-verb pairs and phrases from raw tweets as sub-event candidates. Then, we learn a semantic embedding of extracted noun-verb pairs and phrases, and rank them against a crisis-specific ontology. We filter out noisy and irrelevant information then cluster the noun-verb pairs and phrases so that the top-ranked ones describe the most important sub-events. Through quantitative experiments on two large crisis data sets (Hurricane Harvey and the 2015 Nepal Earthquake), we demonstrate the effectiveness of our approach over the state-of-the-art. Our qualitative evaluation shows better performance compared to our baseline.

Download Full-text

Machine Learning Based Teaching Quality Evaluation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.271-273.1451 ◽

2011 ◽

Vol 271-273 ◽

pp. 1451-1454

Author(s):

Gang Zhang ◽

Jian Yin ◽

Liang Lun Cheng ◽

Chun Ru Wang

Keyword(s):

Machine Learning ◽

Large Scale ◽

Quality Evaluation ◽

College Teaching ◽

Real Data ◽

Teaching Quality ◽

Data Sets ◽

Stable Model ◽

Learning Framework ◽

Artificial Neural Network Ann

Teaching quality is a key metric in college teaching effect and ability evaluation. In many previous literatures, evaluation of such metric is merely depended on subjective judgment of few experts based on their experience, which leads to some false, bias or unstable results. Moreover, pure human based evaluation is expensive that is difficult to extend to large scale. With the application of information technology, much information in college teaching is recorded and stored electronically, which founds the basic of a computer-aid analysis. In this paper, we perform teaching quality evaluation within machine learning framework, focusing on learning and modeling electronic information associated with quality of teaching, to get a stable model described the substantial principles of teaching quality. Artificial Neural Network (ANN) is selected as the main model in this work. Experiment results on real data sets consisted of 4 subjects / 8 semesters show the effectiveness of the proposed method.

Download Full-text

CAFU: a Galaxy framework for exploring unmapped RNA-Seq data

Briefings in Bioinformatics ◽

10.1093/bib/bbz018 ◽

2019 ◽

Vol 21 (2) ◽

pp. 676-686 ◽

Cited By ~ 5

Author(s):

Siyuan Chen ◽

Chengzhi Ren ◽

Jingjing Zhai ◽

Jiantao Yu ◽

Xuyang Zhao ◽

...

Keyword(s):

Large Scale ◽

Biological Information ◽

Machine Learning Techniques ◽

Data Sets ◽

Rna Seq ◽

Mixed Species ◽

Short Reads ◽

Comprehensive Collection ◽

Expression Characterization ◽

And Function

Abstract A widely used approach in transcriptome analysis is the alignment of short reads to a reference genome. However, owing to the deficiencies of specially designed analytical systems, short reads unmapped to the genome sequence are usually ignored, resulting in the loss of significant biological information and insights. To fill this gap, we present Comprehensive Assembly and Functional annotation of Unmapped RNA-Seq data (CAFU), a Galaxy-based framework that can facilitate the large-scale analysis of unmapped RNA sequencing (RNA-Seq) reads from single- and mixed-species samples. By taking advantage of machine learning techniques, CAFU addresses the issue of accurately identifying the species origin of transcripts assembled using unmapped reads from mixed-species samples. CAFU also represents an innovation in that it provides a comprehensive collection of functions required for transcript confidence evaluation, coding potential calculation, sequence and expression characterization and function annotation. These functions and their dependencies have been integrated into a Galaxy framework that provides access to CAFU via a user-friendly interface, dramatically simplifying complex exploration tasks involving unmapped RNA-Seq reads. CAFU has been validated with RNA-Seq data sets from wheat and Zea mays (maize) samples. CAFU is freely available via GitHub: https://github.com/cma2015/CAFU.

Download Full-text

Probabilistic forecasting of replication studies

10.31234/osf.io/fhwb7 ◽

2019 ◽

Cited By ~ 1

Author(s):

Samuel Pawel ◽

Leonhard Held

Keyword(s):

Large Scale ◽

Statistical Significance ◽

Predictive Performance ◽

R Package ◽

Effect Estimate ◽

Data Sets ◽

Probabilistic Forecasting ◽

Questionable Research Practices ◽

Replication Studies ◽

Probabilistic Forecasts

Throughout the last decade, the so-called replication crisis has stimulated many researchers to conduct large-scale replication projects. With data from four of these projects, we computed probabilistic forecasts of the replication outcomes, which we then evaluated regarding discrimination, calibration and sharpness. A novel model, which can take into account both inflation and heterogeneity of effects, was used and predicted the effect estimate of the replication study with good performance in two of the four data sets. In the other two data sets, predictive performance was still substantially improved compared to the naive model which does not consider inflation and heterogeneity of effects. The results suggest that many of the estimates from the original studies were inflated, possibly caused by publication bias or questionable research practices, and also that some degree of heterogeneity between original and replication effects should be expected. Moreover, the results indicate that the use of statistical significance as the only criterion for replication success may be questionable, since from a predictive viewpoint, non-significant replication results are often compatible with significant results from the original study. The developed statistical methods as well as the data sets are available in the R package ReplicationSuccess.

Download Full-text

Neural networks to learn protein sequence–function relationships from deep mutational scanning data

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2104878118 ◽

2021 ◽

Vol 118 (48) ◽

pp. e2104878118

Author(s):

Sam Gelman ◽

Sarah A. Fahlberg ◽

Pete Heinzelman ◽

Philip A. Romero ◽

Anthony Gitter

Keyword(s):

Protein Structure ◽

Protein Sequence ◽

Internal Representation ◽

Superior Performance ◽

Network Architectures ◽

Convolutional Network ◽

Learning Framework ◽

And Function ◽

Multiple Neural Network ◽

Function Mapping

The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein’s behavior and properties. We present a supervised deep learning framework to learn the sequence–function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network’s internal representation affects its ability to learn the sequence–function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks’ ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models’ ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.

Download Full-text

Plant hormones, plant growth regulators

Orvosi Hetilap ◽

10.1556/oh.2014.29939 ◽

2014 ◽

Vol 155 (26) ◽

pp. 1011-1018 ◽

Cited By ~ 1

Author(s):

György Végvári ◽

Edina Vidéki

Keyword(s):

Nervous System ◽

Growth And Development ◽

Large Scale ◽

Essential Role ◽

Higher Plants ◽

Structure And Function ◽

Wide Sense ◽

Large Scale Integration ◽

Scale Integration ◽

And Function

Plants seem to be rather defenceless, they are unable to do motion, have no nervous system or immune system unlike animals. Besides this, plants do have hormones, though these substances are produced not in glands. In view of their complexity they lagged behind animals, however, plant organisms show large scale integration in their structure and function. In higher plants, such as in animals, the intercellular communication is fulfilled through chemical messengers. These specific compounds in plants are called phytohormones, or in a wide sense, bioregulators. Even a small quantity of these endogenous organic compounds are able to regulate the operation, growth and development of higher plants, and keep the connection between cells, tissues and synergy beween organs. Since they do not have nervous and immume systems, phytohormones play essential role in plants’ life. Orv. Hetil., 2014, 155(26), 1011–1018.

Download Full-text