scholarly journals Inferring protein sequence-function relationships with large-scale positive-unlabeled learning

Author(s):  
Hyebin Song ◽  
Bennett J. Bremer ◽  
Emily C. Hinds ◽  
Garvesh Raskutti ◽  
Philip A. Romero

SummaryMachine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It’s challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Importantly, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function data sets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.

2020 ◽  
Author(s):  
Sam Gelman ◽  
Philip A. Romero ◽  
Anthony Gitter

ABSTRACTThe mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein’s behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network’s internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks’ ability to learn biologically meaningful information about protein structure and mechanism. Our software is available from https://github.com/gitter-lab/nn4dms.


2021 ◽  
Author(s):  
Sergey Levchenko ◽  
Yaqiong Zhong ◽  
Xiaojuan Hu ◽  
Debalaya Sarker ◽  
Qingrui Xia ◽  
...  

Abstract Thermoelectric (TE) materials are among very few sustainable yet feasible energy solutions of present time. This huge promise of energy harvesting is contingent on identifying/designing materials having higher efficiency than presently available ones. However, due to the vastness of the chemical space of materials, only its small fraction was scanned experimentally and/or computationally so far. Employing a compressed-sensing based symbolic regression in an active-learning framework, we have not only identified a trend in materials’ compositions for superior TE performance, but have also predicted and experimentally synthesized several extremely high performing novel TE materials. Among these, we found polycrystalline p-type Cu0.45Ag0.55GaTe2 to possess an experimental figure of merit as high as ~2.8 at 827 K. This is a breakthrough in the field, because all previously known thermoelectric materials with a comparable figure of merit are either unstable or much more difficult to synthesize, rendering them unusable in large-scale applications. The presented methodology demonstrates the importance and tremendous potential of physically informed descriptors in material science, in particular for relatively small data sets typically available from experiments at well-controlled conditions.


2021 ◽  
Author(s):  
Jaspreet Singh ◽  
Thomas Litfin ◽  
Jaswinder Singh ◽  
Kuldip Paliwal ◽  
Yaoqi Zhou

Motivation: Accurate prediction of protein contact map is essential for accurate proteins structure and function prediction. As a result, many methods have been developed for protein contact map prediction. However, most contact map prediction methods rely on protein sequence evolutionary information which may not exist for many proteins due to lack of sequence homology. Moreover, generating evolutionary profiles is computationally intensive and time consuming. Therefore, we developed a contact map predictor utilizing the output of a pre-trained language model ESM-1B as an input along with a large training set and an ensemble of residual neural networks. Results: We showed that the proposed method makes a significant improvement over a single-sequence-based predictor SSCpred with 15% improvement in the F1-score for the independent CASP14-FM test set. It also outperforms evolutionary-profile-based methods TrRosetta and SPOT-Contact with 48.7% and 48.5% respective improvement in the F1-score on the proteins in the SPOT-2018 set without homologs (Neff=1). The new method provides a much faster and reasonably accurate alternative to profile-based methods, useful for large-scale prediction, in particular.


2020 ◽  
Vol 34 (01) ◽  
pp. 354-361 ◽  
Author(s):  
Chidubem Arachie ◽  
Manas Gaur ◽  
Sam Anzaroot ◽  
William Groves ◽  
Ke Zhang ◽  
...  

Social media plays a major role during and after major natural disasters (e.g., hurricanes, large-scale fires, etc.), as people “on the ground” post useful information on what is actually happening. Given the large amounts of posts, a major challenge is identifying the information that is useful and actionable. Emergency responders are largely interested in finding out what events are taking place so they can properly plan and deploy resources. In this paper we address the problem of automatically identifying important sub-events (within a large-scale emergency “event”, such as a hurricane). In particular, we present a novel, unsupervised learning framework to detect sub-events in Tweets for retrospective crisis analysis. We first extract noun-verb pairs and phrases from raw tweets as sub-event candidates. Then, we learn a semantic embedding of extracted noun-verb pairs and phrases, and rank them against a crisis-specific ontology. We filter out noisy and irrelevant information then cluster the noun-verb pairs and phrases so that the top-ranked ones describe the most important sub-events. Through quantitative experiments on two large crisis data sets (Hurricane Harvey and the 2015 Nepal Earthquake), we demonstrate the effectiveness of our approach over the state-of-the-art. Our qualitative evaluation shows better performance compared to our baseline.


2011 ◽  
Vol 271-273 ◽  
pp. 1451-1454
Author(s):  
Gang Zhang ◽  
Jian Yin ◽  
Liang Lun Cheng ◽  
Chun Ru Wang

Teaching quality is a key metric in college teaching effect and ability evaluation. In many previous literatures, evaluation of such metric is merely depended on subjective judgment of few experts based on their experience, which leads to some false, bias or unstable results. Moreover, pure human based evaluation is expensive that is difficult to extend to large scale. With the application of information technology, much information in college teaching is recorded and stored electronically, which founds the basic of a computer-aid analysis. In this paper, we perform teaching quality evaluation within machine learning framework, focusing on learning and modeling electronic information associated with quality of teaching, to get a stable model described the substantial principles of teaching quality. Artificial Neural Network (ANN) is selected as the main model in this work. Experiment results on real data sets consisted of 4 subjects / 8 semesters show the effectiveness of the proposed method.


2019 ◽  
Vol 21 (2) ◽  
pp. 676-686 ◽  
Author(s):  
Siyuan Chen ◽  
Chengzhi Ren ◽  
Jingjing Zhai ◽  
Jiantao Yu ◽  
Xuyang Zhao ◽  
...  

Abstract A widely used approach in transcriptome analysis is the alignment of short reads to a reference genome. However, owing to the deficiencies of specially designed analytical systems, short reads unmapped to the genome sequence are usually ignored, resulting in the loss of significant biological information and insights. To fill this gap, we present Comprehensive Assembly and Functional annotation of Unmapped RNA-Seq data (CAFU), a Galaxy-based framework that can facilitate the large-scale analysis of unmapped RNA sequencing (RNA-Seq) reads from single- and mixed-species samples. By taking advantage of machine learning techniques, CAFU addresses the issue of accurately identifying the species origin of transcripts assembled using unmapped reads from mixed-species samples. CAFU also represents an innovation in that it provides a comprehensive collection of functions required for transcript confidence evaluation, coding potential calculation, sequence and expression characterization and function annotation. These functions and their dependencies have been integrated into a Galaxy framework that provides access to CAFU via a user-friendly interface, dramatically simplifying complex exploration tasks involving unmapped RNA-Seq reads. CAFU has been validated with RNA-Seq data sets from wheat and Zea mays (maize) samples. CAFU is freely available via GitHub: https://github.com/cma2015/CAFU.


2019 ◽  
Author(s):  
Samuel Pawel ◽  
Leonhard Held

Throughout the last decade, the so-called replication crisis has stimulated many researchers to conduct large-scale replication projects. With data from four of these projects, we computed probabilistic forecasts of the replication outcomes, which we then evaluated regarding discrimination, calibration and sharpness. A novel model, which can take into account both inflation and heterogeneity of effects, was used and predicted the effect estimate of the replication study with good performance in two of the four data sets. In the other two data sets, predictive performance was still substantially improved compared to the naive model which does not consider inflation and heterogeneity of effects. The results suggest that many of the estimates from the original studies were inflated, possibly caused by publication bias or questionable research practices, and also that some degree of heterogeneity between original and replication effects should be expected. Moreover, the results indicate that the use of statistical significance as the only criterion for replication success may be questionable, since from a predictive viewpoint, non-significant replication results are often compatible with significant results from the original study. The developed statistical methods as well as the data sets are available in the R package ReplicationSuccess.


2021 ◽  
Vol 118 (48) ◽  
pp. e2104878118
Author(s):  
Sam Gelman ◽  
Sarah A. Fahlberg ◽  
Pete Heinzelman ◽  
Philip A. Romero ◽  
Anthony Gitter

The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein’s behavior and properties. We present a supervised deep learning framework to learn the sequence–function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network’s internal representation affects its ability to learn the sequence–function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks’ ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models’ ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.


2014 ◽  
Vol 155 (26) ◽  
pp. 1011-1018 ◽  
Author(s):  
György Végvári ◽  
Edina Vidéki

Plants seem to be rather defenceless, they are unable to do motion, have no nervous system or immune system unlike animals. Besides this, plants do have hormones, though these substances are produced not in glands. In view of their complexity they lagged behind animals, however, plant organisms show large scale integration in their structure and function. In higher plants, such as in animals, the intercellular communication is fulfilled through chemical messengers. These specific compounds in plants are called phytohormones, or in a wide sense, bioregulators. Even a small quantity of these endogenous organic compounds are able to regulate the operation, growth and development of higher plants, and keep the connection between cells, tissues and synergy beween organs. Since they do not have nervous and immume systems, phytohormones play essential role in plants’ life. Orv. Hetil., 2014, 155(26), 1011–1018.


Sign in / Sign up

Export Citation Format

Share Document