scholarly journals Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT

2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Aparna Elangovan ◽  
Yuan Li ◽  
Douglas E. V. Pires ◽  
Melissa J. Davis ◽  
Karin Verspoor

Abstract Motivation Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. Method We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models—dubbed PPI-BioBERT-x10—to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. Results and conclusion The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter $$\approx 5700$$ ≈ 5700 (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.

2021 ◽  
Author(s):  
Aparna Elangovan ◽  
Yuan Li ◽  
Douglas E.V. Pires ◽  
Melissa J. Davis ◽  
Karin Verspoor

Abstract Motivation: Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. We further assessed model generalisation in a real-world scenario, evaluating its performance on a randomly sampled subset of predictions from 18 million PubMed abstracts. Method: We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models – dubbed PPI-BioBERT-x10 – to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. Results and conclusion: The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P=58.1, R=32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter ≈5,700 (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.


2021 ◽  
Vol 13 (11) ◽  
pp. 2220
Author(s):  
Yanbing Bai ◽  
Wenqi Wu ◽  
Zhengxin Yang ◽  
Jinze Yu ◽  
Bo Zhao ◽  
...  

Identifying permanent water and temporary water in flood disasters efficiently has mainly relied on change detection method from multi-temporal remote sensing imageries, but estimating the water type in flood disaster events from only post-flood remote sensing imageries still remains challenging. Research progress in recent years has demonstrated the excellent potential of multi-source data fusion and deep learning algorithms in improving flood detection, while this field has only been studied initially due to the lack of large-scale labelled remote sensing images of flood events. Here, we present new deep learning algorithms and a multi-source data fusion driven flood inundation mapping approach by leveraging a large-scale publicly available Sen1Flood11 dataset consisting of roughly 4831 labelled Sentinel-1 SAR and Sentinel-2 optical imagery gathered from flood events worldwide in recent years. Specifically, we proposed an automatic segmentation method for surface water, permanent water, and temporary water identification, and all tasks share the same convolutional neural network architecture. We utilize focal loss to deal with the class (water/non-water) imbalance problem. Thorough ablation experiments and analysis confirmed the effectiveness of various proposed designs. In comparison experiments, the method proposed in this paper is superior to other classical models. Our model achieves a mean Intersection over Union (mIoU) of 52.99%, Intersection over Union (IoU) of 52.30%, and Overall Accuracy (OA) of 92.81% on the Sen1Flood11 test set. On the Sen1Flood11 Bolivia test set, our model also achieves very high mIoU (47.88%), IoU (76.74%), and OA (95.59%) and shows good generalization ability.


2021 ◽  
Vol 13 (3) ◽  
pp. 364
Author(s):  
Han Gao ◽  
Jinhui Guo ◽  
Peng Guo ◽  
Xiuwan Chen

Recently, deep learning has become the most innovative trend for a variety of high-spatial-resolution remote sensing imaging applications. However, large-scale land cover classification via traditional convolutional neural networks (CNNs) with sliding windows is computationally expensive and produces coarse results. Additionally, although such supervised learning approaches have performed well, collecting and annotating datasets for every task are extremely laborious, especially for those fully supervised cases where the pixel-level ground-truth labels are dense. In this work, we propose a new object-oriented deep learning framework that leverages residual networks with different depths to learn adjacent feature representations by embedding a multibranch architecture in the deep learning pipeline. The idea is to exploit limited training data at different neighboring scales to make a tradeoff between weak semantics and strong feature representations for operational land cover mapping tasks. We draw from established geographic object-based image analysis (GEOBIA) as an auxiliary module to reduce the computational burden of spatial reasoning and optimize the classification boundaries. We evaluated the proposed approach on two subdecimeter-resolution datasets involving both urban and rural landscapes. It presented better classification accuracy (88.9%) compared to traditional object-based deep learning methods and achieves an excellent inference time (11.3 s/ha).


Author(s):  
Lok Man ◽  
William P. Klare ◽  
Ashleigh L. Dale ◽  
Joel A. Cain ◽  
Stuart J. Cordwell

Despite being considered the simplest form of life, bacteria remain enigmatic, particularly in light of pathogenesis and evolving antimicrobial resistance. After three decades of genomics, we remain some way from understanding these organisms, and a substantial proportion of genes remain functionally unknown. Methodological advances, principally mass spectrometry (MS), are paving the way for parallel analysis of the proteome, metabolome and lipidome. Each provides a global, complementary assay, in addition to genomics, and the ability to better comprehend how pathogens respond to changes in their internal (e.g. mutation) and external environments consistent with infection-like conditions. Such responses include accessing necessary nutrients for survival in a hostile environment where co-colonizing bacteria and normal flora are acclimated to the prevailing conditions. Multi-omics can be harnessed across temporal and spatial (sub-cellular) dimensions to understand adaptation at the molecular level. Gene deletion libraries, in conjunction with large-scale approaches and evolving bioinformatics integration, will greatly facilitate next-generation vaccines and antimicrobial interventions by highlighting novel targets and pathogen-specific pathways. MS is also central in phenotypic characterization of surface biomolecules such as lipid A, as well as aiding in the determination of protein interactions and complexes. There is increasing evidence that bacteria are capable of widespread post-translational modification, including phosphorylation, glycosylation and acetylation; with each contributing to virulence. This review focuses on the bacterial genotype to phenotype transition and surveys the recent literature showing how the genome can be validated at the proteome, metabolome and lipidome levels to provide an integrated view of organism response to host conditions.


2019 ◽  
Author(s):  
Mojtaba Haghighatlari ◽  
Gaurav Vishwakarma ◽  
Mohammad Atif Faiz Afzal ◽  
Johannes Hachmann

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>


2020 ◽  
Vol 34 (05) ◽  
pp. 9193-9200
Author(s):  
Shaolei Wang ◽  
Wangxiang Che ◽  
Qi Liu ◽  
Pengda Qin ◽  
Ting Liu ◽  
...  

Most existing approaches to disfluency detection heavily rely on human-annotated data, which is expensive to obtain in practice. To tackle the training data bottleneck, we investigate methods for combining multiple self-supervised tasks-i.e., supervised tasks where data can be collected without manual labeling. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled news data, and propose two self-supervised pre-training tasks: (i) tagging task to detect the added noisy words. (ii) sentence classification to distinguish original sentences from grammatically-incorrect sentences. We then combine these two tasks to jointly train a network. The pre-trained network is then fine-tuned using human-annotated disfluency detection training data. Experimental results on the commonly used English Switchboard test set show that our approach can achieve competitive performance compared to the previous systems (trained using the full dataset) by using less than 1% (1000 sentences) of the training data. Our method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard.


2021 ◽  
Author(s):  
Ronghui You ◽  
Wei Qu ◽  
Hiroshi Mamitsuka ◽  
Shanfeng Zhu

Computationally predicting MHC-peptide binding affinity is an important problem in immunological bioinformatics. Recent cutting-edge deep learning-based methods for this problem are unable to achieve satisfactory performance for MHC class II molecules. This is because such methods generate the input by simply concatenating the two given sequences: (the estimated binding core of) a peptide and (the pseudo sequence of) an MHC class II molecule, ignoring the biological knowledge behind the interactions of the two molecules. We thus propose a binding core-aware deep learning-based model, DeepMHCII, with binding interaction convolution layer (BICL), which allows integrating all potential binding cores (in a given peptide) and the MHC pseudo (binding) sequence, through modeling the interaction with multiple convolutional kernels. Extensive empirical experiments with four large-scale datasets demonstrate that DeepMHCII significantly outperformed four state-of-the-art methods under numerous settings, such as five-fold cross-validation, leave one molecule out, validation with independent testing sets, and binding core prediction. All these results with visualization of the predicted binding cores indicate the effectiveness and importance of properly modeling biological facts in deep learning for high performance and knowledge discovery. DeepMHCII is publicly available at https://weilab.sjtu.edu.cn/DeepMHCII/.


2018 ◽  
Author(s):  
Yanhui Hu ◽  
Richelle Sopko ◽  
Verena Chung ◽  
Romain A. Studer ◽  
Sean D. Landry ◽  
...  

AbstractPost-translational modification (PTM) serves as a regulatory mechanism for protein function, influencing stability, protein interactions, activity and localization, and is critical in many signaling pathways. The best characterized PTM is phosphorylation, whereby a phosphate is added to an acceptor residue, commonly serine, threonine and tyrosine. As proteins are often phosphorylated at multiple sites, identifying those sites that are important for function is a challenging problem. Considering that many phosphorylation sites may be non-functional, prioritizing evolutionarily conserved phosphosites provides a general strategy to identify the putative functional sites with regards to regulation and function. To facilitate the identification of conserved phosphosites, we generated a large-scale phosphoproteomics dataset from Drosophila embryos collected from six closely-related species. We built iProteinDB (https://www.flyrnai.org/tools/iproteindb/), a resource integrating these data with other high-throughput PTM datasets, including vertebrates, and manually curated information for Drosophila. At iProteinDB, scientists can view the PTM landscape for any Drosophila protein and identify predicted functional phosphosites based on a comparative analysis of data from closely-related Drosophila species. Further, iProteinDB enables comparison of PTM data from Drosophila to that of orthologous proteins from other model organisms, including human, mouse, rat, Xenopus laevis, Danio rerio, and Caenorhabditis elegans.


Author(s):  
Uzma Batool ◽  
Mohd Ibrahim Shapiai ◽  
Nordinah Ismail ◽  
Hilman Fauzi ◽  
Syahrizal Salleh

Silicon wafer defect data collected from fabrication facilities is intrinsically imbalanced because of the variable frequencies of defect types. Frequently occurring types will have more influence on the classification predictions if a model gets trained on such skewed data. A fair classifier for such imbalanced data requires a mechanism to deal with type imbalance in order to avoid biased results. This study has proposed a convolutional neural network for wafer map defect classification, employing oversampling as an imbalance addressing technique. To have an equal participation of all classes in the classifier’s training, data augmentation has been employed, generating more samples in minor classes. The proposed deep learning method has been evaluated on a real wafer map defect dataset and its classification results on the test set returned a 97.91% accuracy. The results were compared with another deep learning based auto-encoder model demonstrating the proposed method, a potential approach for silicon wafer defect classification that needs to be investigated further for its robustness.


Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Tao Chen ◽  
Mingfen Wu ◽  
Hexi Li

Abstract The automatic extraction of meaningful relations from biomedical literature or clinical records is crucial in various biomedical applications. Most of the current deep learning approaches for medical relation extraction require large-scale training data to prevent overfitting of the training model. We propose using a pre-trained model and a fine-tuning technique to improve these approaches without additional time-consuming human labeling. Firstly, we show the architecture of Bidirectional Encoder Representations from Transformers (BERT), an approach for pre-training a model on large-scale unstructured text. We then combine BERT with a one-dimensional convolutional neural network (1d-CNN) to fine-tune the pre-trained model for relation extraction. Extensive experiments on three datasets, namely the BioCreative V chemical disease relation corpus, traditional Chinese medicine literature corpus and i2b2 2012 temporal relation challenge corpus, show that the proposed approach achieves state-of-the-art results (giving a relative improvement of 22.2, 7.77, and 38.5% in F1 score, respectively, compared with a traditional 1d-CNN classifier). The source code is available at https://github.com/chentao1999/MedicalRelationExtraction.


Sign in / Sign up

Export Citation Format

Share Document