Assessing predictors for new post translational modification sites: a case study on hydroxylation

AbstractPost-translational modification (PTM) sites have become popular for predictor development. However, with the exception of phosphorylation and a handful of other examples, PTMs suffer from a limited number of available training examples and their sparsity in protein sequences. Here, proline hydroxylation is taken as an example to compare different methods and evaluate their performance on new experimentally determined sites. As a proxy for an effective experimental design, predictors require both high specificity and sensitivity. However, the self-reported performance is often not indicative of prediction quality and detection of new sites is not guaranteed. We have benchmarked seven published hydroxylation site predictors on two newly constructed independent datasets. The self-reported performance widely overestimates the real accuracy measured on independent datasets. No predictor performs better than random on new examples, indicating the refined models are not sufficiently general to detect new sites. The number of false positives is high and precision low, in particular for non-collagen proteins whose motifs are not conserved. In short, existing predictors for hydroxylation sites do not appear to generalize to new data. Caution is advised when dealing with PTM predictors in the absence of independent evaluations, in particular for unique specific sites such as those involved in signalling.Author SummaryMachine learning methods are extensively used by biologists to design and interpret experiments. Predictors which take the only sequence as input are of particular interest due to the large amount of sequence data available and self-reported performance is often very high. In this work, we evaluated post-translational modification (PTM) predictors for hydroxylation sites and found that they perform no better than random, in strong contrast to performances reported in the original publications. PTMs are chemical amino acids alterations providing the cell with conditional mechanisms to fine tune protein function, thereby regulating complex biological processes such as signalling and cell cycle. Hydroxylation sites are a good PTM test case due to the availability of a range of predictors and an abundance of newly experimentally detected modification sites. Poor performances in our results highlight the overlooked problem of predicting PTMs when best practices are not followed and training data are likely incomplete. Experimentalists should be careful when using PTM predictors blindly and more independent assessments are needed to separate the wheat from the chaff in the field.

Download Full-text

TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding

10.1101/2020.09.27.315937 ◽

2020 ◽

Author(s):

Yue Cao ◽

Yang Shen

Keyword(s):

High Throughput ◽

Protein Function ◽

Sequence Data ◽

Sequence Similarity ◽

Directed Graphs ◽

Training Data ◽

Supplementary Information ◽

Function Annotation ◽

Source Codes ◽

Protein Function Annotation

AbstractMotivationFacing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on data besides sequences, or lack generalizability to novel sequences, species and functions.ResultsTo overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model, named Transformer-based protein function Annotation through joint sequence–Label Embedding (TALE). For generalizbility to novel sequences we use self attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions, we also embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low homology and never/rarely annotated novel species or functions compared to training data, revealing deep insights into the protein sequence–function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability.AvailabilityThe data, source codes and models are available at https://github.com/Shen-Lab/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding

Bioinformatics ◽

10.1093/bioinformatics/btab198 ◽

2021 ◽

Author(s):

Yue Cao ◽

Yang Shen

Keyword(s):

High Throughput ◽

Protein Function ◽

Sequence Data ◽

Sequence Similarity ◽

Directed Graphs ◽

Training Data ◽

Supplementary Information ◽

Sequence Information ◽

Function Annotation ◽

Protein Function Annotation

Abstract Motivation Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions. Results To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence–Label Embedding (TALE). For generalizability to novel sequences we use self attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence–function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability. Availability The data, source codes and models are available at https://github.com/Shen-Lab/TALE Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prospective Application of Aptamer-based Assays and Therapeutics in Bloodstream Infections

Mini-Reviews in Medicinal Chemistry ◽

10.2174/1389557520666200212105813 ◽

2020 ◽

Vol 20 (10) ◽

pp. 831-840

Author(s):

Weibin Li

Keyword(s):

Pathogenic Bacteria ◽

Genetic Diagnosis ◽

Bloodstream Infections ◽

High Specificity ◽

Peptide Sequence ◽

High Morbidity ◽

Specificity And Sensitivity ◽

Prospective Application ◽

Small Molecule Therapeutics

Sepsis is still a severe health problem worldwide with high morbidity and mortality. Blood bacterial culture remains the gold standard for the detection of pathogenic bacteria in bloodstream infections, but it is time-consuming, and both the sophisticated equipment and well-trained personnel are required. Immunoassays and genetic diagnosis are expensive and limited to specificity and sensitivity. Aptamers are single-stranded deoxyribonucleic acid (ssDNA) and ribonucleic acid (RNA) oligonucleotide or peptide sequence generated in vitro based on the binding affinity of aptamer-target by a process known as Systematic Evolution of Ligands by Exponential Enrichment (SELEX). By taking several advantages over monoclonal antibodies and other conventional small-molecule therapeutics, such as high specificity and affinity, negligible batch-to-batch variation, flexible modification and production, thermal stability, low immunogenicity and lack of toxicity, aptamers are presently becoming promising novel diagnostic and therapeutic agents. This review describes the prospective application of aptamerbased laboratory diagnostic assays and therapeutics for pathogenic bacteria and toxins in bloodstream infections.

Download Full-text

CirBiTree: Citrullination Site Inference Based on a Fuzzy Neural Network and Flexible Neural Tree

Scientific Programming ◽

10.1155/2020/8847694 ◽

2020 ◽

Vol 2020 ◽

pp. 1-8

Author(s):

Chuandong Song ◽

Haifeng Wang

Keyword(s):

Neural Network ◽

Fuzzy Neural Network ◽

Classification Model ◽

Peptide Sequence ◽

Sequence Information ◽

Post Translational Modification ◽

Fuzzy Neural ◽

Human Complex ◽

Better Than

Emerging evidence demonstrates that post-translational modification plays an important role in several human complex diseases. Nevertheless, considering the inherent high cost and time consumption of classical and typical in vitro experiments, an increasing attention has been paid to the development of efficient and available computational tools to identify the potential modification sites in the level of protein. In this work, we propose a machine learning-based model called CirBiTree for identification the potential citrullination sites. More specifically, we initially utilize the biprofile Bayesian to extract peptide sequence information. Then, a flexible neural tree and fuzzy neural network are employed as the classification model. Finally, the most available length of identified peptides has been selected in this model. To evaluate the performance of the proposed methods, some state-of-the-art methods have been employed for comparison. The experimental results demonstrate that the proposed method is better than other methods. CirBiTree can achieve 83.07% in sn%, 80.50% in sp, 0.8201 in F1, and 0.6359 in MCC, respectively.

Download Full-text

Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks

BMC Bioinformatics ◽

10.1186/s12859-021-04101-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yingxi Yang ◽

Hui Wang ◽

Wen Li ◽

Xiaobo Wang ◽

Shizhao Wei ◽

...

Keyword(s):

Correlation Coefficient ◽

Sequence Data ◽

Rapid Development ◽

Pearson Correlation ◽

Structural Features ◽

Generative Adversarial Networks ◽

Post Translational Modification ◽

Generative Adversarial Network ◽

Data Imbalance ◽

Adversarial Network

Abstract Background Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. Method We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. Results In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN. Conclusions The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.

Download Full-text

Post translational modifications of milk proteins in geographically diverse goat breeds

Scientific Reports ◽

10.1038/s41598-021-85094-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

P. K. Rout ◽

M. Verma

Keyword(s):

Protein Function ◽

Whey Proteins ◽

Milk Proteins ◽

Goat Milk ◽

Peptide Sequence ◽

Cow Milk ◽

Post Translational Modification ◽

Post Translational Modifications ◽

Functional Roles ◽

Dpp Iv

AbstractGoat milk is a source of nutrition in difficult areas and has lesser allerginicity than cow milk. It is leading in the area for nutraceutical formulation and drug development using goat mammary gland as a bioreactor. Post translational modifications of a protein regulate protein function, biological activity, stabilization and interactions. The protein variants of goat milk from 10 breeds were studied for the post translational modifications by combining highly sensitive 2DE and Q-Exactive LC-MS/MS. Here we observed high levels of post translational modifications in 201 peptides of 120 goat milk proteins. The phosphosites observed for CSN2, CSN1S1, CSN1S2, CSN3 were 11P, 13P, 17P and 6P, respectively in 105 casein phosphopeptides. Whey proteins BLG and LALBA showed 19 and 4 phosphosites respectively. Post translational modification was observed in 45 low abundant non-casein milk proteins mainly associated with signal transduction, immune system, developmental biology and metabolism pathways. Pasp is reported for the first time in 47 sites. The rare conserved peptide sequence of (SSSEE) was observed in αS1 and αS2 casein. The functional roles of identified phosphopeptides included anti-microbial, DPP-IV inhibitory, anti-inflammatory and ACE inhibitory. This is first report from tropics, investigating post translational modifications in casein and non-casein goat milk proteins and studies their interactions.

Download Full-text

SARS-CoV-2 specific antibody and neutralization assays reveal the wide range of the humoral immune response to virus

Communications Biology ◽

10.1038/s42003-021-01649-6 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Mikail Dogan ◽

Lina Kozhaya ◽

Lindsey Placek ◽

Courtney Gunter ◽

Mesut Yigit ◽

...

Keyword(s):

Specific Antibody ◽

Neutralizing Antibodies ◽

Vaccine Development ◽

High Specificity ◽

Spike Protein ◽

Humoral Immune ◽

Specificity And Sensitivity ◽

Wide Range ◽

Specific Igg ◽

Negative Controls

AbstractDevelopment of antibody protection during SARS-CoV-2 infection is a pressing question for public health and for vaccine development. We developed highly sensitive SARS-CoV-2-specific antibody and neutralization assays. SARS-CoV-2 Spike protein or Nucleocapsid protein specific IgG antibodies at titers more than 1:100,000 were detectable in all PCR+ subjects (n = 115) and were absent in the negative controls. Other isotype antibodies (IgA, IgG1-4) were also detected. SARS-CoV-2 neutralization was determined in COVID-19 and convalescent plasma at up to 10,000-fold dilution, using Spike protein pseudotyped lentiviruses, which were also blocked by neutralizing antibodies (NAbs). Hospitalized patients had up to 3000-fold higher antibody and neutralization titers compared to outpatients or convalescent plasma donors. Interestingly, some COVID-19 patients also possessed NAbs against SARS-CoV Spike protein pseudovirus. Together these results demonstrate the high specificity and sensitivity of our assays, which may impact understanding the quality or duration of the antibody response during COVID-19 and in determining the effectiveness of potential vaccines.

Download Full-text

NLOS Multipath Classification of GNSS Signal Correlation Output Using Machine Learning

Sensors ◽

10.3390/s21072503 ◽

2021 ◽

Vol 21 (7) ◽

pp. 2503

Author(s):

Taro Suzuki ◽

Yoshiharu Amano

Keyword(s):

Machine Learning ◽

Satellite System ◽

Training Data ◽

Support Vector ◽

Positioning Errors ◽

Automated Method ◽

Global Navigation Satellite ◽

Better Than ◽

Signal Correlation

This paper proposes a method for detecting non-line-of-sight (NLOS) multipath, which causes large positioning errors in a global navigation satellite system (GNSS). We use GNSS signal correlation output, which is the most primitive GNSS signal processing output, to detect NLOS multipath based on machine learning. The shape of the multi-correlator outputs is distorted due to the NLOS multipath. The features of the shape of the multi-correlator are used to discriminate the NLOS multipath. We implement two supervised learning methods, a support vector machine (SVM) and a neural network (NN), and compare their performance. In addition, we also propose an automated method of collecting training data for LOS and NLOS signals of machine learning. The evaluation of the proposed NLOS detection method in an urban environment confirmed that NN was better than SVM, and 97.7% of NLOS signals were correctly discriminated.

Download Full-text

Highly Sensitive Fluorescent Detection of Acetylcholine Based on the Enhanced Peroxidase-Like Activity of Histidine Coated Magnetic Nanoparticles

Nanomaterials ◽

10.3390/nano11051207 ◽

2021 ◽

Vol 11 (5) ◽

pp. 1207

Author(s):

Hong Jae Cheon ◽

Quynh Huong Nguyen ◽

Moon Il Kim

Keyword(s):

Magnetic Nanoparticles ◽

High Sensitivity ◽

High Specificity ◽

Choline Oxidase ◽

Fluorescent Detection ◽

One Pot ◽

Site Structure ◽

Specificity And Sensitivity ◽

Highly Sensitive ◽

Peroxidase Mimics

Inspired by the active site structure of natural horseradish peroxidase having iron as a pivotal element with coordinated histidine residues, we have developed histidine coated magnetic nanoparticles (His@MNPs) with relatively uniform and small sizes (less than 10 nm) through one-pot heat treatment. In comparison to pristine MNPs and other amino acid coated MNPs, His@MNPs exhibited a considerably enhanced peroxidase-imitating activity, approaching 10-fold higher in catalytic reactions. With the high activity, His@MNPs then were exploited to detect the important neurotransmitter acetylcholine. By coupling choline oxidase and acetylcholine esterase with His@MNPs as peroxidase mimics, target choline and acetylcholine were successfully detected via fluorescent mode with high specificity and sensitivity with the limits of detection down to 200 and 100 nM, respectively. The diagnostic capability of the method is demonstrated by analyzing acetylcholine in human blood serum. This study thus demonstrates the potential of utilizing His@MNPs as peroxidase-mimicking nanozymes for detecting important biological and clinical targets with high sensitivity and reliability.

Download Full-text

CrossGR

Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies ◽

10.1145/3448100 ◽

2021 ◽

Vol 5 (1) ◽

pp. 1-23

Author(s):

Xinyi Li ◽

Liqiong Chang ◽

Fangfang Song ◽

Ju Wang ◽

Xiaojiang Chen ◽

...

Keyword(s):

Gesture Recognition ◽

Low Cost ◽

User Involvement ◽

Recognition System ◽

Training Data ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Target User ◽

Order Of Magnitude ◽

Training Examples

This paper focuses on a fundamental question in Wi-Fi-based gesture recognition: "Can we use the knowledge learned from some users to perform gesture recognition for others?". This problem is also known as cross-target recognition. It arises in many practical deployments of Wi-Fi-based gesture recognition where it is prohibitively expensive to collect training data from every single user. We present CrossGR, a low-cost cross-target gesture recognition system. As a departure from existing approaches, CrossGR does not require prior knowledge (such as who is currently performing a gesture) of the target user. Instead, CrossGR employs a deep neural network to extract user-agnostic but gesture-related Wi-Fi signal characteristics to perform gesture recognition. To provide sufficient training data to build an effective deep learning model, CrossGR employs a generative adversarial network to automatically generate many synthetic training data from a small set of real-world examples collected from a small number of users. Such a strategy allows CrossGR to minimize the user involvement and the associated cost in collecting training examples for building an accurate gesture recognition system. We evaluate CrossGR by applying it to perform gesture recognition across 10 users and 15 gestures. Experimental results show that CrossGR achieves an accuracy of over 82.6% (up to 99.75%). We demonstrate that CrossGR delivers comparable recognition accuracy, but uses an order of magnitude less training samples collected from the end-users when compared to state-of-the-art recognition systems.

Download Full-text