2018 YPIC Challenge: A case study in characterizing an unknown protein sample

For the 2018 YPIC Challenge contestants were invited to try to decipher two unknown English questions encoded by a synthetic protein expressed in Escherichia coli. In addition to deciphering the sentence, contestants were asked to determine the 3D structure and detect any post-translation modifications left by the host organism. We present our experimental and computational strategy to characterize this sample by identifying the unknown protein sequence and detecting the presence of post-translational modifications. The sample was acquired with dynamic exclusion disabled to increase the signal-to-noise ratio of the measured molecules, after which spectral clustering was used to generate high-quality consensus spectra. De novo spectrum identification was used to determine the synthetic protein sequence, and any post-translational modifications introduced by E. coli on the synthetic protein were analyzed via spectral networking. This workflow resulted in a de novo sequence coverage of 70%, on par with sequence database searching performance. Additionally, the spectral networking analysis indicated that no systematic modifications were introduced on the synthetic protein by E. coli. The strategy presented here can be directly used to analyze samples for which no protein sequence information is available or when the identity of the sample is unknown. All software and code to perform the bioinformatics analysis is available as open source, and self-contained Jupyter notebooks are provided to fully recreate the analysis.

Download Full-text

2018 YPIC Challenge: A case study in characterizing an unknown protein sample

10.7287/peerj.preprints.27802v2 ◽

2019 ◽

Author(s):

Lindsay Pino ◽

Andy Lin ◽

Wout Bittremieux

Keyword(s):

Protein Sequence ◽

De Novo ◽

Signal To Noise Ratio ◽

3D Structure ◽

Sequence Information ◽

Sequence Coverage ◽

Post Translational Modifications ◽

E Coli ◽

Unknown Protein ◽

Synthetic Protein

Download Full-text

2018 YPIC Challenge: A case study in characterizing an unknown protein sample

10.7287/peerj.preprints.27802v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Lindsay Pino ◽

Andy Lin ◽

Wout Bittremieux

Keyword(s):

De Novo ◽

Signal To Noise Ratio ◽

3D Structure ◽

Sequence Information ◽

Post Translational Modifications ◽

E Coli ◽

Protein Sample ◽

Unknown Protein ◽

Synthetic Protein ◽

Biological Interest

For the 2018 YPIC Challenge contestants were invited to try to decipher two unknown English questions encoded by a synthetic protein expressed in Escherichia coli. In addition to deciphering the sentence, contestants were asked to determine the 3D structure and determine any post-translation modifications left by the host organism. We present how we analyzed this unknown sample using a tryptic digest with dynamic exclusion disabled to increase the signal-to-noise ratio of the measured molecules. Subsequently, spectral clustering was used to generate high-quality consensus spectra and condense the acquired MS/MS spectral data. De novo spectrum identification was used to determine the English questions encoded by the synthetic protein, and any post-translational modifications introduced by E. coli on the synthetic protein were detected using spectral networking. Although the synthetic protein sample for the 2018 YPIC Challenge is not of biological interest, the experimental and computational strategy presented here can be directly used to analyze samples for which no protein sequence information is available or when the identity of the sample is unknown. All software and code to perform the bioinformatics analysis is available as open source, and a self-contained Jupyter notebook is provided to fully recreate the analysis.

Download Full-text

All-atom 3D structure prediction of transmembrane β-barrel proteins from sequences

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1419956112 ◽

2015 ◽

Vol 112 (17) ◽

pp. 5413-5418 ◽

Cited By ~ 41

Author(s):

Sikander Hayat ◽

Chris Sander ◽

Debora S. Marks ◽

Arne Elofsson

Keyword(s):

Structure Prediction ◽

De Novo ◽

3D Structure ◽

3D Models ◽

Sequence Information ◽

Sequence Alignments ◽

Residue Contacts ◽

Machine Learning Approach ◽

3D Structure Prediction ◽

Structure Accuracy

Transmembrane β-barrels (TMBs) carry out major functions in substrate transport and protein biogenesis but experimental determination of their 3D structure is challenging. Encouraged by successful de novo 3D structure prediction of globular and α-helical membrane proteins from sequence alignments alone, we developed an approach to predict the 3D structure of TMBs. The approach combines the maximum-entropy evolutionary coupling method for predicting residue contacts (EVfold) with a machine-learning approach (boctopus2) for predicting β-strands in the barrel. In a blinded test for 19 TMB proteins of known structure that have a sufficient number of diverse homologous sequences available, this combined method (EVfold_bb) predicts hydrogen-bonded residue pairs between adjacent β-strands at an accuracy of ∼70%. This accuracy is sufficient for the generation of all-atom 3D models. In the transmembrane barrel region, the average 3D structure accuracy [template-modeling (TM) score] of top-ranked models is 0.54 (ranging from 0.36 to 0.85), with a higher (44%) number of residue pairs in correct strand–strand registration than in earlier methods (18%). Although the nonbarrel regions are predicted less accurately overall, the evolutionary couplings identify some highly constrained loop residues and, for FecA protein, the barrel including the structure of a plug domain can be accurately modeled (TM score = 0.68). Lower prediction accuracy tends to be associated with insufficient sequence information and we therefore expect increasing numbers of β-barrel families to become accessible to accurate 3D structure prediction as the number of available sequences increases.

Download Full-text

Mass spectrometry-based sequencing of the anti-FLAG-M2 antibody using multiple proteases and a dual fragmentation scheme

10.1101/2021.01.07.425675 ◽

2021 ◽

Author(s):

Weiwei Peng ◽

Matti F Pronker ◽

Joost Snijder

Keyword(s):

Monoclonal Antibody ◽

De Novo ◽

High Energy ◽

De Novo Sequencing ◽

Structural Basis ◽

Sequence Information ◽

Energy Collision ◽

Sequence Coverage ◽

High Energy Collision ◽

Variable Regions

Antibody sequence information is crucial to understanding the structural basis for antigen binding and enables the use of antibodies as therapeutics and research tools. Here we demonstrate a method for direct de novo sequencing of monoclonal IgG from the purified antibody products. The method uses a panel of multiple complementary proteases to generate suitable peptides for de novo sequencing by LC-MS/MS in a bottom-up fashion. Furthermore, we apply a dual fragmentation scheme, using both stepped high-energy collision dissociation (stepped HCD) and electron transfer high-energy collision dissociation (EThcD) on all peptide precursors. The method achieves full sequence coverage of the monoclonal antibody Herceptin, with an accuracy of 98% in the variable regions. We applied the method to sequence the widely used anti-FLAG-M2 mouse monoclonal antibody, which we successfully validated by remodeling a high-resolution crystal structure of the Fab and demonstrating binding to a FLAG-tagged target protein in Western blot analysis. The method thus offers robust and reliable sequences of monoclonal antibodies.

Download Full-text

Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model

Genes ◽

10.3390/genes10110924 ◽

2019 ◽

Vol 10 (11) ◽

pp. 924 ◽

Cited By ~ 2

Author(s):

Chen ◽

You ◽

Zhang ◽

Wang ◽

Cheng ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Protein Sequence ◽

De Novo ◽

Protein Sequences ◽

Computational Method ◽

Sequence Information ◽

Interacting Proteins ◽

Forest Model

Self-interacting proteins (SIPs) is of paramount importance in current molecular biology. There have been developed a number of traditional biological experiment methods for predicting SIPs in the past few years. However, these methods are costly, time-consuming and inefficient, and often limit their usage for predicting SIPs. Therefore, the development of computational method emerges at the times require. In this paper, we for the first time proposed a novel deep learning model which combined natural language processing (NLP) method for potential SIPs prediction from the protein sequence information. More specifically, the protein sequence is de novo assembled by k-mers. Then, we obtained the global vectors representation for each protein sequences by using natural language processing (NLP) technique. Finally, based on the knowledge of known self-interacting and non-interacting proteins, a multi-grained cascade forest model is trained to predict SIPs. Comprehensive experiments were performed on yeast and human datasets, which obtained an accuracy rate of 91.45% and 93.12%, respectively. From our evaluations, the experimental results show that the use of amino acid semantics information is very helpful for addressing the problem of sequences containing both self-interacting and non-interacting pairs of proteins. This work would have potential applications for various biological classification problems.

Download Full-text

Classifying Residues in Mechanically Stable and Unstable Substructures Based on a Protein Sequence: The Case Study of the DnaK Hsp70 Chaperone

Nanomaterials ◽

10.3390/nano11092198 ◽

2021 ◽

Vol 11 (9) ◽

pp. 2198

Author(s):

Michal Gala ◽

Gabriel Žoldák

Keyword(s):

Machine Learning ◽

Protein Sequence ◽

Building Blocks ◽

Support Vector ◽

Sequence Information ◽

E Coli ◽

Physico Chemical ◽

Machine Learning Model ◽

Supervised Methods ◽

The Moment

Artificial proteins can be constructed from stable substructures, whose stability is encoded in their protein sequence. Identifying stable protein substructures experimentally is the only available option at the moment because no suitable method exists to extract this information from a protein sequence. In previous research, we examined the mechanics of E. coli Hsp70 and found four mechanically stable (S class) and three unstable substructures (U class). Of the total 603 residues in the folded domains of Hsp70, 234 residues belong to one of four mechanically stable substructures, and 369 residues belong to one of three unstable substructures. Here our goal is to develop a machine learning model to categorize Hsp70 residues using sequence information. We applied three supervised methods: logistic regression (LR), random forest, and support vector machine. The LR method showed the highest accuracy, 0.925, to predict the correct class of a particular residue only when context-dependent physico-chemical features were included. The cross-validation of the LR model yielded a prediction accuracy of 0.879 and revealed that most of the misclassified residues lie at the borders between substructures. We foresee machine learning models being used to identify stable substructures as candidates for building blocks to engineer new proteins.

Download Full-text

Optimization of carbamylation conditions and study on the effects on the product ions of carbamylation and dual modification of the peptide by Q-TOF MS

European Journal of Mass Spectrometry ◽

10.1177/1469066718788665 ◽

2018 ◽

Vol 24 (5) ◽

pp. 384-396

Author(s):

Cheng Guo ◽

Xuefeng Guo ◽

Lei Zhao ◽

Dandan Chen ◽

Jin Wang ◽

...

Keyword(s):

Amino Acid ◽

De Novo ◽

De Novo Sequencing ◽

Peptide Sequence ◽

High Mass ◽

Sequence Information ◽

Amino Acid Residues ◽

Sequence Coverage ◽

Modification Method ◽

Modified Peptides

Modified peptides fragmented by collision-induced dissociation can offer additional sequence information, which is beneficial for the de novo sequencing of peptides. Here, the model peptide VQGESNDLK was carbamylated. The optimal conditions were as follows: temperature of 90℃, pH of 7, and the time of 60 min. Then, we studied the b- and y-series ions of the native, carbamylated, and dual-modified peptides. The results were as follows. The short carbamylated peptides (≤10 amino acid residues) produced more b-series ions (including b1 ion). The long carbamylated peptides (>10 amino acid residues) produced additional b1 ion but fewer y-series ions (especially in the high-mass region). The short dual-modified peptides produced more b-series ions (including b1 ion) and more y-series ions, and their peptide sequence coverage was almost 100%. The long dual-modified peptides produce b1 ion and more y-series ions, and their peptide sequence coverage was nearly above 90%. Therefore, both carbamylation and the dual modification method could be used to identify the N-terminal amino acid, and the dual modification method was also excellent for the de novo sequencing of the tryptic peptides.

Download Full-text

A General Protease Digestion Procedure for Optimal Protein Sequence Coverage and Post-Translational Modifications Analysis of Recombinant Glycoproteins: Application to the Characterization of Human Lysyl Oxidase-like 2 Glycosylation

Analytical Chemistry ◽

10.1021/ac2017037 ◽

2011 ◽

Vol 83 (22) ◽

pp. 8484-8491 ◽

Cited By ~ 29

Author(s):

Kathryn R. Rebecchi ◽

Eden P. Go ◽

Li Xu ◽

Carrie L. Woodin ◽

Minae Mure ◽

...

Keyword(s):

Protein Sequence ◽

Lysyl Oxidase ◽

Sequence Coverage ◽

Post Translational Modifications ◽

Digestion Procedure ◽

Recombinant Glycoproteins ◽

Protease Digestion ◽

Protein Sequence Coverage

Download Full-text

EVfold.org: Evolutionary Couplings and Protein 3D Structure Prediction

10.1101/021022 ◽

2015 ◽

Cited By ~ 14

Author(s):

Robert Sheridan ◽

Robert J. Fieldhouse ◽

Sikander Hayat ◽

Yichao Sun ◽

Yevgeniy Antipin ◽

...

Keyword(s):

Protein Function ◽

Structure Prediction ◽

De Novo ◽

3D Structure ◽

Sequence Information ◽

Major Advance ◽

Sequence Alignments ◽

Multiple Sequence ◽

Genomic Databases ◽

Multiple Sequence Alignments

Recently developed maximum entropy methods infer evolutionary constraints on protein function and structure from the millions of protein sequences available in genomic databases. The EVfold web server (at EVfold.org) makes these methods available to predict functional and structural interactions in proteins. The key algorithmic development has been to disentangle direct and indirect residue-residue correlations in large multiple sequence alignments and derive direct residue-residue evolutionary couplings (EVcouplings or ECs). For proteins of unknown structure, distance constraints obtained from evolutionarily couplings between residue pairs are used to de novo predict all-atom 3D structures, often to good accuracy. Given sufficient sequence information in a protein family, this is a major advance toward solving the problem of computing the native 3D fold of proteins from sequence information alone. Availability: EVfold server at http://evfold.org/ Contact: [email protected]

Download Full-text

Cryomicroscopy of crotoxin complex crystals at 400 KV

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100156109 ◽

1989 ◽

Vol 47 ◽

pp. 826-827

Author(s):

Jaap Brink ◽

Wah Chiu

Keyword(s):

Signal To Noise Ratio ◽

3D Structure ◽

Three Dimensional ◽

Carbon Films ◽

Depth Of Field ◽

Basic Subunit ◽

Ewald Sphere ◽

Diffraction Patterns ◽

Complex Crystals ◽

Acidic Subunit

The crotoxin complex is a potent neurotoxin composed of a basic subunit (Mr = 12,000) and an acidic subunit (M = 10,000). The basic subunit possesses phospholipase activity whereas the acidic subunit shows no enzymatic activity at all. The complex's toxocity is expressed both pre- and post-synaptically. The crotoxin complex forms thin crystals suitable for electron crystallography. The crystals diffract up to 0.16 nm in the microscope, whereas images show reflections out to 0.39 nm2. Ultimate goal in this study is to obtain a three-dimensional (3D-) structure map of the protein around 0.3 nm resolution. Use of 100 keV electrons in this is limited; the unit cell's height c of 25.6 nm causes problems associated with multiple scattering, radiation damage, limited depth of field and a more pronounced Ewald sphere curvature. In general, they lead to projections of the unit cell, which at the desired resolution, cannot be interpreted following the weak-phase approximation. Circumventing this problem is possible through the use of 400 keV electrons. Although the overall contrast is lowered due to a smaller scattering cross-section, the signal-to-noise ratio of especially higher order reflections will improve due to a smaller contribution of inelastic scattering. We report here our preliminary results demonstrating the feasability of the data collection procedure at 400 kV.Crystals of crotoxin complex were prepared on carbon-covered holey-carbon films, quench frozen in liquid ethane, inserted into a Gatan 626 holder, transferred into a JEOL 4000EX electron microscope equipped with a pair of anticontaminators operating at −184°C and examined under low-dose conditions. Selected area electron diffraction patterns (EDP's) and images of the crystals were recorded at 400 kV and −167°C with dose levels of 5 and 9.5 electrons/Å, respectively.

Download Full-text