SPIDER: SOFTWARE FOR PROTEIN IDENTIFICATION FROM SEQUENCE TAGS WITH DE NOVO SEQUENCING ERROR

De novo peptide sequencing algorithms have been developed for peptide identification in proteomics from tandem mass spectra (MS/MS), which can be used to identify and discover novel peptides and proteins that do not have a database available. Despite improvements in MS instrumentation and de novo sequencing methods, a significant number of CID MS/MS spectra still remain unassigned with the current algorithms, often leading to low confidence of peptide assignments to the spectra. Moreover, current algorithms often fail to construct the completely matched sequences, and produce partial matches. Therefore, identification of full-length peptides remains challenging. Another major challenge is the existence of noise in MS/MS spectra which makes the data highly imbalanced. Also missing peaks, caused by incomplete MS fragmentation makes it more difficult to infer a full-length peptide sequence. In addition, the large search space of all possible amino acid sequences for each spectrum leads to a high false discovery rate. This thesis focuses on improving the performance of current methods by developing new algorithms corresponding to three steps of preprocessing, sequence optimisation and post-processing using machine learning for more comprehensive interrogation of MS/MS datasets. From the machine learning point of view, the three steps can be addressed by solving different tasks such as classification, optimisation, and symbolic regression. Since Evolutionary Algorithms (EAs), as effective global search techniques, have shown promising results in solving these problems, this thesis investigates the capability of EAs in improving the de novo peptide sequencing. In the preprocessing step, this thesis proposes an effective GP-based method for classification of signal and noise peaks in highly imbalanced MS/MS spectra with the purpose of having a positive influence on the reliability of the peptide identification. The results show that the proposed algorithm is the most stable classification method across various noise ratios, outperforming six other benchmark classification algorithms. The experimental results show a significant improvement in high confidence peptide assignments to MS/MS spectra when the data is preprocessed by the proposed GP method. Moreover, the first multi-objective GP approach for classification of peaks in MS/MS data, aiming at maximising the accuracy of the minority class (signal peaks) and the accuracy of the majority class (noise peaks) is also proposed in this thesis. The results show that the multi-objective GP method outperforms the single objective GP algorithm and a popular multi-objective approach in terms of retaining more signal peaks and removing more noise peaks. The multi-objective GP approach significantly improved the reliability of peptide identification. This thesis proposes a GA-based method to solve the complex optimisation task of de novo peptide sequencing, aiming at constructing full-length sequences. The proposed GA method benefits the GA capability of searching a large search space of potential amino acid sequences to find the most likely full-length sequence. The experimental results show that the proposed method outperforms the most commonly used de novo sequencing method at both amino acid level and peptide level. This thesis also proposes a novel method for re-scoring and re-ranking the peptide spectrum matches (PSMs) from the result of de novo peptide sequencing, aiming at minimising the false discovery rate as a post-processing approach. The proposed GP method evolves the computer programs to perform regression and classification simultaneously in order to generate an effective scoring function for finding the correct PSMs from many incorrect ones. The results show that the new GP-based PSM scoring function significantly improves the identification of full-length peptides when it is used to post-process the de novo sequencing results.

Download Full-text

Evolutionary Algorithms for Improving De Novo Peptide Sequencing

10.26686/wgtn.17145581 ◽

2021 ◽

Author(s):

◽

Samaneh Azari

Keyword(s):

Amino Acid ◽

De Novo ◽

Peptide Identification ◽

Peptide Sequencing ◽

De Novo Sequencing ◽

Amino Acid Sequences ◽

Full Length ◽

Multi Objective ◽

De Novo Peptide Sequencing ◽

De Novo Peptide

De novo peptide sequencing algorithms have been developed for peptide identification in proteomics from tandem mass spectra (MS/MS), which can be used to identify and discover novel peptides and proteins that do not have a database available. Despite improvements in MS instrumentation and de novo sequencing methods, a significant number of CID MS/MS spectra still remain unassigned with the current algorithms, often leading to low confidence of peptide assignments to the spectra. Moreover, current algorithms often fail to construct the completely matched sequences, and produce partial matches. Therefore, identification of full-length peptides remains challenging. Another major challenge is the existence of noise in MS/MS spectra which makes the data highly imbalanced. Also missing peaks, caused by incomplete MS fragmentation makes it more difficult to infer a full-length peptide sequence. In addition, the large search space of all possible amino acid sequences for each spectrum leads to a high false discovery rate. This thesis focuses on improving the performance of current methods by developing new algorithms corresponding to three steps of preprocessing, sequence optimisation and post-processing using machine learning for more comprehensive interrogation of MS/MS datasets. From the machine learning point of view, the three steps can be addressed by solving different tasks such as classification, optimisation, and symbolic regression. Since Evolutionary Algorithms (EAs), as effective global search techniques, have shown promising results in solving these problems, this thesis investigates the capability of EAs in improving the de novo peptide sequencing. In the preprocessing step, this thesis proposes an effective GP-based method for classification of signal and noise peaks in highly imbalanced MS/MS spectra with the purpose of having a positive influence on the reliability of the peptide identification. The results show that the proposed algorithm is the most stable classification method across various noise ratios, outperforming six other benchmark classification algorithms. The experimental results show a significant improvement in high confidence peptide assignments to MS/MS spectra when the data is preprocessed by the proposed GP method. Moreover, the first multi-objective GP approach for classification of peaks in MS/MS data, aiming at maximising the accuracy of the minority class (signal peaks) and the accuracy of the majority class (noise peaks) is also proposed in this thesis. The results show that the multi-objective GP method outperforms the single objective GP algorithm and a popular multi-objective approach in terms of retaining more signal peaks and removing more noise peaks. The multi-objective GP approach significantly improved the reliability of peptide identification. This thesis proposes a GA-based method to solve the complex optimisation task of de novo peptide sequencing, aiming at constructing full-length sequences. The proposed GA method benefits the GA capability of searching a large search space of potential amino acid sequences to find the most likely full-length sequence. The experimental results show that the proposed method outperforms the most commonly used de novo sequencing method at both amino acid level and peptide level. This thesis also proposes a novel method for re-scoring and re-ranking the peptide spectrum matches (PSMs) from the result of de novo peptide sequencing, aiming at minimising the false discovery rate as a post-processing approach. The proposed GP method evolves the computer programs to perform regression and classification simultaneously in order to generate an effective scoring function for finding the correct PSMs from many incorrect ones. The results show that the new GP-based PSM scoring function significantly improves the identification of full-length peptides when it is used to post-process the de novo sequencing results.

Download Full-text

Enhancing TOF/TOF-based de Novo Sequencing Capability for High Throughput Protein Identification with Amino Acid-Coded Mass Tagging

Journal of Proteome Research ◽

10.1021/pr049850u ◽

2005 ◽

Vol 4 (1) ◽

pp. 83-90 ◽

Cited By ~ 14

Author(s):

Wenqing Shui ◽

Yinkun Liu ◽

Huizhi Fan ◽

Huimin Bao ◽

Shufang Liang ◽

...

Keyword(s):

Amino Acid ◽

High Throughput ◽

Protein Identification ◽

De Novo ◽

De Novo Sequencing

Download Full-text

Uncovering thousands of new HLA antigens and phosphopeptides with deep learning-based sequence-mask-search de novo peptide sequencing framework

10.1101/667527 ◽

2019 ◽

Author(s):

Korrawe Karunratanakul ◽

Hsin-Yao Tang ◽

David W. Speicher ◽

Ekapol Chuangsuwanich ◽

Sira Sriswasdi

Keyword(s):

Deep Learning ◽

Amino Acid ◽

De Novo ◽

Hla Antigens ◽

Peptide Identification ◽

Peptide Sequencing ◽

Amino Acid Sequences ◽

Mass Spectrometry Data ◽

Model Organisms ◽

Invaluable Tool

ABSTRACTTypical analyses of mass spectrometry data only identify amino acid sequences that exist in reference databases. This restricts the possibility of discovering new peptides such as those that contain uncharacterized mutations or originate from unexpected processing of RNAs and proteins. De novo peptide sequencing approaches address this limitation but often suffer from low accuracy and require extensive validation by experts. Here, we develop SMSNet, a deep learning-based hybrid de novo peptide sequencing framework that achieves >95% amino acid accuracy while retaining good identification coverage. Applications of SMSNet on landmark proteomics and peptideomics studies reveal over 10,000 previously uncharacterized HLA antigens and phosphopeptides and in conjunction with database-search methods, expand the coverage of peptide identification by almost 30%. The power to accurately identify new peptides of SMSNet would make it an invaluable tool for any future proteomics and peptidomics studies – especially cancer neoantigen discovery and proteome characterization of non-model organisms.

Download Full-text

SPIDER: software for protein identification from sequence tags with de novo sequencing error

Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004. ◽

10.1109/csb.2004.1332434 ◽

2004 ◽

Cited By ~ 2

Author(s):

Yonghua Han ◽

Bin Ma ◽

Kaizhong Zhang

Keyword(s):

Protein Identification ◽

De Novo ◽

De Novo Sequencing ◽

Sequencing Error

Download Full-text

AN AUTOMATA APPROACH TO MATCH GAPPED SEQUENCE TAGS AGAINST PROTEIN DATABASE

International Journal of Foundations of Computer Science ◽

10.1142/s012905410500311x ◽

2005 ◽

Vol 16 (03) ◽

pp. 487-497

Author(s):

YONGHUA HAN ◽

BIN MA ◽

KAIZHONG ZHANG

Keyword(s):

Mass Spectrometry ◽

Protein Identification ◽

De Novo ◽

De Novo Sequencing ◽

Computational Method ◽

Peptide Sequence ◽

Tandem Mass ◽

Protein Database ◽

Matching Algorithm ◽

Mass Gap

In Biochemistry, tandem mass spectrometry (MS/MS) is the most common method for peptide and protein identifications. One computational method to get a peptide sequence from the MS/MS data is called de novo sequencing, which is becoming more and more important in this area. However De novo sequencing usually can only confidently determine partial sequences, while the undetermined parts are represented by "mass gaps". We call such a partially determined sequence a gapped sequence tag. When a gapped sequence tag is searched in a database for protein identification, the determined parts should match the database sequence exactly, while each mass gap should match a substring of amino acids whose masses add up to the value of the mass gap. In such a case, the standard string matching algorithm does not work any more. In this paper, we present a new efficient algorithm to find the matches of gapped sequence tags in a protein database.

Download Full-text

An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database

Journal of the American Society for Mass Spectrometry ◽

10.1016/1044-0305(94)80016-2 ◽

1994 ◽

Vol 5 (11) ◽

pp. 976-989 ◽

Cited By ~ 4235

Author(s):

Jimmy K. Eng ◽

Ashley L. McCormack ◽

John R. Yates

Keyword(s):

Amino Acid ◽

Spectral Data ◽

Amino Acid Sequences ◽

Tandem Mass ◽

Mass Spectral Data ◽

Protein Database ◽

Mass Spectral ◽

Tandem Mass Spectral Data

Download Full-text

Protein Identification Strategies for the Greenshell Mussel Perna canaliculus

10.26686/wgtn.16999822.v1 ◽

2021 ◽

Author(s):

◽

Cassidy Moeke

Keyword(s):

Heavy Metal ◽

Metal Pollution ◽

Heavy Metal Pollution ◽

Protein Identification ◽

De Novo ◽

De Novo Sequencing ◽

Cytoskeletal Proteins ◽

Est Database ◽

Perna Canaliculus ◽

Sequence Databases

The greenshell mussel Perna canaliculus is considered to be a suitable biomonitor for heavy metal pollution. This is due to their ability to accumulate and tolerate heavy metals in their tissues. These characteristics make them useful for identifying protein biomarkers of heavy metal pollution, as well as proteins associated with heavy metal detoxification and homeostasis. However, the identification of such proteins is restricted by the greenshell mussel being poorly represented in sequence databases. Several strategies have previously been used to identify proteins in unsequenced species, but only one of these strategies has been applied to the greenshell mussel. The objective of this thesis was to examine different protein identification strategies using a combined two-dimensional gel electrophoresis and MALDI-TOF/TOF mass spectrometry approach. The protein identification strategies used include a Mascot database search, as well as de novo sequencing approaches using PEAKS DB and SPIDER homology searches. In total, 155 protein spots were excised and a total of 68 identified. Fifty-six proteins were identified using a Mascot search against the Mollusca, NCBInr and Invertebrate EST database, with seven single-peptide identifications. De novo sequencing strategies identified additional proteins, with two from a PEAKS DB search and 10 from an error-tolerant SPIDER homology search. The most noticeable protein groups identified were cytoskeletal proteins, stress response proteins and those involved in protein biosynthesis. Actin and tubulin made up the bulk of the identifications, accounting for 39% of all proteins identified. This multifaceted approach was shown to be useful for identifying proteins in the greenshell mussel Perna canaliculus. Mascot and PEAKS DB performed equally well, while the error-tolerant functionality of SPIDER was useful for identifying additional proteins. A subsequent search against the Invertebrate EST database was also found to be useful for identifying additional proteins. Despite this, more than half of all proteins remained unidentified. Most of these proteins either failed to produce good quality MS spectra or did not find a match to a sequence in the database. Future research should first focus on obtaining quality MS spectra for all proteins concerned and then examine other strategies that may be more suitable for identifying proteins for species with poor representation in sequence databases.

Download Full-text

Expression of intermediate filament proteins during development of Xenopus laevis. I. cDNA clones encoding different forms of vimentin

Development ◽

10.1242/dev.105.2.279 ◽

1989 ◽

Vol 105 (2) ◽

pp. 279-298

Author(s):

H. Herrmann ◽

B. Fouquet ◽

W.W. Franke

Keyword(s):

Amino Acid ◽

Xenopus Laevis ◽

Intermediate Filament ◽

De Novo ◽

Mesenchymal Cell ◽

Amino Acid Sequences ◽

Cytoskeletal Proteins ◽

Cdna Clones ◽

Mammalian Development ◽

Intermediate Filament Proteins

To provide a basis for studies of the expression of genes encoding the diverse kinds of intermediate-filament (IF) proteins during embryogenesis of Xenopus laevis we have isolated and characterized IF protein cDNA clones. Here we report the identification of two types of Xenopus vimentin, Vim1 and Vim4, with their complete amino acid sequences as deduced from the cloned cDNAs, both of which are expressed during early embryogenesis. In addition, we have obtained two further vimentin cDNAs (Vim2 and 3) which are sequence variants of closely related Vim1. The high evolutionary conservation of the amino acid sequences (Vim1: 458 residues; Mr approximately 52,800; Vim4: 463 residues; Mr approximately 53,500) to avian and mammalian vimentin and, to a lesser degree, to desmin from the same and higher vertebrate species, is emphasized, including conserved oligopeptide motifs in their head domains. Using these cDNAs in RNA blot and ribonuclease protection assays of various embryonic stages, we observed a dramatic increase of vimentin RNA at stage 14, in agreement with immunocytochemical results obtained with antibody VIM-3B4. The significance of very weak mRNA signals detected in earlier stages is discussed in relation to negative immunocytochemical results obtained in these stages. The first appearance of vimentin has been localized to a distinct mesenchymal cell layer underlying the neural plate or tube, respectively. The results are discussed in relation to programs of de novo synthesis of other cytoskeletal proteins in amphibian and mammalian development.

Download Full-text

Unevolved De Novo Proteins Have Innate Tendencies to Bind Transition Metals

Life ◽

10.3390/life9010008 ◽

2019 ◽

Vol 9 (1) ◽

pp. 8 ◽

Cited By ~ 4

Author(s):

Michael S. Wang ◽

Kenric J. Hoegler ◽

Michael H. Hecht

Keyword(s):

Amino Acid ◽

Transition Metals ◽

Metal Binding ◽

Combinatorial Library ◽

De Novo ◽

Protein Sequences ◽

Amino Acid Sequences ◽

Ancestral Sequences ◽

Wide Range ◽

Catalytic Functions

Life as we know it would not exist without the ability of protein sequences to bind metal ions. Transition metals, in particular, play essential roles in a wide range of structural and catalytic functions. The ubiquitous occurrence of metalloproteins in all organisms leads one to ask whether metal binding is an evolved trait that occurred only rarely in ancestral sequences, or alternatively, whether it is an innate property of amino acid sequences, occurring frequently in unevolved sequence space. To address this question, we studied 52 proteins from a combinatorial library of novel sequences designed to fold into 4-helix bundles. Although these sequences were neither designed nor evolved to bind metals, the majority of them have innate tendencies to bind the transition metals copper, cobalt, and zinc with high nanomolar to low-micromolar affinity.

Download Full-text