The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site

Antimicrobial peptides (AMPs) are anti-infectives that have the potential to be used as a novel and untapped class of biotherapeutics. Modes of action of antimicrobial peptides include interaction with the cell envelope (cell wall, outer- and inner-membrane). A comprehensive understanding of the peculiarities of interaction of antimicrobial peptides with the cell envelope is necessary to perform a rational design of new biotherapeutics, against which working out resistance is hard for microbes. In order to enable de novo design with low cost and high throughput, in silico predictive models have to be invoked. To develop an efficient predictive model, a comprehensive understanding of the sequence-to-function relationship is required. This knowledge will allow us to encode amino acid sequences expressively and to adequately choose the accurate AMP classifier. A shared protective layer of microbial cells is the inner, plasmatic membrane. The interaction of AMP with a biological membrane (native and/or artificial) has been comprehensively studied. We provide a review of mechanisms and results of interactions of AMP with the cell membrane, relying on the survey of physicochemical, aggregative, and structural features of AMPs. The potency and mechanism of AMP action are presented in terms of amino acid compositions and distributions of the polar and apolar residues along the chain, that is, in terms of the physicochemical features of peptides such as hydrophobicity, hydrophilicity, and amphiphilicity. The survey of current data highlights topics that should be taken into account to come up with a comprehensive explanation of the mechanisms of action of AMP and to uncover the physicochemical faces of peptides, essential to perform their function. Many different approaches have been used to classify AMPs, including machine learning. The survey of knowledge on sequences, structures, and modes of actions of AMP allows concluding that only possessing comprehensive information on physicochemical features of AMPs enables us to develop accurate classifiers and create effective methods of prediction. Consequently, this knowledge is necessary for the development of design tools for peptide-based antibiotics.

Download Full-text

Conserved Peptides Recognition by Ensemble of Neural Networks for Mining Protein Data – LPMO Case Study

Математическая биология и биоинформатика ◽

10.17537/2020.15.429 ◽

2020 ◽

Vol 15 (2) ◽

pp. 429-440

Author(s):

G.S. Dotsenko ◽

A.S. Dotsenko

Keyword(s):

Neural Networks ◽

Amino Acid ◽

Markov Models ◽

Amino Acid Sequences ◽

Peptide Pattern ◽

Novel Approach ◽

Polysaccharide Monooxygenases ◽

Promising Area ◽

Peptide Pattern Recognition ◽

Bacterial Proteomes

Mining protein data is a recent promising area of modern bioinformatics. In this work, we suggested a novel approach for mining protein data – conserved peptides recognition by ensemble of neural networks (CPRENN). This approach was applied for mining lytic polysaccharide monooxygenases (LPMOs) in 19 ascomycete, 18 basidiomycete, and 18 bacterial proteomes. LPMOs are recently discovered enzymes and their mining is of high relevance for biotechnology of lignocellulosic materials. CPRENN was compared with two conventional bioinformatic methods for mining protein data – profile hidden Markov models (HMMs) search (HMMER program) and peptide pattern recognition (PPR program combined with Hotpep application). The maximum number of hypothetical LPMO amino acid sequences was discovered by HMMER. Profile HMMs search proved to be more sensitive method for mining LPMOs than conserved peptides recognition. Totally, CPRENN found 76 %, 67 %, and 65 % of hypothetical ascomycete, basidiomycete, and bacterial LPMOs discovered by HMMER, respectively. For AA9, AA10, and AA11 families which contain the major part of all LPMOs in the carbohydrate-active enzymes database (CAZy), CPRENN and PPR + Hotpep found 69–98 % and 62–95 % of amino acid sequences discovered by HMMER, respectively. In contrast with PPR + Hotpep, CPRENN possessed perfect precision and provided more complete mining of basidiomycete and bacterial LPMOs.

Download Full-text

FIND: Identifying Functionally and Structurally Important Features in Protein Sequences with Deep Neural Networks

10.1101/592808 ◽

2019 ◽

Author(s):

Ranjani Murali ◽

James Hemp ◽

Victoria Orphan ◽

Yonatan Bisk

Keyword(s):

Neural Networks ◽

Amino Acid ◽

Hidden Markov Models ◽

Markov Models ◽

Genomic Sequence ◽

Hidden Markov ◽

Amino Acid Sequences ◽

Homologous Proteins ◽

Biological Studies ◽

Insight Into

AbstractThe ability to correctly predict the functional role of proteins from their amino acid sequences would significantly advance biological studies at the molecular level by improving our ability to understand the biochemical capability of biological organisms from their genomic sequence. Existing methods that are geared towards protein function prediction or annotation mostly use alignment-based approaches and probabilistic models such as Hidden-Markov Models. In this work we introduce a deep learning architecture (FunctionIdentification withNeuralDescriptions orFIND) which performs protein annotation from primary sequence. The accuracy of our methods matches state of the art techniques, such as protein classifiers based on Hidden Markov Models. Further, our approach allows for model introspection via a neural attention mechanism, which weights parts of the amino acid sequence proportionally to their relevance for functional assignment. In this way, the attention weights automatically uncover structurally and functionally relevant features of the classified protein and find novel functional motifs in previously uncharacterized proteins. While this model is applicable to any database of proteins, we chose to apply this model to superfamilies of homologous proteins, with the aim of extracting features inherent to divergent protein families within a larger superfamily. This provided insight into the functional diversification of an enzyme superfamily and its adaptation to different physiological contexts. We tested our approach on three families (nitrogenases, cytochromebd-type oxygen reductases and heme-copper oxygen reductases) and present a detailed analysis of the sequence characteristics identified in previously characterized proteins in the heme-copper oxygen reductase (HCO) superfamily. These are correlated with their catalytic relevance and evolutionary history. FIND was then applied to discover features in previously uncharacterized members of the HCO superfamily, providing insight into their unique sequence features. This modeling approach demonstrates the power of neural networks to recognize patterns in large datasets and can be utilized to discover biochemically and structurally important features in proteins from their amino acid sequences.Author summary

Download Full-text

Evolutionary Algorithms for Improving De Novo Peptide Sequencing

10.26686/wgtn.17145581.v1 ◽

2021 ◽

Author(s):

◽

Samaneh Azari

Keyword(s):

Amino Acid ◽

De Novo ◽

Peptide Identification ◽

Peptide Sequencing ◽

De Novo Sequencing ◽

Amino Acid Sequences ◽

Full Length ◽

Multi Objective ◽

De Novo Peptide Sequencing ◽

De Novo Peptide

<p>De novo peptide sequencing algorithms have been developed for peptide identification in proteomics from tandem mass spectra (MS/MS), which can be used to identify and discover novel peptides and proteins that do not have a database available. Despite improvements in MS instrumentation and de novo sequencing methods, a significant number of CID MS/MS spectra still remain unassigned with the current algorithms, often leading to low confidence of peptide assignments to the spectra. Moreover, current algorithms often fail to construct the completely matched sequences, and produce partial matches. Therefore, identification of full-length peptides remains challenging. Another major challenge is the existence of noise in MS/MS spectra which makes the data highly imbalanced. Also missing peaks, caused by incomplete MS fragmentation makes it more difficult to infer a full-length peptide sequence. In addition, the large search space of all possible amino acid sequences for each spectrum leads to a high false discovery rate. This thesis focuses on improving the performance of current methods by developing new algorithms corresponding to three steps of preprocessing, sequence optimisation and post-processing using machine learning for more comprehensive interrogation of MS/MS datasets. From the machine learning point of view, the three steps can be addressed by solving different tasks such as classification, optimisation, and symbolic regression. Since Evolutionary Algorithms (EAs), as effective global search techniques, have shown promising results in solving these problems, this thesis investigates the capability of EAs in improving the de novo peptide sequencing. In the preprocessing step, this thesis proposes an effective GP-based method for classification of signal and noise peaks in highly imbalanced MS/MS spectra with the purpose of having a positive influence on the reliability of the peptide identification. The results show that the proposed algorithm is the most stable classification method across various noise ratios, outperforming six other benchmark classification algorithms. The experimental results show a significant improvement in high confidence peptide assignments to MS/MS spectra when the data is preprocessed by the proposed GP method. Moreover, the first multi-objective GP approach for classification of peaks in MS/MS data, aiming at maximising the accuracy of the minority class (signal peaks) and the accuracy of the majority class (noise peaks) is also proposed in this thesis. The results show that the multi-objective GP method outperforms the single objective GP algorithm and a popular multi-objective approach in terms of retaining more signal peaks and removing more noise peaks. The multi-objective GP approach significantly improved the reliability of peptide identification. This thesis proposes a GA-based method to solve the complex optimisation task of de novo peptide sequencing, aiming at constructing full-length sequences. The proposed GA method benefits the GA capability of searching a large search space of potential amino acid sequences to find the most likely full-length sequence. The experimental results show that the proposed method outperforms the most commonly used de novo sequencing method at both amino acid level and peptide level. This thesis also proposes a novel method for re-scoring and re-ranking the peptide spectrum matches (PSMs) from the result of de novo peptide sequencing, aiming at minimising the false discovery rate as a post-processing approach. The proposed GP method evolves the computer programs to perform regression and classification simultaneously in order to generate an effective scoring function for finding the correct PSMs from many incorrect ones. The results show that the new GP-based PSM scoring function significantly improves the identification of full-length peptides when it is used to post-process the de novo sequencing results.</p>

Download Full-text

Expression of intermediate filament proteins during development of Xenopus laevis. I. cDNA clones encoding different forms of vimentin

Development ◽

10.1242/dev.105.2.279 ◽

1989 ◽

Vol 105 (2) ◽

pp. 279-298

Author(s):

H. Herrmann ◽

B. Fouquet ◽

W.W. Franke

Keyword(s):

Amino Acid ◽

Xenopus Laevis ◽

Intermediate Filament ◽

De Novo ◽

Mesenchymal Cell ◽

Amino Acid Sequences ◽

Cytoskeletal Proteins ◽

Cdna Clones ◽

Mammalian Development ◽

Intermediate Filament Proteins

To provide a basis for studies of the expression of genes encoding the diverse kinds of intermediate-filament (IF) proteins during embryogenesis of Xenopus laevis we have isolated and characterized IF protein cDNA clones. Here we report the identification of two types of Xenopus vimentin, Vim1 and Vim4, with their complete amino acid sequences as deduced from the cloned cDNAs, both of which are expressed during early embryogenesis. In addition, we have obtained two further vimentin cDNAs (Vim2 and 3) which are sequence variants of closely related Vim1. The high evolutionary conservation of the amino acid sequences (Vim1: 458 residues; Mr approximately 52,800; Vim4: 463 residues; Mr approximately 53,500) to avian and mammalian vimentin and, to a lesser degree, to desmin from the same and higher vertebrate species, is emphasized, including conserved oligopeptide motifs in their head domains. Using these cDNAs in RNA blot and ribonuclease protection assays of various embryonic stages, we observed a dramatic increase of vimentin RNA at stage 14, in agreement with immunocytochemical results obtained with antibody VIM-3B4. The significance of very weak mRNA signals detected in earlier stages is discussed in relation to negative immunocytochemical results obtained in these stages. The first appearance of vimentin has been localized to a distinct mesenchymal cell layer underlying the neural plate or tube, respectively. The results are discussed in relation to programs of de novo synthesis of other cytoskeletal proteins in amphibian and mammalian development.

Download Full-text

Unevolved De Novo Proteins Have Innate Tendencies to Bind Transition Metals

Life ◽

10.3390/life9010008 ◽

2019 ◽

Vol 9 (1) ◽

pp. 8 ◽

Cited By ~ 4

Author(s):

Michael S. Wang ◽

Kenric J. Hoegler ◽

Michael H. Hecht

Keyword(s):

Amino Acid ◽

Transition Metals ◽

Metal Binding ◽

Combinatorial Library ◽

De Novo ◽

Protein Sequences ◽

Amino Acid Sequences ◽

Ancestral Sequences ◽

Wide Range ◽

Catalytic Functions

Life as we know it would not exist without the ability of protein sequences to bind metal ions. Transition metals, in particular, play essential roles in a wide range of structural and catalytic functions. The ubiquitous occurrence of metalloproteins in all organisms leads one to ask whether metal binding is an evolved trait that occurred only rarely in ancestral sequences, or alternatively, whether it is an innate property of amino acid sequences, occurring frequently in unevolved sequence space. To address this question, we studied 52 proteins from a combinatorial library of novel sequences designed to fold into 4-helix bundles. Although these sequences were neither designed nor evolved to bind metals, the majority of them have innate tendencies to bind the transition metals copper, cobalt, and zinc with high nanomolar to low-micromolar affinity.

Download Full-text