A resource for improved predictions of Trypanosoma and Leishmania protein three-dimensional structure

Mapping Intimacies ◽

10.1101/2021.09.02.458674 ◽

2021 ◽

Author(s):

Richard John Wheeler

Keyword(s):

Protein Structure ◽

Protein Sequence ◽

Structure Prediction ◽

Sequence Data ◽

Three Dimensional ◽

Model Organisms ◽

Dimensional Structure ◽

Sequence Alignments ◽

High Quality ◽

Multiple Sequence

AbstractAlphaFold2 and RoseTTAfold represent a transformative advance for predicting protein structure. They are able to make very high-quality predictions given a high-quality alignment of the protein sequence with related proteins. These predictions are now readily available via the AlphaFold database of predicted structures and AlphaFold/RoseTTAfold Colaboratory notebooks for custom predictions. However, predictions for some species tend to be lower confidence than model organisms. This includes Trypanosoma cruzi and Leishmania infantum: important unicellular eukaryotic human parasites in an early-branching eukaryotic lineage. The cause appears to be due to poor sampling of this branch of life in the protein sequences databases used for the AlphaFold database and ColabFold. Here, by comprehensively gathering openly available protein sequence data for species from this lineage, significant improvements to AlphaFold2 protein structure prediction over the AlphaFold database and ColabFold are demonstrated. This is made available as an easy-to-use tool for the parasitology community in the form of Colaboratory notebooks for generating multiple sequence alignments and AlphaFold2 predictions of protein structure for Trypanosoma, Leishmania and related species.

Download Full-text

A resource for improved predictions of Trypanosoma and Leishmania protein three-dimensional structure

PLoS ONE ◽

10.1371/journal.pone.0259871 ◽

2021 ◽

Vol 16 (11) ◽

pp. e0259871

Author(s):

Richard John Wheeler

Keyword(s):

Protein Structure ◽

Protein Sequence ◽

Structure Prediction ◽

Sequence Data ◽

Three Dimensional ◽

Model Organisms ◽

Dimensional Structure ◽

Sequence Alignments ◽

High Quality ◽

Multiple Sequence

AlphaFold2 and RoseTTAfold represent a transformative advance for predicting protein structure. They are able to make very high-quality predictions given a high-quality alignment of the protein sequence with related proteins. These predictions are now readily available via the AlphaFold database of predicted structures and AlphaFold or RoseTTAfold Colaboratory notebooks for custom predictions. However, predictions for some species tend to be lower confidence than model organisms. Problematic species include Trypanosoma cruzi and Leishmania infantum: important unicellular eukaryotic human parasites in an early-branching eukaryotic lineage. The cause appears to be due to poor sampling of this branch of life (Discoba) in the protein sequences databases used for the AlphaFold database and ColabFold. Here, by comprehensively gathering openly available protein sequence data for Discoba species, significant improvements to AlphaFold2 protein structure prediction over the AlphaFold database and ColabFold are demonstrated. This is made available as an easy-to-use tool for the parasitology community in the form of Colaboratory notebooks for generating multiple sequence alignments and AlphaFold2 predictions of protein structure for Trypanosoma, Leishmania and related species.

Download Full-text

A COMPARATIVE STUDY OF PROTEIN TERTIARY STRUCTURE PREDICTION METHODS

International Journal of Computer Science and Informatics ◽

10.47893/ijcsi.2014.1168 ◽

2014 ◽

pp. 15-18

Author(s):

CHANDRAYANI N. ROKDE ◽

DR.MANALI KSHIRSAGAR

Keyword(s):

Protein Structure ◽

Structure Prediction ◽

Tertiary Structure ◽

Sequence Data ◽

Protein Structures ◽

Three Dimensional ◽

Data Bank ◽

Dimensional Structure ◽

X Ray Crystallography ◽

Protein Tertiary Structure Prediction

Protein structure prediction (PSP) from amino acid sequence is one of the high focus problems in bioinformatics today. This is due to the fact that the biological function of the protein is determined by its three dimensional structure. The understanding of protein structures is vital to determine the function of a protein and its interaction with DNA, RNA and enzyme. Thus, protein structure is a fundamental area of computational biology. Its importance is intensed by large amounts of sequence data coming from PDB (Protein Data Bank) and the fact that experimentally methods such as X-ray crystallography or Nuclear Magnetic Resonance (NMR)which are used to determining protein structures remains very expensive and time consuming. In this paper, different types of protein structures and methods for its prediction are described.

Download Full-text

Prediction of Structural and Functional Aspects of Protein

Advances in Secure Computing, Internet Services, and Applications - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-4666-4940-8.ch016 ◽

2014 ◽

pp. 317-333

Author(s):

Arun G. Ingale

Keyword(s):

Protein Structure ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Tertiary Structure ◽

Protein Structures ◽

Three Dimensional ◽

Dimensional Structure ◽

Sequence Information ◽

Predict Protein Structure ◽

Basic Ideas

To predict the structure of protein from a primary amino acid sequence is computationally difficult. An investigation of the methods and algorithms used to predict protein structure and a thorough knowledge of the function and structure of proteins are critical for the advancement of biology and the life sciences as well as the development of better drugs, higher-yield crops, and even synthetic bio-fuels. To that end, this chapter sheds light on the methods used for protein structure prediction. This chapter covers the applications of modeled protein structures and unravels the relationship between pure sequence information and three-dimensional structure, which continues to be one of the greatest challenges in molecular biology. With this resource, it presents an all-encompassing examination of the problems, methods, tools, servers, databases, and applications of protein structure prediction, giving unique insight into the future applications of the modeled protein structures. In this chapter, current protein structure prediction methods are reviewed for a milieu on structure prediction, the prediction of structural fundamentals, tertiary structure prediction, and functional imminent. The basic ideas and advances of these directions are discussed in detail.

Download Full-text

BiRDS - Binding Residue Detection from Protein Sequences using Deep ResNets

10.33774/chemrxiv-2021-013gn-v2 ◽

2021 ◽

Author(s):

Vineeth Chelur ◽

U. Deva Priyakumar

Keyword(s):

Binding Site ◽

Binding Sites ◽

Tertiary Structure ◽

Solvent Accessibility ◽

Three Dimensional ◽

Dimensional Structure ◽

Relative Solvent Accessibility ◽

Single Chain ◽

Sequence Alignments ◽

Multiple Sequence

Protein-drug interactions play important roles in many biological processes and therapeutics. Prediction of the active binding site of a protein helps discover and optimise these interactions leading to the design of better ligand molecules. The tertiary structure of a protein determines the binding sites available to the drug molecule. A quick and accurate prediction of the binding site from sequence alone without utilising the three-dimensional structure is challenging. Deep Learning has been used in a variety of biochemical tasks and has been hugely successful. In this paper, a Residual Neural Network (leveraging skip connections) is implemented to predict a protein's most active binding site. An Annotated Database of Druggable Binding Sites from the Protein DataBank, sc-PDB, is used for training the network. Features extracted from the Multiple Sequence Alignments (MSAs) of the protein generated using DeepMSA, such as Position-Specific Scoring Matrix (PSSM), Secondary Structure (SS3), and Relative Solvent Accessibility (RSA), are provided as input to the network. A weighted binary cross-entropy loss function is used to counter the substantial imbalance in the two classes of binding and non-binding residues. The network performs very well on single-chain proteins, providing a pocket that has good interactions with a ligand.

Download Full-text

SECAPR - A bioinformatics pipeline for the rapid and user-friendly alignment of hybrid enrichment sequences, from raw reads to alignments

10.7287/peerj.preprints.26477v2 ◽

2018 ◽

Author(s):

Tobias Andermann ◽

Angela Cano ◽

Alexander Zizka ◽

Christine Bacon ◽

Alexandre Antonelli

Keyword(s):

Evolutionary Biology ◽

Sequence Data ◽

Model Organisms ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequence Capture ◽

Sequencing Platforms ◽

User Friendly

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (= hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.

Download Full-text

SECAPR - A bioinformatics pipeline for the rapid and user-friendly processing of Illumina sequences, from raw reads to alignments

10.7287/peerj.preprints.26477 ◽

2018 ◽

Author(s):

Tobias Andermann ◽

Angela Cano ◽

Alexander Zizka ◽

Christine Bacon ◽

Alexandre Antonelli

Keyword(s):

Evolutionary Biology ◽

Sequence Data ◽

Model Organisms ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequence Capture ◽

Sequencing Platforms ◽

User Friendly

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.

Download Full-text

Contrastive learning on protein embeddings enlightens midnight zone at lightning speed

10.1101/2021.11.14.468528 ◽

2021 ◽

Author(s):

Michael Heinzinger ◽

Maria Littmann ◽

Ian Sillitoe ◽

Nicola Bordin ◽

Christine Orengo ◽

...

Keyword(s):

Structure Prediction ◽

Sequence Similarity ◽

3D Structure ◽

Three Dimensional ◽

Hierarchical Classification ◽

Language Models ◽

Sequence Alignments ◽

Sequence Comparisons ◽

Multiple Sequence ◽

3D Structures

Thanks to the recent advances in protein three-dimensional (3D) structure prediction, in particular through AlphaFold 2 and RoseTTAFold, the abundance of protein 3D information will explode over the next year(s). Expert resources based on 3D structures such as SCOP and CATH have been organizing the complex sequence-structure-function relations into a hierarchical classification schema. Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI) transferring annotations from a protein with experimentally known annotation to a query without annotation. Here, we presented a novel approach that expands the concept of HBI from a low-dimensional sequence-distance lookup to the level of a high-dimensional embedding-based annotation transfer (EAT). Secondly, we introduced a novel solution using single protein sequence representations from protein Language Models (pLMs), so called embeddings (Prose, ESM-1b, ProtBERT, and ProtT5), as input to contrastive learning, by which a new set of embeddings was created that optimized constraints captured by hierarchical classifications of protein 3D structures. These new embeddings (dubbed ProtTucker) clearly improved what was historically referred to as threading or fold recognition. Thereby, the new embeddings enabled the intrusion into the midnight zone of protein comparisons, i.e., the region in which the level of pairwise sequence similarity is akin of random relations and therefore is hard to navigate by HBI methods. Cautious benchmarking showed that ProtTucker reached much further than advanced sequence comparisons without the need to compute alignments allowing it to be orders of magnitude faster. Code is available at https://github.com/Rostlab/EAT .

Download Full-text

Protein language model embeddings for fast, accurate, alignment-free protein structure prediction

10.1101/2021.07.31.454572 ◽

2021 ◽

Author(s):

Konstantin Weissenow ◽

Michael Heinzinger ◽

Burkhard Rost

Keyword(s):

Protein Structure ◽

Structure Prediction ◽

Prediction Models ◽

Language Model ◽

Structural Features ◽

Language Models ◽

Evolutionary Information ◽

Major Advance ◽

Sequence Alignments ◽

Multiple Sequence

All state-of-the-art (SOTA) protein structure predictions rely on evolutionary information captured in multiple sequence alignments (MSAs), primarily on evolutionary couplings (co-evolution). Such information is not available for all proteins and is computationally expensive to generate. Prediction models based on Artificial Intelligence (AI) using only single sequences as input are easier and cheaper but perform so poorly that speed becomes irrelevant. Here, we described the first competitive AI solution exclusively inputting embeddings extracted from pre-trained protein Language Models (pLMs), namely from the transformer pLM ProtT5, from single sequences into a relatively shallow (few free parameters) convolutional neural network (CNN) trained on inter-residue distances, i.e. protein structure in 2D. The major advance originated from processing the attention heads learned by ProtT5. Although these models required at no point any MSA, they matched the performance of methods relying on co-evolution. Although not reaching the very top, our lean approach came close at substantially lower costs thereby speeding up development and each future prediction. By generating protein-specific rather than family-averaged predictions, these new solutions could distinguish between structural features differentiating members of the same family of proteins with similar structure predicted alike by all other top methods.

Download Full-text

Using sound to understand protein sequence data: new sonification algorithms for protein sequences and multiple sequence alignments

BMC Bioinformatics ◽

10.1186/s12859-021-04362-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Edward J. Martin ◽

Thomas R. Meagher ◽

Daniel Barker

Keyword(s):

Focus Group ◽

User Experience ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Future Directions ◽

Multiple Sequence Alignments ◽

Protein Sequence Data

Abstract Background The use of sound to represent sequence data—sonification—has great potential as an alternative and complement to visual representation, exploiting features of human psychoacoustic intuitions to convey nuance more effectively. We have created five parameter-mapping sonification algorithms that aim to improve knowledge discovery from protein sequences and small protein multiple sequence alignments. For two of these algorithms, we investigated their effectiveness at conveying information. To do this we focussed on subjective assessments of user experience. This entailed a focus group session and survey research by questionnaire of individuals engaged in bioinformatics research. Results For single protein sequences, the success of our sonifications for conveying features was supported by both the survey and focus group findings. For protein multiple sequence alignments, there was limited evidence that the sonifications successfully conveyed information. Additional work is required to identify effective algorithms to render multiple sequence alignment sonification useful to researchers. Feedback from both our survey and focus groups suggests future directions for sonification of multiple alignments: animated visualisation indicating the column in the multiple alignment as the sonification progresses, user control of sequence navigation, and customisation of the sound parameters. Conclusions Sonification approaches undertaken in this work have shown some success in conveying information from protein sequence data. Feedback points out future directions to build on the sonification approaches outlined in this paper. The effectiveness assessment process implemented in this work proved useful, giving detailed feedback and key approaches for improvement based on end-user input. The uptake of similar user experience focussed effectiveness assessments could also help with other areas of bioinformatics, for example in visualisation.

Download Full-text

Protein Structure Prediction Based on Improved Genetic Algorithm

International Journal of Environmental Science and Development ◽

10.18178/ijesd.2020.11.9.1289 ◽

2020 ◽

Vol 11 (9) ◽

pp. 450-454

Author(s):

Jiaxi Liu ◽

Keyword(s):

Genetic Algorithm ◽

Protein Structure ◽

Amino Acid ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Protein Structures ◽

Three Dimensional ◽

Dimensional Structure ◽

Improved Genetic Algorithm ◽

Research Areas

The prediction of protein three-dimensional structure from amino acid sequence has been a challenge problem in bioinformatics, owing to the many potential applications for robust protein structure prediction methods. Protein structure prediction is essential to bioscience, and its research results are important for other research areas. Methods for the prediction an才d design of protein structures have advanced dramatically. The prediction of protein structure based on average hydrophobic values is discussed and an improved genetic algorithm is proposed to solve the optimization problem of hydrophobic protein structure prediction. An adjustment operator is designed with the average hydrophobic value to prevent the overlapping of amino acid positions. Finally, some numerical experiments are conducted to verify the feasibility and effectiveness of the proposed algorithm by comparing with the traditional HNN algorithm.

Download Full-text