Parallel and Scalable Precise Clustering for Homologous Protein Discovery

AbstractThis paper presents a new, parallel implementation of clustering and demonstrates its utility in greatly speeding up the process of identifying homologous proteins. Clustering is a technique to reduce the number of comparison needed to find similar pairs in a set of n elements such as protein sequences. Precise clustering ensures that each pair of similar elements appears together in at least one cluster, so that similarities can be identified by all-to-all comparison in each cluster rather than on the full set. This paper introduces ClusterMerge, a new algorithm for precise clustering that uses transitive relationships among the elements to enable parallel and scalable implementations of this approach.We apply ClusterMerge to the important problem of finding similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8% of similar pairs found by a full O (n2) comparison, with only half as many operations. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604 × on 768 cores (1400 × faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90%, and a weak scaling efficiency of nearly 100%.

Download Full-text

Computational Analysis of Therapeutic Enzyme Uricase from Different Source Organisms

Current Proteomics ◽

10.2174/1570164616666190617165107 ◽

2020 ◽

Vol 17 (1) ◽

pp. 59-77

Author(s):

Anand Kumar Nelapati ◽

JagadeeshBabu PonnanEttiyappan

Keyword(s):

Uric Acid ◽

Amino Acid ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Protein Sequences ◽

Amino Acid Sequences ◽

Amino Acid Residues ◽

Multiple Sequence ◽

Physiochemical Properties ◽

Pharmaceutical Industries

Background:Hyperuricemia and gout are the conditions, which is a response of accumulation of uric acid in the blood and urine. Uric acid is the product of purine metabolic pathway in humans. Uricase is a therapeutic enzyme that can enzymatically reduces the concentration of uric acid in serum and urine into more a soluble allantoin. Uricases are widely available in several sources like bacteria, fungi, yeast, plants and animals.Objective:The present study is aimed at elucidating the structure and physiochemical properties of uricase by insilico analysis.Methods:A total number of sixty amino acid sequences of uricase belongs to different sources were obtained from NCBI and different analysis like Multiple Sequence Alignment (MSA), homology search, phylogenetic relation, motif search, domain architecture and physiochemical properties including pI, EC, Ai, Ii, and were performed.Results:Multiple sequence alignment of all the selected protein sequences has exhibited distinct difference between bacterial, fungal, plant and animal sources based on the position-specific existence of conserved amino acid residues. The maximum homology of all the selected protein sequences is between 51-388. In singular category, homology is between 16-337 for bacterial uricase, 14-339 for fungal uricase, 12-317 for plants uricase, and 37-361 for animals uricase. The phylogenetic tree constructed based on the amino acid sequences disclosed clusters indicating that uricase is from different source. The physiochemical features revealed that the uricase amino acid residues are in between 300- 338 with a molecular weight as 33-39kDa and theoretical pI ranging from 4.95-8.88. The amino acid composition results showed that valine amino acid has a high average frequency of 8.79 percentage compared to different amino acids in all analyzed species.Conclusion:In the area of bioinformatics field, this work might be informative and a stepping-stone to other researchers to get an idea about the physicochemical features, evolutionary history and structural motifs of uricase that can be widely used in biotechnological and pharmaceutical industries. Therefore, the proposed in silico analysis can be considered for protein engineering work, as well as for gout therapy.

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

Molecular Analysis of pDL10 from Acidianus ambivalens Reveals a Family of Related Plasmids from Extremely Thermophilic and Acidophilic Archaea

Genetics ◽

10.1093/genetics/152.4.1307 ◽

1999 ◽

Vol 152 (4) ◽

pp. 1307-1314

Author(s):

Arnulf Kletzin ◽

Angelika Lieke ◽

Tim Urich ◽

Robert L Charlebois ◽

Christoph W Sensen

Keyword(s):

Amino Acid ◽

Regulatory Protein ◽

Amino Acid Sequences ◽

Open Reading Frames ◽

Mung Bean Nuclease ◽

Stem Loop ◽

Rolling Circle ◽

Rep Protein ◽

Homologous Protein ◽

Rep Proteins

Abstract The 7598-bp plasmid pDL10 from the extremely thermophilic, acidophilic, and chemolithoautotrophic Archaeon Acidianus ambivalens was sequenced. It contains 10 open reading frames (ORFs) organized in five putative operons. The deduced amino acid sequence of the largest ORF (909 aa) showed similarity to bacterial Rep proteins known from phages and plasmids with rolling-circle (RC) replication. From the comparison of the amino acid sequences, a novel family of RC Rep proteins was defined. The pDL10 Rep protein shared 45-80% identical residues with homologous protein genes encoded by the Sulfolobus islandicus plasmids pRN1 and pRN2. Two DNA regions capable of forming extended stem-loop structures were also conserved in the three plasmids (48-69% sequence identity). In addition, a putative plasmid regulatory protein gene (plrA) was found, which was conserved among the three plasmids and the conjugative Sulfolobus plasmid pNOB8. A homolog of this gene was also found in the chromosome of S. solfataricus. Single-stranded DNA of both pDL10 strands was detected with a mung bean nuclease protection assay using PCR detection of protected fragments, giving additional evidence for an RC mechanism of replication.

Download Full-text

FIND: Identifying Functionally and Structurally Important Features in Protein Sequences with Deep Neural Networks

10.1101/592808 ◽

2019 ◽

Author(s):

Ranjani Murali ◽

James Hemp ◽

Victoria Orphan ◽

Yonatan Bisk

Keyword(s):

Neural Networks ◽

Amino Acid ◽

Hidden Markov Models ◽

Markov Models ◽

Genomic Sequence ◽

Hidden Markov ◽

Amino Acid Sequences ◽

Homologous Proteins ◽

Biological Studies ◽

Insight Into

AbstractThe ability to correctly predict the functional role of proteins from their amino acid sequences would significantly advance biological studies at the molecular level by improving our ability to understand the biochemical capability of biological organisms from their genomic sequence. Existing methods that are geared towards protein function prediction or annotation mostly use alignment-based approaches and probabilistic models such as Hidden-Markov Models. In this work we introduce a deep learning architecture (FunctionIdentification withNeuralDescriptions orFIND) which performs protein annotation from primary sequence. The accuracy of our methods matches state of the art techniques, such as protein classifiers based on Hidden Markov Models. Further, our approach allows for model introspection via a neural attention mechanism, which weights parts of the amino acid sequence proportionally to their relevance for functional assignment. In this way, the attention weights automatically uncover structurally and functionally relevant features of the classified protein and find novel functional motifs in previously uncharacterized proteins. While this model is applicable to any database of proteins, we chose to apply this model to superfamilies of homologous proteins, with the aim of extracting features inherent to divergent protein families within a larger superfamily. This provided insight into the functional diversification of an enzyme superfamily and its adaptation to different physiological contexts. We tested our approach on three families (nitrogenases, cytochromebd-type oxygen reductases and heme-copper oxygen reductases) and present a detailed analysis of the sequence characteristics identified in previously characterized proteins in the heme-copper oxygen reductase (HCO) superfamily. These are correlated with their catalytic relevance and evolutionary history. FIND was then applied to discover features in previously uncharacterized members of the HCO superfamily, providing insight into their unique sequence features. This modeling approach demonstrates the power of neural networks to recognize patterns in large datasets and can be utilized to discover biochemically and structurally important features in proteins from their amino acid sequences.Author summary

Download Full-text

SIMILAR AMINO ACID SEQUENCES REVISITED

Proteins: Form and Function ◽

10.1016/b978-1-85166-512-9.50011-x ◽

1990 ◽

pp. 71-74 ◽

Cited By ~ 1

Author(s):

RUSSELL F. DOOLITTLE

Keyword(s):

Amino Acid ◽

Amino Acid Sequences ◽

Similar Amino Acid

Download Full-text

Unevolved De Novo Proteins Have Innate Tendencies to Bind Transition Metals

Life ◽

10.3390/life9010008 ◽

2019 ◽

Vol 9 (1) ◽

pp. 8 ◽

Cited By ~ 4

Author(s):

Michael S. Wang ◽

Kenric J. Hoegler ◽

Michael H. Hecht

Keyword(s):

Amino Acid ◽

Transition Metals ◽

Metal Binding ◽

Combinatorial Library ◽

De Novo ◽

Protein Sequences ◽

Amino Acid Sequences ◽

Ancestral Sequences ◽

Wide Range ◽

Catalytic Functions

Life as we know it would not exist without the ability of protein sequences to bind metal ions. Transition metals, in particular, play essential roles in a wide range of structural and catalytic functions. The ubiquitous occurrence of metalloproteins in all organisms leads one to ask whether metal binding is an evolved trait that occurred only rarely in ancestral sequences, or alternatively, whether it is an innate property of amino acid sequences, occurring frequently in unevolved sequence space. To address this question, we studied 52 proteins from a combinatorial library of novel sequences designed to fold into 4-helix bundles. Although these sequences were neither designed nor evolved to bind metals, the majority of them have innate tendencies to bind the transition metals copper, cobalt, and zinc with high nanomolar to low-micromolar affinity.

Download Full-text

BIOPEP-UWM Database of Bioactive Peptides: Current Opportunities

International Journal of Molecular Sciences ◽

10.3390/ijms20235978 ◽

2019 ◽

Vol 20 (23) ◽

pp. 5978 ◽

Cited By ~ 49

Author(s):

Minkiewicz ◽

Iwaniak ◽

Darewicz

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Chronic Diseases ◽

Bioactive Peptides ◽

Protein Sequences ◽

Batch Processing ◽

Amino Acid Sequences ◽

Quantitative Parameters ◽

New Information

The BIOPEP-UWM™ database of bioactive peptides (formerly BIOPEP) has recently become a popular tool in the research on bioactive peptides, especially on these derived from foods and being constituents of diets that prevent development of chronic diseases. The database is continuously updated and modified. The addition of new peptides and the introduction of new information about the existing ones (e.g., chemical codes and references to other databases) is in progress. New opportunities include the possibility of annotating peptides containing D-enantiomers of amino acids, batch processing option, converting amino acid sequences into SMILES code, new quantitative parameters characterizing the presence of bioactive fragments in protein sequences, and finding proteinases that release particular peptides.

Download Full-text

Reducing communication in algebraic multigrid with multi-step node aware communication

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020925535 ◽

2020 ◽

Vol 34 (5) ◽

pp. 547-561

Author(s):

Amanda Bienz ◽

William D Gropp ◽

Luke N Olson

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Parallel Implementation ◽

Algebraic Multigrid ◽

Sparse Linear Systems ◽

Parallel Scalability ◽

Strong Scaling ◽

The Cost ◽

Communication Schedule ◽

Inter Process Communication

Algebraic multigrid (AMG) is often viewed as a scalable [Formula: see text] solver for sparse linear systems. Yet, AMG lacks parallel scalability due to increasingly large costs associated with communication, both in the initial construction of a multigrid hierarchy and in the iterative solve phase. This work introduces a parallel implementation of AMG that reduces the cost of communication, yielding improved parallel scalability. It is common in Message Passing Interface (MPI), particularly in the MPI-everywhere approach, to arrange inter-process communication, so that communication is transported regardless of the location of the send and receive processes. Performance tests show notable differences in the cost of intra- and internode communication, motivating a restructuring of communication. In this case, the communication schedule takes advantage of the less costly intra-node communication, reducing both the number and the size of internode messages. Node-centric communication extends to the range of components in both the setup and solve phase of AMG, yielding an increase in the weak and strong scaling of the entire method.

Download Full-text

Nucleotide and derived amino acid sequences of a cDNA coding for pre-uteroglobin from the lung of the hare (Lepus capensis)

Biochemical Journal ◽

10.1042/bj2350895 ◽

1986 ◽

Vol 235 (3) ◽

pp. 895-898 ◽

Cited By ~ 12

Author(s):

M S López de Haro ◽

A Nieto

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Nucleotide Sequence ◽

Amino Acid Sequence ◽

Amino Acid Sequences ◽

Untranslated Regions ◽

Coding Region ◽

Homologous Proteins ◽

Lepus Capensis ◽

Rabbit Gene

An almost full-length cDNA coding for pre-uteroglobin from hare lung was cloned and sequenced. The derived amino acid sequence indicated that hare pre-uteroglobin contained 91 amino acids, including a signal peptide of 21 residues. Comparison of the nucleotide sequence of hare pre-uteroglobin cDNA with that previously reported for the rabbit gene indicated five silent point substitutions and six others leading to amino acid changes in the coding region. The untranslated regions of both pre-uteroglobin mRNAs were very similar. The amino acid changes observed are discussed in relation to the different progesterone-binding abilities of both homologous proteins.

Download Full-text