Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Entropy ◽

10.3390/e23050530 ◽

2021 ◽

Vol 23 (5) ◽

pp. 530

Author(s):

Milton Silva ◽

Diogo Pratas ◽

Armando J. Pinho

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Specific Protein ◽

General Purpose ◽

Amino Acid Sequences ◽

Input Size ◽

Protein Sequence Data ◽

Analysis Application ◽

Straightforward Solution ◽

Human Coronaviruses

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

Download Full-text

FEGS: a novel feature extraction model for protein sequences and its applications

BMC Bioinformatics ◽

10.1186/s12859-021-04223-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Zengchao Mu ◽

Ting Yu ◽

Xiaoping Liu ◽

Hongyu Zheng ◽

Leyi Wei ◽

...

Keyword(s):

Feature Extraction ◽

Protein Sequence ◽

Graphical Representation ◽

Sequence Data ◽

Protein Sequences ◽

Statistical Features ◽

Research Areas ◽

Protein Functions ◽

Protein Sequence Data ◽

Extraction Model

Abstract Background Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. Results In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. Conclusion The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.

Download Full-text

Using sound to understand protein sequence data: new sonification algorithms for protein sequences and multiple sequence alignments

BMC Bioinformatics ◽

10.1186/s12859-021-04362-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Edward J. Martin ◽

Thomas R. Meagher ◽

Daniel Barker

Keyword(s):

Focus Group ◽

User Experience ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Future Directions ◽

Multiple Sequence Alignments ◽

Protein Sequence Data

Abstract Background The use of sound to represent sequence data—sonification—has great potential as an alternative and complement to visual representation, exploiting features of human psychoacoustic intuitions to convey nuance more effectively. We have created five parameter-mapping sonification algorithms that aim to improve knowledge discovery from protein sequences and small protein multiple sequence alignments. For two of these algorithms, we investigated their effectiveness at conveying information. To do this we focussed on subjective assessments of user experience. This entailed a focus group session and survey research by questionnaire of individuals engaged in bioinformatics research. Results For single protein sequences, the success of our sonifications for conveying features was supported by both the survey and focus group findings. For protein multiple sequence alignments, there was limited evidence that the sonifications successfully conveyed information. Additional work is required to identify effective algorithms to render multiple sequence alignment sonification useful to researchers. Feedback from both our survey and focus groups suggests future directions for sonification of multiple alignments: animated visualisation indicating the column in the multiple alignment as the sonification progresses, user control of sequence navigation, and customisation of the sound parameters. Conclusions Sonification approaches undertaken in this work have shown some success in conveying information from protein sequence data. Feedback points out future directions to build on the sonification approaches outlined in this paper. The effectiveness assessment process implemented in this work proved useful, giving detailed feedback and key approaches for improvement based on end-user input. The uptake of similar user experience focussed effectiveness assessments could also help with other areas of bioinformatics, for example in visualisation.

Download Full-text

Using Genome-Wide Protein Sequence Data to Predict Amino Acid Conservation

The Protein Journal ◽

10.1007/s10930-008-9150-3 ◽

2008 ◽

Vol 27 (6) ◽

pp. 401-407 ◽

Cited By ~ 2

Author(s):

Peter Palenchar ◽

Mathew Mount ◽

Douglas Cusato ◽

Jeffery Dougherty

Keyword(s):

Amino Acid ◽

Protein Sequence ◽

Sequence Data ◽

Genome Wide ◽

Protein Sequence Data ◽

Amino Acid Conservation

Download Full-text

Many protein products from a few loci: assignment of human salivary proline-rich proteins to specific loci.

Genetics ◽

10.1093/genetics/120.1.255 ◽

1988 ◽

Vol 120 (1) ◽

pp. 255-265 ◽

Cited By ~ 2

Author(s):

K M Lyons ◽

E A Azen ◽

P A Goodman ◽

O Smithies

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Proteolytic Cleavage ◽

Amino Acid Sequences ◽

Null Alleles ◽

Cleavage Sites ◽

Genetic Studies ◽

Protein Sequence Data ◽

Po Protein ◽

Six Genes

Abstract Earlier studies of protein polymorphisms led to the description of 13 linked loci thought to encode the human salivary proline-rich proteins (PRPs). However, more recent studies at the DNA level have shown that there are only six genes which encode PRPs. The present study was undertaken in order to reconcile these observations. Nucleotide and decoded amino acid sequences from each of the six genes were compared with the available protein sequence data for PRPs. This analysis allowed assignment of the PmF, PmS and Pe proteins to the PRB1 locus, the G1 protein to the PRB3 locus, the Po protein to the PRB4 locus, the Ps protein to the PRB2 locus, and the CON1 and CON2 proteins to the PRB4 locus. Correlations between insertion/deletion RFLPs and PRP protein phenotypes were observed for the PmF, PmS, Gl and CON2 proteins. Our overall analysis indicates that in many instances several proteins previously considered to be the products of separate loci are actually proteolytic cleavage products of a large precursor specified by one or other of the six genes identified at the DNA level. Our analysis also demonstrates that some of the "null" alleles proposed to occur at 11 of the 13 loci in the earlier genetic studies, are actually productive alleles having alterations at proteolytic cleavage sites within the relevant precursor protein. The absence of cleavage leads to the persistence of longer precursor peptides not resolved electrophoretically, concurrently with an absence of the smaller PRPs seen when cleavage occurs.

Download Full-text

An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema

Algorithms ◽

10.3390/a14020059 ◽

2021 ◽

Vol 14 (2) ◽

pp. 59

Author(s):

Roman Zoun ◽

Kay Schallert ◽

David Broneske ◽

Ivayla Trifonova ◽

Xiao Chen ◽

...

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Memory Consumption ◽

Mass Spectrometers ◽

Protein Sequence Data ◽

Relational Systems ◽

Spectrometer Data ◽

Database Engine ◽

High Storage

Mass spectrometers enable identifying proteins in biological samples leading to biomarkers for biological process parameters and diseases. However, bioinformatic evaluation of the mass spectrometer data needs a standardized workflow and system that stores the protein sequences. Due to its standardization and maturity, relational systems are a great fit for storing protein sequences. Hence, in this work, we present a schema for distributed column-based database management systems using a column-oriented index to store sequence data. In order to achieve a high storage performance, it was necessary to choose a well-performing strategy for transforming the protein sequence data from the FASTA format to the new schema. Therefore, we applied an in-memory map, HDDmap, database engine, and extended radix tree and evaluated their performance. The results show that our proposed extended radix tree performs best regarding memory consumption and runtime. Hence, the radix tree is a suitable data structure for transforming protein sequences into the indexed schema.

Download Full-text

Improving Generalizability of Protein Sequence Models with Data Augmentations

10.1101/2021.02.18.431877 ◽

2021 ◽

Author(s):

Hongyu Shen ◽

Layne C. Price ◽

Taha Bahadori ◽

Franziska Seeger

Keyword(s):

Machine Learning ◽

Protein Sequence ◽

Data Augmentation ◽

Sequence Data ◽

Protein Sequences ◽

Representation Learning ◽

Amino Acid Replacement ◽

Fine Tuning ◽

Protein Sequence Data ◽

Tuning Methods

AbstractWhile protein sequence data is an emerging application domain for machine learning methods, small modifications to protein sequences can result in difficult-to-predict changes to the protein’s function. Consequently, protein machine learning models typically do not use randomized data augmentation procedures analogous to those used in computer vision or natural language, e.g., cropping or synonym substitution. In this paper, we empirically explore a set of simple string manipulations, which we use to augment protein sequence data when fine-tuning semi-supervised protein models. We provide 276 different comparisons to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with Transformer-based models and training datasets that vary from the baseline methods only in the data augmentations and representation learning procedure. For each TAPE validation task, we demonstrate improvements to the baseline scores when the learned protein representation is fixed between tasks. We also show that contrastive learning fine-tuning methods typically outperform masked-token prediction in these models, with increasing amounts of data augmentation generally improving performance for contrastive learning protein methods. We find the most consistent results across TAPE tasks when using domain-motivated transformations, such as amino acid replacement, as well as restricting the Transformer attention to randomly sampled sub-regions of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as randomly shuffling entire protein sequences, can improve downstream performance.

Download Full-text

Middle Pleistocene protein sequences from the rhinoceros genusStephanorhinusand the phylogeny of extant and extinct Middle/Late Pleistocene Rhinocerotidae

PeerJ ◽

10.7717/peerj.3033 ◽

2017 ◽

Vol 5 ◽

pp. e3033 ◽

Cited By ~ 26

Author(s):

Frido Welker ◽

Geoff M. Smith ◽

Jarod M. Hutson ◽

Lutz Kindler ◽

Alejandro Garcia-Moreno ◽

...

Keyword(s):

Mass Spectrometry ◽

Phylogenetic Analysis ◽

Late Pleistocene ◽

Phylogenetic Relationships ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Middle Pleistocene ◽

Extant Species ◽

Protein Sequence Data

BackgroundAncient protein sequences are increasingly used to elucidate the phylogenetic relationships between extinct and extant mammalian taxa. Here, we apply these recent developments to Middle Pleistocene bone specimens of the rhinoceros genusStephanorhinus. No biomolecular sequence data is currently available for this genus, leaving phylogenetic hypotheses on its evolutionary relationships to extant and extinct rhinoceroses untested. Furthermore, recent phylogenies based on Rhinocerotidae (partial or complete) mitochondrial DNA sequences differ in the placement of the Sumatran rhinoceros (Dicerorhinus sumatrensis). Therefore, studies utilising ancient protein sequences from Middle Pleistocene contexts have the potential to provide further insights into the phylogenetic relationships between extant and extinct species, includingStephanorhinusandDicerorhinus.MethodsZooMS screening (zooarchaeology by mass spectrometry) was performed on several Late and Middle Pleistocene specimens from the genusStephanorhinus, subsequently followed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) to obtain ancient protein sequences from a Middle PleistoceneStephanorhinusspecimen. We performed parallel analysis on a Late Pleistocene woolly rhinoceros specimen and extant species of rhinoceroses, resulting in the availability of protein sequence data for five extant species and two extinct genera. Phylogenetic analysis additionally included all extant Perissodactyla genera (Equus,Tapirus), and was conducted using Bayesian (MrBayes) and maximum-likelihood (RAxML) methods.ResultsVarious ancient proteins were identified in both the Middle and Late Pleistocene rhinoceros samples. Protein degradation and proteome complexity are consistent with an endogenous origin of the identified proteins. Phylogenetic analysis of informative proteins resolved the Perissodactyla phylogeny in agreement with previous studies in regards to the placement of the families Equidae, Tapiridae, and Rhinocerotidae.Stephanorhinusis shown to be most closely related to the generaCoelodontaandDicerorhinus. The protein sequence data further places the Sumatran rhino in a clade together with the genusRhinoceros, opposed to forming a clade with the black and white rhinoceros species.DiscussionThe first biomolecular dataset available forStephanorhinusplaces this genus together with the extinct genusCoelodontaand the extant genusDicerorhinus. This is in agreement with morphological studies, although we are unable to resolve the order of divergence between these genera based on the protein sequences available. Our data supports the placement of the genusDicerorhinusin a clade together with extantRhinocerosspecies. Finally, the availability of protein sequence data for both extinct European rhinoceros genera allows future investigations into their geographic distribution and extinction chronologies.

Download Full-text

Phylogeny of Firmicutes with special reference to Mycoplasma (Mollicutes) as inferred from phosphoglycerate kinase amino acid sequence data

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.02868-0 ◽

2004 ◽

Vol 54 (3) ◽

pp. 871-875 ◽

Cited By ~ 75

Author(s):

Matthias Wolf ◽

Tobias Müller ◽

Thomas Dandekar ◽

J. Dennis Pollack

Keyword(s):

Amino Acid ◽

Phylogenetic Trees ◽

Gene Sequence ◽

Sequence Data ◽

Phosphoglycerate Kinase ◽

Amino Acid Sequences ◽

Phylogenetic Position ◽

Rrna Gene ◽

Spiroplasma Citri ◽

Mycoplasma Mycoides

The phylogenetic position of the Mollicutes has been re-examined by using phosphoglycerate kinase (Pgk) amino acid sequences. Hitherto unpublished sequences from Mycoplasma mycoides subsp. mycoides, Mycoplasma hyopneumoniae and Spiroplasma citri were included in the analysis. Phylogenetic trees based on Pgk data indicated a monophyletic origin for the Mollicutes within the Firmicutes, whereas Bacilli (Firmicutes) and Clostridia (Firmicutes) appeared to be paraphyletic. With two exceptions, i.e. Thermotoga (Thermotogae) and Fusobacterium (Fusobacteria), which clustered within the Firmicutes, comparative analyses show that at a low taxonomic level, the resolved phylogenetic relationships that were inferred from both the Pgk protein and 16S rRNA gene sequence data are congruent.

Download Full-text

New families in the classification of glycosyl hydrolases based on amino acid sequence similarities

Biochemical Journal ◽

10.1042/bj2930781 ◽

1993 ◽

Vol 293 (3) ◽

pp. 781-788 ◽

Cited By ~ 1335

Author(s):

B Henrissat ◽

A Bairoch

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Protein Sequence ◽

Sequence Data ◽

Data Bank ◽

The Other ◽

Glycosyl Hydrolases ◽

Protein Sequence Data ◽

Sequence Similarities

301 glycosyl hydrolases and related enzymes corresponding to 39 EC entries of the I.U.B. classification system have been classified into 35 families on the basis of amino-acid-sequence similarities [Henrissat (1991) Biochem. J. 280, 309-316]. Approximately half of the families were found to be monospecific (containing only one EC number), whereas the other half were found to be polyspecific (containing at least two EC numbers). A > 60% increase in sequence data for glycosyl hydrolases (181 additional enzymes or enzyme domains sequences have since become available) allowed us to update the classification not only by the addition of more members to already identified families, but also by the finding of ten new families. On the basis of a comparison of 482 sequences corresponding to 52 EC entries, 45 families, out of which 22 are polyspecific, can now be defined. This classification has been implemented in the SWISS-PROT protein sequence data bank.

Download Full-text