Improving Generalizability of Protein Sequence Models with Data Augmentations

Mapping Intimacies ◽

10.1101/2021.02.18.431877 ◽

2021 ◽

Author(s):

Hongyu Shen ◽

Layne C. Price ◽

Taha Bahadori ◽

Franziska Seeger

Keyword(s):

Machine Learning ◽

Protein Sequence ◽

Data Augmentation ◽

Sequence Data ◽

Protein Sequences ◽

Representation Learning ◽

Amino Acid Replacement ◽

Fine Tuning ◽

Protein Sequence Data ◽

Tuning Methods

AbstractWhile protein sequence data is an emerging application domain for machine learning methods, small modifications to protein sequences can result in difficult-to-predict changes to the protein’s function. Consequently, protein machine learning models typically do not use randomized data augmentation procedures analogous to those used in computer vision or natural language, e.g., cropping or synonym substitution. In this paper, we empirically explore a set of simple string manipulations, which we use to augment protein sequence data when fine-tuning semi-supervised protein models. We provide 276 different comparisons to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with Transformer-based models and training datasets that vary from the baseline methods only in the data augmentations and representation learning procedure. For each TAPE validation task, we demonstrate improvements to the baseline scores when the learned protein representation is fixed between tasks. We also show that contrastive learning fine-tuning methods typically outperform masked-token prediction in these models, with increasing amounts of data augmentation generally improving performance for contrastive learning protein methods. We find the most consistent results across TAPE tasks when using domain-motivated transformations, such as amino acid replacement, as well as restricting the Transformer attention to randomly sampled sub-regions of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as randomly shuffling entire protein sequences, can improve downstream performance.

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

Application of different machine learning techniques in identifying features of protein sequence data

2016 1st India International Conference on Information Processing (IICIP) ◽

10.1109/iicip.2016.7975376 ◽

2016 ◽

Author(s):

Swati Mishra ◽

Mukesh Kumar ◽

Santanu Kumar Rath

Keyword(s):

Machine Learning ◽

Protein Sequence ◽

Sequence Data ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Protein Sequence Data

Download Full-text

FEGS: a novel feature extraction model for protein sequences and its applications

BMC Bioinformatics ◽

10.1186/s12859-021-04223-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Zengchao Mu ◽

Ting Yu ◽

Xiaoping Liu ◽

Hongyu Zheng ◽

Leyi Wei ◽

...

Keyword(s):

Feature Extraction ◽

Protein Sequence ◽

Graphical Representation ◽

Sequence Data ◽

Protein Sequences ◽

Statistical Features ◽

Research Areas ◽

Protein Functions ◽

Protein Sequence Data ◽

Extraction Model

Abstract Background Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. Results In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. Conclusion The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.

Download Full-text

Using sound to understand protein sequence data: new sonification algorithms for protein sequences and multiple sequence alignments

BMC Bioinformatics ◽

10.1186/s12859-021-04362-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Edward J. Martin ◽

Thomas R. Meagher ◽

Daniel Barker

Keyword(s):

Focus Group ◽

User Experience ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Future Directions ◽

Multiple Sequence Alignments ◽

Protein Sequence Data

Abstract Background The use of sound to represent sequence data—sonification—has great potential as an alternative and complement to visual representation, exploiting features of human psychoacoustic intuitions to convey nuance more effectively. We have created five parameter-mapping sonification algorithms that aim to improve knowledge discovery from protein sequences and small protein multiple sequence alignments. For two of these algorithms, we investigated their effectiveness at conveying information. To do this we focussed on subjective assessments of user experience. This entailed a focus group session and survey research by questionnaire of individuals engaged in bioinformatics research. Results For single protein sequences, the success of our sonifications for conveying features was supported by both the survey and focus group findings. For protein multiple sequence alignments, there was limited evidence that the sonifications successfully conveyed information. Additional work is required to identify effective algorithms to render multiple sequence alignment sonification useful to researchers. Feedback from both our survey and focus groups suggests future directions for sonification of multiple alignments: animated visualisation indicating the column in the multiple alignment as the sonification progresses, user control of sequence navigation, and customisation of the sound parameters. Conclusions Sonification approaches undertaken in this work have shown some success in conveying information from protein sequence data. Feedback points out future directions to build on the sonification approaches outlined in this paper. The effectiveness assessment process implemented in this work proved useful, giving detailed feedback and key approaches for improvement based on end-user input. The uptake of similar user experience focussed effectiveness assessments could also help with other areas of bioinformatics, for example in visualisation.

Download Full-text

An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema

Algorithms ◽

10.3390/a14020059 ◽

2021 ◽

Vol 14 (2) ◽

pp. 59

Author(s):

Roman Zoun ◽

Kay Schallert ◽

David Broneske ◽

Ivayla Trifonova ◽

Xiao Chen ◽

...

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Memory Consumption ◽

Mass Spectrometers ◽

Protein Sequence Data ◽

Relational Systems ◽

Spectrometer Data ◽

Database Engine ◽

High Storage

Mass spectrometers enable identifying proteins in biological samples leading to biomarkers for biological process parameters and diseases. However, bioinformatic evaluation of the mass spectrometer data needs a standardized workflow and system that stores the protein sequences. Due to its standardization and maturity, relational systems are a great fit for storing protein sequences. Hence, in this work, we present a schema for distributed column-based database management systems using a column-oriented index to store sequence data. In order to achieve a high storage performance, it was necessary to choose a well-performing strategy for transforming the protein sequence data from the FASTA format to the new schema. Therefore, we applied an in-memory map, HDDmap, database engine, and extended radix tree and evaluated their performance. The results show that our proposed extended radix tree performs best regarding memory consumption and runtime. Hence, the radix tree is a suitable data structure for transforming protein sequences into the indexed schema.

Download Full-text

Middle Pleistocene protein sequences from the rhinoceros genusStephanorhinusand the phylogeny of extant and extinct Middle/Late Pleistocene Rhinocerotidae

PeerJ ◽

10.7717/peerj.3033 ◽

2017 ◽

Vol 5 ◽

pp. e3033 ◽

Cited By ~ 26

Author(s):

Frido Welker ◽

Geoff M. Smith ◽

Jarod M. Hutson ◽

Lutz Kindler ◽

Alejandro Garcia-Moreno ◽

...

Keyword(s):

Mass Spectrometry ◽

Phylogenetic Analysis ◽

Late Pleistocene ◽

Phylogenetic Relationships ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Middle Pleistocene ◽

Extant Species ◽

Protein Sequence Data

BackgroundAncient protein sequences are increasingly used to elucidate the phylogenetic relationships between extinct and extant mammalian taxa. Here, we apply these recent developments to Middle Pleistocene bone specimens of the rhinoceros genusStephanorhinus. No biomolecular sequence data is currently available for this genus, leaving phylogenetic hypotheses on its evolutionary relationships to extant and extinct rhinoceroses untested. Furthermore, recent phylogenies based on Rhinocerotidae (partial or complete) mitochondrial DNA sequences differ in the placement of the Sumatran rhinoceros (Dicerorhinus sumatrensis). Therefore, studies utilising ancient protein sequences from Middle Pleistocene contexts have the potential to provide further insights into the phylogenetic relationships between extant and extinct species, includingStephanorhinusandDicerorhinus.MethodsZooMS screening (zooarchaeology by mass spectrometry) was performed on several Late and Middle Pleistocene specimens from the genusStephanorhinus, subsequently followed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) to obtain ancient protein sequences from a Middle PleistoceneStephanorhinusspecimen. We performed parallel analysis on a Late Pleistocene woolly rhinoceros specimen and extant species of rhinoceroses, resulting in the availability of protein sequence data for five extant species and two extinct genera. Phylogenetic analysis additionally included all extant Perissodactyla genera (Equus,Tapirus), and was conducted using Bayesian (MrBayes) and maximum-likelihood (RAxML) methods.ResultsVarious ancient proteins were identified in both the Middle and Late Pleistocene rhinoceros samples. Protein degradation and proteome complexity are consistent with an endogenous origin of the identified proteins. Phylogenetic analysis of informative proteins resolved the Perissodactyla phylogeny in agreement with previous studies in regards to the placement of the families Equidae, Tapiridae, and Rhinocerotidae.Stephanorhinusis shown to be most closely related to the generaCoelodontaandDicerorhinus. The protein sequence data further places the Sumatran rhino in a clade together with the genusRhinoceros, opposed to forming a clade with the black and white rhinoceros species.DiscussionThe first biomolecular dataset available forStephanorhinusplaces this genus together with the extinct genusCoelodontaand the extant genusDicerorhinus. This is in agreement with morphological studies, although we are unable to resolve the order of divergence between these genera based on the protein sequences available. Our data supports the placement of the genusDicerorhinusin a clade together with extantRhinocerosspecies. Finally, the availability of protein sequence data for both extinct European rhinoceros genera allows future investigations into their geographic distribution and extinction chronologies.

Download Full-text

Human Protein Sequence Classification using Machine Learning and Statistical Classification Techniques

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b3224.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 3591-3599

Keyword(s):

Machine Learning ◽

Protein Sequence ◽

Protein Function ◽

Sequence Data ◽

Human Protein ◽

Statistical Classification ◽

Classification Technique ◽

Unknown Protein ◽

Protein Sequence Data

In the field of computational biology, to gauge the meaningful and accurate feature for protein function predications, either the profile-based protein data or sequence-based data has been used. As we know that the prediction of enzyme class from an unknown protein is most interacted research in the current era. In this context, machine learning and statistical classification technique has been used. In this article, we have use six different machine learning and statistical classification technique such as CRT, QUEST, CHAID, C5.0, ANN and SVM for classification of 4314 number of human protein sequence data. These data are extracted form UniprotKB databank with the help of PROFEAT server. The extracted data are categorized in seven different classes. To manipulate the high dimensional protein sequence data with some missing value, the SPSS has been used for classification and estimation of the performance of classification technique. The experimental results highlight that the class C4, C5, C6 and C7 data are imbalanced that affect the overall performance of classification technique. This article provides an extensive comparative analysis of different classification technique on sequence-based protein data. The experimental analysis highlights that the SVM and C5.0 classification technique gives better result than others and can be used for protein classification and predictions.

Download Full-text

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Entropy ◽

10.3390/e23050530 ◽

2021 ◽

Vol 23 (5) ◽

pp. 530

Author(s):

Milton Silva ◽

Diogo Pratas ◽

Armando J. Pinho

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Specific Protein ◽

General Purpose ◽

Amino Acid Sequences ◽

Input Size ◽

Protein Sequence Data ◽

Analysis Application ◽

Straightforward Solution ◽

Human Coronaviruses

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

Download Full-text

Coupling Between Protein Level Selection and Codon Usage Optimization in the Evolution of Bacteria and Archaea

mBio ◽

10.1128/mbio.00956-14 ◽

2014 ◽

Vol 5 (2) ◽

Cited By ~ 25

Author(s):

Wenqi Ran ◽

David M. Kristensen ◽

Eugene V. Koonin

Keyword(s):

Codon Usage ◽

Protein Level ◽

Codon Usage Bias ◽

Protein Sequence ◽

Gc Content ◽

Protein Sequences ◽

Microbial Evolution ◽

Fine Tuning ◽

Selection For ◽

Genomic Gc Content

ABSTRACT The relationship between the selection affecting codon usage and selection on protein sequences of orthologous genes in diverse groups of bacteria and archaea was examined by using the Alignable Tight Genome Clusters database of prokaryote genomes. The codon usage bias is generally low, with 57.5% of the gene-specific optimal codon frequencies (F opt ) being below 0.55. This apparent weak selection on codon usage contrasts with the strong purifying selection on amino acid sequences, with 65.8% of the gene-specific dN/dS ratios being below 0.1. For most of the genomes compared, a limited but statistically significant negative correlation between F opt and dN/dS was observed, which is indicative of a link between selection on protein sequence and selection on codon usage. The strength of the coupling between the protein level selection and codon usage bias showed a strong positive correlation with the genomic GC content. Combined with previous observations on the selection for GC-rich codons in bacteria and archaea with GC-rich genomes, these findings suggest that selection for translational fine-tuning could be an important factor in microbial evolution that drives the evolution of genome GC content away from mutational equilibrium. This type of selection is particularly pronounced in slowly evolving, “high-status” genes. A significantly stronger link between the two aspects of selection is observed in free-living bacteria than in parasitic bacteria and in genes encoding metabolic enzymes and transporters than in informational genes. These differences might reflect the special importance of translational fine-tuning for the adaptability of gene expression to environmental changes. The results of this work establish the coupling between protein level selection and selection for translational optimization as a distinct and potentially important factor in microbial evolution. IMPORTANCE Selection affects the evolution of microbial genomes at many levels, including both the structure of proteins and the regulation of their production. Here we demonstrate the coupling between the selection on protein sequences and the optimization of codon usage in a broad range of bacteria and archaea. The strength of this coupling varies over a wide range and strongly and positively correlates with the genomic GC content. The cause(s) of the evolution of high GC content is a long-standing open question, given the universal mutational bias toward AT. We propose that optimization of codon usage could be one of the key factors that determine the evolution of GC-rich genomes. This work establishes the coupling between selection at the level of protein sequence and at the level of codon choice optimization as a distinct aspect of genome evolution.

Download Full-text

GEMPROT: visualization of the impact on the protein of the genetic variants found on each haplotype

Bioinformatics ◽

10.1093/bioinformatics/bty993 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2492-2494

Author(s):

Tania Cuppens ◽

Thomas E Ludwig ◽

Pascal Trouvé ◽

Emmanuelle Genin

Keyword(s):

Genetic Variants ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Supplementary Information ◽

Analysis Tool ◽

Functional Protein ◽

Key Players ◽

On Line ◽

The Impact

Abstract Summary When analyzing sequence data, genetic variants are considered one by one, taking no account of whether or not they are found in the same individual. However, variant combinations might be key players in some diseases as variants that are neutral on their own can become deleterious when associated together. GEMPROT is a new analysis tool that allows, from a phased vcf file, to visualize the consequences of the genetic variants on the protein. At the level of an individual, the program shows the variants on each of the two protein sequences and the Pfam functional protein domains. When data on several individuals are available, GEMPROT lists the haplotypes found in the sample and can compare the haplotype distributions between different sub-groups of individuals. By offering a global visualization of the gene with the genetic variants present, GEMPROT makes it possible to better understand the impact of combinations of genetic variants on the protein sequence. Availability and implementation GEMPROT is freely available at https://github.com/TaniaCuppens/GEMPROT. An on-line version is also available at http://med-laennec.univ-brest.fr/GEMPROT/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text