An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema

Roman Zoun; Kay Schallert; David Broneske; Ivayla Trifonova; Xiao Chen; Robert Heyer; Dirk Benndorf; Gunter Saake

doi:10.3390/a14020059

An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema

Algorithms ◽

10.3390/a14020059 ◽

2021 ◽

Vol 14 (2) ◽

pp. 59

Author(s):

Roman Zoun ◽

Kay Schallert ◽

David Broneske ◽

Ivayla Trifonova ◽

Xiao Chen ◽

...

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Memory Consumption ◽

Mass Spectrometers ◽

Protein Sequence Data ◽

Relational Systems ◽

Spectrometer Data ◽

Database Engine ◽

High Storage

Mass spectrometers enable identifying proteins in biological samples leading to biomarkers for biological process parameters and diseases. However, bioinformatic evaluation of the mass spectrometer data needs a standardized workflow and system that stores the protein sequences. Due to its standardization and maturity, relational systems are a great fit for storing protein sequences. Hence, in this work, we present a schema for distributed column-based database management systems using a column-oriented index to store sequence data. In order to achieve a high storage performance, it was necessary to choose a well-performing strategy for transforming the protein sequence data from the FASTA format to the new schema. Therefore, we applied an in-memory map, HDDmap, database engine, and extended radix tree and evaluated their performance. The results show that our proposed extended radix tree performs best regarding memory consumption and runtime. Hence, the radix tree is a suitable data structure for transforming protein sequences into the indexed schema.

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

FEGS: a novel feature extraction model for protein sequences and its applications

BMC Bioinformatics ◽

10.1186/s12859-021-04223-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Zengchao Mu ◽

Ting Yu ◽

Xiaoping Liu ◽

Hongyu Zheng ◽

Leyi Wei ◽

...

Keyword(s):

Feature Extraction ◽

Protein Sequence ◽

Graphical Representation ◽

Sequence Data ◽

Protein Sequences ◽

Statistical Features ◽

Research Areas ◽

Protein Functions ◽

Protein Sequence Data ◽

Extraction Model

Abstract Background Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. Results In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. Conclusion The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.

Using sound to understand protein sequence data: new sonification algorithms for protein sequences and multiple sequence alignments

BMC Bioinformatics ◽

10.1186/s12859-021-04362-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Edward J. Martin ◽

Thomas R. Meagher ◽

Daniel Barker

Keyword(s):

Focus Group ◽

User Experience ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Future Directions ◽

Multiple Sequence Alignments ◽

Protein Sequence Data

Abstract Background The use of sound to represent sequence data—sonification—has great potential as an alternative and complement to visual representation, exploiting features of human psychoacoustic intuitions to convey nuance more effectively. We have created five parameter-mapping sonification algorithms that aim to improve knowledge discovery from protein sequences and small protein multiple sequence alignments. For two of these algorithms, we investigated their effectiveness at conveying information. To do this we focussed on subjective assessments of user experience. This entailed a focus group session and survey research by questionnaire of individuals engaged in bioinformatics research. Results For single protein sequences, the success of our sonifications for conveying features was supported by both the survey and focus group findings. For protein multiple sequence alignments, there was limited evidence that the sonifications successfully conveyed information. Additional work is required to identify effective algorithms to render multiple sequence alignment sonification useful to researchers. Feedback from both our survey and focus groups suggests future directions for sonification of multiple alignments: animated visualisation indicating the column in the multiple alignment as the sonification progresses, user control of sequence navigation, and customisation of the sound parameters. Conclusions Sonification approaches undertaken in this work have shown some success in conveying information from protein sequence data. Feedback points out future directions to build on the sonification approaches outlined in this paper. The effectiveness assessment process implemented in this work proved useful, giving detailed feedback and key approaches for improvement based on end-user input. The uptake of similar user experience focussed effectiveness assessments could also help with other areas of bioinformatics, for example in visualisation.

Improving Generalizability of Protein Sequence Models with Data Augmentations

10.1101/2021.02.18.431877 ◽

2021 ◽

Author(s):

Hongyu Shen ◽

Layne C. Price ◽

Taha Bahadori ◽

Franziska Seeger

Keyword(s):

Machine Learning ◽

Protein Sequence ◽

Data Augmentation ◽

Sequence Data ◽

Protein Sequences ◽

Representation Learning ◽

Amino Acid Replacement ◽

Fine Tuning ◽

Protein Sequence Data ◽

Tuning Methods

AbstractWhile protein sequence data is an emerging application domain for machine learning methods, small modifications to protein sequences can result in difficult-to-predict changes to the protein’s function. Consequently, protein machine learning models typically do not use randomized data augmentation procedures analogous to those used in computer vision or natural language, e.g., cropping or synonym substitution. In this paper, we empirically explore a set of simple string manipulations, which we use to augment protein sequence data when fine-tuning semi-supervised protein models. We provide 276 different comparisons to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with Transformer-based models and training datasets that vary from the baseline methods only in the data augmentations and representation learning procedure. For each TAPE validation task, we demonstrate improvements to the baseline scores when the learned protein representation is fixed between tasks. We also show that contrastive learning fine-tuning methods typically outperform masked-token prediction in these models, with increasing amounts of data augmentation generally improving performance for contrastive learning protein methods. We find the most consistent results across TAPE tasks when using domain-motivated transformations, such as amino acid replacement, as well as restricting the Transformer attention to randomly sampled sub-regions of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as randomly shuffling entire protein sequences, can improve downstream performance.

Middle Pleistocene protein sequences from the rhinoceros genusStephanorhinusand the phylogeny of extant and extinct Middle/Late Pleistocene Rhinocerotidae

PeerJ ◽

10.7717/peerj.3033 ◽

2017 ◽

Vol 5 ◽

pp. e3033 ◽

Cited By ~ 26

Author(s):

Frido Welker ◽

Geoff M. Smith ◽

Jarod M. Hutson ◽

Lutz Kindler ◽

Alejandro Garcia-Moreno ◽

...

Keyword(s):

Mass Spectrometry ◽

Phylogenetic Analysis ◽

Late Pleistocene ◽

Phylogenetic Relationships ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Middle Pleistocene ◽

Extant Species ◽

Protein Sequence Data

BackgroundAncient protein sequences are increasingly used to elucidate the phylogenetic relationships between extinct and extant mammalian taxa. Here, we apply these recent developments to Middle Pleistocene bone specimens of the rhinoceros genusStephanorhinus. No biomolecular sequence data is currently available for this genus, leaving phylogenetic hypotheses on its evolutionary relationships to extant and extinct rhinoceroses untested. Furthermore, recent phylogenies based on Rhinocerotidae (partial or complete) mitochondrial DNA sequences differ in the placement of the Sumatran rhinoceros (Dicerorhinus sumatrensis). Therefore, studies utilising ancient protein sequences from Middle Pleistocene contexts have the potential to provide further insights into the phylogenetic relationships between extant and extinct species, includingStephanorhinusandDicerorhinus.MethodsZooMS screening (zooarchaeology by mass spectrometry) was performed on several Late and Middle Pleistocene specimens from the genusStephanorhinus, subsequently followed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) to obtain ancient protein sequences from a Middle PleistoceneStephanorhinusspecimen. We performed parallel analysis on a Late Pleistocene woolly rhinoceros specimen and extant species of rhinoceroses, resulting in the availability of protein sequence data for five extant species and two extinct genera. Phylogenetic analysis additionally included all extant Perissodactyla genera (Equus,Tapirus), and was conducted using Bayesian (MrBayes) and maximum-likelihood (RAxML) methods.ResultsVarious ancient proteins were identified in both the Middle and Late Pleistocene rhinoceros samples. Protein degradation and proteome complexity are consistent with an endogenous origin of the identified proteins. Phylogenetic analysis of informative proteins resolved the Perissodactyla phylogeny in agreement with previous studies in regards to the placement of the families Equidae, Tapiridae, and Rhinocerotidae.Stephanorhinusis shown to be most closely related to the generaCoelodontaandDicerorhinus. The protein sequence data further places the Sumatran rhino in a clade together with the genusRhinoceros, opposed to forming a clade with the black and white rhinoceros species.DiscussionThe first biomolecular dataset available forStephanorhinusplaces this genus together with the extinct genusCoelodontaand the extant genusDicerorhinus. This is in agreement with morphological studies, although we are unable to resolve the order of divergence between these genera based on the protein sequences available. Our data supports the placement of the genusDicerorhinusin a clade together with extantRhinocerosspecies. Finally, the availability of protein sequence data for both extinct European rhinoceros genera allows future investigations into their geographic distribution and extinction chronologies.

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Entropy ◽

10.3390/e23050530 ◽

2021 ◽

Vol 23 (5) ◽

pp. 530

Author(s):

Milton Silva ◽

Diogo Pratas ◽

Armando J. Pinho

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Specific Protein ◽

General Purpose ◽

Amino Acid Sequences ◽

Input Size ◽

Protein Sequence Data ◽

Analysis Application ◽

Straightforward Solution ◽

Human Coronaviruses

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

GEMPROT: visualization of the impact on the protein of the genetic variants found on each haplotype

Bioinformatics ◽

10.1093/bioinformatics/bty993 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2492-2494

Author(s):

Tania Cuppens ◽

Thomas E Ludwig ◽

Pascal Trouvé ◽

Emmanuelle Genin

Keyword(s):

Genetic Variants ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Supplementary Information ◽

Analysis Tool ◽

Functional Protein ◽

Key Players ◽

On Line ◽

The Impact

Abstract Summary When analyzing sequence data, genetic variants are considered one by one, taking no account of whether or not they are found in the same individual. However, variant combinations might be key players in some diseases as variants that are neutral on their own can become deleterious when associated together. GEMPROT is a new analysis tool that allows, from a phased vcf file, to visualize the consequences of the genetic variants on the protein. At the level of an individual, the program shows the variants on each of the two protein sequences and the Pfam functional protein domains. When data on several individuals are available, GEMPROT lists the haplotypes found in the sample and can compare the haplotype distributions between different sub-groups of individuals. By offering a global visualization of the gene with the genetic variants present, GEMPROT makes it possible to better understand the impact of combinations of genetic variants on the protein sequence. Availability and implementation GEMPROT is freely available at https://github.com/TaniaCuppens/GEMPROT. An on-line version is also available at http://med-laennec.univ-brest.fr/GEMPROT/. Supplementary information Supplementary data are available at Bioinformatics online.

Phylogenetic Analysis of Protein Sequence Data Using the Randomized Axelerated Maximum Likelihood ( RAXML ) Program

Current Protocols in Molecular Biology ◽

10.1002/0471142727.mb1911s96 ◽

2011 ◽

Vol 96 (1) ◽

Cited By ~ 19

Author(s):

Antonis Rokas

Keyword(s):

Phylogenetic Analysis ◽

Maximum Likelihood ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequence Data

Protein Sequence Data Banks: The Continuing Search for Related Structures

Protein Engineering ◽

10.1016/b978-0-12-372485-4.50005-x ◽

1973 ◽

pp. 15-27

Author(s):

R.F. DOOLITTLE

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Protein Sequence Data ◽

Data Banks

Capturing Protein Sequence Data: UniProt Knowledgebase (UniProtKB) submissions and Journal Scanning

Nature Precedings ◽

10.1038/npre.2009.3120.1 ◽

2009 ◽

Author(s):

Paul Browne ◽

Ruth Eberhardt

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Protein Sequence Data