Adopting the MapReduce framework to pre-train 1-D and 2-D protein structure predictors with large protein datasets

Abstract According to the amino acid composition of natural proteins, it could be expected that all possible sequences of three or four amino acids will occur at least once in large protein datasets purely by chance. However, in some species or cellular context, specific short amino acid motifs are missing due to unknown reasons. We describe these as Avoided Motifs, short amino acid combinations missing from biological sequences. Here we identify 209 human and 154 bacterial Avoided Motifs of length four amino acids, and discuss their possible functionality according to their presence in other species. Furthermore, we determine two Avoided Motifs of length three amino acids in human proteins specifically located in the cytoplasm, and two more in secreted proteins. Our results support the hypothesis that the characterization of Avoided Motifs in particular contexts can provide us with information about functional motifs, pointing to a new approach in the use of molecular sequences for the discovery of protein function.

Download Full-text

Bioinformatic Prediction of S-Nitrosylation Sites in Large Protein Datasets

Methods in Molecular Biology - Nitric Oxide ◽

10.1007/978-1-4939-7695-9_19 ◽

2018 ◽

pp. 241-250 ◽

Cited By ~ 1

Author(s):

Rosario Carmona ◽

M. Claros ◽

Juan de Alché

Keyword(s):

Large Protein ◽

Protein Datasets

Download Full-text

RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets

BMC Bioinformatics ◽

10.1186/s12859-019-2973-4 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Bruno Thiago de Lima Nichio ◽

Aryel Marlus Repula de Oliveira ◽

Camilla Reginatto de Pierri ◽

Leticia Graziela Costa Santos ◽

Alexandre Quadros Lejambre ◽

...

Keyword(s):

Large Protein ◽

Protein Datasets

Download Full-text

INDEX-BASED SIMILARITY SEARCH FOR PROTEIN STRUCTURE DATABASES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720004000491 ◽

2004 ◽

Vol 02 (01) ◽

pp. 99-126 ◽

Cited By ~ 8

Author(s):

ORHAN ÇAMOĞLU ◽

TAMER KAHVECI ◽

AMBUJ K. SINGH

Keyword(s):

Protein Structure ◽

Pairwise Alignment ◽

Query Protein ◽

Index Structure ◽

Feature Vectors ◽

New Methods ◽

Alignment Tool ◽

Structure Databases ◽

Protein Datasets ◽

Protein Dataset

We propose new methods for finding similarities in protein structure databases. These methods extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. The feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. It quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times, while keeping the sensitivity similar. Our technique can also be incorporated with DALI and CE to improve their running times by a factor of 2 and 2.7 respectively. The software is available online at .

Download Full-text

PARROT is a flexible recurrent neural network framework for analysis of large protein datasets

eLife ◽

10.7554/elife.70576 ◽

2021 ◽

Vol 10 ◽

Author(s):

Daniel Griffith ◽

Alex S Holehouse

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

High Throughput ◽

Recurrent Neural Network ◽

Transcriptional Activation ◽

Network Architecture ◽

Learning Approaches ◽

Large Protein ◽

Protein Datasets

The rise of high-throughput experiments has transformed how scientists approach biological questions. The ubiquity of large-scale assays that can test thousands of samples in a day has necessitated the development of new computational approaches to interpret this data. Among these tools, machine learning approaches are increasingly being utilized due to their ability to infer complex nonlinear patterns from high-dimensional data. Despite their effectiveness, machine learning (and in particular deep learning) approaches are not always accessible or easy to implement for those with limited computational expertise. Here we present PARROT, a general framework for training and applying deep learning-based predictors on large protein datasets. Using an internal recurrent neural network architecture, PARROT is capable of tackling both classification and regression tasks while only requiring raw protein sequences as input. We showcase the potential uses of PARROT on three diverse machine learning tasks: predicting phosphorylation sites, predicting transcriptional activation function of peptides generated by high-throughput reporter assays, and predicting the fibrillization propensity of amyloid beta with data generated by deep mutational scanning. Through these examples, we demonstrate that PARROT is easy to use, performs comparably to state-of-the-art computational tools, and is applicable for a wide array of biological problems.

Download Full-text

Efficient protein structure alignment algorithms under the MapReduce framework

4th IEEE International Conference on Cloud Computing Technology and Science Proceedings ◽

10.1109/cloudcom.2012.6427604 ◽

2012 ◽

Author(s):

Che-Lun Hung ◽

Yaw-Ling Lin ◽

Chen-En Hsieh ◽

Guan-Jie Hua

Keyword(s):

Protein Structure ◽

Structure Alignment ◽

Protein Structure Alignment ◽

Mapreduce Framework ◽

Alignment Algorithms

Download Full-text

Homology-Based Annotation of Large Protein Datasets

Methods in Molecular Biology - Data Mining Techniques for the Life Sciences ◽

10.1007/978-1-4939-3572-7_8 ◽

2016 ◽

pp. 153-176

Author(s):

Marco Punta ◽

Jaina Mistry

Keyword(s):

Large Protein ◽

Protein Datasets

Download Full-text

On the intrinsic disorder status of the major players in programmed cell death pathways

F1000Research ◽

10.12688/f1000research.2-190.v1 ◽

2013 ◽

Vol 2 ◽

pp. 190 ◽

Cited By ~ 15

Author(s):

Alexey V Uversky ◽

Bin Xue ◽

Zhenling Peng ◽

Lukasz Kurgan ◽

Vladimir N Uversky

Keyword(s):

Cell Death ◽

Programmed Cell Death ◽

Posttranslational Modification ◽

Bioinformatics Analysis ◽

Intrinsic Disorder ◽

Large Protein ◽

Evolutionary Patterns ◽

Cell Death Pathways ◽

Related Proteins ◽

Protein Datasets

Earlier computational and bioinformatics analysis of several large protein datasets across 28 species showed that proteins involved in regulation and execution of programmed cell death (PCD) possess substantial amounts of intrinsic disorder. Based on the comprehensive analysis of these datasets by a wide array of modern bioinformatics tools it was concluded that disordered regions of PCD-related proteins are involved in a multitude of biological functions and interactions with various partners, possess numerous posttranslational modification sites, and have specific evolutionary patterns (Peng et al. 2013). This study extends our previous work by providing information on the intrinsic disorder status of some of the major players of the three major PCD pathways: apoptosis, autophagy, and necroptosis. We also present a detailed description of the disorder status and interactomes of selected proteins that are involved in the p53-mediated apoptotic signaling pathways.

Download Full-text