Adopting the MapReduce framework to pre-train 1-D and 2-D protein structure predictors with large protein datasets

Author(s):  
Jesse Eickholt ◽  
Suman Karki
2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Pablo Mier ◽  
Miguel A. Andrade-Navarro

Abstract According to the amino acid composition of natural proteins, it could be expected that all possible sequences of three or four amino acids will occur at least once in large protein datasets purely by chance. However, in some species or cellular context, specific short amino acid motifs are missing due to unknown reasons. We describe these as Avoided Motifs, short amino acid combinations missing from biological sequences. Here we identify 209 human and 154 bacterial Avoided Motifs of length four amino acids, and discuss their possible functionality according to their presence in other species. Furthermore, we determine two Avoided Motifs of length three amino acids in human proteins specifically located in the cytoplasm, and two more in secreted proteins. Our results support the hypothesis that the characterization of Avoided Motifs in particular contexts can provide us with information about functional motifs, pointing to a new approach in the use of molecular sequences for the discovery of protein function.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Bruno Thiago de Lima Nichio ◽  
Aryel Marlus Repula de Oliveira ◽  
Camilla Reginatto de Pierri ◽  
Leticia Graziela Costa Santos ◽  
Alexandre Quadros Lejambre ◽  
...  

2004 ◽  
Vol 02 (01) ◽  
pp. 99-126 ◽  
Author(s):  
ORHAN ÇAMOĞLU ◽  
TAMER KAHVECI ◽  
AMBUJ K. SINGH

We propose new methods for finding similarities in protein structure databases. These methods extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. The feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. It quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times, while keeping the sensitivity similar. Our technique can also be incorporated with DALI and CE to improve their running times by a factor of 2 and 2.7 respectively. The software is available online at .


eLife ◽  
2021 ◽  
Vol 10 ◽  
Author(s):  
Daniel Griffith ◽  
Alex S Holehouse

The rise of high-throughput experiments has transformed how scientists approach biological questions. The ubiquity of large-scale assays that can test thousands of samples in a day has necessitated the development of new computational approaches to interpret this data. Among these tools, machine learning approaches are increasingly being utilized due to their ability to infer complex nonlinear patterns from high-dimensional data. Despite their effectiveness, machine learning (and in particular deep learning) approaches are not always accessible or easy to implement for those with limited computational expertise. Here we present PARROT, a general framework for training and applying deep learning-based predictors on large protein datasets. Using an internal recurrent neural network architecture, PARROT is capable of tackling both classification and regression tasks while only requiring raw protein sequences as input. We showcase the potential uses of PARROT on three diverse machine learning tasks: predicting phosphorylation sites, predicting transcriptional activation function of peptides generated by high-throughput reporter assays, and predicting the fibrillization propensity of amyloid beta with data generated by deep mutational scanning. Through these examples, we demonstrate that PARROT is easy to use, performs comparably to state-of-the-art computational tools, and is applicable for a wide array of biological problems.


F1000Research ◽  
2013 ◽  
Vol 2 ◽  
pp. 190 ◽  
Author(s):  
Alexey V Uversky ◽  
Bin Xue ◽  
Zhenling Peng ◽  
Lukasz Kurgan ◽  
Vladimir N Uversky

Earlier computational and bioinformatics analysis of several large protein datasets across 28 species showed that proteins involved in regulation and execution of programmed cell death (PCD) possess substantial amounts of intrinsic disorder. Based on the comprehensive analysis of these datasets by a wide array of modern bioinformatics tools it was concluded that disordered regions of PCD-related proteins are involved in a multitude of biological functions and interactions with various partners, possess numerous posttranslational modification sites, and have specific evolutionary patterns (Peng et al. 2013). This study extends our previous work by providing information on the intrinsic disorder status of some of the major players of the three major PCD pathways: apoptosis, autophagy, and necroptosis. We also present a detailed description of the disorder status and interactomes of selected proteins that are involved in the p53-mediated apoptotic signaling pathways.


2012 ◽  
Vol 28 (19) ◽  
pp. 2431-2440 ◽  
Author(s):  
Edvin Fuglebakk ◽  
Julián Echave ◽  
Nathalie Reuter

Sign in / Sign up

Export Citation Format

Share Document