scholarly journals THREE MSA TOOLS ANALYSIS in DNA and PROTEIN DATASETS

2021 ◽  
Vol 7 (2) ◽  
pp. 89-99
Author(s):  
Fırat AŞIR ◽  
Tuğcan KORAK ◽  
Özgür ÖZTÜRK
Keyword(s):  
2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Pablo Mier ◽  
Miguel A. Andrade-Navarro

Abstract According to the amino acid composition of natural proteins, it could be expected that all possible sequences of three or four amino acids will occur at least once in large protein datasets purely by chance. However, in some species or cellular context, specific short amino acid motifs are missing due to unknown reasons. We describe these as Avoided Motifs, short amino acid combinations missing from biological sequences. Here we identify 209 human and 154 bacterial Avoided Motifs of length four amino acids, and discuss their possible functionality according to their presence in other species. Furthermore, we determine two Avoided Motifs of length three amino acids in human proteins specifically located in the cytoplasm, and two more in secreted proteins. Our results support the hypothesis that the characterization of Avoided Motifs in particular contexts can provide us with information about functional motifs, pointing to a new approach in the use of molecular sequences for the discovery of protein function.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Bruno Thiago de Lima Nichio ◽  
Aryel Marlus Repula de Oliveira ◽  
Camilla Reginatto de Pierri ◽  
Leticia Graziela Costa Santos ◽  
Alexandre Quadros Lejambre ◽  
...  

2004 ◽  
Vol 02 (01) ◽  
pp. 99-126 ◽  
Author(s):  
ORHAN ÇAMOĞLU ◽  
TAMER KAHVECI ◽  
AMBUJ K. SINGH

We propose new methods for finding similarities in protein structure databases. These methods extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. The feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. It quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times, while keeping the sensitivity similar. Our technique can also be incorporated with DALI and CE to improve their running times by a factor of 2 and 2.7 respectively. The software is available online at .


2021 ◽  
Vol 9 ◽  
Author(s):  
Matej Medvecky ◽  
Manolis Mandalakis

The majority of studies focusing on microbial functioning in various environments are based on DNA or RNA sequencing techniques that have inherent limitations and usually provide a distorted picture about the functional status of the studied system. Untargeted proteomics is better suited for that purpose, but it suffers from low efficiency when applied in complex consortia. In practice, the scanning capabilities of the currently employed LC-MS/MS systems provide limited coverage of key-acting proteins, hardly allowing a semiquantitative assessment of the most abundant ones from most prevalent species. When particular biological processes of high importance are under investigation, the analysis of specific proteins using targeted proteomics is a more appropriate strategy as it offers superior sensitivity and comes with the added benefits of increased throughput, dynamic range and selectivity. However, the development of targeted assays requires a priori knowledge regarding the optimal peptides to be screened for each protein of interest. In complex, multi-species systems, a specific biochemical process may be driven by a large number of homologous proteins having considerable differences in their amino acid sequence, complicating LC-MS/MS detection. To overcome the complexity of such systems, we have developed an automated pipeline that interrogates UniProt database or user-created protein datasets (e.g. from metagenomic studies) to gather homolog proteins with a defined functional role and extract respective peptide sequences, while it computes several protein/peptide properties and relevant statistics to deduce a small list of the most representative, process-specific and LC-MS/MS-amenable peptides for the microbial enzymatic activity of interest.


eLife ◽  
2021 ◽  
Vol 10 ◽  
Author(s):  
Daniel Griffith ◽  
Alex S Holehouse

The rise of high-throughput experiments has transformed how scientists approach biological questions. The ubiquity of large-scale assays that can test thousands of samples in a day has necessitated the development of new computational approaches to interpret this data. Among these tools, machine learning approaches are increasingly being utilized due to their ability to infer complex nonlinear patterns from high-dimensional data. Despite their effectiveness, machine learning (and in particular deep learning) approaches are not always accessible or easy to implement for those with limited computational expertise. Here we present PARROT, a general framework for training and applying deep learning-based predictors on large protein datasets. Using an internal recurrent neural network architecture, PARROT is capable of tackling both classification and regression tasks while only requiring raw protein sequences as input. We showcase the potential uses of PARROT on three diverse machine learning tasks: predicting phosphorylation sites, predicting transcriptional activation function of peptides generated by high-throughput reporter assays, and predicting the fibrillization propensity of amyloid beta with data generated by deep mutational scanning. Through these examples, we demonstrate that PARROT is easy to use, performs comparably to state-of-the-art computational tools, and is applicable for a wide array of biological problems.


2017 ◽  
Author(s):  
Guifre Torruella ◽  
Xavier Grau-Bove ◽  
David Moreira ◽  
Sergey A Karpov ◽  
John Burns ◽  
...  

Aphelids are poorly known phagotrophic parasites of algae whose life cycle and morphology resemble those of the widely diverse parasitic rozellids (Cryptomycota, Rozellomycota). In previous phylogenetic analyses of RNA polymerase and rRNA genes, aphelids and rozellids formed a monophyletic group together with the extremely reduced parasitic Microsporidia, named Opisthosporidia, which was sister to Fungi. However, the statistical support for that group was always moderate. We generated the first transcriptome data for one aphelid species, Paraphelidium tribonemae. In-depth multi-gene phylogenomic analyses using various protein datasets place aphelids as the closest relatives of Fungi to the exclusion of rozellids and Microsporidia. In contrast with the comparatively reduced Rozella allomycis genome, we infer a rich, free-living-like aphelid proteome, including cellulases likely involved in algal cell-wall penetration, enzymes involved in chitin biosynthesis and several metabolic pathways. Our results suggest that Fungi evolved from a complex aphelid-like ancestor that lost phagotrophy and became osmotrophic.


2018 ◽  
Author(s):  
Akanksha Pandey ◽  
Edward L. Braun

AbstractPhylogenomics has revolutionized the study of evolutionary relationships. However, genome-scale data have not been able to resolve all relationships in the tree of life. This could reflect the poor-fit of the models used to analyze heterogeneous datasets; that heterogeneity is likely to have many explanations. However, it seems reasonable to hypothesize that the different patterns of selection on proteins based on their structures might represent a source of heterogeneity. To test that hypothesis, we developed an efficient pipeline to divide phylogenomic datasets that comprise proteins into subsets based on secondary structure and relative solvent accessibility. We then tested whether amino acids in different structural environments had different signals for the deepest branches in the metazoan tree of life. Sites located in different structural environments did support distinct tree topologies. The most striking difference in phylogenetic signal reflected relative solvent accessibility; analyses of sites on the surface of proteins yielded a tree that placed ctenophores sister to all other animals whereas sites buried inside proteins yielded a tree with a sponge-ctenophore clade. These differences in phylogenetic signal were not ameliorated when we repeated our analyses using the site-heterogeneous CAT model, a mixture model that is often used for analyses of protein datasets. In fact, analyses using the CAT model actually resulted in rearrangements that are unlikely to represent evolutionary history. These results provide striking evidence that it will be necessary to achieve a better understanding the constraints due to protein structure to improve phylogenetic estimation.


Sign in / Sign up

Export Citation Format

Share Document