protein datasets
Recently Published Documents


TOTAL DOCUMENTS

48
(FIVE YEARS 26)

H-INDEX

8
(FIVE YEARS 3)

2021 ◽  
Vol 7 (2) ◽  
pp. 89-99
Author(s):  
Fırat AŞIR ◽  
Tuğcan KORAK ◽  
Özgür ÖZTÜRK
Keyword(s):  

2021 ◽  
Author(s):  
Samantha M Powell ◽  
Irina V Novikova ◽  
Doo Nam Kim ◽  
James E Evans

Despite rapid adaptation of micro-electron diffraction (MicroED) for protein and small molecule structure determination to sub-angstrom resolution, the lack of automation tools for easy MicroED data processing remains a challenge for expanding to the broader scientific community. In particular, automation tools, which are novice user friendly, compatible with heterogenous datasets and can be run in unison with data collection to judge the quality of incoming data (similar to cryosparc LIVE for single particle cryoEM) do not exist. Here, we present AutoMicroED, a cohesive and semi-automatic MicroED data processing pipeline that runs through image conversion, indexing, integration and scaling of data, followed by merging of successful datasets that are pushed through phasing and final structure determination. AutoMicroED is compatible with both small molecule and protein datasets and creates a straightforward and reproducible method to solve single structures from pure samples, or multiple structures from mixed populations. The immediate feedback on data quality, data completeness and more parameters, aids users to identify whether they have collected enough data for their needs. Overall, AutoMicroED permits efficient structure elucidation for both novice and experienced users with comparable results to more laborious manual processing.


2021 ◽  
Author(s):  
Cuong Cao Dang ◽  
Bui Quang Minh ◽  
Hanon McShea ◽  
Joanna Masel ◽  
Jennifer Eleanor James ◽  
...  

Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In this paper, we introduce a maximum likelihood approach nQMaker, an extension of the recently published QMaker method, that allows the estimation of time non-reversible amino acid substitution models and rooted phylogenetic trees from a set of protein sequence alignments. We show that the non-reversible models estimated with nQMaker are a much better fit to empirical alignments than pre-existing reversible models, across a wide range of datasets including mammals, birds, plants, fungi, and other taxa, and that the improvements in model fit scale with the size of the dataset. Notably, for the recently published plant and bird trees, these non-reversible models correctly recovered the commonly known root placements with very high statistical support without the need to use an outgroup. We provide nQMaker as an easy-to-use feature in the IQ-TREE software (http://www.iqtree.org), allowing users to estimate non-reversible models and rooted phylogenies from their own protein datasets.


eLife ◽  
2021 ◽  
Vol 10 ◽  
Author(s):  
Daniel Griffith ◽  
Alex S Holehouse

The rise of high-throughput experiments has transformed how scientists approach biological questions. The ubiquity of large-scale assays that can test thousands of samples in a day has necessitated the development of new computational approaches to interpret this data. Among these tools, machine learning approaches are increasingly being utilized due to their ability to infer complex nonlinear patterns from high-dimensional data. Despite their effectiveness, machine learning (and in particular deep learning) approaches are not always accessible or easy to implement for those with limited computational expertise. Here we present PARROT, a general framework for training and applying deep learning-based predictors on large protein datasets. Using an internal recurrent neural network architecture, PARROT is capable of tackling both classification and regression tasks while only requiring raw protein sequences as input. We showcase the potential uses of PARROT on three diverse machine learning tasks: predicting phosphorylation sites, predicting transcriptional activation function of peptides generated by high-throughput reporter assays, and predicting the fibrillization propensity of amyloid beta with data generated by deep mutational scanning. Through these examples, we demonstrate that PARROT is easy to use, performs comparably to state-of-the-art computational tools, and is applicable for a wide array of biological problems.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Luis Javier Galindo ◽  
Purificación López-García ◽  
Guifré Torruella ◽  
Sergey Karpov ◽  
David Moreira

AbstractCompared to multicellular fungi and unicellular yeasts, unicellular fungi with free-living flagellated stages (zoospores) remain poorly known and their phylogenetic position is often unresolved. Recently, rRNA gene phylogenetic analyses of two atypical parasitic fungi with amoeboid zoospores and long kinetosomes, the sanchytrids Amoeboradix gromovi and Sanchytrium tribonematis, showed that they formed a monophyletic group without close affinity with known fungal clades. Here, we sequence single-cell genomes for both species to assess their phylogenetic position and evolution. Phylogenomic analyses using different protein datasets and a comprehensive taxon sampling result in an almost fully-resolved fungal tree, with Chytridiomycota as sister to all other fungi, and sanchytrids forming a well-supported, fast-evolving clade sister to Blastocladiomycota. Comparative genomic analyses across fungi and their allies (Holomycota) reveal an atypically reduced metabolic repertoire for sanchytrids. We infer three main independent flagellum losses from the distribution of over 60 flagellum-specific proteins across Holomycota. Based on sanchytrids’ phylogenetic position and unique traits, we propose the designation of a novel phylum, Sanchytriomycota. In addition, our results indicate that most of the hyphal morphogenesis gene repertoire of multicellular fungi had already evolved in early holomycotan lineages.


2021 ◽  
Vol 9 ◽  
Author(s):  
Matej Medvecky ◽  
Manolis Mandalakis

The majority of studies focusing on microbial functioning in various environments are based on DNA or RNA sequencing techniques that have inherent limitations and usually provide a distorted picture about the functional status of the studied system. Untargeted proteomics is better suited for that purpose, but it suffers from low efficiency when applied in complex consortia. In practice, the scanning capabilities of the currently employed LC-MS/MS systems provide limited coverage of key-acting proteins, hardly allowing a semiquantitative assessment of the most abundant ones from most prevalent species. When particular biological processes of high importance are under investigation, the analysis of specific proteins using targeted proteomics is a more appropriate strategy as it offers superior sensitivity and comes with the added benefits of increased throughput, dynamic range and selectivity. However, the development of targeted assays requires a priori knowledge regarding the optimal peptides to be screened for each protein of interest. In complex, multi-species systems, a specific biochemical process may be driven by a large number of homologous proteins having considerable differences in their amino acid sequence, complicating LC-MS/MS detection. To overcome the complexity of such systems, we have developed an automated pipeline that interrogates UniProt database or user-created protein datasets (e.g. from metagenomic studies) to gather homolog proteins with a defined functional role and extract respective peptide sequences, while it computes several protein/peptide properties and relevant statistics to deduce a small list of the most representative, process-specific and LC-MS/MS-amenable peptides for the microbial enzymatic activity of interest.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Heba M. Afify ◽  
Mohamed B. Abdelhalim ◽  
Mai S. Mabrouk ◽  
Ahmed Y. Sayed

Abstract Background The computational biology approach has advanced exponentially in protein secondary structure prediction (PSSP), which is vital for the pharmaceutical industry. Extracting protein structure from the laboratory has insufficient information for PSSP that is used in bioinformatics studies. In this paper, the support vector machine (SVM) model and decision tree are presented on the RS126 dataset to address the problem of PSSP. A decision tree is applied for the SVM outcome to obtain the relevant guidelines possible for PSSP. Furthermore, the number of produced rules was fairly small, and they show a greater degree of comprehensibility compared to other rules. Several of the proposed principles have compelling and relevant biological clarification. Results The results confirmed that the existence of a particular amino acid in a protein sequence increases the stability for the forecast of protein secondary structure. The suggested algorithm achieved 85% accuracy for the E|~E classifier. Conclusions The proposed rules can be very important in managing wet laboratory experiments intended at determining protein secondary structure. Lastly, future work will focus mainly on large protein datasets without overfitting and expand the amount of extracted regulations for PSSP.


2021 ◽  
Author(s):  
Daniel Griffith ◽  
Alex S Holehouse

The rise of high-throughput experiments has transformed how scientists approach biological questions. The ubiquity of large-scale assays that can test thousands of samples in a day has necessitated the development of new computational approaches to interpret this data. Among these tools, machine learning approaches are increasingly being utilized due to their ability to infer complex non-linear patterns from high-dimensional data. Despite their effectiveness, machine learning (and in particular deep learning) approaches are not always accessible or easy to implement for those with limited computational expertise. Here we present PARROT, a general framework for training and applying deep learning-based predictors on large protein datasets. Using an internal recurrent neural network architecture, PARROT is capable of tackling both classification and regression tasks while only requiring raw protein sequences as input. We showcase the potential uses of PARROT on three diverse machine learning tasks: predicting phosphorylation sites, predicting transcriptional activation function of peptides generated by high-throughput reporter assays, and predicting the fibrillization propensity of amyloid-beta with data generated by deep mutational scanning. Through these examples, we demonstrate that PARROT is easy to use, performs comparably to state-of-the-art computational tools, and is applicable for a wide array of biological problems.


2021 ◽  
Author(s):  
Héléna Alexandra Gaspar ◽  
Mohamed Ahmed ◽  
Thomas Edlich ◽  
Benedek Fabian ◽  
Zsolt Varszegi ◽  
...  

<div>Proteochemometric (PCM) models of protein-ligand activity combine information from both the ligands and the proteins to which they bind. Several methods inspired by the field of natural language processing (NLP) have been proposed to represent protein sequences. </div><div>Here, we present PCM benchmark results on three multi-protein datasets: protein kinases, rhodopsin-like GPCRs (ChEMBL binding and functional assays), and cytochrome P450 enzymes. Keeping ligand descriptors fixed, we evaluate our own protein embeddings based on subword-segmented language models trained on mammalian sequences against pre-existing NLP-based descriptors, protein-protein similarity matrices derived from multiple sequence alignments (MSA), dummy protein one-hot encodings, and a combination of NLP-based and MSA-based descriptors. Our results show that performance gains over one-hot encodings are small and combining NLP-based and MSA-based descriptors increases predictive performance consistently across different splitting strategies. This work has been presented at the 3rd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry in September 2020.</div>


2021 ◽  
Author(s):  
Héléna Alexandra Gaspar ◽  
Mohamed Ahmed ◽  
Thomas Edlich ◽  
Benedek Fabian ◽  
Zsolt Varszegi ◽  
...  

<div>Proteochemometric (PCM) models of protein-ligand activity combine information from both the ligands and the proteins to which they bind. Several methods inspired by the field of natural language processing (NLP) have been proposed to represent protein sequences. </div><div>Here, we present PCM benchmark results on three multi-protein datasets: protein kinases, rhodopsin-like GPCRs (ChEMBL binding and functional assays), and cytochrome P450 enzymes. Keeping ligand descriptors fixed, we evaluate our own protein embeddings based on subword-segmented language models trained on mammalian sequences against pre-existing NLP-based descriptors, protein-protein similarity matrices derived from multiple sequence alignments (MSA), dummy protein one-hot encodings, and a combination of NLP-based and MSA-based descriptors. Our results show that performance gains over one-hot encodings are small and combining NLP-based and MSA-based descriptors increases predictive performance consistently across different splitting strategies. This work has been presented at the 3rd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry in September 2020.</div>


Sign in / Sign up

Export Citation Format

Share Document