protein databases
Recently Published Documents


TOTAL DOCUMENTS

145
(FIVE YEARS 43)

H-INDEX

24
(FIVE YEARS 4)

2021 ◽  
Author(s):  
Wesley S. van de Geer ◽  
Job van Riet ◽  
Harmen J. G. van de Werken

AbstractSummaryWe present an R-based open-source software termed ProteoDisco that allows for flexible incorporation of genomic variants, fusion-genes and (aberrant) transcriptomic variants from standardized formats into protein variant sequences. ProteoDisco allows for a flexible step-by-step workflow allowing for in-depth customization to suit a myriad of research approaches in the field of proteogenomics, on all organisms for which a reference genome and transcript annotations are available.Availability and ImplementationProteoDisco (R package version ≥ 0.99) is available from https://github.com/ErasmusMC-CCBC/ProteoDisco/[email protected] informationSupplementary table, figures and data files available.


2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Sangjeong Lee ◽  
Heejin Park ◽  
Hyunwoo Kim

Abstract Background The target-decoy strategy effectively estimates the false-discovery rate (FDR) by creating a decoy database with a size identical to that of the target database. Decoy databases are created by various methods, such as, the reverse, pseudo-reverse, shuffle, pseudo-shuffle, and the de Bruijn methods. FDR is sometimes over- or under-estimated depending on which decoy database is used because the ratios of redundant peptides in the target databases are different, that is, the numbers of unique (non-redundancy) peptides in the target and decoy databases differ. Results We used two protein databases (the UniProt Saccharomyces cerevisiae protein database and the UniProt human protein database) to compare the FDRs of various decoy databases. When the ratio of redundant peptides in the target database is low, the FDR is not overestimated by any decoy construction method. However, if the ratio of redundant peptides in the target database is high, the FDR is overestimated when the (pseudo) shuffle decoy database is used. Additionally, human and S. cerevisiae six frame translation databases, which are large databases, also showed outcomes similar to that from the UniProt human protein database. Conclusion The FDR must be estimated using the correction factor proposed by Elias and Gygi or that by Kim et al. when (pseudo) shuffle decoy databases are used.


2021 ◽  
Author(s):  
Laura Fancello ◽  
Thomas Burger

ABSTRACTBackgroundProteogenomics aims to identify variant or unknown proteins in bottom-up proteomics, by searching transcriptome- or genome-derived custom protein databases. However, empirical observations reveal that these large proteogenomic databases produce lower-sensitivity peptide identifications. Various strategies have been proposed to avoid this, including the generation of reduced transcriptome-informed protein databases (i.e., built from reference protein databases only retaining proteins whose transcripts are detected in the sample-matched transcriptome), which were found to increase peptide identification sensitivity. Here, we present a detailed evaluation of this approach.ResultsFirst, we established that the increased sensitivity in peptide identification is in fact a statistical artifact, directly resulting from the limited capability of target-decoy competition to accurately model incorrect target matches when using excessively small databases. As anti-conservative FDRs are likely to hamper the robustness of the resulting biological conclusions, we advocate for alternative FDR control methods that are less sensitive to database size. Nevertheless, reduced transcriptome-informed databases are useful, as they reduce the ambiguity of protein identifications, yielding fewer shared peptides. Furthermore, searching the reference database and subsequently filtering proteins whose transcripts are not expressed reduces protein identification ambiguity to a similar extent, but is more transparent and reproducible.ConclusionIn summary, using transcriptome information is an interesting strategy that has not been promoted for the right reasons. While the increase in peptide identifications from searching reduced transcriptome-informed databases is an artifact caused by the use of an FDR control method unsuitable to excessively small databases, transcriptome information can reduce ambiguity of protein identifications.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Zheng-Wen Yu ◽  
Ni Zhang ◽  
Chun-Yan Jiang ◽  
Shao-Xiong Wu ◽  
Xia-Yu Feng ◽  
...  

AbstractDihydroquercetin (DHQ), an extremely low content compound (less than 3%) in plants, is an important component of dietary supplements and used as functional food for its antioxidant activity. Moreover, as downstream metabolites of DHQ, an extremely high content of dihydromyricetin (DHM) is up to 38.5% in Ampelopsis grossedentata. However, the mechanisms involved in the biosynthesis and regulation from DHQ to DHM in A. grossedentata remain unclear. In this study, a comparative transcriptome analysis of A. grossedentata containing extreme amounts of DHM was performed on the Illumina HiSeq 2000 sequencing platform. A total of 167,415,597 high-quality clean reads were obtained and assembled into 100,584 unigenes having an N50 value of 1489. Among these contigs, 57,016 (56.68%) were successfully annotated in seven public protein databases. From the differentially expressed gene (DEG) analysis, 926 DEGs were identified between the B group (low DHM: 210.31 mg/g) and D group (high DHM: 359.12 mg/g) libraries, including 446 up-regulated genes and 480 down-regulated genes (B vs. D). Flavonoids (DHQ, DHM)-related DEGs of ten structural enzyme genes, three myeloblastosis transcription factors (MYB TFs), one basic helix–loop–helix (bHLH) TF, and one WD40 domain-containing protein were obtained. The enzyme genes comprised three PALs, two CLs, two CHSs, one F3’H, one F3’5’H (directly converts DHQ to DHM), and one ANS. The expression profiles of randomly selected genes were consistent with the RNA-seq results. Our findings thus provide comprehensive gene expression resources for revealing the molecular mechanism from DHQ to DHM in A. grossedentata. Importantly, this work will spur further genetic studies about A. grossedentata and may eventually lead to genetic improvements of the DHQ content in this plant.


2021 ◽  
Author(s):  
Karin Schork ◽  
Michael Turewicz ◽  
Julian Uszkoreit ◽  
Jörg Rahnenführer ◽  
Martin Eisenacher

Motivation: In bottom-up proteomics, proteins are enzymatically digested before measurement with mass spectrometry. The relationship between proteins and peptides can be represented by bipartite graphs. This representation is useful to aid protein inference and quantification, which is complex due to the occurrence of shared peptides. We conducted a comprehensive analysis of bipartite graphs using theoretical peptides from in silico digestion of protein databases as well as quantified peptides quantified from real data sets. Results: The graphs based on quantified peptides are smaller and have less complex structures compared to graphs using theoretical peptides. The proportion of protein nodes without unique peptides and of graphs that contain such proteins are considerably greater for real data. Large differences between the two analyzed organisms (mouse and yeast) on database as well as quantitative level have been observed. Insights of this analysis may be useful for the development of protein inference and quantification algorithms.


2021 ◽  
Vol 12 ◽  
Author(s):  
Dehia Sahmi-Bounsiar ◽  
Jean-Pierre Baudoin ◽  
Sihem Hannat ◽  
Philippe Decloquement ◽  
Eric Chabrieres ◽  
...  

One of the most curious findings associated with the discovery of Acanthamoeba polyphaga mimivirus (APMV) was the presence of many proteins and RNAs within the virion. Although some hypotheses on their role in Acanthamoeba infection have been put forward, none have been validated. In this study, we directly transfected mimivirus DNA with or without additional proteinase K treatment to extracted DNA into Acanthamoeba castellanii. In this way, it was possible to generate infectious APMV virions, but only without extra proteinase K treatment of extracted DNA. The virus genomes before and after transfection were identical. We searched for the remaining DNA-associated proteins that were digested by proteinase K and could visualize at least five putative proteins. Matrix-assisted laser desorption/ionization time-of-flight and liquid chromatography–mass spectrometry comparison with protein databases allowed the identification of four hypothetical proteins—L442, L724, L829, and R387—and putative GMC-type oxidoreductase R135. We believe that L442 plays a major role in this protein–DNA interaction. In the future, expression in vectors and then diffraction of X-rays by protein crystals could help reveal the exact structure of this protein and its precise role.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11330
Author(s):  
Julian Echave

Studying the effect of perturbations on protein structure is a basic approach in protein research. Important problems, such as predicting pathological mutations and understanding patterns of structural evolution, have been addressed by computational simulations that model mutations using forces and predict the resulting deformations. In single mutation-response scanning simulations, a sensitivity matrix is obtained by averaging deformations over point mutations. In double mutation-response scanning simulations, a compensation matrix is obtained by minimizing deformations over pairs of mutations. These very useful simulation-based methods may be too slow to deal with large proteins, protein complexes, or large protein databases. To address this issue, I derived analytical closed formulas to calculate the sensitivity and compensation matrices directly, without simulations. Here, I present these derivations and show that the resulting analytical methods are much faster than their simulation counterparts.


Author(s):  
Xiaolong Cao ◽  
Jinchuan Xing

Abstract Summary As the next-generation sequencing technology becomes broadly applied, genomics and transcriptomics are becoming more commonly used in both research and clinical settings. However, proteomics is still an obstacle to be conquered. For most peptide search programs in proteomics, a standard reference protein database is used. Because of the thousands of coding DNA variants in each individual, a standard reference database does not provide perfect match for many proteins/peptides of an individual. A personalized reference database can improve the detection power and accuracy for individual proteomics data. To connect genomics and proteomics, we designed a Python package PrecisionProDB that is specialized for generating a personized protein database for proteomics applications. PrecisionProDB supports multiple popular file formats and reference databases, and can generate a personized database in minutes. To demonstrate the application of PrecisionProDB, we generated human population-specific reference protein databases with PrecisionProDB, which improves the number of identified peptides by 0.34% on average. In addition, by incorporating cell line-specific variants into the protein database, we demonstrated a 0.71% improvement for peptide identification in the Jurkat cell line. With PrecisionProDB and these datasets, researchers and clinicians can improve their peptide search performance by adopting the more representative protein database or adding population and individual-specific proteins to the search database with minimum increase of efforts. Availabilityand implementation PrecisionProDB and pre-calculated protein databases are freely available at https://github.com/ATPs/PrecisionProDB and https://github.com/ATPs/PrecisionProDB_references. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Keiichi Inoue ◽  
Masayuki Karasuyama ◽  
Ryoko Nakamura ◽  
Masae Konno ◽  
Daichi Yamada ◽  
...  

AbstractMicrobial rhodopsins are photoreceptive membrane proteins, which are used as molecular tools in optogenetics. Here, a machine learning (ML)-based experimental design method is introduced for screening rhodopsins that are likely to be red-shifted from representative rhodopsins in the same subfamily. Among 3,022 ion-pumping rhodopsins that were suggested by a protein BLAST search in several protein databases, the ML-based method selected 65 candidate rhodopsins. The wavelengths of 39 of them were able to be experimentally determined by expressing proteins with the Escherichia coli system, and 32 (82%, p = 7.025 × 10−5) actually showed red-shift gains. In addition, four showed red-shift gains >20 nm, and two were found to have desirable ion-transporting properties, indicating that they would be potentially useful in optogenetics. These findings suggest that data-driven ML-based approaches play effective roles in the experimental design of rhodopsin and other photobiological studies. (141/150 words).


2021 ◽  
Vol 12 ◽  
Author(s):  
Wenjie Zhang ◽  
Hongyuan Xu ◽  
Xiaxia Duan ◽  
Jing Hu ◽  
Jingjing Li ◽  
...  

Chrysanthemum rhombifolium (Ling et C. Shih), an endemic plant that is extremely well-adapted to harsh environments. However, little is known about its molecular biology of the plant's resistant traits against stress, or even its molecular biology of overall plant. To investigate the molecular biology of C. rhombifolium and mechanism of stress adaptation, we performed transcriptome sequencing of its leaves using an Illumina platform. A total of 130,891 unigenes were obtained, and 97,496 (~74.5%) unigenes were annotated in the public protein database. The similarity search indicated that 40,878 and 74,084 unigenes showed significant similarities to known proteins from NCBI non-redundant and Swissprot protein databases, respectively. Of these, 56,213 and 42,005 unigenes were assigned to the Gene Ontology (GO) database and Cluster of Orthologous Groups (COG), respectively, and 38,918 unigenes were mapped into five main categories, including 18 KEGG pathways. Metabolism was the largest category (23,128, 59.4%) among the main KEGG categories, suggesting active metabolic processes in C. rhombifolium. About 2,459 unigenes were annotated to have a role in defense mechanism or stress tolerance. Transcriptome analysis of C. rhombifolium revealed the presence of 12,925 microsatellites in 10,524 unigenes and mono, trip, and dinucleotides having higher polymorphism rates. The phylogenetic analysis based on GME gene among related species confirmed the reliability of the transcriptomic data. This work is the first genetic study of C. rhombifolium as a new plant resource of stress-tolerant genes. This large number of transcriptome sequences enabled us to comprehensively understand the basic genetics of C. rhombifolium and discover novel genes that will be helpful in the molecular improvement of chrysanthemums.


Sign in / Sign up

Export Citation Format

Share Document