scholarly journals Direct mapping of Peptide-to-Spectra-Matches to genome information facilitates qualifying proteomics information

Author(s):  
John Anders ◽  
Hannes Petruschke ◽  
Nico Jehmlich ◽  
Sven-Bastiaan Haange ◽  
Martin von Bergen ◽  
...  

Abstract Background: Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. Results: We observe that number and quality of the Peptide-to-Spectra-Matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workflow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artificial gut microbiome model SIHUMIx, comprising eight different species, for which we validate 5114 proteins that previously have been annotated only as hypothetical ORFs. In addition, we identified 37 non-annotated protein candidates for which we found evidence in proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short (< 100 AA) and are most likely bona fide novel proteins. Conclusions: The aggregation of PSM quality information for predicted ORFs provides a robust and efficient method to identify novel proteins in proteomics data. The workflow is in particular also capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration with transcriptomics data and other source of genome-level information.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
John Anders ◽  
Hannes Petruschke ◽  
Nico Jehmlich ◽  
Sven-Bastiaan Haange ◽  
Martin von Bergen ◽  
...  

Abstract Background Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. Results We observe that number and quality of the peptide-spectrum matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workflow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artificial gut microbiome model SIHUMIx, comprising eight different species, for which we validate 5114 proteins that have previously been annotated only as hypothetical ORFs. In addition, we identified 37 non-annotated protein candidates for which we found evidence at the proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short (< 100 AA) and are most likely bona fide novel proteins. Conclusions The aggregation of PSM quality information for predicted ORFs provides a robust and efficient method to identify novel proteins in proteomics data. The workflow is in particular capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration of transcriptomics data and other sources of genome-level information.


2020 ◽  
Vol 40 (6) ◽  
Author(s):  
Corrine Corrina R. Hartford ◽  
Ashish Lal

ABSTRACT Recent advancements in genetic and proteomic technologies have revealed that more of the genome encodes proteins than originally thought possible. Specifically, some putative long noncoding RNAs (lncRNAs) have been misannotated as noncoding. Numerous lncRNAs have been found to contain short open reading frames (sORFs) which have been overlooked because of their small size. Many of these sORFs encode small proteins or micropeptides with fundamental biological importance. These micropeptides can aid in diverse processes, including cell division, transcription regulation, and cell signaling. Here we discuss strategies for establishing the coding potential of putative lncRNAs and describe various functions of known micropeptides.


2017 ◽  
Vol 61 (5) ◽  
Author(s):  
Helena Turano ◽  
Fernando Gomes ◽  
Gesiele A. Barros-Carvalho ◽  
Ralf Lopes ◽  
Louise Cerdeira ◽  
...  

ABSTRACT A novel transposon belonging to the Tn3-like family was identified on the chromosome of a commensal strain of Pseudomonas aeruginosa sequence type 2343 (ET02). Tn6350 is 7,367 bp long and harbors eight open reading frames (ORFs), an ATPase (IS481 family), a transposase (DDE catalytic type), a Tn3 resolvase, three hypothetical proteins, and genes encoding the new pyocin S8 with its immunity protein. We show that pyocin S8 displays activity against carbapenemase-producing P. aeruginosa, including IMP-1, SPM-1, VIM-1, GES-5, and KPC-2 producers.


2020 ◽  
Vol 36 (6-7) ◽  
pp. 675-677
Author(s):  
Bertrand Jordan

A systematic search for non-conventional open reading frames in human DNA reveals a large number of small ORFs encoding peptides generally smaller than 100 amino-acids. These ORFs are transcribed and translated into small proteins, which are demonstrated to have functional significance by bulk CRISPR inactivation. Evidence is also found for bicistronic mRNAs including such a small ORF upstream of a canonical coding sequence. These findings add a new facet to our understanding of biological processes.


RNA Biology ◽  
2015 ◽  
Vol 12 (12) ◽  
pp. 1289-1300 ◽  
Author(s):  
Laura Patrucco ◽  
Clelia Peano ◽  
Andrea Chiesa ◽  
Filomena Guida ◽  
Imma Luisi ◽  
...  

1998 ◽  
Vol 180 (17) ◽  
pp. 4693-4703 ◽  
Author(s):  
Tendai Mhlanga-Mutangadura ◽  
Gregory Morlin ◽  
Arnold L. Smith ◽  
Abraham Eisenstark ◽  
Miriam Golomb

ABSTRACT Haemophilus influenzae is a ubiquitous colonizer of the human respiratory tract and causes diseases ranging from otitis media to meningitis. Many H. influenzae isolates express pili (fimbriae), which mediate adherence to epithelial cells and facilitate colonization. The pilus gene (hif) cluster of H. influenzae type b maps between purE andpepN and resembles a pathogenicity island: it is present in invasive strains, absent from the nonpathogenic Rd strain, and flanked by direct repeats of sequence at the insertion site. To investigate the evolution and role in pathogenesis of the hif cluster, we compared the purE-pepN regions of various H. influenzae laboratory strains and clinical isolates. Unlike Rd, most strains had an insert at this site, which usually was the only chromosomal locus of hif DNA. The inserts are diverse in length and organization: among 20 strains, nine different arrangements were found. Several nontypeable isolates lack hif genes but have two conserved open reading frames (hicA andhicB) upstream of purE; their inferred products are small proteins with no data bank homologs. Other isolates havehif genes but lack hic DNA or have combinations of hif and hic genes. By comparing these arrangements, we have reconstructed a hypothetical ancestral genotype, the extended hif cluster. The hif region of INT1, an invasive nontypeable isolate, resembles the hypothetical ancestor. We propose that a progenitor strain acquired the extended cluster by horizontal transfer and that other variants arose as deletions. The structure of the hif cluster may correlate with colonization site or pathogenicity.


2019 ◽  
Author(s):  
Jeremy Weaver ◽  
Fuad Mohammad ◽  
Allen R. Buskirk ◽  
Gisela Storz

ABSTRACTSmall proteins consisting of 50 or fewer amino acids have been identified as regulators of larger proteins in bacteria and eukaryotes. Despite the importance of these molecules, the true prevalence of small proteins remains unknown because conventional annotation pipelines usually exclude small open reading frames (smORFs). We previously identified several dozen small proteins in the model organism Escherichia coli using theoretical bioinformatic approaches based on sequence conservation and matches to canonical ribosome binding sites. Here, we present an empirical approach for discovering new proteins, taking advantage of recent advances in ribosome profiling in which antibiotics are used to trap newly-initiated 70S ribosomes at start codons. This approach led to the identification of many novel initiation sites in intergenic regions in E. coli. We tagged 41 smORFs on the chromosome and detected protein synthesis for all but three. The corresponding genes are not only intergenic, but are also found antisense to other genes, in operons, and overlapping other open reading frames (ORFs), some impacting the translation of larger downstream genes. These results demonstrate the utility of this method for identifying new genes, regardless of their genomic context.IMPORTANCEProteins comprised of 50 or fewer amino acids have been shown to interact with and modulate the function of larger proteins in a range of organisms. Despite the possible importance of small proteins, the true prevalence and capabilities of these regulators remain unknown as the small size of the proteins places serious limitations on their identification, purification and characterization. Here, we present a ribosome profiling approach with stalled initiation complexes that led to the identification of 38 new small proteins.


2017 ◽  
Author(s):  
Sondos Samandi ◽  
Annie V. Roy ◽  
Vivian Delcourt ◽  
Jean-François Lucier ◽  
Jules Gagnon ◽  
...  

AbstractRecent studies in eukaryotes have demonstrated the translation of alternative open reading frames (altORFs) in addition to annotated protein coding sequences (CDSs). We show that a large number of small proteins could in fact be coded by altORFs. The putative alternative proteins translated from altORFs have orthologs in many species and evolutionary patterns indicate that altORFs are particularly constrained in CDSs that evolve slowly. Thousands of predicted alternative proteins are detected in proteomic datasets by reanalysis using a database containing predicted alternative proteins. Protein domains and co-conservation analyses suggest a potential functional relationship between small and large proteins encoded in the same genes. This is illustrated with specific examples, including altMiD51, a 70 amino acid mitochondrial fission-promoting protein encoded in MiD51/Mief1/SMCR7L, a gene encoding an annotated protein promoting mitochondrial fission. Our results suggest that many coding genes code for more than one protein that are often functionally related.


Sign in / Sign up

Export Citation Format

Share Document