Direct mapping of Peptide-to-Spectra-Matches to genome information facilitates qualifying proteomics information

Abstract Background: Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. Results: We observe that number and quality of the Peptide-to-Spectra-Matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workflow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artificial gut microbiome model SIHUMIx, comprising eight different species, for which we validate 5114 proteins that previously have been annotated only as hypothetical ORFs. In addition, we identified 37 non-annotated protein candidates for which we found evidence in proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short (< 100 AA) and are most likely bona fide novel proteins. Conclusions: The aggregation of PSM quality information for predicted ORFs provides a robust and efficient method to identify novel proteins in proteomics data. The workflow is in particular also capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration with transcriptomics data and other source of genome-level information.

Download Full-text

A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations

BMC Bioinformatics ◽

10.1186/s12859-021-04159-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

John Anders ◽

Hannes Petruschke ◽

Nico Jehmlich ◽

Sven-Bastiaan Haange ◽

Martin von Bergen ◽

...

Keyword(s):

False Positive ◽

Open Reading Frames ◽

Local Context ◽

Quality Information ◽

Hypothetical Proteins ◽

Novel Proteins ◽

Small Proteins ◽

Large Numbers ◽

Genomic Locations ◽

Reading Frames

Abstract Background Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. Results We observe that number and quality of the peptide-spectrum matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workflow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artificial gut microbiome model SIHUMIx, comprising eight different species, for which we validate 5114 proteins that have previously been annotated only as hypothetical ORFs. In addition, we identified 37 non-annotated protein candidates for which we found evidence at the proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short (< 100 AA) and are most likely bona fide novel proteins. Conclusions The aggregation of PSM quality information for predicted ORFs provides a robust and efficient method to identify novel proteins in proteomics data. The workflow is in particular capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration of transcriptomics data and other sources of genome-level information.

Download Full-text

[34] GATEWAY recombinational cloning: Application to the cloning of large numbers of open reading frames or ORFeomes

Methods in Enzymology - Applications of Chimeric Genes and Hybrid Proteins - Part C: Protein-Protein Interactions and Genomics ◽

10.1016/s0076-6879(00)28419-x ◽

2000 ◽

pp. 575-IN7 ◽

Cited By ~ 421

Author(s):

Albertha J.M. Walhout ◽

Gary F. Temple ◽

Michael A. Brasch ◽

James L. Hartley ◽

Monique A. Lorson ◽

...

Keyword(s):

Open Reading Frames ◽

Large Numbers ◽

Reading Frames

Download Full-text

When Long Noncoding Becomes Protein Coding

Molecular and Cellular Biology ◽

10.1128/mcb.00528-19 ◽

2020 ◽

Vol 40 (6) ◽

Cited By ~ 14

Author(s):

Corrine Corrina R. Hartford ◽

Ashish Lal

Keyword(s):

Cell Division ◽

Cell Signaling ◽

Transcription Regulation ◽

Noncoding Rnas ◽

Long Noncoding Rnas ◽

Open Reading Frames ◽

Protein Coding ◽

Small Proteins ◽

Coding Potential ◽

Reading Frames

ABSTRACT Recent advancements in genetic and proteomic technologies have revealed that more of the genome encodes proteins than originally thought possible. Specifically, some putative long noncoding RNAs (lncRNAs) have been misannotated as noncoding. Numerous lncRNAs have been found to contain short open reading frames (sORFs) which have been overlooked because of their small size. Many of these sORFs encode small proteins or micropeptides with fundamental biological importance. These micropeptides can aid in diverse processes, including cell division, transcription regulation, and cell signaling. Here we discuss strategies for establishing the coding potential of putative lncRNAs and describe various functions of known micropeptides.

Download Full-text

Tn6350, a Novel Transposon Carrying Pyocin S8 Genes Encoding a Bacteriocin with Activity against Carbapenemase-Producing Pseudomonas aeruginosa

Antimicrobial Agents and Chemotherapy ◽

10.1128/aac.00100-17 ◽

2017 ◽

Vol 61 (5) ◽

Cited By ~ 2

Author(s):

Helena Turano ◽

Fernando Gomes ◽

Gesiele A. Barros-Carvalho ◽

Ralf Lopes ◽

Louise Cerdeira ◽

...

Keyword(s):

Pseudomonas Aeruginosa ◽

Type A ◽

Sequence Type ◽

Open Reading Frames ◽

Hypothetical Proteins ◽

Immunity Protein ◽

Content Type ◽

Commensal Strain ◽

Genes Encoding ◽

Reading Frames

ABSTRACT A novel transposon belonging to the Tn3-like family was identified on the chromosome of a commensal strain of Pseudomonas aeruginosa sequence type 2343 (ET02). Tn6350 is 7,367 bp long and harbors eight open reading frames (ORFs), an ATPase (IS481 family), a transposase (DDE catalytic type), a Tn3 resolvase, three hypothetical proteins, and genes encoding the new pyocin S8 with its immunity protein. We show that pyocin S8 displays activity against carbapenemase-producing P. aeruginosa, including IMP-1, SPM-1, VIM-1, GES-5, and KPC-2 producers.

Download Full-text

Chroniques génomiques

médecine/sciences ◽

10.1051/medsci/2020108 ◽

2020 ◽

Vol 36 (6-7) ◽

pp. 675-677

Author(s):

Bertrand Jordan

Keyword(s):

Amino Acids ◽

Functional Significance ◽

Open Reading Frames ◽

Systematic Search ◽

Biological Processes ◽

Coding Sequence ◽

Small Proteins ◽

Human Dna ◽

Small Orfs ◽

Reading Frames

A systematic search for non-conventional open reading frames in human DNA reveals a large number of small ORFs encoding peptides generally smaller than 100 amino-acids. These ORFs are transcribed and translated into small proteins, which are demonstrated to have functional significance by bulk CRISPR inactivation. Evidence is also found for bicistronic mRNAs including such a small ORF upstream of a canonical coding sequence. These findings add a new facet to our understanding of biological processes.

Download Full-text

Mass Spectrometry-Based Proteomics Analyses Using the OpenProt Database to Unveil Novel Proteins Translated from Non-Canonical Open Reading Frames

Journal of Visualized Experiments ◽

10.3791/59589 ◽

2019 ◽

Cited By ~ 4

Author(s):

Marie A. Brunet ◽

Xavier Roucou

Keyword(s):

Mass Spectrometry ◽

Open Reading Frames ◽

Novel Proteins ◽

Reading Frames

Download Full-text

Identification of novel proteins binding the AU-rich element of α-prothymosin mRNA through the selection of open reading frames (RIDome)

RNA Biology ◽

10.1080/15476286.2015.1107702 ◽

2015 ◽

Vol 12 (12) ◽

pp. 1289-1300 ◽

Cited By ~ 2

Author(s):

Laura Patrucco ◽

Clelia Peano ◽

Andrea Chiesa ◽

Filomena Guida ◽

Imma Luisi ◽

...

Keyword(s):

Open Reading Frames ◽

Novel Proteins ◽

Reading Frames ◽

Selection Of

Download Full-text

Evolution of the Major Pilus Gene Cluster ofHaemophilus influenzae

Journal of Bacteriology ◽

10.1128/jb.180.17.4693-4703.1998 ◽

1998 ◽

Vol 180 (17) ◽

pp. 4693-4703 ◽

Cited By ~ 40

Author(s):

Tendai Mhlanga-Mutangadura ◽

Gregory Morlin ◽

Arnold L. Smith ◽

Abraham Eisenstark ◽

Miriam Golomb

Keyword(s):

Insertion Site ◽

Data Bank ◽

Open Reading Frames ◽

Direct Repeats ◽

Chromosomal Locus ◽

Human Respiratory Tract ◽

Small Proteins ◽

Ofhaemophilus Influenzae ◽

Hypothetical Ancestor ◽

Reading Frames

ABSTRACT Haemophilus influenzae is a ubiquitous colonizer of the human respiratory tract and causes diseases ranging from otitis media to meningitis. Many H. influenzae isolates express pili (fimbriae), which mediate adherence to epithelial cells and facilitate colonization. The pilus gene (hif) cluster of H. influenzae type b maps between purE andpepN and resembles a pathogenicity island: it is present in invasive strains, absent from the nonpathogenic Rd strain, and flanked by direct repeats of sequence at the insertion site. To investigate the evolution and role in pathogenesis of the hif cluster, we compared the purE-pepN regions of various H. influenzae laboratory strains and clinical isolates. Unlike Rd, most strains had an insert at this site, which usually was the only chromosomal locus of hif DNA. The inserts are diverse in length and organization: among 20 strains, nine different arrangements were found. Several nontypeable isolates lack hif genes but have two conserved open reading frames (hicA andhicB) upstream of purE; their inferred products are small proteins with no data bank homologs. Other isolates havehif genes but lack hic DNA or have combinations of hif and hic genes. By comparing these arrangements, we have reconstructed a hypothetical ancestral genotype, the extended hif cluster. The hif region of INT1, an invasive nontypeable isolate, resembles the hypothetical ancestor. We propose that a progenitor strain acquired the extended cluster by horizontal transfer and that other variants arose as deletions. The structure of the hif cluster may correlate with colonization site or pathogenicity.

Download Full-text

Identifying small proteins by ribosome profiling with stalled initiation complexes

10.1101/511675 ◽

2019 ◽

Author(s):

Jeremy Weaver ◽

Fuad Mohammad ◽

Allen R. Buskirk ◽

Gisela Storz

Keyword(s):

Amino Acids ◽

Model Organism ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Genomic Context ◽

True Prevalence ◽

New Genes ◽

Small Proteins ◽

Intergenic Regions ◽

Reading Frames

ABSTRACTSmall proteins consisting of 50 or fewer amino acids have been identified as regulators of larger proteins in bacteria and eukaryotes. Despite the importance of these molecules, the true prevalence of small proteins remains unknown because conventional annotation pipelines usually exclude small open reading frames (smORFs). We previously identified several dozen small proteins in the model organism Escherichia coli using theoretical bioinformatic approaches based on sequence conservation and matches to canonical ribosome binding sites. Here, we present an empirical approach for discovering new proteins, taking advantage of recent advances in ribosome profiling in which antibiotics are used to trap newly-initiated 70S ribosomes at start codons. This approach led to the identification of many novel initiation sites in intergenic regions in E. coli. We tagged 41 smORFs on the chromosome and detected protein synthesis for all but three. The corresponding genes are not only intergenic, but are also found antisense to other genes, in operons, and overlapping other open reading frames (ORFs), some impacting the translation of larger downstream genes. These results demonstrate the utility of this method for identifying new genes, regardless of their genomic context.IMPORTANCEProteins comprised of 50 or fewer amino acids have been shown to interact with and modulate the function of larger proteins in a range of organisms. Despite the possible importance of small proteins, the true prevalence and capabilities of these regulators remain unknown as the small size of the proteins places serious limitations on their identification, purification and characterization. Here, we present a ribosome profiling approach with stalled initiation complexes that led to the identification of 38 new small proteins.

Download Full-text

Deep transcriptome annotation suggests that small and large proteins encoded in the same genes often cooperate

10.1101/142992 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sondos Samandi ◽

Annie V. Roy ◽

Vivian Delcourt ◽

Jean-François Lucier ◽

Jules Gagnon ◽

...

Keyword(s):

Functional Relationship ◽

Mitochondrial Fission ◽

Open Reading Frames ◽

Gene Encoding ◽

Evolutionary Patterns ◽

Protein Coding ◽

Coding Sequences ◽

Large Proteins ◽

Small Proteins ◽

Reading Frames

AbstractRecent studies in eukaryotes have demonstrated the translation of alternative open reading frames (altORFs) in addition to annotated protein coding sequences (CDSs). We show that a large number of small proteins could in fact be coded by altORFs. The putative alternative proteins translated from altORFs have orthologs in many species and evolutionary patterns indicate that altORFs are particularly constrained in CDSs that evolve slowly. Thousands of predicted alternative proteins are detected in proteomic datasets by reanalysis using a database containing predicted alternative proteins. Protein domains and co-conservation analyses suggest a potential functional relationship between small and large proteins encoded in the same genes. This is illustrated with specific examples, including altMiD51, a 70 amino acid mitochondrial fission-promoting protein encoded in MiD51/Mief1/SMCR7L, a gene encoding an annotated protein promoting mitochondrial fission. Our results suggest that many coding genes code for more than one protein that are often functionally related.

Download Full-text