ORFLine: a bioinformatic pipeline to prioritise small open reading frames identifies candidate secreted small proteins from lymphocytes

AbstractThe annotation of small open reading frames (smORFs) of less than 100 codons (<300 nucleotides) is challenging due to the large number of such sequences in the genome. The recent development of next generation sequence and ribosome profiling enables identification of actively translated smORFs. In this study, we developed a computational pipeline, which we have named ORFLine, that stringently identifies smORFs and classifies them according to their position within transcripts. We identified a total of 5744 unique smORFs in datasets from mouse B and T lymphocytes and systematically characterized them using ORFLine. We further searched smORFs for the presence of a signal peptide, which predicted known secreted chemokines as well as novel micropeptides. Five novel micropeptides show evidence of secretion and are therefore candidate mediators of immunoregulatory functions.

Download Full-text

SmProt: a reliable repository with comprehensive annotation of small proteins identified from ribosome profiling

10.1101/2021.04.29.441405 ◽

2021 ◽

Author(s):

Yanyan Li ◽

Honghong Zhou ◽

Xiaomin Chen ◽

Yu Zheng ◽

Quan Kang ◽

...

Keyword(s):

Genetic Variants ◽

Rattus Norvegicus ◽

Homo Sapiens ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Small Proteins ◽

Data Volume ◽

Reading Frames ◽

Disease Specific ◽

Small Open Reading Frames

Small proteins specifically refer to proteins consisting of less than 100 amino acids translated from small open reading frames (sORFs), which were usually missed in previous genome annotation. The significance of small proteins has been revealed in current years, along with the discovery of their diverse functions. However, systematic annotation of small proteins is still insufficient. SmProt was specially developed to provide valuable information on small proteins for scientific community. Here we present the update of SmProt, which emphasizes reliability of translated sORFs, genetic variants in translated sORFs, disease-specific sORFs translation events or sequences, and significantly increased data volume. More components such as non-AUG translation initiation, function, and new sources are also included. SmProt incorporated 638,958 unique small proteins curated from 3,165,229 primary records, which were computationally predicted from 419 ribosome profiling (Ribo-seq) datasets and collected from the literature and other sources originating from 370 cell lines or tissues in 8 species (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, and Escherichia coli). In addition, small protein families identified from human microbiomes were collected. All datasets in SmProt are free to access, and available for browse, search, and bulk downloads at http://bigdata.ibp.ac.cn/SmProt/.

Download Full-text

Identifying small proteins by ribosome profiling with stalled initiation complexes

10.1101/511675 ◽

2019 ◽

Author(s):

Jeremy Weaver ◽

Fuad Mohammad ◽

Allen R. Buskirk ◽

Gisela Storz

Keyword(s):

Amino Acids ◽

Model Organism ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Genomic Context ◽

True Prevalence ◽

New Genes ◽

Small Proteins ◽

Intergenic Regions ◽

Reading Frames

ABSTRACTSmall proteins consisting of 50 or fewer amino acids have been identified as regulators of larger proteins in bacteria and eukaryotes. Despite the importance of these molecules, the true prevalence of small proteins remains unknown because conventional annotation pipelines usually exclude small open reading frames (smORFs). We previously identified several dozen small proteins in the model organism Escherichia coli using theoretical bioinformatic approaches based on sequence conservation and matches to canonical ribosome binding sites. Here, we present an empirical approach for discovering new proteins, taking advantage of recent advances in ribosome profiling in which antibiotics are used to trap newly-initiated 70S ribosomes at start codons. This approach led to the identification of many novel initiation sites in intergenic regions in E. coli. We tagged 41 smORFs on the chromosome and detected protein synthesis for all but three. The corresponding genes are not only intergenic, but are also found antisense to other genes, in operons, and overlapping other open reading frames (ORFs), some impacting the translation of larger downstream genes. These results demonstrate the utility of this method for identifying new genes, regardless of their genomic context.IMPORTANCEProteins comprised of 50 or fewer amino acids have been shown to interact with and modulate the function of larger proteins in a range of organisms. Despite the possible importance of small proteins, the true prevalence and capabilities of these regulators remain unknown as the small size of the proteins places serious limitations on their identification, purification and characterization. Here, we present a ribosome profiling approach with stalled initiation complexes that led to the identification of 38 new small proteins.

Download Full-text

RiboReport - Benchmarking tools for ribosome profiling-based identification of open reading frames in bacteria

10.1101/2021.06.08.447495 ◽

2021 ◽

Author(s):

Rick Gelhausen ◽

Teresa Müller ◽

Sarah Svensson ◽

Omer S. Alkhnbashi ◽

Cynthia M. Sharma ◽

...

Keyword(s):

High Sensitivity ◽

Predictive Performance ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Rna Seq ◽

E Coli ◽

Prediction Tools ◽

Small Proteins ◽

Significant Difference ◽

Reading Frames

Small proteins, those encoded by open reading frames, with less than or equal to 50 codons, are emerging as an important class of cellular macromolecules in all kingdoms of life. However, they are recalcitrant to detection by proteomics or in silico methods. Ribosome profiling (Ribo-seq) has revealed widespread translation of sORFs in diverse species, and this has driven the development of ORF detection tools using Ribo-seq read signals. However, only a handful of tools have been designed for bacterial data, and have not yet been systematically compared. Here, we have performed a comprehensive benchmark of ORF prediction tools which handle bacterial Ribo-seq data. For this, we created a novel Ribo-seq dataset for E. coli, and based on this plus three publicly available datasets for different bacteria, we created a benchmark set by manual labeling of translated ORFs using their Ribo-seq expression profile. This was then used to investigate the predictive performance of four Ribo-seq-based ORF detection tools we found are compatible with bacterial data (REPARATION_blast, DeepRibo, Ribo-TISH and SPECtre). The tool IRSOM was also included as a comparison for tools using coding potential and RNA-seq coverage only. DeepRibo and REPARATION_blast robustly predicted translated ORFs, including sORFs, with no significant difference for those inside or outside of operons. However, none of the tools was able to predict a set of recently identified, novel, experimentally-verified sORFs with high sensitivity. Overall, we find there is potential for improving the performance, applicability, usability, and reproducibility of prokaryotic ORF prediction tools that use Ribo-Seq as input.

Download Full-text

smORFer: a modular algorithm to detect small ORFs in prokaryotes

10.1101/2020.05.21.109181 ◽

2020 ◽

Author(s):

Alexander Bartholomäus ◽

Baban Kolte ◽

Ayten Mustafayeva ◽

Ingrid Goebel ◽

Stephan Fuchs ◽

...

Keyword(s):

Integrated Approach ◽

Structural Features ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Data Sets ◽

Physiological Processes ◽

Small Proteins ◽

Prokaryotic Genomes ◽

Modular Algorithm ◽

Reading Frames

ABSTRACTEmerging evidence places small proteins (≤ 50 amino acids) more centrally in physiological processes. Yet, the identification of functional small proteins and the systematic genome annotation of their cognate small open reading frames (smORFs) remains challenging both experimentally and computationally. Ribosome profiling or Ribo-Seq (that is a deep sequencing of ribosome-protected fragments) enables detecting of actively translated open-reading frames (ORFs) and empirical annotation of coding sequences (CDSs) using the in-register translation pattern that is characteristic for genuinely translating ribosomes. Multiple identifiers of ORFs that use 3-nt periodicity in Ribo-Seq data sets have been successful in eukaryotic smORF annotation. Yet, they have difficulties evaluating prokaryotic genomes due to the unique architecture of prokaryotic genomes (e.g. polycistronic messages, overlapping ORFs, leaderless translation, non-canonical initiation etc.). Here, we present our new algorithm, smORFer, which performs with high accuracy in prokaryotic organisms in detecting smORFs. The unique feature of smORFer is that it uses integrated approach and considers structural features of the genetic sequence along with in-register translation and uses Fourier transform to convert these parameters into a measurable score to faithfully select smORFs. The algorithm is executed in a modular way and dependent on the data available for a particular organism allows using different modules for smORF search.

Download Full-text

Small Protein Enrichment Improves Proteomics Detection of sORF Encoded Polypeptides

Frontiers in Genetics ◽

10.3389/fgene.2021.713400 ◽

2021 ◽

Vol 12 ◽

Author(s):

Igor Fijalkowski ◽

Marlies K. R. Peeters ◽

Petra Van Damme

Keyword(s):

Protein Extraction ◽

Protein Solubility ◽

Extraction Methods ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Physiochemical Properties ◽

Small Proteins ◽

Efficient Detection ◽

Optimized Protocol ◽

Small Open Reading Frames

With the rapid growth in the number of sequenced genomes, genome annotation efforts became almost exclusively reliant on automated pipelines. Despite their unquestionable utility, these methods have been shown to underestimate the true complexity of the studied genomes, with small open reading frames (sORFs; ORFs typically considered shorter than 300 nucleotides) and, in consequence, their protein products (sORF encoded polypeptides or SEPs) being the primary example of a poorly annotated and highly underexplored class of genomic elements. With the advent of advanced translatomics such as ribosome profiling, reannotation efforts have progressed a great deal in providing translation evidence for numerous, previously unannotated sORFs. However, proteomics validation of these riboproteogenomics discoveries remains challenging due to their short length and often highly variable physiochemical properties. In this work we evaluate and compare tailored, yet easily adaptable, protein extraction methodologies for their efficacy in the extraction and concomitantly proteomics detection of SEPs expressed in the prokaryotic model pathogen Salmonella typhimurium (S. typhimurium). Further, an optimized protocol for the enrichment and efficient detection of SEPs making use of the of amphipathic polymer amphipol A8-35 and relying on differential peptide vs. protein solubility was developed and compared with global extraction methods making use of chaotropic agents. Given the versatile biological functions SEPs have been shown to exert, this work provides an accessible protocol for proteomics exploration of this fascinating class of small proteins.

Download Full-text

Frequent translation of small open reading frames in evolutionary conserved lncRNA regions

10.1101/348326 ◽

2018 ◽

Cited By ~ 1

Author(s):

Jorge Ruiz-Orera ◽

M.Mar Albà

Keyword(s):

Ribosomal Protein ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Protein Coding ◽

Conserved Regions ◽

Biochemical Measurements ◽

Rna Interaction ◽

Human And Mouse ◽

Reading Frames ◽

Small Open Reading Frames

SUMMARYThe mammalian transcriptome includes thousands of transcripts that do not correspond to annotated protein-coding genes. Although many of these transcripts show homology between human and mouse, only a small proportion of them have been functionally characterized. Here we use ribosome profiling data to identify translated open reading frames, as well as non-ribosomal protein-RNA interactions, in evolutionary conserved and non-conserved transcripts. We find that conserved regions are subject to significant evolutionary constraints and are enriched in translated open reading frames, as well as non-ribosomal protein-RNA interaction signatures, when compared to non-conserved regions. Translated ORFs can be divided in two classes, those encoding functional micropeptides and those that show no evidence of protein functionality. This study underscores the importance of combining evolutionary and biochemical measurements to advance in a more complete understanding of the transcriptome.

Download Full-text

Identifying Small Proteins by Ribosome Profiling with Stalled Initiation Complexes

mBio ◽

10.1128/mbio.02819-18 ◽

2019 ◽

Vol 10 (2) ◽

Cited By ~ 45

Author(s):

Jeremy Weaver ◽

Fuad Mohammad ◽

Allen R. Buskirk ◽

Gisela Storz

Keyword(s):

Amino Acids ◽

Model Organism ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Genomic Context ◽

Content Type ◽

New Genes ◽

Small Proteins ◽

Intergenic Regions ◽

Reading Frames

ABSTRACTSmall proteins consisting of 50 or fewer amino acids have been identified as regulators of larger proteins in bacteria and eukaryotes. Despite the importance of these molecules, the total number of small proteins remains unknown because conventional annotation pipelines usually exclude small open reading frames (smORFs). We previously identified several dozen small proteins in the model organismEscherichia coliusing theoretical bioinformatic approaches based on sequence conservation and matches to canonical ribosome binding sites. Here, we present an empirical approach for discovering new proteins, taking advantage of recent advances in ribosome profiling in which antibiotics are used to trap newly initiated 70S ribosomes at start codons. This approach led to the identification of many novel initiation sites in intergenic regions inE. coli. We tagged 41 smORFs on the chromosome and detected protein synthesis for all but three. Not only are the corresponding genes intergenic but they are also found antisense to other genes, in operons, and overlapping other open reading frames (ORFs), some impacting the translation of larger downstream genes. These results demonstrate the utility of this method for identifying new genes, regardless of their genomic context.IMPORTANCEProteins comprised of 50 or fewer amino acids have been shown to interact with and modulate the functions of larger proteins in a range of organisms. Despite the possible importance of small proteins, the true prevalence and capabilities of these regulators remain unknown as the small size of the proteins places serious limitations on their identification, purification, and characterization. Here, we present a ribosome profiling approach with stalled initiation complexes that led to the identification of 38 new small proteins.

Download Full-text

Detecting Misannotated Long Non-coding RNAs with Training Dynamics of Deep Sequence Classification

10.1101/2020.11.07.372771 ◽

2020 ◽

Author(s):

Afshan Nabi ◽

Ogun Adebali ◽

Oznur Tastan

Keyword(s):

Ground Truth ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Sequence Classification ◽

Learning Models ◽

Coding Sequences ◽

Lncrna Transcript ◽

Non Coding Rnas ◽

Reading Frames ◽

Small Open Reading Frames

AbstractLong non-coding RNAs (lncRNAs) are the largest class of non-coding RNAs (ncRNAs). However, recent experimental evidence has shown that some lncRNAs contain small open reading frames (sORFs) that are translated into functional micropeptides. Current methods to detect misannotated lncRNAs rely on ribosome-profiling (ribo-seq) experiments, which are expensive and cell-type dependent. In addition, while very accurate machine learning models have been trained to distinguish between coding and non-coding sequences, little attention has been paid to the increasing evidence about the incorrect ground-truth labels of some lncRNAs in the underlying training datasets. We present a framework that leverages deep learning models’ training dynamics to determine whether a given lncRNA transcript is misannotated. Our models achieve AUC scores > 91% and AUPR > 93% in classifying non-coding vs. coding sequences while allowing us to identify possible misannotated lncRNAs present in the dataset. Our results overlap significantly with a set of experimentally validated misannotated lncRNAs as well as with coding sORFs within lncRNAs found by a ribo-seq dataset. The general framework applied here offers promising potential for use in curating datasets used for training coding potential predictors and assisting experimental efforts in characterizing the hidden proteome encoded by misannotated lncRNAs. Source code is available at https://github.com/nabiafshan/DetectingMisannotatedLncRNAs.

Download Full-text