Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome

AbstractPseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns. Some pseudogene transcripts have intact open reading frames and are translated in cultured cells, representing unannotated protein-coding genes. To assess the biological impact of noncoding pseudogenes, we CRISPR-Cas9 delete the nucleus-enriched pseudogene PDCL3P4 and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the human transcriptional landscape.

Download Full-text

Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome

10.1101/2021.03.29.437610 ◽

2021 ◽

Author(s):

Robin-Lee Troskie ◽

Yohaann Jafrani ◽

Tim R Mercer ◽

Adam D Ewing ◽

Geoffrey J Faulkner ◽

...

Keyword(s):

Cultured Cells ◽

Sequence Similarity ◽

Open Reading Frames ◽

Cdna Sequencing ◽

High Sequence Similarity ◽

Protein Coding ◽

Dynamic Component ◽

Gene Copies ◽

Sequencing Platforms ◽

The Impact

Pseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. When transcribed, pseudogenes may encode proteins or enact RNA-intrinsic regulatory mechanisms. However, the extent, characteristics and functional relevance of the human pseudogene transcriptome are unclear. Short-read sequencing platforms have limited power to resolve and accurately quantify pseudogene transcripts owing to the high sequence similarity of pseudogenes and their parent genes. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes. Pseudogene transcripts are expressed in tissue-specific patterns, exhibit complex splicing patterns and contribute to the coding sequences of known genes. We survey pseudogene transcripts encoding intact open reading frames (ORFs), representing potential unannotated protein-coding genes, and demonstrate their efficient translation in cultured cells. To assess the impact of noncoding pseudogenes on the cellular transcriptome, we delete the nucleus- enriched pseudogene PDCL3P4 transcript from HAP1 cells and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the transcriptional landscape underpinning human biology and disease.

Download Full-text

Disrupting upstream translation in mRNAs is associated with human disease

Nature Communications ◽

10.1038/s41467-021-21812-1 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

David S. M. Lee ◽

Joseph Park ◽

Andrew Kromer ◽

Aris Baras ◽

Daniel J. Rader ◽

...

Keyword(s):

Protein Expression ◽

Biological Significance ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Protein Coding ◽

Stop Codons ◽

Human Genes ◽

Strong Negative Selection ◽

Disease Associations ◽

Reading Frames

AbstractRibosome-profiling has uncovered pervasive translation in non-canonical open reading frames, however the biological significance of this phenomenon remains unclear. Using genetic variation from 71,702 human genomes, we assess patterns of selection in translated upstream open reading frames (uORFs) in 5’UTRs. We show that uORF variants introducing new stop codons, or strengthening existing stop codons, are under strong negative selection comparable to protein-coding missense variants. Using these variants, we map and validate gene-disease associations in two independent biobanks containing exome sequencing from 10,900 and 32,268 individuals, respectively, and elucidate their impact on protein expression in human cells. Our results suggest translation disrupting mechanisms relating uORF variation to reduced protein expression, and demonstrate that translation at uORFs is genetically constrained in 50% of human genes.

Download Full-text

Overexpression-based detection of translatable circular RNAs is vulnerable to coexistent linear RNA byproducts

10.1101/2021.03.23.433163 ◽

2021 ◽

Author(s):

Yanyi Jiang ◽

Xiaofan Chen ◽

Wei Zhang

Keyword(s):

Open Reading Frames ◽

Systematic Evaluation ◽

Circular Rnas ◽

Protein Coding ◽

Rolling Circle ◽

Functional Investigation ◽

Overexpression System ◽

Translation Signals ◽

Coding Potential ◽

Reading Frames

AbstractIn RNA field, the demarcation between coding and non-coding has been negotiated by the recent discovery of occasionally translated circular RNAs (circRNAs). Although absent of 5’ cap structure, circRNAs can be translated cap-independently. Complementary intron-mediated overexpression is one of the most utilized methodologies for circRNA research but not without bearing echoing skepticism for its poorly defined mechanism and latent coexistent side products. In this study, leveraging such circRNA overexpression system, we have interrogated the protein-coding potential of 30 human circRNAs containing infinite open reading frames in HEK293T cells. Surprisingly, pervasive translation signals are detected by immunoblotting. However, intensive mutagenesis reveals that numerous translation signals are generated independently of circRNA synthesis. We have developed a dual tag strategy to isolate translation noise and directly demonstrate that the fallacious translation signals originate from cryptically spliced linear transcripts. The concomitant linear RNA byproducts, presumably concatemers, can be translated to allow pseudo rolling circle translation signals, and can involve backsplicing junction (BSJ) to disqualify the BSJ-based evidence for circRNA translation. We also find non-AUG start codons may engage in the translation initiation of circRNAs. Taken together, our systematic evaluation sheds light on heterogeneous translational outputs from circRNA overexpression vector and comes with a caveat that ectopic overexpression technique necessitates extremely rigorous control setup in circRNA translation and functional investigation.

Download Full-text

A micropeptide encoded by lncRNA MIR155HG suppresses autoimmune inflammation via modulating antigen presentation

Science Advances ◽

10.1126/sciadv.aaz2059 ◽

2020 ◽

Vol 6 (21) ◽

pp. eaaz2059 ◽

Cited By ~ 4

Author(s):

Liman Niu ◽

Fangzhou Lou ◽

Yang Sun ◽

Libo Sun ◽

Xiaojie Cai ◽

...

Keyword(s):

Antigen Presentation ◽

Inflammatory Diseases ◽

Open Reading Frames ◽

Protein Coding ◽

Histocompatibility Complex ◽

Antigen Trafficking ◽

Heat Shock Cognate Protein ◽

Antigen Presenting ◽

Cognate Protein ◽

Reading Frames

Many annotated long noncoding RNAs (lncRNAs) harbor predicted short open reading frames (sORFs), but the coding capacities of these sORFs and the functions of the resulting micropeptides remain elusive. Here, we report that human lncRNA MIR155HG encodes a 17–amino acid micropeptide, which we termed miPEP155 (P155). MIR155HG is highly expressed by inflamed antigen-presenting cells, leading to the discovery that P155 interacts with the adenosine 5′-triphosphate binding domain of heat shock cognate protein 70 (HSC70), a chaperone required for antigen trafficking and presentation in dendritic cells (DCs). P155 modulates major histocompatibility complex class II–mediated antigen presentation and T cell priming by disrupting the HSC70-HSP90 machinery. Exogenously injected P155 improves two classical mouse models of DC-driven auto inflammation. Collectively, we demonstrate the endogenous existence of a micropeptide encoded by a transcript annotated as “non-protein coding” and characterize a micropeptide as a regulator of antigen presentation and a suppressor of inflammatory diseases.

Download Full-text

When Long Noncoding Becomes Protein Coding

Molecular and Cellular Biology ◽

10.1128/mcb.00528-19 ◽

2020 ◽

Vol 40 (6) ◽

Cited By ~ 14

Author(s):

Corrine Corrina R. Hartford ◽

Ashish Lal

Keyword(s):

Cell Division ◽

Cell Signaling ◽

Transcription Regulation ◽

Noncoding Rnas ◽

Long Noncoding Rnas ◽

Open Reading Frames ◽

Protein Coding ◽

Small Proteins ◽

Coding Potential ◽

Reading Frames

ABSTRACT Recent advancements in genetic and proteomic technologies have revealed that more of the genome encodes proteins than originally thought possible. Specifically, some putative long noncoding RNAs (lncRNAs) have been misannotated as noncoding. Numerous lncRNAs have been found to contain short open reading frames (sORFs) which have been overlooked because of their small size. Many of these sORFs encode small proteins or micropeptides with fundamental biological importance. These micropeptides can aid in diverse processes, including cell division, transcription regulation, and cell signaling. Here we discuss strategies for establishing the coding potential of putative lncRNAs and describe various functions of known micropeptides.

Download Full-text

Identification of Proteins Associated with Murine Cytomegalovirus Virions

Journal of Virology ◽

10.1128/jvi.78.20.11187-11197.2004 ◽

2004 ◽

Vol 78 (20) ◽

pp. 11187-11197 ◽

Cited By ~ 105

Author(s):

Lisa M. Kattenhorn ◽

Ryan Mills ◽

Markus Wagner ◽

Alexandre Lomsadze ◽

Vsevolod Makeev ◽

...

Keyword(s):

Gene Prediction ◽

Polyacrylamide Gel Electrophoresis ◽

Sodium Dodecyl ◽

Open Reading Frames ◽

Murine Cytomegalovirus ◽

Prediction Algorithm ◽

Sequencing Analysis ◽

Protein Coding ◽

Coding Potential ◽

Reading Frames

ABSTRACT Proteins associated with the murine cytomegalovirus (MCMV) viral particle were identified by a combined approach of proteomic and genomic methods. Purified MCMV virions were dissociated by complete denaturation and subjected to either separation by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and in-gel digestion or treated directly by in-solution tryptic digestion. Peptides were separated by nanoflow liquid chromatography and analyzed by tandem mass spectrometry (LC-MS/MS). The MS/MS spectra obtained were searched against a database of MCMV open reading frames (ORFs) predicted to be protein coding by an MCMV-specific version of the gene prediction algorithm GeneMarkS. We identified 38 proteins from the capsid, tegument, glycoprotein, replication, and immunomodulatory protein families, as well as 20 genes of unknown function. Observed irregularities in coding potential suggested possible sequence errors in the 3′-proximal ends of m20 and M31. These errors were experimentally confirmed by sequencing analysis. The MS data further indicated the presence of peptides derived from the unannotated ORFs ORFc225441-226898 (m166.5) and ORF105932-106072. Immunoblot experiments confirmed expression of m166.5 during viral infection.

Download Full-text

Long antiparallel open reading frames are unlikely to be encoding essential proteins in prokaryotic genomes

10.1101/724807 ◽

2019 ◽

Author(s):

Denis Moshensky ◽

Andrei Alexeevski

Keyword(s):

Negative Selection ◽

Stop Codon ◽

Biological Significance ◽

Open Reading Frames ◽

Overlapping Genes ◽

Base Pairs ◽

Protein Coding ◽

Essential Proteins ◽

Prokaryotic Genomes ◽

Reading Frames

AbstractThe origin and evolution of genes that have common base pairs (overlapping genes) are of particular interest due to their influencing each other. Especially intriguing are gene pairs with long overlaps. In prokaryotes, co-directional overlaps longer than 60 bp were shown to be nonexistent except for some instances. A few antiparallel prokaryotic genes with long overlaps were described in the literature. We have analyzed putative long antiparallel overlapping genes to determine whether open reading frames (ORFs) located opposite to genes (antiparallel ORFs) can be protein-coding genes.We have confirmed that long antiparallel ORFs (AORFs) are observed reliably to be more frequent than expected. There are 10 472 000 AORFs in 929 analyzed genomes with overlap length more than 180 bp. Stop codons on the opposite to the coding strand are avoided in 2 898 cases with Benjamini-Hochberg threshold 0.01.Using Ka/Ks ratio calculations, we have revealed that long AORFs do not affect the type of selection acting on genes in a vast majority of cases. This observation indicates that long AORFs translations commonly are not under negative selection.The demonstrative example is 282 longer than 1 800 bp AORFs found opposite to extremely conserved dnaK genes. Translations of these AORFs were annotated “glutamate dehydrogenases” and were included into Pfam database as third protein family of glutamate dehydrogenases, PF10712. Ka/Ks analysis has demonstrated that if these translations correspond to proteins, they are not subjected by negative selection while dnaK genes are under strong stabilizing selection. Moreover, we have found other arguments against the hypothesis that these AORFs encode essential proteins, proteins indispensable for cellular machinery.However, some AORFs, in particular, dnaK related, have been found slightly resisting to synonymous changes in genes. It indicates the possibility of their translation. We speculate that translations of certain AORFs might have a functional role other than encoding essential proteins.Essential genes are unlikely to be encoded by AORFs in prokaryotic genomes. Nevertheless, some AORFs might have biological significance associated with their translations.Author summaryGenes that have common base pairs are called overlapping genes. We have examined the most intriguing case: if gene pairs encoded on opposite DNA strands exist in prokaryotes. An intersection length threshold 180 bp has been used. A few such pairs of genes were experimentally confirmed.We have detected all long antiparallel ORFs in 929 prokaryotic genomes and have found that the number of open reading frames, located opposite to annotated genes, is much more than expected according to statistical model. We have developed a measure of stop codon avoidance on the opposite strand. The lengths of found antiparallel ORFs with stop codon avoidance are typical for prokaryotic genes.Comparative genomics analysis shows that long antiparallel ORFs (AORFs) are unlikely to be essential protein-coding genes. We have analyzed distributions of features typical for essential proteins among formal translations of all long AORFs: prevalence of negative selection, non-uniformity of a conserved positions distribution in a multiple alignment of homologous proteins, the character of homologs distribution in phylogenetic tree of prokaryotes. All of them have not been observed for the majority of long AORFs. Particularly, the same results have been obtained for some experimentally confirmed AOGs.Thus, pairs of antiparallel overlapping essential genes are unlikely to exist. On the other hand, some antiparallel ORFs affect the evolution of genes opposite that they are located. Consequently, translations of some antiparallel ORFs might have yet unknown biological significance.

Download Full-text

Accurate Annotation of Protein‐coding Small Open Reading Frames in the Human Genome

The FASEB Journal ◽

10.1096/fasebj.2020.34.s1.03051 ◽

2020 ◽

Vol 34 (S1) ◽

pp. 1-1

Author(s):

Thomas F. Martinez ◽

Qian Chu ◽

Cynthia Donaldson ◽

Dan Tan ◽

Maxim N. Shokhirev ◽

...

Keyword(s):

Human Genome ◽

Open Reading Frames ◽

Protein Coding ◽

Reading Frames ◽

Small Open Reading Frames

Download Full-text

Deep transcriptome annotation suggests that small and large proteins encoded in the same genes often cooperate

10.1101/142992 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sondos Samandi ◽

Annie V. Roy ◽

Vivian Delcourt ◽

Jean-François Lucier ◽

Jules Gagnon ◽

...

Keyword(s):

Functional Relationship ◽

Mitochondrial Fission ◽

Open Reading Frames ◽

Gene Encoding ◽

Evolutionary Patterns ◽

Protein Coding ◽

Coding Sequences ◽

Large Proteins ◽

Small Proteins ◽

Reading Frames

AbstractRecent studies in eukaryotes have demonstrated the translation of alternative open reading frames (altORFs) in addition to annotated protein coding sequences (CDSs). We show that a large number of small proteins could in fact be coded by altORFs. The putative alternative proteins translated from altORFs have orthologs in many species and evolutionary patterns indicate that altORFs are particularly constrained in CDSs that evolve slowly. Thousands of predicted alternative proteins are detected in proteomic datasets by reanalysis using a database containing predicted alternative proteins. Protein domains and co-conservation analyses suggest a potential functional relationship between small and large proteins encoded in the same genes. This is illustrated with specific examples, including altMiD51, a 70 amino acid mitochondrial fission-promoting protein encoded in MiD51/Mief1/SMCR7L, a gene encoding an annotated protein promoting mitochondrial fission. Our results suggest that many coding genes code for more than one protein that are often functionally related.

Download Full-text

Landscape of the Dark Transcriptome Revealed Through Re-mining Massive RNA-Seq Data

Frontiers in Genetics ◽

10.3389/fgene.2021.722981 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jing Li ◽

Urminder Singh ◽

Zebulun Arendsee ◽

Eve Syrkin Wurtele

Keyword(s):

Open Reading Frames ◽

Rna Seq ◽

Protein Coding ◽

Interactive Analysis ◽

Yeast Data ◽

Specific Proteins ◽

Species Specific ◽

Developmental Conditions ◽

Reading Frames ◽

Expression Matrix

The “dark transcriptome” can be considered the multitude of sequences that are transcribed but not annotated as genes. We evaluated expression of 6,692 annotated genes and 29,354 unannotated open reading frames (ORFs) in the Saccharomyces cerevisiae genome across diverse environmental, genetic and developmental conditions (3,457 RNA-Seq samples). Over 30% of the highly transcribed ORFs have translation evidence. Phylostratigraphic analysis infers most of these transcribed ORFs would encode species-specific proteins (“orphan-ORFs”); hundreds have mean expression comparable to annotated genes. These data reveal unannotated ORFs most likely to be protein-coding genes. We partitioned a co-expression matrix by Markov Chain Clustering; the resultant clusters contain 2,468 orphan-ORFs. We provide the aggregated RNA-Seq yeast data with extensive metadata as a project in MetaOmGraph (MOG), a tool designed for interactive analysis and visualization. This approach enables reuse of public RNA-Seq data for exploratory discovery, providing a rich context for experimentalists to make novel, experimentally testable hypotheses about candidate genes.

Download Full-text