Reducing the structure bias of RNA-Seq reveals a large number of non-annotated non-coding RNA

Abstract The study of RNA expression is the fastest growing area of genomic research. However, despite the dramatic increase in the number of sequenced transcriptomes, we still do not have accurate estimates of the number and expression levels of non-coding RNA genes. Non-coding transcripts are often overlooked due to incomplete genome annotation. In this study, we use annotation-independent detection of RNA reads generated using a reverse transcriptase with low structure bias to identify non-coding RNA. Transcripts between 20 and 500 nucleotides were filtered and crosschecked with non-coding RNA annotations revealing 111 non-annotated non-coding RNAs expressed in different cell lines and tissues. Inspecting the sequence and structural features of these transcripts indicated that 60% of these transcripts correspond to new snoRNA and tRNA-like genes. The identified genes exhibited features of their respective families in terms of structure, expression, conservation and response to depletion of interacting proteins. Together, our data reveal a new group of RNA that are difficult to detect using standard gene prediction and RNA sequencing techniques, suggesting that reliance on actual gene annotation and sequencing techniques distorts the perceived architecture of the human transcriptome.

Download Full-text

Identification of Long Non-Coding RNAs Involved in Porcine Fat Deposition Using Two High-Throughput Sequencing Methods

Genes ◽

10.3390/genes12091374 ◽

2021 ◽

Vol 12 (9) ◽

pp. 1374

Author(s):

Yibing Liu ◽

Ying Yu ◽

Hong Ao ◽

Fengxia Zhang ◽

Xitong Zhao ◽

...

Keyword(s):

High Throughput Sequencing ◽

Target Genes ◽

Gene Prediction ◽

Fat Deposition ◽

Acid Oxidation ◽

Rna Seq ◽

Non Coding Rna ◽

Non Coding Rnas ◽

Rate Limiting ◽

Long Non Coding Rna

Adipose is an important body tissue in pigs, and fatty traits are critical in pig production. The function of long non-coding RNA (lncRNA) in fat deposition and metabolism has been found in previous studies. In this study, we collected the adipose tissue of six Landrace pigs with contrast backfat thickness (nhigh = 3, nlow = 3), after which we performed strand-specific RNA sequencing (RNA-seq) based on pooling and biological replicate methods. Biological replicate and pooling RNA-seq revealed 1870 and 1618 lncRNAs, respectively. Using edgeR, we determined that 1512 genes and 220 lncRNAs, 2240 genes and 127 lncRNAs were differentially expressed in biological replicate and pooling RNA-seq, respectively. After target gene prediction, we found that ACSL3 was cis-targeted by lncRNA TCONS-00052400 and could activate the conversion of long-chain fatty acids. In addition, lncRNA TCONS_00041740 cis-regulated gene ACACB regulated the rate-limiting enzyme in fatty acid oxidation. Since these genes have necessary functions in fat metabolism, the results imply that the lncRNAs detected in our study may affect backfat deposition in swine through regulation of their target genes. Our study explored the regulation of lncRNA and their target genes in porcine backfat deposition and provided new insights for further investigation of the biological functions of lncRNA.

Download Full-text

Machine learning reduced gene/non-coding RNA features that classify Schizophrenia patients accurately and highlight insightful gene clusters

10.1101/2020.06.08.20125906 ◽

2020 ◽

Author(s):

Yichuan Liu ◽

Hui-Qi Qu ◽

Xiao Chang ◽

Lifeng Tian ◽

Joseph Glessner ◽

...

Keyword(s):

Machine Learning ◽

Neurodevelopmental Disorder ◽

Wnt Signaling Pathway ◽

Gene Clusters ◽

Differentially Expressed ◽

Cell Junction ◽

Mrna Levels ◽

Rna Seq ◽

Non Coding Rna ◽

Non Coding Rnas

AbstractSchizophrenia (SCZ) is a chronic and severely disabling neurodevelopmental disorder that affects people worldwide. RNA-seq has been a powerful method to detect the differentially expressed genes/non-coding RNAs in patients; however, due to overfitting problems differentially expressed targets (DETs) cannot be used properly as biomarkers. In this study, dorsolateral prefrontal cortex (dlpfc) RNA-seq data from 254 individuals’ was obtained from the CommonMind consortium and analyzed with machine learning methods, including random forest, forward feature selection (ffs), and factor analysis, to reduce the numbers of gene/non-coding RNA feature vectors to overcome overfitting problem and explore involved functional clusters. In 2-fold shuffle testing, the average predictive accuracy for SCZ patients was 67% based on coding genes, and the 96% based on long non-coding RNAs (lncRNAs). Coding genes were further clustered into 14 factors and lncRNAs were clustered into 45 factors to represent the underlying features. The largest contribution factor for coding genes contains number of genes critical in neurodevelopment and previously reported in relation with various brain disorders. Genomic loci of lncRNAs were more insightful, enriched for genes critical in synapse function (p=7.3E-3), cell junction (p=0.017), neuron differentiation (p=8.3E-3), phosphorylation (8.2E-4), and involving the Wnt signaling pathway (p=0.029). Taken together, machine learning is a powerful algorithm to reduce functional biomarkers in SCZ patients. The lncRNAs capture the characteristics of SCZ tissue more accurately than mRNA as the formers regulate every level of gene expression, not limited to mRNA levels.

Download Full-text

The role of non-coding RNA/microRNAs in cardiac disease

10.1093/med/9780198757269.003.0031 ◽

2018 ◽

Author(s):

Yolan J. Reckman ◽

Yigal M. Pinto

Keyword(s):

Myocardial Infarction ◽

Cardiac Hypertrophy ◽

Cardiac Disease ◽

Cardiac Development ◽

Rna Transcripts ◽

The Past ◽

Non Coding Rna ◽

Organ Systems ◽

Non Coding Rnas

In the past two decades, our knowledge about non-coding DNA has increased tremendously. While non-coding DNA was initially discarded as ‘junk DNA’, we are now aware of the important and often crucial roles of RNA transcripts that do not translate into protein. Non-coding RNAs (ncRNAs) play important functions in normal cellular homeostasis and also in many diseases across all organ systems. Among the different ncRNAs, microRNAs (miRNAs) and long non-coding RNAs (lncRNAs) have been studied the most. In this chapter we discuss the role of miRNAs and lncRNAs in cardiac disease. We present examples of miRNAs with fundamental roles in cardiac development (miR-1), hypertrophy (myomiRs, miR-199, miR-1/133), fibrosis (miR-29, miR-21), myocardial infarction (miR-15, miR17~92), and arrhythmias/conduction (miR-1). We provide examples of lncRNAs related to cardiac hypertrophy (MHRT, CHRF), myocardial infarction (ANRIL, MIAT), and arrhythmias (KCNQ1OT1). We also discuss miRNAs and lncRNAs as potential therapeutic targets or biomarkers in cardiac disease.

Download Full-text

Emerging Roles of Estrogen-Regulated Enhancer and Long Non-Coding RNAs

International Journal of Molecular Sciences ◽

10.3390/ijms21103711 ◽

2020 ◽

Vol 21 (10) ◽

pp. 3711

Author(s):

Melina J. Sedano ◽

Alana L. Harrison ◽

Mina Zilaie ◽

Chandrima Das ◽

Ramesh Choudhari ◽

...

Keyword(s):

Rna Sequencing ◽

Expression Patterns ◽

Biological Significance ◽

Rna Seq ◽

Biological Functions ◽

Protein Coding ◽

Rna Molecules ◽

Non Coding Rna ◽

Genome Wide ◽

Non Coding Rnas

Genome-wide RNA sequencing has shown that only a small fraction of the human genome is transcribed into protein-coding mRNAs. While once thought to be “junk” DNA, recent findings indicate that the rest of the genome encodes many types of non-coding RNA molecules with a myriad of functions still being determined. Among the non-coding RNAs, long non-coding RNAs (lncRNA) and enhancer RNAs (eRNA) are found to be most copious. While their exact biological functions and mechanisms of action are currently unknown, technologies such as next-generation RNA sequencing (RNA-seq) and global nuclear run-on sequencing (GRO-seq) have begun deciphering their expression patterns and biological significance. In addition to their identification, it has been shown that the expression of long non-coding RNAs and enhancer RNAs can vary due to spatial, temporal, developmental, or hormonal variations. In this review, we explore newly reported information on estrogen-regulated eRNAs and lncRNAs and their associated biological functions to help outline their markedly prominent roles in estrogen-dependent signaling.

Download Full-text

Pan-cancer systematic identification of lncRNAs associated with cancer prognosis

PeerJ ◽

10.7717/peerj.8797 ◽

2020 ◽

Vol 8 ◽

pp. e8797 ◽

Cited By ~ 1

Author(s):

Matthew Ung ◽

Evelien Schaafsma ◽

Daniel Mattox ◽

George L. Wang ◽

Chao Cheng

Keyword(s):

Drug Targets ◽

Patient Survival ◽

Cancer Prognosis ◽

Rna Seq ◽

Non Coding Rna ◽

Cancer Types ◽

Non Coding Rnas ◽

Microarray Studies ◽

Systematic Identification ◽

Pan Cancer

Background The “dark matter” of the genome harbors several non-coding RNA species including Long non-coding RNAs (lncRNAs), which have been implicated in neoplasia but remain understudied. RNA-seq has provided deep insights into the nature of lncRNAs in cancer but current RNA-seq data are rarely accompanied by longitudinal patient survival information. In contrast, a plethora of microarray studies have collected these clinical metadata that can be leveraged to identify novel associations between gene expression and clinical phenotypes. Methods In this study, we developed an analysis framework that computationally integrates RNA-seq and microarray data to systematically screen 9,463 lncRNAs for association with mortality risk across 20 cancer types. Results In total, we identified a comprehensive list of associations between lncRNAs and patient survival and demonstrate that these prognostic lncRNAs are under selective pressure and may be functional. Our results provide valuable insights that facilitate further exploration of lncRNAs and their potential as cancer biomarkers and drug targets.

Download Full-text

Evolutionary Patterns of Non-Coding RNA in Cardiovascular Biology

Non-Coding RNA ◽

10.3390/ncrna5010015 ◽

2019 ◽

Vol 5 (1) ◽

pp. 15 ◽

Cited By ~ 5

Author(s):

Shrey Gandhi ◽

Frank Ruehle ◽

Monika Stoll

Keyword(s):

Vascular System ◽

Functional Characterization ◽

Evolutionary Patterns ◽

Protein Coding ◽

Rna Transcripts ◽

Non Coding Rna ◽

High Prevalence ◽

Cardiovascular Biology ◽

Non Coding Rnas

Cardiovascular diseases (CVDs) affect the heart and the vascular system with a high prevalence and place a huge burden on society as well as the healthcare system. These complex diseases are often the result of multiple genetic and environmental risk factors and pose a great challenge to understanding their etiology and consequences. With the advent of next generation sequencing, many non-coding RNA transcripts, especially long non-coding RNAs (lncRNAs), have been linked to the pathogenesis of CVD. Despite increasing evidence, the proper functional characterization of most of these molecules is still lacking. The exploration of conservation of sequences across related species has been used to functionally annotate protein coding genes. In contrast, the rapid evolutionary turnover and weak sequence conservation of lncRNAs make it difficult to characterize functional homologs for these sequences. Recent studies have tried to explore other dimensions of interspecies conservation to elucidate the functional role of these novel transcripts. In this review, we summarize various methodologies adopted to explore the evolutionary conservation of cardiovascular non-coding RNAs at sequence, secondary structure, syntenic, and expression level.

Download Full-text

Long Non-Coding RNA in the Pathogenesis of Cancers

Cells ◽

10.3390/cells8091015 ◽

2019 ◽

Vol 8 (9) ◽

pp. 1015 ◽

Cited By ~ 72

Author(s):

Chi ◽

Wang ◽

Yu ◽

Yang

Keyword(s):

Early Stage ◽

Gene Expressions ◽

Rna Transcripts ◽

The Past ◽

Non Coding Rna ◽

Incidence And Mortality ◽

Non Coding Rnas ◽

Tumor Growth And Metastasis ◽

New Biomarkers ◽

Long Non Coding Rna

The incidence and mortality rate of cancer has been quickly increasing in the past decades. At present, cancer has become the leading cause of death worldwide. Most of the cancers cannot be effectively diagnosed at the early stage. Although there are multiple therapeutic treatments, including surgery, radiotherapy, chemotherapy, and targeted drugs, their effectiveness is still limited. The overall survival rate of malignant cancers is still low. It is necessary to further study the mechanisms for malignant cancers, and explore new biomarkers and targets that are more sensitive and effective for early diagnosis, treatment, and prognosis of cancers than traditional biomarkers and methods. Long non-coding RNAs (lncRNAs) are a class of RNA transcripts with a length greater than 200 nucleotides. Generally, lncRNAs are not capable of encoding proteins or peptides. LncRNAs exert diverse biological functions by regulating gene expressions and functions at transcriptional, translational, and post-translational levels. In the past decade, it has been demonstrated that the dysregulated lncRNA profile is widely involved in the pathogenesis of many diseases, including cancer, metabolic disorders, and cardiovascular diseases. In particular, lncRNAs have been revealed to play an important role in tumor growth and metastasis. Many lncRNAs have been shown to be potential biomarkers and targets for the diagnosis and treatment of cancers. This review aims to briefly discuss the latest findings regarding the roles and mechanisms of some important lncRNAs in the pathogenesis of certain malignant cancers, including lung, breast, liver, and colorectal cancers, as well as hematological malignancies and neuroblastoma.

Download Full-text

The basic mechanism of gene transcription

Epigenetics, Nuclear Organization & Gene Function ◽

10.1093/oso/9780198831204.003.0003 ◽

2019 ◽

pp. 17-32

Author(s):

John C. Lucchesi

Keyword(s):

Rna Polymerase Ii ◽

Basic Mechanism ◽

Mediator Complex ◽

Initiation Complex ◽

Rna Transcripts ◽

Transcription Process ◽

Dna Strands ◽

Non Coding Rnas ◽

A Chain ◽

Rna Genes

Transcription is initiated by factors that interact with RNA polymerases and recruit them to specific sites, unwind the DNA molecules and allow the synthesis of RNA transcripts complementary to one of the single DNA strands. RNA polymerase II (RNAPII) transcribes genes that encode proteins and some non-coding RNAs; RNAPI transcribes ribosomal RNA genes; RNAPIII transcribes genes that encode tRNAs and other non-coding RNAs. The transcription process starts with a pre-initiation complex (PIC), its activation and promoter clearance. Activation involves chromatin looping, usually promoted by the large multiprotein Mediator complex. RNAPII often makes a promoter-proximal pause, then resumes productive elongation of the transcript. Transition through the different phases of transcription is orchestrated by the phosphorylation of the main subunit of RNAPII. The 5´ end of many transcripts is protected by a methylated guanosine “cap,” and the 3´ end by the addition of a chain of adenosine monophosphates (polyadenylation). Many transcripts undergo splicing to remove regions that interrupt the coding sequence.

Download Full-text

Genome-wide analysis of long non-coding RNA expression and function in colorectal cancer

Tumor Biology ◽

10.1177/1010428317703650 ◽

2017 ◽

Vol 39 (5) ◽

pp. 101042831770365 ◽

Cited By ~ 10

Author(s):

Fangyuan Jing ◽

Huicheng Jin ◽

Yingying Mao ◽

Yingjun Li ◽

Ye Ding ◽

...

Keyword(s):

Colorectal Cancer ◽

Expression Profile ◽

Colorectal Carcinogenesis ◽

Rna Expression ◽

Rna Transcripts ◽

Non Coding Rna ◽

Tumor Tissues ◽

Non Coding Rnas ◽

Long Non Coding Rna

Long non-coding RNAs (lncRNAs) are widely transcribed in the genome, but their expression profile and roles in colorectal cancer are not well understood. The aim of this study was to investigate the long non-coding RNA expression profile in colorectal cancer and look for potential diagnostic biomarkers of colorectal cancer. Long non-coding RNA microarray was applied to investigate the global long non-coding RNA expression profile in colorectal cancer. Gene ontology and Kyoto Encyclopedia of Genes and Genomes pathway analyses were performed using standard enrichment computational methods. The expression levels of selected long non-coding RNAs were validated by quantitative reverse transcription polymerase chain reaction. The relationship between long non-coding RNA expression levels and clinicopathological characteristics of colorectal cancer patients was assessed. Coexpression analyses were carried out to find the coexpressed genes of differentially expressed long non-coding RNAs, followed by gene ontology analysis to predict the possible role of the selected long non-coding RNAs in colorectal carcinogenesis. In this study, a total of 1596 long non-coding RNA transcripts and 1866 messenger RNA transcripts were dysregulated in tumor tissues compared with paired normal tissues. The top upregulated long non-coding RNAs in tumor tissues were CCAT1, UCA1, RP5-881L22.5, NOS2P3, and BC005081 and the top downregulated long non-coding RNAs were AK055386, AC078941.1, RP4-800J21.3, RP11-628E19.3, and RP11-384P7.7. Long non-coding RNA UCA1 was significantly upregulated in colon cancer, and AK055386 was significantly downregulated in tumor with dimension <5 cm. Functional prediction analyses showed that both the long non-coding RNAs coexpress with cell cycle related messenger RNAs. The current long non-coding RNA study provided novel insights into expression profile in colorectal cancer and predicted the potential roles of long non-coding RNAs in colorectal carcinogenesis. Among the dysregulated long non-coding RNAs, UCA1 was found to be associated with anatomic site, and AK055386 was found associated with tumor size. Further functional investigations into the molecular mechanisms are warranted to clarify the role of long non-coding RNA in colorectal carcinogenesis.

Download Full-text

FINDER: An automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences

10.1101/2021.02.04.429837 ◽

2021 ◽

Author(s):

Sagnik Banerjee ◽

Priyanka Bhandary ◽

Margaret Woodhouse ◽

Taner Z. Sen ◽

Roger P. Wise ◽

...

Keyword(s):

Gene Annotation ◽

Gene Prediction ◽

Active Regions ◽

Expression Data ◽

Rna Seq ◽

Experimental Conditions ◽

Eukaryotic Genes ◽

Associated Proteins ◽

Gene Structures ◽

Automated Software

AbstractBackgroundGene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of accumulated transcript data. Challenges include transcriptionally active regions of the genome that contain overlapping genes, genes that produce numerous transcripts, transposable elements and numerous diverse sequence repeats. Currently available gene annotation software applications depend on pre-constructed full-length gene sequence assemblies which are not guaranteed to be error-free. The origins of these sequences are often uncertain, making it difficult to identify and rectify errors in them. This hinders the creation of an accurate and holistic representation of the transcriptomic landscape across multiple tissue types and experimental conditions. Therefore, to gauge the extent of diversity in gene structures, a comprehensive analysis of genome-wide expression data is imperative.ResultsWe present FINDER, a fully automated computational tool that optimizes the entire process of annotating genes and transcript structures. Unlike current state-of-the-art pipelines, FINDER automates the RNA-Seq pre-processing step by working directly with raw sequence reads and optimizes gene prediction from BRAKER2 by supplementing these reads with associated proteins. The FINDER pipeline (1) reports transcripts and recognizes genes that are expressed under specific conditions, (2) generates all possible alternatively spliced transcripts from expressed RNA-Seq data, (3) analyzes read coverage patterns to modify existing transcript models and create new ones, and (4) scores genes as high- or low-confidence based on the available evidence across multiple datasets. We demonstrate the ability of FINDER to automatically annotate a diverse pool of genomes from eight species.ConclusionsFINDER takes a completely automated approach to annotate genes directly from raw expression data. It is capable of processing eukaryotic genomes of all sizes and requires no manual supervision – ideal for bench researchers with limited experience in handling computational tools.

Download Full-text