scholarly journals BaRTv1.0: an improved barley reference transcript dataset to determine accurate changes in the barley transcriptome using RNA-seq

2019 ◽  
Author(s):  
Paulo Rapazote-Flores ◽  
Micha Bayer ◽  
Linda Milne ◽  
Claus-Dieter Mayer ◽  
John Fuller ◽  
...  

AbstractBackgroundTime consuming computational assembly and quantification of gene expression and splicing analysis from RNA-seq data vary considerably. Recent fast non-alignment tools such as Kallisto and Salmon overcome these problems, but these tools require a high quality, comprehensive reference transcripts dataset (RTD), which are rarely available in plants.ResultsA high-quality, non-redundant barley gene RTD and database (Barley Reference Transcripts – BaRTv1.0) has been generated. BaRTv1.0, was constructed from a range of tissues, cultivars and abiotic treatments and transcripts assembled and aligned to the barley cv. Morex reference genome (Mascher et al., 2017). Full-length cDNAs from the barley variety Haruna nijo (Matsumoto et al., 2011) determined transcript coverage, and high-resolution RT-PCR validated alternatively spliced (AS) transcripts of 86 genes in five different organs and tissue. These methods were used as benchmarks to select an optimal barley RTD. BaRTv1.0-Quantification of Alternatively Spliced Isoforms (QUASI) was also made to overcome inaccurate quantification due to variation in 5’ and 3’ UTR ends of transcripts. BaRTv1.0-QUASI was used for accurate transcript quantification of RNA-seq data of five barley organs/tissues. This analysis identified 20,972 significant differentially expressed genes, 2,791 differentially alternatively spliced genes and 2,768 transcripts with differential transcript usage.ConclusionA high confidence barley reference transcript dataset consisting of 60,444 genes with 177,240 transcripts has been generated. Compared to current barley transcripts, BaRTv1.0 transcripts are generally longer, have less fragmentation and improved gene models that are well supported by splice junction reads. Precise transcript quantification using BaRTv1.0 allows routine analysis of gene expression and AS.

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Paulo Rapazote-Flores ◽  
Micha Bayer ◽  
Linda Milne ◽  
Claus-Dieter Mayer ◽  
John Fuller ◽  
...  

Abstract Background The time required to analyse RNA-seq data varies considerably, due to discrete steps for computational assembly, quantification of gene expression and splicing analysis. Recent fast non-alignment tools such as Kallisto and Salmon overcome these problems, but these tools require a high quality, comprehensive reference transcripts dataset (RTD), which are rarely available in plants. Results A high-quality, non-redundant barley gene RTD and database (Barley Reference Transcripts – BaRTv1.0) has been generated. BaRTv1.0, was constructed from a range of tissues, cultivars and abiotic treatments and transcripts assembled and aligned to the barley cv. Morex reference genome (Mascher et al. Nature; 544: 427–433, 2017). Full-length cDNAs from the barley variety Haruna nijo (Matsumoto et al. Plant Physiol; 156: 20–28, 2011) determined transcript coverage, and high-resolution RT-PCR validated alternatively spliced (AS) transcripts of 86 genes in five different organs and tissue. These methods were used as benchmarks to select an optimal barley RTD. BaRTv1.0-Quantification of Alternatively Spliced Isoforms (QUASI) was also made to overcome inaccurate quantification due to variation in 5′ and 3′ UTR ends of transcripts. BaRTv1.0-QUASI was used for accurate transcript quantification of RNA-seq data of five barley organs/tissues. This analysis identified 20,972 significant differentially expressed genes, 2791 differentially alternatively spliced genes and 2768 transcripts with differential transcript usage. Conclusion A high confidence barley reference transcript dataset consisting of 60,444 genes with 177,240 transcripts has been generated. Compared to current barley transcripts, BaRTv1.0 transcripts are generally longer, have less fragmentation and improved gene models that are well supported by splice junction reads. Precise transcript quantification using BaRTv1.0 allows routine analysis of gene expression and AS.


2021 ◽  
Author(s):  
Wenbin Guo ◽  
Max Coulter ◽  
Robbie Waugh ◽  
Runxuan Zhang

High quality transcriptome assembly using short reads from RNA-seq data still heavily relies upon reference-based approaches, of which the primary step is to align RNA-seq reads to a single reference genome of haploid sequence. However, it is increasingly apparent that while different genotypes within a species share core genes, they also contain variable numbers of specific genes that are only present a subset of individuals. Using a common reference may thus lead to a loss of genotype-specific information in the assembled transcript dataset and the generation of erroneous, incomplete or misleading transcriptomics analysis results. With the recent development of pan-genome information in many species, it is important that we understand the limitations of single genotype references for transcriptomics analysis. In this study, we quantitively evaluated the advantages of using genotype-specific reference genomes for transcriptome assembly and analysis using cultivated barley as a model. We mapped barley cultivar Barke RNA-seq reads to the Barke genome and to the cultivar Morex genome (common barley genome reference) to construct a genotype specific Reference Transcript Dataset (sRTD) and a common Reference Transcript Datasets (cRTD), respectively. We compared the two RTDs according to their transcript diversity, transcript sequence and structure similarity and the accuracy they provided for transcript quantification and differential expression analysis. Our evaluation shows that the sRTD has a significantly higher diversity of transcripts and alternative splicing events. Despite using a high-quality reference genome for assembly of the cRTD, we miss ca. 40% transcripts present in the sRTD and cRTD only has ca. 70% true assemblies. We found that the sRTD is more accurate for transcript quantification as well as differential expression and differential alternative splicing analysis. However, gene level quantification and comparative expression analysis are less affected by the source RTD, which indicates that analysing transcriptomic data at the gene level may be a reasonable compromise when a high-quality genotype-specific reference is not available.


2016 ◽  
Author(s):  
Runxuan Zhang ◽  
Cristiane P. G. Calixto ◽  
Yamile Marquez ◽  
Peter Venhuizen ◽  
Nikoleta A. Tzioutziou ◽  
...  

AbstractBackgroundAlternative splicing is the major post-transcriptional mechanism by which gene expression is regulated and affects a wide range of processes and responses in most eukaryotic organisms. RNA-sequencing (RNA-seq) can generate genome-wide quantification of individual transcript isoforms to identify changes in expression and alternative splicing. RNA-seq is an essential modern tool but its ability to accurately quantify transcript isoforms depends on the diversity, completeness and quality of the transcript information.ResultsWe have developed a new Reference Transcript Dataset for Arabidopsis (AtRTD2) for RNA-seq analysis containing over 82k non-redundant transcripts, whereby 74,194 transcripts originate from 27,667 protein-coding genes. A total of 13,524 protein-coding genes have at least one alternatively spliced transcript in AtRTD2 such that about 60% of the 22,453 protein-coding, intron-containing genes in Arabidopsis undergo alternative splicing. More than 600 putative U12 introns were identified in more than 2,000 transcripts. AtRTD2 was generated from transcript assemblies of ca. 8.5 billion pairs of reads from 285 RNA-seq data sets obtained from 129 RNA-seq libraries and merged along with the previous version, AtRTD, and Araport11 transcript assemblies. AtRTD2 increases the diversity of transcripts and through application of stringent filters represents the most extensive and accurate transcript collection for Arabidopsis to date. We have demonstrated a generally good correlation of alternative splicing ratios from RNA-seq data analysed by Salmon and experimental data from high resolution RT-PCR. However, we have observed inaccurate quantification of transcript isoforms for genes with multiple transcripts which have variation in the lengths of their UTRs. This variation is not effectively corrected in RNA-seq analysis programmes and will therefore impact RNA-seq analyses generally. To address this, we have tested different genome-wide modifications of AtRTD2 to improve transcript quantification and alternative splicing analysis. As a result, we release AtRTD2-QUASI specifically for use in Quantification of Alternatively Spliced Isoforms and demonstrate that it out-performs other available transcriptomes for RNA-seq analysis.ConclusionsWe have generated a new transcriptome resource for RNA-seq analyses in Arabidopsis (AtRTD2) designed to address quantification of different isoforms and alternative splicing in gene expression studies. Experimental validation of alternative splicing changes identified inaccuracies in transcript quantification due to UTR length variation. To solve this problem, we also release a modified reference transcriptome, AtRTD2-QUASI for quantification of transcript isoforms, which shows high correlation with experimental data.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Shuhua Zhan ◽  
Cortland Griswold ◽  
Lewis Lukens

Abstract Background Genetic variation for gene expression is a source of phenotypic variation for natural and agricultural species. The common approach to map and to quantify gene expression from genetically distinct individuals is to assign their RNA-seq reads to a single reference genome. However, RNA-seq reads from alleles dissimilar to this reference genome may fail to map correctly, causing transcript levels to be underestimated. Presently, the extent of this mapping problem is not clear, particularly in highly diverse species. We investigated if mapping bias occurred and if chromosomal features associated with mapping bias. Zea mays presents a model species to assess these questions, given it has genotypically distinct and well-studied genetic lines. Results In Zea mays, the inbred B73 genome is the standard reference genome and template for RNA-seq read assignments. In the absence of mapping bias, B73 and a second inbred line, Mo17, would each have an approximately equal number of regulatory alleles that increase gene expression. Remarkably, Mo17 had 2–4 times fewer such positively acting alleles than did B73 when RNA-seq reads were aligned to the B73 reference genome. Reciprocally, over one-half of the B73 alleles that increased gene expression were not detected when reads were aligned to the Mo17 genome template. Genes at dissimilar chromosomal ends were strongly affected by mapping bias, and genes at more similar pericentromeric regions were less affected. Biased transcript estimates were higher in untranslated regions and lower in splice junctions. Bias occurred across software and alignment parameters. Conclusions Mapping bias very strongly affects gene transcript abundance estimates in maize, and bias varies across chromosomal features. Individual genome or transcriptome templates are likely necessary for accurate transcript estimation across genetically variable individuals in maize and other species.


BMC Cancer ◽  
2020 ◽  
Vol 20 (1) ◽  
Author(s):  
Wenshuai Liu ◽  
Hanxing Tong ◽  
Chenlu Zhang ◽  
Rongyuan Zhuang ◽  
He Guo ◽  
...  

Abstract Background Treating patients with advanced sarcomas is challenging due to great histologic diversity among its subtypes. Leiomyosarcoma (LMS) and de-differentiated liposarcoma (DDLPS) are two common and aggressive subtypes of soft tissue sarcoma (STS). They differ significantly in histology and clinical behaviors. However, the molecular driving force behind the difference is unclear. Methods We collected 20 LMS and 12 DDLPS samples and performed whole exome sequencing (WES) to obtain their somatic mutation profiles. We also performed RNA-Seq to analyze the transcriptomes of 8 each of the LMS and DDLPS samples and obtained information about differential gene expression, pathway enrichment, immune cell infiltration in tumor microenvironment, and chromosomal rearrangement including gene fusions. Selected gene fusion events from the RNA-seq prediction were checked by RT-PCR in tandem with Sanger sequencing. Results We detected loss of function mutation and deletion of tumor suppressors mostly in LMS, and oncogene amplification mostly in DDLPS. A focal amplification affecting chromosome 12q13–15 region which encodes MDM2, CDK4 and HMGA2 is notable in DDLPS. Mutations in TP53, ATRX, PTEN, and RB1 are identified in LMS but not DDLPS, while mutation of HERC2 is only identified in DDLPS but not LMS. RNA-seq revealed overexpression of MDM2, CDK4 and HMGA2 in DDLPS and down-regulation of TP53 and RB1 in LMS. It also detected more fusion events in DDLPS than LMS (4.5 vs. 1, p = 0.0195), and the ones involving chromosome 12 in DDLPS stand out. RT-PCR and Sanger sequencing verified the majority of the fusion events in DDLPS but only one event in LMS selected to be tested. The tumor microenvironmental signatures are highly correlated with histologic types. DDLPS has more endothelial cells and fibroblasts content than LMS. Conclusions Our analysis revealed different recurrent genetic variations in LMS and DDLPS including simultaneous upregulation of gene expression and gene copy number amplification of MDM2 and CDK4. Up-regulation of tumor related genes is favored in DDLPS, while loss of suppressor function is favored in LMS. DDLPS harbors more frequent fusion events which can generate neoepitopes and potentially targeted by personalized immune treatment.


2017 ◽  
Vol 29 (1) ◽  
pp. 173
Author(s):  
Z. Jiang ◽  
J. Sun ◽  
S. Marjani ◽  
H. Dong ◽  
X. Zheng ◽  
...  

Appropriate reference genes for accurate normalization in RT-PCR are essential for the study of gene expression. Ideal reference genes should not only have stable expression across stages of embryo development, but also be expressed at comparable levels to the target genes. Using RNA-seq data from in vivo-produced bovine oocytes and embryos from the 2-cell to blastocyst stage (Jiang et al., 2014 BMC Genomics 15, 756), we tried to establish a catalogue of all reference genes for RT-PCR analysis. One-way ANOVA generated 4055 genes that did not differ across stages. To reduce this list, we used the entire RNA-seq data set and first removed genes with a FPKM (fragments per kilobase of transcript per million mapped reads) of <1, and then rescaled each gene’s expression values within a range of 0 to 1. We subsequently calculated the expression variance for each gene across all stages. By assuming that the calculated variances follow a Gaussian distribution and that the majority of the genes do not have a stable expression level, a gene was classified as a reference if its variance significantly deviated (P < 0.05) from these assumptions. We identified 346 potential reference genes, all of which were among the candidates from the ANOVA analysis. We arbitrarily assigned genes in this list to high (FPKM ≥ 100), medium (10 < FPKM < 100), and low expression levels (FPKM ≤ 10), and 37, 154, and 155 genes, respectively, fell into these groups. Surprisingly, none of the commonly used reference genes, such as GAPDH, PPIA, ACTB, PRL15, GUSB, and H3F2A, were identified as being stably expressed across in vivo development. This is consistent with findings of prior RT-PCR studies (Robert et al. 2002 Biol. Reprod. 67, 1465–1472; Ross et al. 2010 Cell Reprogram. 12, 709–717). The following gene ontology terms were significantly enriched for the 346 genes: cell cycle, translation, transport, chromatin, cell division, and metabolic process, indicating that the early embryos maintained constant levels of genes involved in fundamental biological functions. Finally, we performed RT-PCR to validate the RNA-seq results using different bovine in vivo-derived oocytes and embryos (n = 3/stage). We successfully validated 10 selected genes, including those in the high (CS, PGD, and ACTR3), medium (CCT5, MRPL47, COG2, CRT9, and HELLS), and low expression groups (CDC23 and TTF1). In conclusion, we recommend the use of reference genes that are expressed at comparable levels to target genes. This study offers a useful resource to aid in the appropriate selection of reference genes, which will improve the accuracy of quantitative gene expression analyses across bovine embryo pre-implantation development.


2020 ◽  
Vol 20 (2) ◽  
Author(s):  
Tyler Doughty ◽  
Eduard Kerkhoven

ABSTRACT Over the past decade, improvements in technology and methods have enabled rapid and relatively inexpensive generation of high-quality RNA-seq datasets. These datasets have been used to characterize gene expression for several yeast species and have provided systems-level insights for basic biology, biotechnology and medicine. Herein, we discuss new techniques that have emerged and existing techniques that enable analysts to extract information from multifactorial yeast RNA-seq datasets. Ultimately, this minireview seeks to inspire readers to query datasets, whether previously published or freshly obtained, with creative and diverse methods to discover and support novel hypotheses.


2016 ◽  
Author(s):  
Christopher A. Odhams ◽  
Andrea Cortini ◽  
Lingyan Chen ◽  
Amy L. Roberts ◽  
Ana Vinuela ◽  
...  

AbstractStudies attempting to functionally interpret complex-disease susceptibility loci by GWAS and eQTL integration have predominantly employed microarrays to quantify gene-expression. RNA-Seq has the potential to discover a more comprehensive set of eQTLs and illuminate the underlying molecular consequence. We examine the functional outcome of 39 variants associated with Systemic Lupus Erythematosus (SLE) through integration of GWAS and eQTL data from the TwinsUK microarray and RNA-Seq cohort in lymphoblastoid cell lines. We use conditional analysis and a Bayesian colocalisation method to provide evidence of a shared causal-variant, then compare the ability of each quantification type to detect disease relevant eQTLs and eGenes. We discovered a greater frequency of candidate-causal eQTLs using RNA-Seq, and identified novel SLE susceptibility genes that were concealed using microarrays (e.g. NADSYN1, SKP1, and TCF7). Many of these eQTLs were found to influence the expression of several genes, suggesting risk haplotypes may harbour multiple functional effects. We pinpointed eQTLs modulating expression of four non-coding RNAs; three of which were replicated in whole-blood. Novel SLE associated splicing events were identified in the T-reg restricted transcription factor, IKZF2, the autophagy-related gene WDFY4, and the redox coenzyme NADSYN1, through asQTL mapping using the Geuvadis cohort. We have significantly increased our understanding of the genetic control of gene-expression in SLE by maximising the leverage of RNA-Seq and performing integrative GWAS-eQTL analysis against gene, exon, and splice-junction quantifications. In doing so, we have identified novel SLE candidate genes and specific molecular mechanisms that will serve as the basis for targeted follow-up studies.


2021 ◽  
Vol 8 ◽  
Author(s):  
Liliana Florea ◽  
Lindsay Payer ◽  
Corina Antonescu ◽  
Guangyu Yang ◽  
Kathleen Burns

Alu exonization events functionally diversify the transcriptome, creating alternative mRNA isoforms and accounting for an estimated 5% of the alternatively spliced (skipped) exons in the human genome. We developed computational methods, implemented into a software called Alubaster, for detecting incorporation of Alu sequences in mRNA transcripts from large scale RNA-seq data sets. The approach detects Alu sequences derived from both fixed and polymorphic Alu elements, including Alu insertions missing from the reference genome. We applied our methods to 117 GTEx human frontal cortex samples to build and characterize a collection of Alu-containing mRNAs. In particular, we detected and characterized Alu exonizations occurring at 870 fixed Alu loci, of which 237 were novel, as well as hundreds of putative events involving Alu elements that are polymorphic variants or rare alleles not present in the reference genome. These methods and annotations represent a unique and valuable resource that can be used to understand the characteristics of Alu-containing mRNAs and their tissue-specific expression patterns.


Author(s):  
Bidossessi Wilfried Hounkpe ◽  
Francine Chenou ◽  
Franciele de Lima ◽  
Erich Vinicius De Paula

Abstract Housekeeping (HK) genes are constitutively expressed genes that are required for the maintenance of basic cellular functions. Despite their importance in the calibration of gene expression, as well as the understanding of many genomic and evolutionary features, important discrepancies have been observed in studies that previously identified these genes. Here, we present Housekeeping and Reference Transcript Atlas (HRT Atlas v1.0, www.housekeeping.unicamp.br) a web-based database which addresses some of the previously observed limitations in the identification of these genes, and offers a more accurate database of human and mouse HK genes and transcripts. The database was generated by mining massive human and mouse RNA-seq data sets, including 11 281 and 507 high-quality RNA-seq samples from 52 human non-disease tissues/cells and 14 healthy tissues/cells of C57BL/6 wild type mouse, respectively. User can visualize the expression and download lists of 2158 human HK transcripts from 2176 HK genes and 3024 mouse HK transcripts from 3277 mouse HK genes. HRT Atlas also offers the most stable and suitable tissue selective candidate reference transcripts for normalization of qPCR experiments. Specific primers and predicted modifiers of gene expression for some of these HK transcripts are also proposed. HRT Atlas has also been integrated with a regulatory elements resource from Epiregio server.


Sign in / Sign up

Export Citation Format

Share Document