Virtue as the mean: Pan-human consensus genome significantly improves the accuracy of RNA-seq analyses

The Human Reference Genome serves as the foundation for modern genomic analyses. However, in its present form, it does not adequately represent the vast genetic diversity of the human population. In this study, we explored the consensus genome as a potential successor of the current Reference genome, and assessed its effect on the accuracy of RNA-seq read alignment. In order to find the best haploid genome representation, we constructed consensus genomes at the Pan-human, Super-population and Population levels, utilizing variant information from the 1000 Genomes project. Using personal haploid genomes as the ground truth, we compared mapping errors for real RNA-seq reads aligned to the consensus genomes versus the Reference genome. For reads overlapping homozygous variants, we found that the mapping error decreased by a factor of ∼2-3 when the Reference was replaced with the Pan-human consensus genome. Interestingly, we also found that using more population-specific consensuses resulted in little to no increase over using the Pan-human consensus, suggesting a limit in the utility of incorporating more specific genomic variation. To assess the functional impact, we performed transcript expression quantification and found that the Pan-human consensus increases accuracy of transcript quantification for hundreds of transcripts.

Download Full-text

Towards a reference genome that captures global genetic diversity

Nature Communications ◽

10.1038/s41467-020-19311-w ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Karen H. Y. Wong ◽

Walfred Ma ◽

Chun-Yu Wei ◽

Erh-Chan Yeh ◽

Wan-Jia Lin ◽

...

Keyword(s):

Genetic Diversity ◽

Reference Genome ◽

Regulatory Elements ◽

Human Populations ◽

Single Individual ◽

Rna Seq ◽

Human Reference Genome ◽

Reference Sequences ◽

Genome Annotations ◽

Unmapped Reads

Abstract The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.

Download Full-text

Revealing the missing expressed genes beyond the human reference genome by RNA-Seq

BMC Genomics ◽

10.1186/1471-2164-12-590 ◽

2011 ◽

Vol 12 (1) ◽

Cited By ~ 22

Author(s):

Geng Chen ◽

Ruiyuan Li ◽

Leming Shi ◽

Junyi Qi ◽

Pengzhan Hu ◽

...

Keyword(s):

Reference Genome ◽

Rna Seq ◽

Human Reference Genome

Download Full-text

The value of genotype-specific reference for transcriptome analyses

10.1101/2021.09.14.460213 ◽

2021 ◽

Author(s):

Wenbin Guo ◽

Max Coulter ◽

Robbie Waugh ◽

Runxuan Zhang

Keyword(s):

Alternative Splicing ◽

Reference Genome ◽

Transcriptome Assembly ◽

Specific Reference ◽

Rna Seq ◽

High Quality ◽

Common Reference ◽

Transcript Quantification ◽

Gene Level ◽

Reference Transcript

High quality transcriptome assembly using short reads from RNA-seq data still heavily relies upon reference-based approaches, of which the primary step is to align RNA-seq reads to a single reference genome of haploid sequence. However, it is increasingly apparent that while different genotypes within a species share core genes, they also contain variable numbers of specific genes that are only present a subset of individuals. Using a common reference may thus lead to a loss of genotype-specific information in the assembled transcript dataset and the generation of erroneous, incomplete or misleading transcriptomics analysis results. With the recent development of pan-genome information in many species, it is important that we understand the limitations of single genotype references for transcriptomics analysis. In this study, we quantitively evaluated the advantages of using genotype-specific reference genomes for transcriptome assembly and analysis using cultivated barley as a model. We mapped barley cultivar Barke RNA-seq reads to the Barke genome and to the cultivar Morex genome (common barley genome reference) to construct a genotype specific Reference Transcript Dataset (sRTD) and a common Reference Transcript Datasets (cRTD), respectively. We compared the two RTDs according to their transcript diversity, transcript sequence and structure similarity and the accuracy they provided for transcript quantification and differential expression analysis. Our evaluation shows that the sRTD has a significantly higher diversity of transcripts and alternative splicing events. Despite using a high-quality reference genome for assembly of the cRTD, we miss ca. 40% transcripts present in the sRTD and cRTD only has ca. 70% true assemblies. We found that the sRTD is more accurate for transcript quantification as well as differential expression and differential alternative splicing analysis. However, gene level quantification and comparative expression analysis are less affected by the source RTD, which indicates that analysing transcriptomic data at the gene level may be a reasonable compromise when a high-quality genotype-specific reference is not available.

Download Full-text

Benchmark of long non-coding RNA quantification for RNA sequencing of cancer samples

GigaScience ◽

10.1093/gigascience/giz145 ◽

2019 ◽

Vol 8 (12) ◽

Cited By ~ 6

Author(s):

Hong Zheng ◽

Kevin Brennan ◽

Mikel Hernaez ◽

Olivier Gevaert

Keyword(s):

Rna Sequencing ◽

Ground Truth ◽

The Cancer Genome Atlas ◽

Rna Seq ◽

Optimal Method ◽

Lncrna Expression ◽

Transcript Quantification ◽

Gene Quantification ◽

Non Coding Rnas ◽

Expression Quantification

Abstract Background Long non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification. Results In this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods. Conclusions Considering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.

Download Full-text

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

BMC Bioinformatics ◽

10.1186/1471-2105-12-323 ◽

2011 ◽

Vol 12 (1) ◽

Cited By ~ 7066

Author(s):

Bo Li ◽

Colin N Dewey

Keyword(s):

Reference Genome ◽

De Novo ◽

Software Tool ◽

Read Length ◽

Rna Seq ◽

Efficient Design ◽

De Novo Transcriptome ◽

Transcript Quantification ◽

Abundance Estimates ◽

User Friendly

Abstract Background RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. Results We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. Conclusions RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.

Download Full-text

BaRTv1.0: an improved barley reference transcript dataset to determine accurate changes in the barley transcriptome using RNA-seq

10.1101/638106 ◽

2019 ◽

Cited By ~ 2

Author(s):

Paulo Rapazote-Flores ◽

Micha Bayer ◽

Linda Milne ◽

Claus-Dieter Mayer ◽

John Fuller ◽

...

Keyword(s):

Gene Expression ◽

Reference Genome ◽

Splice Junction ◽

Rna Seq ◽

Rt Pcr ◽

High Quality ◽

Transcript Quantification ◽

Reference Transcript ◽

Alternatively Spliced ◽

Comprehensive Reference

AbstractBackgroundTime consuming computational assembly and quantification of gene expression and splicing analysis from RNA-seq data vary considerably. Recent fast non-alignment tools such as Kallisto and Salmon overcome these problems, but these tools require a high quality, comprehensive reference transcripts dataset (RTD), which are rarely available in plants.ResultsA high-quality, non-redundant barley gene RTD and database (Barley Reference Transcripts – BaRTv1.0) has been generated. BaRTv1.0, was constructed from a range of tissues, cultivars and abiotic treatments and transcripts assembled and aligned to the barley cv. Morex reference genome (Mascher et al., 2017). Full-length cDNAs from the barley variety Haruna nijo (Matsumoto et al., 2011) determined transcript coverage, and high-resolution RT-PCR validated alternatively spliced (AS) transcripts of 86 genes in five different organs and tissue. These methods were used as benchmarks to select an optimal barley RTD. BaRTv1.0-Quantification of Alternatively Spliced Isoforms (QUASI) was also made to overcome inaccurate quantification due to variation in 5’ and 3’ UTR ends of transcripts. BaRTv1.0-QUASI was used for accurate transcript quantification of RNA-seq data of five barley organs/tissues. This analysis identified 20,972 significant differentially expressed genes, 2,791 differentially alternatively spliced genes and 2,768 transcripts with differential transcript usage.ConclusionA high confidence barley reference transcript dataset consisting of 60,444 genes with 177,240 transcripts has been generated. Compared to current barley transcripts, BaRTv1.0 transcripts are generally longer, have less fragmentation and improved gene models that are well supported by splice junction reads. Precise transcript quantification using BaRTv1.0 allows routine analysis of gene expression and AS.

Download Full-text

Superenhancer–transcription factor regulatory network in malignant tumors

Open Medicine ◽

10.1515/med-2021-0326 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1564-1582

Author(s):

Yuan Liang ◽

Linlin Li ◽

Tian Xin ◽

Binru Li ◽

Dalin Zhang

Keyword(s):

Bladder Cancer ◽

Malignant Tumor ◽

Regulatory Network ◽

Reference Genome ◽

Malignant Tumors ◽

Rna Seq ◽

Human Reference Genome ◽

Bladder Cancer Cells ◽

On Chip

Abstract Objective This study aims to identify superenhancer (SE)–transcriptional factor (TF) regulatory network related to eight common malignant tumors based on ChIP-seq data modified by histone H3K27ac in the enhancer region of the SRA database. Methods H3K27ac ChIP-seq data of eight common malignant tumor samples were downloaded from the SRA database and subjected to comparison with the human reference genome hg19. TFs regulated by SEs were screened with HOMER software. Core regulatory circuitry (CRC) in malignant tumor samples was defined through CRCmapper software and validated by RNA-seq data in TCGA. The findings were substantiated in bladder cancer cell experiments. Results Different malignant tumors could be distinguished through the H3K27ac signal. After SE identification in eight common malignant tumor samples, 35 SE-regulated genes were defined as malignant tumor-specific. SE-regulated specific TFs effectively distinguished the types of malignant tumors. Finally, we obtained 60 CRC TFs, and SMAD3 exhibited a strong H3K27ac signal in eight common malignant tumor samples. In vitro experimental data verified the presence of a SE–TF regulatory network in bladder cancer, and SE–TF regulatory network enhanced the malignant phenotype of bladder cancer cells. Conclusion The SE–TF regulatory network with SMAD3 as the core TF may participate in the carcinogenesis of malignant tumors.

Download Full-text

Faculty Opinions recommendation of Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.13296969.14657090 ◽

2011 ◽

Author(s):

Steven Salzberg ◽

Michael Schatz

Keyword(s):

Reference Genome ◽

Transcriptome Assembly ◽

Full Length ◽

Rna Seq

Download Full-text

Geometric property-based convolutional neural network for indoor object detection

International Journal of Advanced Robotic Systems ◽

10.1177/1729881421993323 ◽

2021 ◽

Vol 18 (1) ◽

pp. 172988142199332

Author(s):

Xintao Ding ◽

Boquan Li ◽

Jinbao Wang

Keyword(s):

Neural Network ◽

Object Detection ◽

Convolutional Neural Network ◽

Geometric Property ◽

Ground Truth ◽

Geometric Constraints ◽

Depth Information ◽

Training Set ◽

Object Knowledge ◽

The Mean

Indoor object detection is a very demanding and important task for robot applications. Object knowledge, such as two-dimensional (2D) shape and depth information, may be helpful for detection. In this article, we focus on region-based convolutional neural network (CNN) detector and propose a geometric property-based Faster R-CNN method (GP-Faster) for indoor object detection. GP-Faster incorporates geometric property in Faster R-CNN to improve the detection performance. In detail, we first use mesh grids that are the intersections of direct and inverse proportion functions to generate appropriate anchors for indoor objects. After the anchors are regressed to the regions of interest produced by a region proposal network (RPN-RoIs), we then use 2D geometric constraints to refine the RPN-RoIs, in which the 2D constraint of every classification is a convex hull region enclosing the width and height coordinates of the ground-truth boxes on the training set. Comparison experiments are implemented on two indoor datasets SUN2012 and NYUv2. Since the depth information is available in NYUv2, we involve depth constraints in GP-Faster and propose 3D geometric property-based Faster R-CNN (DGP-Faster) on NYUv2. The experimental results show that both GP-Faster and DGP-Faster increase the performance of the mean average precision.

Download Full-text

Improved PET/MRI attenuation correction in the pelvic region using a statistical decomposition method on T2-weighted images

EJNMMI Physics ◽

10.1186/s40658-020-00336-5 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Elin Wallstén ◽

Jan Axelsson ◽

Joakim Jonsson ◽

Camilla Thellenberg Karlsson ◽

Tufve Nyholm ◽

...

Keyword(s):

Attenuation Correction ◽

Research Study ◽

Standardized Uptake Value ◽

Ground Truth ◽

Whole Body ◽

Mr Images ◽

Pelvic Region ◽

Mri Scans ◽

The Mean ◽

Prostate Cancer Research

Abstract Background Attenuation correction of PET/MRI is a remaining problem for whole-body PET/MRI. The statistical decomposition algorithm (SDA) is a probabilistic atlas-based method that calculates synthetic CTs from T2-weighted MRI scans. In this study, we evaluated the application of SDA for attenuation correction of PET images in the pelvic region. Materials and method Twelve patients were retrospectively selected from an ongoing prostate cancer research study. The patients had same-day scans of [11C]acetate PET/MRI and CT. The CT images were non-rigidly registered to the PET/MRI geometry, and PET images were reconstructed with attenuation correction employing CT, SDA-generated CT, and the built-in Dixon sequence-based method of the scanner. The PET images reconstructed using CT-based attenuation correction were used as ground truth. Results The mean whole-image PET uptake error was reduced from − 5.4% for Dixon-PET to − 0.9% for SDA-PET. The prostate standardized uptake value (SUV) quantification error was significantly reduced from − 5.6% for Dixon-PET to − 2.3% for SDA-PET. Conclusion Attenuation correction with SDA improves quantification of PET/MR images in the pelvic region compared to the Dixon-based method.

Download Full-text