scholarly journals Fusion transcripts and their genomic breakpoints in polyadenylated and ribosomal RNA–minus RNA sequencing data

GigaScience ◽  
2021 ◽  
Vol 10 (12) ◽  
Author(s):  
Youri Hoogstrate ◽  
Malgorzata A Komor ◽  
René Böttcher ◽  
Job van Riet ◽  
Harmen J G van de Werken ◽  
...  

Abstract Background Fusion genes are typically identified by RNA sequencing (RNA-seq) without elucidating the causal genomic breakpoints. However, non–poly(A)-enriched RNA-seq contains large proportions of intronic reads that also span genomic breakpoints. Results We have developed an algorithm, Dr. Disco, that searches for fusion transcripts by taking an entire reference genome into account as search space. This includes exons but also introns, intergenic regions, and sequences that do not meet splice junction motifs. Using 1,275 RNA-seq samples, we investigated to what extent genomic breakpoints can be extracted from RNA-seq data and their implications regarding poly(A)-enriched and ribosomal RNA–minus RNA-seq data. Comparison with whole-genome sequencing data revealed that most genomic breakpoints are not, or minimally, transcribed while, in contrast, the genomic breakpoints of all 32 TMPRSS2-ERG–positive tumours were present at RNA level. We also revealed tumours in which the ERG breakpoint was located before ERG, which co-existed with additional deletions and messenger RNA that incorporated intergenic cryptic exons. In breast cancer we identified rearrangement hot spots near CCND1 and in glioma near CDK4 and MDM2 and could directly associate this with increased expression. Furthermore, in all datasets we find fusions to intergenic regions, often spanning multiple cryptic exons that potentially encode neo-antigens. Thus, fusion transcripts other than classical gene-to-gene fusions are prominently present and can be identified using RNA-seq. Conclusion By using the full potential of non–poly(A)-enriched RNA-seq data, sophisticated analysis can reliably identify expressed genomic breakpoints and their transcriptional effects.

2021 ◽  
Author(s):  
Youri Hoogstrate ◽  
Malgorzata A. Komor ◽  
René Böttcher ◽  
Job van Riet ◽  
Harmen J.G. van de Werken ◽  
...  

Spliced fusion-transcripts are typically identified by RNA-seq without elucidating the causal genomic breakpoints. However, non poly(A)-enriched RNA-seq contains large proportions of intronic reads spanning also genomic breakpoints. Using 1.274 RNA-seq samples, we investigated what additional information is embedded in non poly(A)-enriched RNA-seq data. Here, we present our novel, graph-based, Dr. Disco algorithm that makes use of both intronic and exonic RNA-seq reads to identify not only fusion transcripts but also genomic breakpoints in gene but also in intergenic regions. Dr. Disco identified TMPRSS2-ERG fusions with genomic breakpoints and other transcribed rearrangements from multiple RNA-sequencing cohorts. In breast cancer and glioma samples Dr. Disco identified rearrangement hotspots near CCND1 and MDM2 and could directly associate this with increased expression. A comparison with matched DNA-sequencing revealed that most genomic breakpoints are not, or minimally, transcribed while also revealing highly expressed translocations missed by DNA-seq. By using the full potential of non poly(A)-enriched RNA-seq data, Dr. Disco can reliably identify expressed genomic breakpoints and their transcriptional effects.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Li Chen ◽  
Ruirui Yang ◽  
Tony Kwan ◽  
Chao Tang ◽  
Stephen Watt ◽  
...  

Abstract Both poly(A) enrichment and ribosomal RNA depletion are commonly used for RNA sequencing. Either has its advantages and disadvantages that may lead to biases in the downstream analyses. To better access these effects, we carried out both ribosomal RNA-depleted and poly(A)-selected RNA-seq for CD4+ T naive cells isolated from 40 healthy individuals from the Blueprint Project. For these 40 individuals, the genomic and epigenetic data were also available. This dataset offers a unique opportunity to understand how library construction influences differential gene expression, alternative splicing and molecular QTL (quantitative loci) analyses for human primary cells.


2021 ◽  
Author(s):  
Yu-Sheng Chen ◽  
Shuaiyao Lu ◽  
Bing Zhang ◽  
Tingfu Du ◽  
Wen-Jie Li ◽  
...  

SARS-CoV-2, as the causation of severe epidemic of COVID-19, is one kind of positive single-stranded RNA virus with high transmissibility. However, whether or not SARS-CoV-2 can integrate into host genome needs thorough investigation. Here, we performed both RNA sequencing (RNA-seq) and whole genome sequencing on SARS-CoV-2 infected human and monkey cells, and investigated the presence of host-virus chimeric events. Through RNA-seq, we did detect the chimeric host-virus reads in the infected cells. But further analysis using mixed libraries of infected cells and uninfected zebrafish embryos demonstrated that these reads are falsely generated during library construction. In support, whole genome sequencing also didn't identify the existence of chimeric reads in their corresponding regions. Therefore, the evidence for SARS-CoV-2's integration into host genome is lacking.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Lilian J. Gehrke ◽  
Maulik Upadhyay ◽  
Kristin Heidrich ◽  
Elisabeth Kunz ◽  
Daniela Klaus-Halla ◽  
...  

Abstract Polledness in cattle is an autosomal dominant trait. Previous studies have revealed allelic heterogeneity at the polled locus and four different variants were identified, all in intergenic regions. In this study, we report a case of polled bull (FV-Polled1) born to horned parents, indicating a de novo origin of this polled condition. Using 50K genotyping and whole genome sequencing data, we identified on chromosome 2 an 11-bp deletion (AC_000159.1:g.52364063_52364073del; Del11) in the second exon of ZEB2 gene as the causal mutation for this de novo polled condition. We predicted that the deletion would shorten the protein product of ZEB2 by almost 91%. Moreover, we showed that all animals carrying Del11 mutation displayed symptoms similar to Mowat-Wilson syndrome (MWS) in humans, which is also associated with genetic variations in ZEB2. The symptoms in cattle include delayed maturity, small body stature and abnormal shape of skull. This is the first report of a de novo dominant mutation affecting only ZEB2 and associated with a genetic absence of horns. Therefore our results demonstrate undoubtedly that ZEB2 plays an important role in the process of horn ontogenesis as well as in the regulation of overall development and growth of animals.


2019 ◽  
Vol 38 (11) ◽  
pp. 1207-1222
Author(s):  
Yu-Jie Zhou ◽  
Gui-Qi Zhu ◽  
Qing-Wei Zhang ◽  
Kenneth I. Zheng ◽  
Jin-Nan Chen ◽  
...  

Author(s):  
Paul L. Auer ◽  
Rebecca W Doerge

RNA sequencing technology is providing data of unprecedented throughput, resolution, and accuracy. Although there are many different computational tools for processing these data, there are a limited number of statistical methods for analyzing them, and even fewer that acknowledge the unique nature of individual gene transcription. We introduce a simple and powerful statistical approach, based on a two-stage Poisson model, for modeling RNA sequencing data and testing for biologically important changes in gene expression. The advantages of this approach are demonstrated through simulations and real data applications.


2015 ◽  
Vol 2015 ◽  
pp. 1-5 ◽  
Author(s):  
Yuxiang Tan ◽  
Yann Tambouret ◽  
Stefano Monti

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.


2018 ◽  
Author(s):  
Xianwen Ren ◽  
Liangtao Zheng ◽  
Zemin Zhang

ABSTRACTClustering is a prevalent analytical means to analyze single cell RNA sequencing data but the rapidly expanding data volume can make this process computational challenging. New methods for both accurate and efficient clustering are of pressing needs. Here we proposed a new clustering framework based on random projection and feature construction for large scale single-cell RNA sequencing data, which greatly improves clustering accuracy, robustness and computational efficacy for various state-of-the-art algorithms benchmarked on multiple real datasets. On a dataset with 68,578 human blood cells, our method reached 20% improvements for clustering accuracy and 50-fold acceleration but only consumed 66% memory usage compared to the widely-used software package SC3. Compared to k-means, the accuracy improvement can reach 3-fold depending on the concrete dataset. An R implementation of the framework is available from https://github.com/Japrin/sscClust.


2020 ◽  
Author(s):  
Sihao Xiao ◽  
Zhentian Kai ◽  
David Brown ◽  
Claire L Shovlin ◽  

SUMMARYWhole genome sequencing (WGS) is championed by the UK National Health Service (NHS) to identify genetic variants that cause particular diseases. The full potential of WGS has yet to be realised as early data analytic steps prioritise protein-coding genes, and effectively ignore the less well annotated non-coding genome which is rich in transcribed and critical regulatory regions. To address, we developed a filter, which we call GROFFFY, and validated in WGS data from hereditary haemorrhagic telangiectasia patients within the 100,000 Genomes Project. Before filter application, the mean number of DNA variants compared to human reference sequence GRCh38 was 4,867,167 (range 4,786,039-5,070,340), and one-third lay within intergenic areas. GROFFFY removed a mean of 2,812,015 variants per DNA. In combination with allele frequency and other filters, GROFFFY enabled a 99.56% reduction in variant number. The proportion of intergenic variants was maintained, and no pathogenic variants in disease genes were lost. We conclude that the filter applied to NHS diagnostic samples in the 100,000 Genomes pipeline offers an efficient method to prioritise intergenic, intronic and coding gDNA variants. Reducing the overwhelming number of variants while retaining functional genome variation of importance to patients, enhances the near-term value of WGS in clinical diagnostics.


2017 ◽  
Author(s):  
Luke Zappia ◽  
Belinda Phipson ◽  
Alicia Oshlack

AbstractAs single-cell RNA sequencing technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available.Here we present the Splatter Bioconductor package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.


Sign in / Sign up

Export Citation Format

Share Document