XenoCP: Cloud-based BAM cleansing tool for RNA and DNA from Xenograft

ABSTRACTSummaryXenografts are important models for cancer research and the presence of mouse reads in xenograft next generation sequencing data can potentially confound interpretation of experimental results. We present an efficient, cloud-based BAM-to-BAM cleaning tool called XenoCP to remove mouse reads from xenograft BAM files. We show application of XenoCP in obtaining accurate gene expression quantification in RNA-seq and tumor heterogeneity in WGS of xenografts derived from brain and solid tumors.Availability and ImplementationSt. Jude Cloud (https://pecan.stjude.cloud/permalink/xenocp) and St. Jude Github (https://github.com/stjude/XenoCP)

Download Full-text

Impact of Gene Annotation Choice on the Quantification of RNA-Seq Data

10.21203/rs.3.rs-421080/v1 ◽

2021 ◽

Author(s):

David Chisanga ◽

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Expression Data ◽

Refseq Gene ◽

Rna Seq ◽

Sequencing Data ◽

Microarray Expression Data ◽

Sequencing Quality ◽

Gene Expression Quantification ◽

Expression Quantification

Abstract Background: RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis.Results: In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.Conclusion: In conclusion, our study found that the use of the conservative RefSeq gene annotation yields better RNA-seq quantification results than the more comprehensive Ensembl annotation. We also found that, surprisingly, the recent expansion of the RefSeq database, which was primarily driven by the incorporation of sequencing data into the gene annotation process, resulted in a reduction in the accuracy of RNA-seq quantification.

Download Full-text

Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensivein silicoassessment of RNA-seq experiments

Molecular Ecology ◽

10.1111/mec.12014 ◽

2012 ◽

Vol 22 (3) ◽

pp. 620-634 ◽

Cited By ~ 167

Author(s):

Nagarjun Vijay ◽

Jelmer W. Poelstra ◽

Axel Künstner ◽

Jochen B. W. Wolf

Keyword(s):

Gene Expression ◽

Differential Gene Expression ◽

Transcriptome Assembly ◽

Rna Seq ◽

Gene Expression Quantification ◽

Differential Gene ◽

Expression Quantification ◽

Challenges And Strategies

Download Full-text

The effect of human genome annotation complexity on RNA-Seq gene expression quantification

2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops ◽

10.1109/bibmw.2012.6470224 ◽

2012 ◽

Cited By ~ 3

Author(s):

Po-Yen Wu ◽

John H. Phan ◽

May D. Wang

Keyword(s):

Gene Expression ◽

Human Genome ◽

Genome Annotation ◽

Rna Seq ◽

Gene Expression Quantification ◽

Expression Quantification

Download Full-text

Methods for analyzing next-generation sequencing data III. From setting a Linux environment to manipulating Lactobacillus RNA-seq data

Japanese Journal of Lactic Acid Bacteria ◽

10.4109/jslab.26.32 ◽

2015 ◽

Vol 26 (1) ◽

pp. 32-41

Author(s):

Jianqiang Sun ◽

Aya Miura ◽

Kentaro Shimizu ◽

Koji Kadota

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Impact of gene annotation choice on the quantification of RNA-seq data

10.1101/2021.01.07.425794 ◽

2021 ◽

Author(s):

David Chisanga ◽

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Expression Data ◽

Rna Seq ◽

Microarray Expression Data ◽

Refseq Annotation ◽

Sequencing Quality ◽

Gene Expression Quantification ◽

Microarray Expression ◽

Expression Quantification

RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis. In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from $>$800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.

Download Full-text

GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data

10.1101/771063 ◽

2019 ◽

Cited By ~ 2

Author(s):

Bastian Seelbinder ◽

Thomas Wolf ◽

Steffen Priebe ◽

Sylvie McNamara ◽

Silvia Gerber ◽

...

Keyword(s):

Gene Expression ◽

Single Species ◽

Gene Expression Omnibus ◽

Rna Seq ◽

Sequencing Data ◽

Interacting Species ◽

Link Type ◽

Fastq Format ◽

Standard Tool ◽

Processing Steps

ABSTRACTIn transcriptomics, the study of the total set of RNAs transcribed by the cell, RNA sequencing (RNA-seq) has become the standard tool for analysing gene expression. The primary goal is the detection of genes whose expression changes significantly between two or more conditions, either for a single species or for two or more interacting species at the same time (dual RNA-seq, triple RNA-seq and so forth). The analysis of RNA-seq can be simplified as many steps of the data pre-processing can be standardised in a pipeline.In this publication we present the “GEO2RNAseq” pipeline for complete, quick and concurrent pre-processing of single, dual, and triple RNA-seq data. It covers all pre-processing steps starting from raw sequencing data to the analysis of differentially expressed genes, including various tables and figures to report intermediate and final results. Raw data may be provided in FASTQ format or can be downloaded automatically from the Gene Expression Omnibus repository. GEO2RNAseq strongly incorporates experimental as well as computational metadata. GEO2RNAseq is implemented in R, lightweight, easy to install via Conda and easy to use, but still very flexible through using modular programming and offering many extensions and alternative workflows.GEO2RNAseq is publicly available at https://anaconda.org/xentrics/r-geo2rnaseq and https://bitbucket.org/thomas_wolf/geo2rnaseq/overview, including source code, installation instruction, and comprehensive package documentation.

Download Full-text

The ISMARA client

F1000Research ◽

10.12688/f1000research.9794.1 ◽

2016 ◽

Vol 5 ◽

pp. 2851 ◽

Cited By ~ 1

Author(s):

Panu Artimo ◽

Séverine Duvaud ◽

Mikhail Pachkov ◽

Vassilios Ioannidis ◽

Erik van Nimwegen ◽

...

Keyword(s):

Gene Expression ◽

Client Application ◽

Rna Seq ◽

State Data ◽

Regulatory Interactions ◽

Link Type ◽

Major Bottleneck ◽

High Throughput Gene Expression ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

ISMARA (ismara.unibas.ch) automatically infers the key regulators and regulatory interactions from high-throughput gene expression or chromatin state data. However, given the large sizes of current next generation sequencing (NGS) datasets, data uploading times are a major bottleneck. Additionally, for proprietary data, users may be uncomfortable with uploading entire raw datasets to an external server. Both these problems could be alleviated by providing a means by which users could pre-process their raw data locally, transferring only a small summary file to the ISMARA server. We developed a stand-alone client application that pre-processes large input files (RNA-seq or ChIP-seq data) on the user's computer for performing ISMARA analysis in a completely automated manner, including uploading of small processed summary files to the ISMARA server. This reduces file sizes by up to a factor of 1000, and upload times from many hours to mere seconds. The client application is available from ismara.unibas.ch/ISMARA/client.

Download Full-text

Wx: a neural network-based feature selection algorithm for next-generation sequencing data

10.1101/221911 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sungsoo Park ◽

Bonggun Shin ◽

Yoonjung Choi ◽

Kilsoo Kang ◽

Keunsoo Kang

Keyword(s):

Neural Network ◽

Gene Expression ◽

Next Generation Sequencing ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Selection Algorithm ◽

Sequencing Data ◽

Optimal Set ◽

Generation Sequencing

AbstractMotivationNext-generation sequencing (NGS), which allows the simultaneous sequencing of billions of DNA fragments simultaneously, has revolutionized how we study genomics and molecular biology by generating genome-wide molecular maps of molecules of interest. For example, an NGS-based transcriptomic assay called RNA-seq can be used to estimate the abundance of approximately 190,000 transcripts together. As the cost of next-generation sequencing sharply declines, researchers in many fields have been conducting research using NGS. The amount of information produced by NGS has made it difficult for researchers to choose the optimal set of target genes (or genomic loci).ResultsWe have sought to resolve this issue by developing a neural network-based feature (gene) selection algorithm called Wx. The Wx algorithm ranks genes based on the discriminative index (DI) score that represents the classification power for distinguishing given groups. With a gene list ranked by DI score, researchers can institutively select the optimal set of genes from the highest-ranking ones. We applied the Wx algorithm to a TCGA pan-cancer gene-expression cohort to identify an optimal set of gene-expression biomarker (universal gene-expression biomarkers) candidates that can distinguish cancer samples from normal samples for 12 different types of cancer. The 14 gene-expression biomarker candidates identified by Wx were comparable to or outperformed previously reported universal gene expression biomarkers, highlighting the usefulness of the Wx algorithm for next-generation sequencing data. Thus, we anticipate that the Wx algorithm can complement current state-of-the-art analytical applications for the identification of biomarker candidates as an alternative method.Availabilityhttps://github.com/deargen/[email protected] informationSupplementary data are available at online.

Download Full-text

Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples

F1000Research ◽

10.12688/f1000research.10082.1 ◽

2016 ◽

Vol 5 ◽

pp. 2741 ◽

Cited By ~ 27

Author(s):

Miika J. Ahdesmäki ◽

Simon R. Gray ◽

Justin H. Johnson ◽

Zhongwu Lai

Keyword(s):

Open Source ◽

Variant Calling ◽

High Sensitivity ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Novel Approach ◽

Gene Expression Quantification ◽

Closed Source ◽

Generation Sequencing ◽

Bioinformatics Community

Grafting of cell lines and primary tumours is a crucial step in the drug development process between cell line studies and clinical trials. Disambiguate is a program for computationally separating the sequencing reads of two species derived from grafted samples. Disambiguate operates on alignments to the two species and separates the components at very high sensitivity and specificity as illustrated in artificially mixed human-mouse samples. This allows for maximum recovery of data from target tumours for more accurate variant calling and gene expression quantification. Given that no general use open source algorithm accessible to the bioinformatics community exists for the purposes of separating the two species data, the proposed Disambiguate tool presents a novel approach and improvement to performing sequence analysis of grafted samples. Both Python and C++ implementations are available and they are integrated into several open and closed source pipelines. Disambiguate is open source and is freely available at https://github.com/AstraZeneca-NGS/disambiguate.

Download Full-text

CoverView: a sequence quality evaluation tool for next generation sequencing data

Wellcome Open Research ◽

10.12688/wellcomeopenres.14306.1 ◽

2018 ◽

Vol 3 ◽

pp. 36 ◽

Cited By ~ 5

Author(s):

Márton Münz ◽

Shazia Mahamdallie ◽

Shawn Yost ◽

Andrew Rimmer ◽

Emma Poyastro-Pearson ◽

...

Keyword(s):

Quality Control ◽

Next Generation Sequencing ◽

Quality Evaluation ◽

Reference Sample ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Evaluation Tool ◽

Link Type ◽

Generation Sequencing

Quality assurance and quality control are essential for robust next generation sequencing (NGS). Here we present CoverView, a fast, flexible, user-friendly quality evaluation tool for NGS data. CoverView processes mapped sequencing reads and user-specified regions to report depth of coverage, base and mapping quality metrics with increasing levels of detail from a chromosome-level summary to per-base profiles. CoverView can flag regions that do not fulfil user-specified quality requirements, allowing suboptimal data to be systematically and automatically presented for review. It also provides an interactive graphical user interface (GUI) that can be opened in a web browser and allows intuitive exploration of results. We have integrated CoverView into our accredited clinical cancer predisposition gene testing laboratory that uses the TruSight Cancer Panel (TSCP). CoverView has been invaluable for optimisation and quality control of our testing pipeline, providing transparent, consistent quality metric information and automatic flagging of regions that fall below quality thresholds. We demonstrate this utility with TSCP data from the Genome in a Bottle reference sample, which CoverView analysed in 13 seconds. CoverView uses data routinely generated by NGS pipelines, reads standard input formats, and rapidly creates easy-to-parse output text (.txt) files that are customised by a simple configuration file. CoverView can therefore be easily integrated into any NGS pipeline. CoverView and detailed documentation for its use are freely available at github.com/RahmanTeamDevelopment/CoverView/releases and www.icr.ac.uk/CoverView

Download Full-text