scholarly journals Basal Contamination of Sequencing: Lessons from the GTEx dataset

2019 ◽  
Author(s):  
Tim O. Nieuwenhuis ◽  
Stephanie Yang ◽  
Rohan X. Verma ◽  
Vamsee Pillalamarri ◽  
Dan E. Arking ◽  
...  

AbstractOne of the challenges of next generation sequencing (NGS) is read contamination. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, to understand the factors that contribute to contamination. We obtained GTEx datasets and technical metadata and validating RNA-Seq from other studies. Of 48 analyzed tissues in GTEx, 26 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicated contamination. Sample contamination by non-native genes was associated with a sample being sequenced on the same day as a tissue that natively expressed those genes. This was highly significant for pancreas and esophagus genes (linear model, p=9.5e-237 and p=5e-260 respectively). Nine SNPs in four genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes, validating the contamination. Low-level contamination affected 4,497 (39.6%) samples (defined as 10 PRSS1 TPM). It also led ≥ to eQTL assignments in inappropriate tissues among these 18 genes. We note this type of contamination occurs widely, impacting bulk and single cell data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses. Awareness of this process is necessary to avoid assigning inaccurate importance to low-level gene expression in inappropriate tissues and cells.

F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 2851 ◽  
Author(s):  
Panu Artimo ◽  
Séverine Duvaud ◽  
Mikhail Pachkov ◽  
Vassilios Ioannidis ◽  
Erik van Nimwegen ◽  
...  

ISMARA (ismara.unibas.ch) automatically infers the key regulators and regulatory interactions from high-throughput gene expression or chromatin state data. However, given the large sizes of current next generation sequencing (NGS) datasets, data uploading times are a major bottleneck. Additionally, for proprietary data, users may be uncomfortable with uploading entire raw datasets to an external server. Both these problems could be alleviated by providing a means by which users could pre-process their raw data locally, transferring only a small summary file to the ISMARA server. We developed a stand-alone client application that pre-processes large input files (RNA-seq or ChIP-seq data) on the user's computer for performing ISMARA analysis in a completely automated manner, including uploading of small processed summary files to the ISMARA server. This reduces file sizes by up to a factor of 1000, and upload times from many hours to mere seconds. The client application is available from ismara.unibas.ch/ISMARA/client.


2017 ◽  
Vol 29 (1) ◽  
pp. 173
Author(s):  
Z. Jiang ◽  
J. Sun ◽  
S. Marjani ◽  
H. Dong ◽  
X. Zheng ◽  
...  

Appropriate reference genes for accurate normalization in RT-PCR are essential for the study of gene expression. Ideal reference genes should not only have stable expression across stages of embryo development, but also be expressed at comparable levels to the target genes. Using RNA-seq data from in vivo-produced bovine oocytes and embryos from the 2-cell to blastocyst stage (Jiang et al., 2014 BMC Genomics 15, 756), we tried to establish a catalogue of all reference genes for RT-PCR analysis. One-way ANOVA generated 4055 genes that did not differ across stages. To reduce this list, we used the entire RNA-seq data set and first removed genes with a FPKM (fragments per kilobase of transcript per million mapped reads) of <1, and then rescaled each gene’s expression values within a range of 0 to 1. We subsequently calculated the expression variance for each gene across all stages. By assuming that the calculated variances follow a Gaussian distribution and that the majority of the genes do not have a stable expression level, a gene was classified as a reference if its variance significantly deviated (P < 0.05) from these assumptions. We identified 346 potential reference genes, all of which were among the candidates from the ANOVA analysis. We arbitrarily assigned genes in this list to high (FPKM ≥ 100), medium (10 < FPKM < 100), and low expression levels (FPKM ≤ 10), and 37, 154, and 155 genes, respectively, fell into these groups. Surprisingly, none of the commonly used reference genes, such as GAPDH, PPIA, ACTB, PRL15, GUSB, and H3F2A, were identified as being stably expressed across in vivo development. This is consistent with findings of prior RT-PCR studies (Robert et al. 2002 Biol. Reprod. 67, 1465–1472; Ross et al. 2010 Cell Reprogram. 12, 709–717). The following gene ontology terms were significantly enriched for the 346 genes: cell cycle, translation, transport, chromatin, cell division, and metabolic process, indicating that the early embryos maintained constant levels of genes involved in fundamental biological functions. Finally, we performed RT-PCR to validate the RNA-seq results using different bovine in vivo-derived oocytes and embryos (n = 3/stage). We successfully validated 10 selected genes, including those in the high (CS, PGD, and ACTR3), medium (CCT5, MRPL47, COG2, CRT9, and HELLS), and low expression groups (CDC23 and TTF1). In conclusion, we recommend the use of reference genes that are expressed at comparable levels to target genes. This study offers a useful resource to aid in the appropriate selection of reference genes, which will improve the accuracy of quantitative gene expression analyses across bovine embryo pre-implantation development.


Blood ◽  
2013 ◽  
Vol 122 (21) ◽  
pp. 1199-1199 ◽  
Author(s):  
Brian Liddicoat ◽  
Robert Piskol ◽  
Alistair Chalk ◽  
Miyoko Higuchi ◽  
Peter Seeburg ◽  
...  

Abstract The role of RNA and its regulation is becoming increasingly appreciated as a vital component of hematopoietic development. RNA editing by members of the Adenosine Deaminase Acting on RNA (ADAR) gene family is a form of post-transcriptional modification which converts genomically encoded adenosine to inosine (A-to-I) in double-stranded RNA. A-to-I editing by ADAR directly converts the sequence of the RNA substrate and can alter the structure, function, processing, and localization of the targeted RNA. ADAR1 is ubiquitously expressed and we have previously described essential roles in the development of hematopoietic and hepatic organs. Germline ablation of murine ADAR1 results in a significant upregulation of interferon (IFN) stimulated genes and embryonic death between E11.5 and E12.5 associated with fetal liver disintegration and failed hemopoiesis. To determine the biological importance of A-to-I editing by ADAR1, we generated an editing dead knock-in allele of ADAR1 (ADAR1E861A). Mice homozygous for the ADAR1E861A allele died in utero at ∼E13.5. The fetal liver (FL) was small and had significantly lower cellularity than in controls. Analysis of hemopoiesis demonstrated increased apoptosis and a loss of hematopoietic stem cells (HSC) and all mature lineages. Most notably erythropoiesis was severely impaired with ∼7-fold reduction across all erythrocyte progenitor populations compared to controls. These data are consistent with our previous findings that ADAR1 is essential for erythropoiesis (unpublished data) and suggest that the ADAR1E861A allele phenocopies the null allele in utero. To assess the requirement of A-to-I editing in adult hematopoiesis, we generated mice where we could somatically delete the wild-type ADAR1 allele and leave only ADAR1E861A expressed in HSCs (hScl-CreERAdar1fl/E861A). In comparison to hScl-CreERAdar1fl/+ controls, hScl-CreERAdar1fl/E861A mice were anemic and had severe leukopenia 20 days post tamoxifen treatment. Investigation of marrow hemopoiesis revealed a significant loss of all cells committed to the erythroid lineage in hScl-CreERAdar1fl/E861A mice, despite having elevated phenotypic HSCs. Upon withdrawal of tamoxifen diet, all blood parameters were restored to control levels within 12 weeks owing to strong selection against cells expressing only the ADAR1E861A allele. To understand the mechanism through which ADAR1 mediated A-to-I editing regulates hematopoiesis, RNA-seq was performed. Gene expression profiles showed that a loss of ADAR1 mediated A-to-I editing resulted in a significant upregulation of IFN signatures, consistent with the gene expression changes in ADAR1 null mice. To define substrates of ADAR1 we assessed A-to-I mismatches in the RNA-seq data sets. 3,560 previously known and 353 novel A-to-I editing sites were identified in our data set. However, no single editing substrate discovered could account for the IFN signature observed or the lethality of ADAR1E861A/E861A mice. These results demonstrate that ADAR1 mediated A-to-I editing is essential for the maintenance of both fetal and adult hemopoiesis in a cell-autonomous manner and a key suppressor of the IFN response in hematopoiesis. Furthermore the ADAR1E861A allele demonstrates the essential role of ADAR1 in vivo is A-to-I editing. Disclosures: Hartner: TaconicArtemis: Employment.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. e13032-e13032 ◽  
Author(s):  
Anton Buzdin ◽  
Andrew Garazha ◽  
Maxim Sorokin ◽  
Alex Glusker ◽  
Alexey Aleshin ◽  
...  

e13032 Background: Intracellular molecular pathways (IMPs) control all major events in the living cell. They are considered hotspots in contemporary oncology because knowledge of IMPs activation is essential for understanding mechanisms of molecular pathogenesis in oncology. Profiling IMPs requires RNA-seq data for tumors and for a collection of reference normal tissues. However, there is a shortage now in such profiles for normal tissues from healthy human donors, uniformly profiled in a single series of experiments. Access to the largest dataset of normal profiles GTEx is only partly available through the dbGaP. In TCGA database, norms are adjacent to surgically removed tumors and may be affected by tumor-linked growth factors, inflammation and altered vascularization. ENCODE datasets were for the autopsies of normal tissues, but they can’t form statistically significant reference groups. Methods: Tissue samples representing 20 organs were taken from post-mortal human healthy donors killed in road accidents no later than 36 hours after death, blood samples were taken from healthy volunteers. Gene expression was profiled in RNA-seq experiments using the same reagents, equipment and protocols. Bioinformatic algorithms for IMP analysis were developed and validated using experimental and public gene expression datasets. Results: From original sequencing data we constructed the biggest fully open reference expression database of normal human tissues including 465 profiles termed Oncobox Atlas of Normal Tissue Expression (ANTE, original data: GSE120795). We next developed a method termed Oncobox for interrogating activation of IMPs in human cancers. It includes modules of expression data harmonization and comparison and an algorithm for automatic annotation of molecular pathways. The Oncobox system enables accurate scoring of thousands molecular pathways using RNA-seq data. Oncobox pathway analysis is also applicable for quantitative proteomics and microRNA data in oncology. Conclusions: The Oncobox system can be used for a plethora of applications in cancer research including finding differentially regulated genes and IMPs, and for discovery of new pathway-related diagnostic and prognostic biomarkers.


2010 ◽  
Vol 2010 ◽  
pp. 1-19 ◽  
Author(s):  
Valerio Costa ◽  
Claudia Angelini ◽  
Italia De Feis ◽  
Alfredo Ciccodicola

In recent years, the introduction of massively parallel sequencing platforms for Next Generation Sequencing (NGS) protocols, able to simultaneously sequence hundred thousand DNA fragments, dramatically changed the landscape of the genetics studies. RNA-Seq for transcriptome studies, Chip-Seq for DNA-proteins interaction, CNV-Seq for large genome nucleotide variations are only some of the intriguing new applications supported by these innovative platforms. Among them RNA-Seq is perhaps the most complex NGS application. Expression levels of specific genes, differential splicing, allele-specific expression of transcripts can be accurately determined by RNA-Seq experiments to address many biological-related issues. All these attributes are not readily achievable from previously widespread hybridization-based or tag sequence-based approaches. However, the unprecedented level of sensitivity and the large amount of available data produced by NGS platforms provide clear advantages as well as new challenges and issues. This technology brings the great power to make several new biological observations and discoveries, it also requires a considerable effort in the development of new bioinformatics tools to deal with these massive data files. The paper aims to give a survey of the RNA-Seq methodology, particularly focusing on the challenges that this application presents both from a biological and a bioinformatics point of view.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3091 ◽  
Author(s):  
Anna V. Klepikova ◽  
Artem S. Kasianov ◽  
Mikhail S. Chesnokov ◽  
Natalia L. Lazarevich ◽  
Aleksey A. Penin ◽  
...  

BackgroundRNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads.ResultsTo infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in samtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes.ConclusionThe use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.


2017 ◽  
Author(s):  
Alexander Platzer ◽  
Julia Polzin ◽  
Ping Penny Han ◽  
Klaus Rembart ◽  
Thomas Nussbaumer

AbstractMetagenomics, RNA-seq, WGS (Whole Genome Sequencing) and other types of next-generation sequencing techniques provide quantitative measurements for single strains and genes over time. To obtain a global overview of the experiment and to explore the full potential of a given dataset, intuitive and interactive visualization tools are needed. Therefore, we established BioSankey, which allows to visualize microbial species in microbiome studies and gene expression over time as a Sankey diagram. These diagrams are embedded into a project-specific HTML page, that contains all information as provided during the installation process. BioSankey can be easily applied to analyse bacterial communities in time-series datasets. Furthermore, it can be used to analyse the fluctuations of differentially expressed genes (DEG). The output of BioSankey is a project-specific HTML page, which depends only on JavaScript to enable searches of interesting species or genes of interest without requiring a web server or connection to a database to exchange results among collaboration partners. BioSankey is a tool to visualize different data elements from single and dual RNA-seq datasets as well as from metagenomes studies.


Author(s):  
Naiyar Iqbal ◽  
Pradeep Kumar

Disease classification based on biological data is an important area in bioinformatics and biomedical research. It helps the doctors and medical practitioners for the early detection of disease and support them as a computer-aided diagnostic tool for accurate diagnosis, prognosis, and treatment of disease. Earlier Microarray gene expression data have wide application for the classification of disease, but now Next-generation sequencing (NGS) has replaced the Microarray technology. From the last few years, RNA sequence (RNA-Seq) data are widely used for the transcriptomic analysis. Hence, RNA-Seq based classification of disease is in its infancy. In this article, we present a general framework for the classification of disease constructed on RNA-Seq data. This framework will guide the researchers to process RNA-Seq, extract relevant features and apply the appropriate classifier to classify any kind of disease.


2015 ◽  
Author(s):  
Benjamin K Johnson ◽  
Matthew B Scholz ◽  
Tracy K Teal ◽  
Robert B Abramovitch

Summary: SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots. The workflow is implemented in Python for file management and sequential execution of each analysis step and is available for Mac OS X, Microsoft Windows, and Linux. To promote the use of SPARTA as a teaching platform, a web-based tutorial is available explaining how RNA-seq data are processed and analyzed by the software. Availability and Implementation: Tutorial and workflow can be found at sparta.readthedocs.org. Teaching materials are located at sparta-teaching.readthedocs.org. Source code can be downloaded at www.github.com/abramovitchMSU/, implemented in Python and supported on Mac OS X, Linux, and MS Windows. Contact: Robert B. Abramovitch ([email protected]) Supplemental Information: Supplementary data are available online


Sign in / Sign up

Export Citation Format

Share Document