scholarly journals Normalization benchmark of ATAC-seq datasets shows the importance of accounting for GC-content effects

2021 ◽  
Author(s):  
Koen Van den Berge ◽  
Hsin-Jung Chou ◽  
Hector Roux de Bézieux ◽  
Kelly Street ◽  
Davide Risso ◽  
...  

AbstractModern assays have enabled high-throughput studies of epigenetic regulation of gene expression using DNA sequencing. In particular, the assay for transposase-accessible chromatin using sequencing (ATAC-seq) allows the study of chromatin configuration for an entire genome. Despite the gain in popularity of the assay, there have been limited studies investigating the analytical challenges related to ATAC-seq data, and most studies leverage tools developed for bulk transcriptome sequencing (RNA-seq). Here, we show that GC-content effects are omnipresent in ATAC-seq datasets. Since the GC-content effects are sample-specific, they can bias downstream analyses such as clustering and differential accessibility analysis. We evaluate twelve different normalization procedures on eight public ATAC-seq datasets and show that no method uniformly outperforms all others. However, our work clearly shows that accounting for GC-content effects in the normalization is crucial for common downstream ATAC-seq data analyses, such as clustering and differential accessibility analysis, leading to improved accuracy and interpretability of the results. Using two case studies, we show that exploratory data analysis is essential to guide the choice of an appropriate normalization method for a given dataset.

2020 ◽  
Vol 21 (8) ◽  
pp. 2800 ◽  
Author(s):  
Xi Wu ◽  
Yang Yang ◽  
Chaoyue Zhong ◽  
Yin Guo ◽  
Tengyu Wei ◽  
...  

Chromatin structure plays a pivotal role in maintaining the precise regulation of gene expression. Accessible chromatin regions act as the binding sites of transcription factors (TFs) and cis-elements. Therefore, information from these open regions will enhance our understanding of the relationship between TF binding, chromatin status and the regulation of gene expression. We employed an assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) and RNA-seq analyses in the gonads of protogynous hermaphroditic orange-spotted groupers during sex reversal to profile open chromatin regions and TF binding sites. We focused on several crucial TFs, including ZNF263, SPIB, and KLF9, and analyzed the networks of TF-target genes. We identified numerous transcripts exhibiting sex-preferred expression among their target genes, along with their associated open chromatin regions. We then investigated the expression patterns of sex-related genes as well as the mRNA localization of certain genes during sex reversal. We found a set of sex-related genes that—upon further study—might be identified as the sex-specific or cell-specific marker genes that trigger sex reversal. Moreover, we discovered the core genes (gnas, ccnb2, and cyp21a) of several pathways related to sex reversal that provide the guideposts for future study.


2018 ◽  
Author(s):  
Yuanchao Zhang ◽  
Man S. Kim ◽  
Erin R. Reichenberger ◽  
Ben Stear ◽  
Deanne M. Taylor

AbstractIn single-cell RNA-seq (scRNA-seq) experiments, the number of individual cells has increased exponentially, and the sequencing depth of each cell has decreased significantly. As a result, analyzing scRNA-seq data requires extensive considerations of program efficiency and method selection. In order to reduce the complexity of scRNA-seq data analysis, we present scedar, a scalable Python package for scRNA-seq exploratory data analysis. The package provides a convenient and reliable interface for performing visualization, imputation of gene dropouts, detection of rare transcriptomic profiles, and clustering on large-scale scRNA-seq datasets. The analytical methods are efficient, and they also do not assume that the data follow certain statistical distributions. The package is extensible and modular, which would facilitate the further development of functionalities for future requirements with the open-source development community. The scedar package is distributed under the terms of the MIT license at https://pypi.org/project/scedar.


F1000Research ◽  
2016 ◽  
Vol 4 ◽  
pp. 1070 ◽  
Author(s):  
Michael I. Love ◽  
Simon Anders ◽  
Vladislav Kim ◽  
Wolfgang Huber

Here we walk through an end-to-end gene-level RNA-Seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample.We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.


2017 ◽  
Author(s):  
Steven Xijin Ge ◽  
Eun Wo Son

AbstractThe analysis and interpretation of the RNA-Seq data can be time-consuming and challenging. We aim to streamline the bioinformatic analyses of gene-level data by developing a user-friendly web application for exploratory data analysis, differential expression, and pathway analysis. iDEP (integrated Differential Expression and Pathway analysis) seamlessly connects 63 R/Bioconductor packages, 208 annotation databases for plant and animal species, and 2 web services. The workflow can be reproduced by downloading customized R code and related files. As demonstrated by two examples, iDEP (http://ge-lab.org/idep/) democratizes access to bioinformatics resources and empowers biologists to easily gain actionable insights from transcriptomic data.


2019 ◽  
Vol 97 (Supplement_2) ◽  
pp. 15-16
Author(s):  
Sylvain Foissac ◽  
Sarah Djebali ◽  
Kylie Munyard ◽  
Nathalie Vialaneix ◽  
Andrea Rau ◽  
...  

Abstract Improving the functional annotation of animal genomes is a key challenge in bridging the gap between genotype and phenotype, thus enabling predictive biology. Regarding livestock production, major outcomes are expected from a better understanding of the genetic architecture underlying quantitative traits. As part of the Functional Annotation of ANimal Genomes action (FAANG: www.faang.org), the FR-AgENCODE project generated omics data to improve the reference annotation of the cattle, pig, goat and chicken genome. High-throughput molecular assays have been performed on tissues/cells relevant to immune and metabolic traits. From two males and two females per species (pig, cattle, goat, chicken), strand-oriented RNA-seq gene expression and ATAC-seq chromatin accessibility assays were performed on liver and two PBMC-sorted T-cell types (CD4+ and CD8+). Chromosome Conformation Capture (in situ Hi-C) was also carried out on liver samples. About 4,000 samples have been collected at the INRA biorepository and registered at the EBI BioSamples registry. More than 80% of the planned experiments could be completed, generating ~11.5 billions of sequencing reads over the 3 assays. While most (50–80%) RNA-seq reads mapped to annotated exons, thousands of novel transcripts were found, with ~60K mRNAs and ~22K lncRNAs in cattle. Differentially expressed genes between cell types were enriched for immunity- or metabolism-related terms, and differentially accessible chromatin regions were identified as potential regulatory sites. Interestingly, correlations between gene expression and promoter accessibility across samples were skewed towards both positive and negative values, suggesting distinct regulatory mechanisms of gene expression. These patterns have been further investigated using human data from the Epigenome Roadmap Mapping Consortium. Altogether, this study illustrates the interest of a coordinated effort to tackle the genome-to-phenome challenge and provides a useful resource to the community. Availability: www.fragencode.org.


2020 ◽  
Vol 2 (3) ◽  
Author(s):  
Stuart Lee ◽  
Albert Y Zhang ◽  
Shian Su ◽  
Ashley P Ng ◽  
Aliaksei Z Holik ◽  
...  

Abstract RNA-seq datasets can contain millions of intron reads per library that are typically removed from downstream analysis. Only reads overlapping annotated exons are considered to be informative since mature mRNA is assumed to be the major component sequenced, especially for poly(A) RNA libraries. In this study, we show that intron reads are informative, and through exploratory data analysis of read coverage that intron signal is representative of both pre-mRNAs and intron retention. We demonstrate how intron reads can be utilized in differential expression analysis using our index method where a unique set of differentially expressed genes can be detected using intron counts. In exploring read coverage, we also developed the superintronic software that quickly and robustly calculates user-defined summary statistics for exonic and intronic regions. Across multiple datasets, superintronic enabled us to identify several genes with distinctly retained introns that had similar coverage levels to that of neighbouring exons. The work and ideas presented in this paper is the first of its kind to consider multiple biological sources for intron reads through exploratory data analysis, minimizing bias in discovery and interpretation of results. Our findings open up possibilities for further methods development for intron reads and RNA-seq data in general.


2020 ◽  
Vol 16 (4) ◽  
pp. e1007794
Author(s):  
Yuanchao Zhang ◽  
Man S. Kim ◽  
Erin R. Reichenberger ◽  
Ben Stear ◽  
Deanne M. Taylor

F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 1070 ◽  
Author(s):  
Michael I. Love ◽  
Simon Anders ◽  
Vladislav Kim ◽  
Wolfgang Huber

Here we walk through an end-to-end gene-level RNA-Seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample. We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.


2016 ◽  
Vol 3 (1) ◽  
pp. 39 ◽  
Author(s):  
Federico Marini ◽  
Harald Binder

For a proper understanding of the organization and regulation of gene expression, the computational analysis is an essential component of the scientific workflow, and this is particularly true in the fields of biostatistics and bioinformatics. Interactivity and reproducibility are two highly relevant features to consider when adopting or designing a tool, and often they can not be provided simultaneously.In this work, we address the issue of developing a framework that can provide interactive analysis, in order to allow experimentalists to fully exploit advanced software tools, as well as reproducibility as an internal validation of the analysis steps, by providing the underlying code and data in such a way that enables the re-creation of the results, and also constitutes a didactic tool for the life scientist.We illustrate this paradigm with the help of the R/Bioconductor package pcaExplorer, designed as a practical companion for interactive and reproducible exploratory data analysis for high dimensional data (e.g. RNA-seq), and highlight some of the features that are provided in the software.


Sign in / Sign up

Export Citation Format

Share Document