scholarly journals PEPATAC: An optimized ATAC-seq pipeline with serial alignments

2020 ◽  
Author(s):  
Jason P. Smith ◽  
M. Ryan Corces ◽  
Jin Xu ◽  
Vincent P. Reuter ◽  
Howard Y. Chang ◽  
...  

MotivationAs chromatin accessibility data from ATAC-seq experiments continues to expand, there is continuing need for standardized analysis pipelines. Here, we present PEPATAC, an ATAC-seq pipeline that is easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects.ResultsPEPATAC leverages unique features of ATAC-seq data to optimize for speed and accuracy, and it provides several unique analytical approaches. Output includes convenient quality control plots, summary statistics, and a variety of generally useful data formats to set the groundwork for subsequent project-specific data analysis. Downstream analysis is simplified by a standard definition format, modularity of components, and metadata APIs in R and Python. It is restartable, fault-tolerant, and can be run on local hardware, using any cluster resource manager, or in provided Linux containers. We also demonstrate the advantage of aligning to the mitochondrial genome serially, which improves the accuracy of alignment statistics and quality control metrics. PEPATAC is a robust and portable first step for any ATAC-seq project.AvailabilityBSD2-licensed code and documentation at https://pepatac.databio.org.

2021 ◽  
Vol 3 (4) ◽  
Author(s):  
Jason P Smith ◽  
M Ryan Corces ◽  
Jin Xu ◽  
Vincent P Reuter ◽  
Howard Y Chang ◽  
...  

Abstract As chromatin accessibility data from ATAC-seq experiments continues to expand, there is continuing need for standardized analysis pipelines. Here, we present PEPATAC, an ATAC-seq pipeline that is easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects. PEPATAC leverages unique features of ATAC-seq data to optimize for speed and accuracy, and it provides several unique analytical approaches. Output includes convenient quality control plots, summary statistics, and a variety of generally useful data formats to set the groundwork for subsequent project-specific data analysis. Downstream analysis is simplified by a standard definition format, modularity of components, and metadata APIs in R and Python. It is restartable, fault-tolerant, and can be run on local hardware, using any cluster resource manager, or in provided Linux containers. We also demonstrate the advantage of aligning to the mitochondrial genome serially, which improves the accuracy of alignment statistics and quality control metrics. PEPATAC is a robust and portable first step for any ATAC-seq project. BSD2-licensed code and documentation are available at https://pepatac.databio.org.


2020 ◽  
Author(s):  
Jason P. Smith ◽  
Arun B. Dutta ◽  
Kizhakke Mattada Sathyan ◽  
Michael J. Guertin ◽  
Nathan C. Sheffield

Experiments that profile nascent RNA are growing in popularity; however, there is no standard analysis pipeline to uniformly process the data and assess quality. Here, we introduce PEPPRO, a comprehensive, scalable workflow for GRO-seq, PRO-seq, and ChRO-seq data. PEPPRO produces uniform processed output files for downstream analysis, including alignment files, signal tracks, and count matrices. Furthermore, PEPPRO simplifies downstream analysis by using a standard project definition format which can be read using metadata APIs in R and Python. For quality control, PEPPRO provides several novel statistics and plots, including assessments of adapter abundance, RNA integrity, library complexity, nascent RNA purity, and run-on efficiency. PEPPRO is restartable and fault-tolerant, records copious logs, and provides a web-based project report for navigating results. It can be run on local hardware or using any cluster resource manager, using either native software or a provided modular Linux container environment. PEPPRO is thus a robust and portable first step for genomic nascent RNA analysis.AvailabilityBSD2-licensed code and documentation: https://peppro.databio.org.


2020 ◽  
Author(s):  
Shirin Moossavi ◽  
Kelsey Fehr ◽  
Theo J. Moraes ◽  
Ehsan Khafipour ◽  
Meghan B. Azad

AbstractBackgroundQuality control including assessment of batch variabilities and confirmation of repeatability and reproducibility are integral component of high throughput omics studies including microbiome research. Batch effects can mask true biological results and/or result in irreproducible conclusions and interpretations. Low biomass samples in microbiome research are prone to reagent contamination; yet, quality control procedures for low biomass samples in large-scale microbiome studies are not well established.ResultsIn this study we have proposed a framework for an in-depth step-by-step approach to address this gap. The framework consists of three independent stages: 1) verification of sequencing accuracy by assessing technical repeatability and reproducibility of the results using mock communities and biological controls; 2) contaminant removal and batch variability correction by applying a two-tier strategy using statistical algorithms (e.g. decontam) followed by comparison of the data structure between batches; and 3) corroborating the repeatability and reproducibility of microbiome composition and downstream statistical analysis. Using this approach on the milk microbiota data from the CHILD Cohort generated in two batches (extracted and sequenced in 2016 and 2019), we were able to identify potential reagent contaminants that were missed with standard algorithms, and substantially reduce contaminant-induced batch variability. Additionally, we confirmed the repeatability and reproducibility of our reslults in each batch before merging them for downstream analysis.ConclusionThis study provides important insight to advance quality control efforts in low biomass microbiome research. Within-study quality control that takes advantage of the data structure (i.e. differential prevalence of contaminants between batches) would enhance the overall reliability and reproducibility of research in this field.


2020 ◽  
Author(s):  
Rui Hong ◽  
Yusuke Koga ◽  
Shruthi Bandyadka ◽  
Anastasia Leshchyk ◽  
Zhe Wang ◽  
...  

AbstractPerforming comprehensive quality control is necessary to remove technical or biological artifacts in single-cell RNA sequencing (scRNA-seq) data. Artifacts in the scRNA-seq data, such as doublets or ambient RNA, can also hinder downstream clustering and marker selection and need to be assessed. While several algorithms have been developed to perform various quality control tasks, they are only available in different packages across various programming environments. No standardized workflow has been developed to streamline the generation and reporting of all quality control metrics from these tools. We have built an easy-to-use pipeline, named SCTK-QC, in the singleCellTK package that generates a comprehensive set of quality control metrics from a plethora of packages for quality control. We are able to import data from several preprocessing tools including CellRanger, STARSolo, BUSTools, dropEST, Optimus, and SEQC. Standard quality control metrics for each cell are calculated including the total number of UMIs, total number of genes detected, and the percentage of counts mapping to predefined gene sets such as mitochondrial genes. Doublet detection algorithms employed include scrublet, scds, doubletCells, and doubletFinder. DecontX is used to identify contamination in each individual cell. To make the data accessible in downstream analysis workflows, the results can be exported to common data structures in R and Python or to text files for use in any generic workflow. Overall, this pipeline will streamline and standardize quality control analyses for single cell RNA-seq data across different platforms.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 517
Author(s):  
Len Taing ◽  
Gali Bai ◽  
Clara Cousins ◽  
Paloma Cejas ◽  
Xintao Qiu ◽  
...  

Motivation: The chromatin profile measured by ATAC-seq, ChIP-seq, or DNase-seq experiments can identify genomic regions critical in regulating gene expression and provide insights on biological processes such as diseases and development. However, quality control and processing chromatin profiling data involves many steps, and different bioinformatics tools are used at each step. It can be challenging to manage the analysis. Results: We developed a Snakemake pipeline called CHIPS (CHromatin enrIchment ProcesSor) to streamline the processing of ChIP-seq, ATAC-seq, and DNase-seq data. The pipeline supports single- and paired-end data and is flexible to start with FASTQ or BAM files. It includes basic steps such as read trimming, mapping, and peak calling. In addition, it calculates quality control metrics such as contamination profiles, polymerase chain reaction bottleneck coefficient, the fraction of reads in peaks, percentage of peaks overlapping with the union of public DNaseI hypersensitivity sites, and conservation profile of the peaks. For downstream analysis, it carries out peak annotations, motif finding, and regulatory potential calculation for all genes. The pipeline ensures that the processing is robust and reproducible. Availability: CHIPS is available at https://github.com/liulab-dfci/CHIPS.


Microbiome ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Shirin Moossavi ◽  
Kelsey Fehr ◽  
Ehsan Khafipour ◽  
Meghan B. Azad

Abstract Background Quality control including assessment of batch variabilities and confirmation of repeatability and reproducibility are integral component of high throughput omics studies including microbiome research. Batch effects can mask true biological results and/or result in irreproducible conclusions and interpretations. Low biomass samples in microbiome research are prone to reagent contamination; yet, quality control procedures for low biomass samples in large-scale microbiome studies are not well established. Results In this study, we have proposed a framework for an in-depth step-by-step approach to address this gap. The framework consists of three independent stages: (1) verification of sequencing accuracy by assessing technical repeatability and reproducibility of the results using mock communities and biological controls; (2) contaminant removal and batch variability correction by applying a two-tier strategy using statistical algorithms (e.g. decontam) followed by comparison of the data structure between batches; and (3) corroborating the repeatability and reproducibility of microbiome composition and downstream statistical analysis. Using this approach on the milk microbiota data from the CHILD Cohort generated in two batches (extracted and sequenced in 2016 and 2019), we were able to identify potential reagent contaminants that were missed with standard algorithms and substantially reduce contaminant-induced batch variability. Additionally, we confirmed the repeatability and reproducibility of our results in each batch before merging them for downstream analysis. Conclusion This study provides important insight to advance quality control efforts in low biomass microbiome research. Within-study quality control that takes advantage of the data structure (i.e. differential prevalence of contaminants between batches) would enhance the overall reliability and reproducibility of research in this field.


2020 ◽  
Author(s):  
Mathias Kuhring ◽  
Alina Eisenberger ◽  
Vanessa Schmidt ◽  
Nicolle Kränkel ◽  
David M. Leistner ◽  
...  

ABSTRACTTargeted quantitative mass spectrometry metabolite profiling is the workhorse of metabolomics research. Robust and reproducible data is essential for confidence in analytical results and is particularly important with large-scale studies. Commercial kits are now available which use carefully calibrated and validated internal and external standards to provide such reliability. However, they are still subject to processing and technical errors in their use and should be subject to a laboratory’s routine quality assurance and quality control measures to maintain confidence in the results. We discuss important systematic and random measurement errors when using these kits and suggest measures to detect and quantify them. We demonstrate how wider analysis of the entire dataset, alongside standard analyses of quality control samples can be used to identify outliers and quantify systematic trends in order to improve downstream analysis. Finally we present the MeTaQuaC software which implements the above concepts and methods for Biocrates kits and creates a comprehensive quality control report containing rich visualization and informative scores and summary statistics. Preliminary unsupervised multivariate analysis methods are also included to provide rapid insight into study variables and groups. MeTaQuaC is provided as an open source R package under a permissive MIT license and includes detailed user documentation.


2021 ◽  
Author(s):  
Len Taing ◽  
Clara Cousins ◽  
Gali Bai ◽  
Paloma Cejas ◽  
Xintao Qiu ◽  
...  

AbstractMotivationThe chromatin profile measured by ATAC-seq, ChIP-seq, or DNase-seq experiments can identify genomic regions critical in regulating gene expression and provide insights on biological processes such as diseases and development. However, quality control and processing chromatin profiling data involve many steps, and different bioinformatics tools are used at each step. It can be challenging to manage the analysis.ResultsWe developed a Snakemake pipeline called CHIPS (CHromatin enrichment Processor) to streamline the processing of ChIP-seq, ATAC-seq, and DNase-seq data. The pipeline supports single- and paired-end data and is flexible to start with FASTQ or BAM files. It includes basic steps such as read trimming, mapping, and peak calling. In addition, it calculates quality control metrics such as contamination profiles, PCR bottleneck coefficient, the fraction of reads in peaks, percentage of peaks overlapping with the union of public DNaseI hypersensitivity sites, and conservation profile of the peaks. For downstream analysis, it carries out peak annotations, motif finding, and regulatory potential calculation for all genes. The pipeline ensures that the processing is robust and reproducible.AvailabilityCHIPS is available at https://github.com/liulab-dfci/CHIPS


2019 ◽  
Author(s):  
Melissa Y Yan ◽  
Betsy Ferguson ◽  
Benjamin N Bimber

Abstract Summary Large scale genomic studies produce millions of sequence variants, generating datasets far too massive for manual inspection. To ensure variant and genotype data are consistent and accurate, it is necessary to evaluate variants prior to downstream analysis using quality control (QC) reports. Variant call format (VCF) files are the standard format for representing variant data; however, generating summary statistics from these files is not always straightforward. While tools to summarize variant data exist, they generally produce simple text file tables, which still require additional processing and interpretation. VariantQC fills this gap as a user friendly, interactive visual QC report that generates and concisely summarizes statistics from VCF files. The report aggregates and summarizes variants by dataset, chromosome, sample and filter type. The VariantQC report is useful for high-level dataset summary, quality control and helps flag outliers. Furthermore, VariantQC operates on VCF files, so it can be easily integrated into many existing variant pipelines. Availability and implementation DISCVRSeq's VariantQC tool is freely available as a Java program, with the compiled JAR and source code available from https://github.com/BimberLab/DISCVRSeq/. Documentation and example reports are available at https://bimberlab.github.io/DISCVRSeq/.


2019 ◽  
Author(s):  
Wenbao Yu ◽  
Yasin Uzun ◽  
Qin Zhu ◽  
Changya Chen ◽  
Kai Tan

AbstractSingle cell chromatin accessibility sequencing (scCAS) has become a powerful technology for understanding epigenetic heterogeneity of complex tissues. The development of several experimental protocols has led to a rapid accumulation of scCAS data. In contrast, there is a lack of open-source software tools for comprehensive processing, analysis and visualization of scCAS data generated using all existing experimental protocols. Here we present scATAC-pro for quality assessment, analysis and visualization of scCAS data. scATAC-pro provides flexible choice of methods for different data processing and analytical tasks, with carefully curated default parameters. A range of quality control metrics are computed for several key steps of the experimental protocol. scATAC-pro generates summary reports for both quality assessment and downstream analysis. It also provides additional utility functions for generating input files for various types of downstream analyses and data visualization. With the rapid accumulation of scCAS data, scATAC-pro will facilitate studies of epigenomic heterogeneity in healthy and diseased tissues.


Sign in / Sign up

Export Citation Format

Share Document