PEPATAC: An optimized ATAC-seq pipeline with serial alignments

MotivationAs chromatin accessibility data from ATAC-seq experiments continues to expand, there is continuing need for standardized analysis pipelines. Here, we present PEPATAC, an ATAC-seq pipeline that is easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects.ResultsPEPATAC leverages unique features of ATAC-seq data to optimize for speed and accuracy, and it provides several unique analytical approaches. Output includes convenient quality control plots, summary statistics, and a variety of generally useful data formats to set the groundwork for subsequent project-specific data analysis. Downstream analysis is simplified by a standard definition format, modularity of components, and metadata APIs in R and Python. It is restartable, fault-tolerant, and can be run on local hardware, using any cluster resource manager, or in provided Linux containers. We also demonstrate the advantage of aligning to the mitochondrial genome serially, which improves the accuracy of alignment statistics and quality control metrics. PEPATAC is a robust and portable first step for any ATAC-seq project.AvailabilityBSD2-licensed code and documentation at https://pepatac.databio.org.

Download Full-text

PEPATAC: an optimized pipeline for ATAC-seq data analysis with serial alignments

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab101 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Jason P Smith ◽

M Ryan Corces ◽

Jin Xu ◽

Vincent P Reuter ◽

Howard Y Chang ◽

...

Keyword(s):

Quality Control ◽

Data Analysis ◽

Large Scale ◽

Fault Tolerant ◽

Chromatin Accessibility ◽

Resource Manager ◽

Data Formats ◽

Quality Control Metrics ◽

Downstream Analysis ◽

Analytical Approaches

Abstract As chromatin accessibility data from ATAC-seq experiments continues to expand, there is continuing need for standardized analysis pipelines. Here, we present PEPATAC, an ATAC-seq pipeline that is easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects. PEPATAC leverages unique features of ATAC-seq data to optimize for speed and accuracy, and it provides several unique analytical approaches. Output includes convenient quality control plots, summary statistics, and a variety of generally useful data formats to set the groundwork for subsequent project-specific data analysis. Downstream analysis is simplified by a standard definition format, modularity of components, and metadata APIs in R and Python. It is restartable, fault-tolerant, and can be run on local hardware, using any cluster resource manager, or in provided Linux containers. We also demonstrate the advantage of aligning to the mitochondrial genome serially, which improves the accuracy of alignment statistics and quality control metrics. PEPATAC is a robust and portable first step for any ATAC-seq project. BSD2-licensed code and documentation are available at https://pepatac.databio.org.

Download Full-text

Quality control and processing of nascent RNA profiling data

10.1101/2020.02.27.956110 ◽

2020 ◽

Author(s):

Jason P. Smith ◽

Arun B. Dutta ◽

Kizhakke Mattada Sathyan ◽

Michael J. Guertin ◽

Nathan C. Sheffield

Keyword(s):

Quality Control ◽

Fault Tolerant ◽

Resource Manager ◽

Web Based ◽

Rna Integrity ◽

Rna Profiling ◽

Library Complexity ◽

Nascent Rna ◽

Assess Quality ◽

Downstream Analysis

Experiments that profile nascent RNA are growing in popularity; however, there is no standard analysis pipeline to uniformly process the data and assess quality. Here, we introduce PEPPRO, a comprehensive, scalable workflow for GRO-seq, PRO-seq, and ChRO-seq data. PEPPRO produces uniform processed output files for downstream analysis, including alignment files, signal tracks, and count matrices. Furthermore, PEPPRO simplifies downstream analysis by using a standard project definition format which can be read using metadata APIs in R and Python. For quality control, PEPPRO provides several novel statistics and plots, including assessments of adapter abundance, RNA integrity, library complexity, nascent RNA purity, and run-on efficiency. PEPPRO is restartable and fault-tolerant, records copious logs, and provides a web-based project report for navigating results. It can be run on local hardware or using any cluster resource manager, using either native software or a provided modular Linux container environment. PEPPRO is thus a robust and portable first step for genomic nascent RNA analysis.AvailabilityBSD2-licensed code and documentation: https://peppro.databio.org.

Download Full-text

Repeatability and reproducibility assessment in a large-scale population-based microbiota study: case study on human milk microbiota

10.1101/2020.04.20.052035 ◽

2020 ◽

Author(s):

Shirin Moossavi ◽

Kelsey Fehr ◽

Theo J. Moraes ◽

Ehsan Khafipour ◽

Meghan B. Azad

Keyword(s):

Quality Control ◽

Data Structure ◽

Large Scale ◽

Population Based ◽

Microbiome Composition ◽

Scale Population ◽

Microbiome Research ◽

Control Procedures ◽

Repeatability And Reproducibility ◽

Downstream Analysis

AbstractBackgroundQuality control including assessment of batch variabilities and confirmation of repeatability and reproducibility are integral component of high throughput omics studies including microbiome research. Batch effects can mask true biological results and/or result in irreproducible conclusions and interpretations. Low biomass samples in microbiome research are prone to reagent contamination; yet, quality control procedures for low biomass samples in large-scale microbiome studies are not well established.ResultsIn this study we have proposed a framework for an in-depth step-by-step approach to address this gap. The framework consists of three independent stages: 1) verification of sequencing accuracy by assessing technical repeatability and reproducibility of the results using mock communities and biological controls; 2) contaminant removal and batch variability correction by applying a two-tier strategy using statistical algorithms (e.g. decontam) followed by comparison of the data structure between batches; and 3) corroborating the repeatability and reproducibility of microbiome composition and downstream statistical analysis. Using this approach on the milk microbiota data from the CHILD Cohort generated in two batches (extracted and sequenced in 2016 and 2019), we were able to identify potential reagent contaminants that were missed with standard algorithms, and substantially reduce contaminant-induced batch variability. Additionally, we confirmed the repeatability and reproducibility of our reslults in each batch before merging them for downstream analysis.ConclusionThis study provides important insight to advance quality control efforts in low biomass microbiome research. Within-study quality control that takes advantage of the data structure (i.e. differential prevalence of contaminants between batches) would enhance the overall reliability and reproducibility of research in this field.

Download Full-text

Comprehensive generation, visualization, and reporting of quality control metrics for single-cell RNA sequencing data

10.1101/2020.11.16.385328 ◽

2020 ◽

Author(s):

Rui Hong ◽

Yusuke Koga ◽

Shruthi Bandyadka ◽

Anastasia Leshchyk ◽

Zhe Wang ◽

...

Keyword(s):

Quality Control ◽

Single Cell ◽

Rna Sequencing ◽

Sequencing Data ◽

Programming Environments ◽

Marker Selection ◽

Quality Control Metrics ◽

Single Cell Rna Sequencing ◽

Standard Quality ◽

Downstream Analysis

AbstractPerforming comprehensive quality control is necessary to remove technical or biological artifacts in single-cell RNA sequencing (scRNA-seq) data. Artifacts in the scRNA-seq data, such as doublets or ambient RNA, can also hinder downstream clustering and marker selection and need to be assessed. While several algorithms have been developed to perform various quality control tasks, they are only available in different packages across various programming environments. No standardized workflow has been developed to streamline the generation and reporting of all quality control metrics from these tools. We have built an easy-to-use pipeline, named SCTK-QC, in the singleCellTK package that generates a comprehensive set of quality control metrics from a plethora of packages for quality control. We are able to import data from several preprocessing tools including CellRanger, STARSolo, BUSTools, dropEST, Optimus, and SEQC. Standard quality control metrics for each cell are calculated including the total number of UMIs, total number of genes detected, and the percentage of counts mapping to predefined gene sets such as mitochondrial genes. Doublet detection algorithms employed include scrublet, scds, doubletCells, and doubletFinder. DecontX is used to identify contamination in each individual cell. To make the data accessible in downstream analysis workflows, the results can be exported to common data structures in R and Python or to text files for use in any generic workflow. Overall, this pipeline will streamline and standardize quality control analyses for single cell RNA-seq data across different platforms.

Download Full-text

CHIPS: A Snakemake pipeline for quality control and reproducible processing of chromatin profiling data

F1000Research ◽

10.12688/f1000research.52878.1 ◽

2021 ◽

Vol 10 ◽

pp. 517

Author(s):

Len Taing ◽

Gali Bai ◽

Clara Cousins ◽

Paloma Cejas ◽

Xintao Qiu ◽

...

Keyword(s):

Quality Control ◽

Motif Finding ◽

Biological Processes ◽

Chain Reaction ◽

Quality Control Metrics ◽

Regulatory Potential ◽

Polymerase Chain ◽

Downstream Analysis ◽

Chromatin Profiling ◽

Genomic Regions

Motivation: The chromatin profile measured by ATAC-seq, ChIP-seq, or DNase-seq experiments can identify genomic regions critical in regulating gene expression and provide insights on biological processes such as diseases and development. However, quality control and processing chromatin profiling data involves many steps, and different bioinformatics tools are used at each step. It can be challenging to manage the analysis. Results: We developed a Snakemake pipeline called CHIPS (CHromatin enrIchment ProcesSor) to streamline the processing of ChIP-seq, ATAC-seq, and DNase-seq data. The pipeline supports single- and paired-end data and is flexible to start with FASTQ or BAM files. It includes basic steps such as read trimming, mapping, and peak calling. In addition, it calculates quality control metrics such as contamination profiles, polymerase chain reaction bottleneck coefficient, the fraction of reads in peaks, percentage of peaks overlapping with the union of public DNaseI hypersensitivity sites, and conservation profile of the peaks. For downstream analysis, it carries out peak annotations, motif finding, and regulatory potential calculation for all genes. The pipeline ensures that the processing is robust and reproducible. Availability: CHIPS is available at https://github.com/liulab-dfci/CHIPS.

Download Full-text

Repeatability and reproducibility assessment in a large-scale population-based microbiota study: case study on human milk microbiota

Microbiome ◽

10.1186/s40168-020-00998-4 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Shirin Moossavi ◽

Kelsey Fehr ◽

Ehsan Khafipour ◽

Meghan B. Azad

Keyword(s):

Quality Control ◽

Data Structure ◽

Large Scale ◽

Population Based ◽

Microbiome Composition ◽

Scale Population ◽

Microbiome Research ◽

Control Procedures ◽

Repeatability And Reproducibility ◽

Downstream Analysis

Abstract Background Quality control including assessment of batch variabilities and confirmation of repeatability and reproducibility are integral component of high throughput omics studies including microbiome research. Batch effects can mask true biological results and/or result in irreproducible conclusions and interpretations. Low biomass samples in microbiome research are prone to reagent contamination; yet, quality control procedures for low biomass samples in large-scale microbiome studies are not well established. Results In this study, we have proposed a framework for an in-depth step-by-step approach to address this gap. The framework consists of three independent stages: (1) verification of sequencing accuracy by assessing technical repeatability and reproducibility of the results using mock communities and biological controls; (2) contaminant removal and batch variability correction by applying a two-tier strategy using statistical algorithms (e.g. decontam) followed by comparison of the data structure between batches; and (3) corroborating the repeatability and reproducibility of microbiome composition and downstream statistical analysis. Using this approach on the milk microbiota data from the CHILD Cohort generated in two batches (extracted and sequenced in 2016 and 2019), we were able to identify potential reagent contaminants that were missed with standard algorithms and substantially reduce contaminant-induced batch variability. Additionally, we confirmed the repeatability and reproducibility of our results in each batch before merging them for downstream analysis. Conclusion This study provides important insight to advance quality control efforts in low biomass microbiome research. Within-study quality control that takes advantage of the data structure (i.e. differential prevalence of contaminants between batches) would enhance the overall reliability and reproducibility of research in this field.

Download Full-text

Concepts and Software Package for Efficient Quality Control in Targeted Metabolomics Studies – MeTaQuaC

10.1101/2020.01.10.901710 ◽

2020 ◽

Author(s):

Mathias Kuhring ◽

Alina Eisenberger ◽

Vanessa Schmidt ◽

Nicolle Kränkel ◽

David M. Leistner ◽

...

Keyword(s):

Quality Control ◽

Measurement Errors ◽

Large Scale ◽

R Package ◽

Control Measures ◽

Technical Errors ◽

Commercial Kits ◽

Downstream Analysis ◽

Entire Dataset ◽

Insight Into

ABSTRACTTargeted quantitative mass spectrometry metabolite profiling is the workhorse of metabolomics research. Robust and reproducible data is essential for confidence in analytical results and is particularly important with large-scale studies. Commercial kits are now available which use carefully calibrated and validated internal and external standards to provide such reliability. However, they are still subject to processing and technical errors in their use and should be subject to a laboratory’s routine quality assurance and quality control measures to maintain confidence in the results. We discuss important systematic and random measurement errors when using these kits and suggest measures to detect and quantify them. We demonstrate how wider analysis of the entire dataset, alongside standard analyses of quality control samples can be used to identify outliers and quantify systematic trends in order to improve downstream analysis. Finally we present the MeTaQuaC software which implements the above concepts and methods for Biocrates kits and creates a comprehensive quality control report containing rich visualization and informative scores and summary statistics. Preliminary unsupervised multivariate analysis methods are also included to provide rapid insight into study variables and groups. MeTaQuaC is provided as an open source R package under a permissive MIT license and includes detailed user documentation.

Download Full-text

CHIPS: A Snakemake pipeline for quality control and reproducible processing of chromatin profiling data

10.1101/2021.03.09.434676 ◽

2021 ◽

Author(s):

Len Taing ◽

Clara Cousins ◽

Gali Bai ◽

Paloma Cejas ◽

Xintao Qiu ◽

...

Keyword(s):

Gene Expression ◽

Quality Control ◽

Motif Finding ◽

Biological Processes ◽

Peak Calling ◽

Quality Control Metrics ◽

Regulatory Potential ◽

Downstream Analysis ◽

Chromatin Profiling ◽

Genomic Regions

AbstractMotivationThe chromatin profile measured by ATAC-seq, ChIP-seq, or DNase-seq experiments can identify genomic regions critical in regulating gene expression and provide insights on biological processes such as diseases and development. However, quality control and processing chromatin profiling data involve many steps, and different bioinformatics tools are used at each step. It can be challenging to manage the analysis.ResultsWe developed a Snakemake pipeline called CHIPS (CHromatin enrichment Processor) to streamline the processing of ChIP-seq, ATAC-seq, and DNase-seq data. The pipeline supports single- and paired-end data and is flexible to start with FASTQ or BAM files. It includes basic steps such as read trimming, mapping, and peak calling. In addition, it calculates quality control metrics such as contamination profiles, PCR bottleneck coefficient, the fraction of reads in peaks, percentage of peaks overlapping with the union of public DNaseI hypersensitivity sites, and conservation profile of the peaks. For downstream analysis, it carries out peak annotations, motif finding, and regulatory potential calculation for all genes. The pipeline ensures that the processing is robust and reproducible.AvailabilityCHIPS is available at https://github.com/liulab-dfci/CHIPS

Download Full-text

VariantQC: a visual quality control report for variant evaluation

Bioinformatics ◽

10.1093/bioinformatics/btz560 ◽

2019 ◽

Author(s):

Melissa Y Yan ◽

Betsy Ferguson ◽

Benjamin N Bimber

Keyword(s):

Quality Control ◽

Large Scale ◽

Standard Format ◽

Java Program ◽

Additional Processing ◽

Genomic Studies ◽

Filter Type ◽

Downstream Analysis ◽

High Level ◽

User Friendly

Abstract Summary Large scale genomic studies produce millions of sequence variants, generating datasets far too massive for manual inspection. To ensure variant and genotype data are consistent and accurate, it is necessary to evaluate variants prior to downstream analysis using quality control (QC) reports. Variant call format (VCF) files are the standard format for representing variant data; however, generating summary statistics from these files is not always straightforward. While tools to summarize variant data exist, they generally produce simple text file tables, which still require additional processing and interpretation. VariantQC fills this gap as a user friendly, interactive visual QC report that generates and concisely summarizes statistics from VCF files. The report aggregates and summarizes variants by dataset, chromosome, sample and filter type. The VariantQC report is useful for high-level dataset summary, quality control and helps flag outliers. Furthermore, VariantQC operates on VCF files, so it can be easily integrated into many existing variant pipelines. Availability and implementation DISCVRSeq's VariantQC tool is freely available as a Java program, with the compiled JAR and source code available from https://github.com/BimberLab/DISCVRSeq/. Documentation and example reports are available at https://bimberlab.github.io/DISCVRSeq/.

Download Full-text

scATAC-pro: a comprehensive workbench for single-cell chromatin accessibility sequencing data

10.1101/824326 ◽

2019 ◽

Cited By ~ 2

Author(s):

Wenbao Yu ◽

Yasin Uzun ◽

Qin Zhu ◽

Changya Chen ◽

Kai Tan

Keyword(s):

Quality Assessment ◽

Single Cell ◽

Chromatin Accessibility ◽

Sequencing Data ◽

Quality Control Metrics ◽

Rapid Accumulation ◽

Experimental Protocols ◽

Downstream Analysis ◽

Key Steps ◽

Comprehensive Processing

AbstractSingle cell chromatin accessibility sequencing (scCAS) has become a powerful technology for understanding epigenetic heterogeneity of complex tissues. The development of several experimental protocols has led to a rapid accumulation of scCAS data. In contrast, there is a lack of open-source software tools for comprehensive processing, analysis and visualization of scCAS data generated using all existing experimental protocols. Here we present scATAC-pro for quality assessment, analysis and visualization of scCAS data. scATAC-pro provides flexible choice of methods for different data processing and analytical tasks, with carefully curated default parameters. A range of quality control metrics are computed for several key steps of the experimental protocol. scATAC-pro generates summary reports for both quality assessment and downstream analysis. It also provides additional utility functions for generating input files for various types of downstream analyses and data visualization. With the rapid accumulation of scCAS data, scATAC-pro will facilitate studies of epigenomic heterogeneity in healthy and diseased tissues.

Download Full-text