ngsReports: a Bioconductor package for managing FastQC reports and other NGS related log files

Christopher M Ward; Thu-Hien To; Stephen M Pederson

doi:10.1093/bioinformatics/btz937

ngsReports: a Bioconductor package for managing FastQC reports and other NGS related log files

Bioinformatics ◽

10.1093/bioinformatics/btz937 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2587-2588 ◽

Cited By ~ 10

Author(s):

Christopher M Ward ◽

Thu-Hien To ◽

Stephen M Pederson

Keyword(s):

Quality Control ◽

R Package ◽

Supplementary Information ◽

Bioconductor Package ◽

Supplementary Data ◽

Large Sample ◽

Log Files ◽

Shiny App ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Motivation High throughput next generation sequencing (NGS) has become exceedingly cheap, facilitating studies to be undertaken containing large sample numbers. Quality control (QC) is an essential stage during analytic pipelines and the outputs of popular bioinformatics tools such as FastQC and Picard can provide information on individual samples. Although these tools provide considerable power when carrying out QC, large sample numbers can make inspection of all samples and identification of systemic bias a challenge. Results We present ngsReports, an R package designed for the management and visualization of NGS reports from within an R environment. The available methods allow direct import into R of FastQC reports along with outputs from other tools. Visualization can be carried out across many samples using default, highly customizable plots with options to perform hierarchical clustering to quickly identify outlier libraries. Moreover, these can be displayed in an interactive shiny app or HTML report for ease of analysis. Availability and implementation The ngsReports package is available on Bioconductor and the GUI shiny app is available at https://github.com/UofABioinformaticsHub/shinyNgsreports. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ngsReports: An R Package for managing FastQC reports and other NGS related log files

10.1101/313148 ◽

2018 ◽

Cited By ~ 1

Author(s):

Christopher M. Ward ◽

Hein To ◽

Stephen M Pederson

Keyword(s):

Quality Control ◽

Gc Content ◽

R Package ◽

Dependent Manner ◽

Batch Effects ◽

Large Sample ◽

Log Files ◽

Shiny App ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

AbstractMotivationHigh throughput next generation sequencing (NGS) has become exceedingly cheap facilitating studies to be undertaken containing large sample numbers. Quality control (QC) is an essential stage during analytic pipelines and can be found in the outputs of popular bioinformatics tools such as FastQC and Picard. Although these tools provide considerable power when carrying out QC, large sample numbers can make identification of systemic bias a challenge.ResultsWe present ngsReports, an R package designed for the management and visualization of NGS reports from within an R environment. The available methods allow direct import into R of FastQC output as well as that from aligners such as HISAT2, STAR and Bowtie2. Visualization can be carried out across many samples using heatmaps rendered using ggplot2 and plotly. Moreover, these can be displayed in an interactive shiny app or a HTML report. We also provide methods to assess observed GC content in an organism dependent manner for both transcriptomic and genomic datasets. Importantly, hierarchical clustering can be carried out on heatmaps with large sample sizes to quickly identify outliers and batch effects.Availability and ImplementationngsReports is available at https://github.com/UofABioinformaticsHub/ngsReports.

Download Full-text

EARRINGS: an efficient and accurate adapter trimmer entails no a priori adapter sequences

Bioinformatics ◽

10.1093/bioinformatics/btab025 ◽

2021 ◽

Author(s):

Ting-Hsuan Wang ◽

Cheng-Ching Huang ◽

Jui-Hung Hung

Keyword(s):

Open Source Software ◽

Large Scale ◽

A Priori ◽

Supplementary Information ◽

Supplementary Data ◽

Comparable Accuracy ◽

Meta Analyses ◽

Next Generation Sequencing Ngs ◽

Adapter Trimming ◽

Generation Sequencing

Abstract Motivation Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming. Results Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales. Availability and implementation EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GladiaTOX: GLobal Assessment of Dose-IndicAtor in TOXicology

Bioinformatics ◽

10.1093/bioinformatics/btz187 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4190-4192 ◽

Cited By ~ 4

Author(s):

Vincenzo Belcastro ◽

Stephane Cano ◽

Diego Marescotti ◽

Stefano Acali ◽

Carine Poussin ◽

...

Keyword(s):

Quality Control ◽

Data Processing ◽

Web Service ◽

Biomedical Research ◽

Global Assessment ◽

R Package ◽

Supplementary Information ◽

High Content Screening ◽

Supplementary Data ◽

Severity Scores

Abstract Summary GladiaTOX R package is an open-source, flexible solution to high-content screening data processing and reporting in biomedical research. GladiaTOX takes advantage of the ‘tcpl’ core functionalities and provides a number of extensions: it provides a web-service solution to fetch raw data; it computes severity scores and exports ToxPi formatted files; furthermore it contains a suite of functionalities to generate PDF reports for quality control and data processing. Availability and implementation GladiaTOX R package (bioconductor). Also available via: git clone https://github.com/philipmorrisintl/GladiaTOX.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Issues of Z-factor and an approach to avoid them for quality control in high-throughput screening studies

Bioinformatics ◽

10.1093/bioinformatics/btaa1049 ◽

2020 ◽

Author(s):

Xiaohua Douglas Zhang ◽

Dandan Wang ◽

Shixue Sun ◽

Heping Zhang

Keyword(s):

Quality Control ◽

High Throughput ◽

High Throughput Screening ◽

Theoretical Basis ◽

Sampling Error ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Automation Technology ◽

General Public License

Abstract Motivation High-throughput screening (HTS) is a vital automation technology in biomedical research in both industry and academia. The well-known Z-factor has been widely used as a gatekeeper to assure assay quality in an HTS study. However, many researchers and users may not have realized that Z-factor has major issues. Results In this article, the following four major issues are explored and demonstrated so that researchers may use the Z-factor appropriately. First, the Z-factor violates the Pythagorean theorem of statistics. Second, there is no adjustment of sampling error in the application of the Z-factor for quality control (QC) in HTS studies. Third, the expectation of the sample-based Z-factor does not exist. Fourth, the thresholds in the Z-factor-based criterion lack a theoretical basis. Here, an approach to avoid these issues was proposed and new QC criteria under homoscedasticity were constructed so that researchers can choose a statistically grounded criterion for QC in the HTS studies. We implemented this approach in an R package and demonstrated its utility in multiple CRISPR/CAS9 or siRNA HTS studies. Availability and implementation The R package qcSSMDhomo is freely available from GitHub: https://github.com/Karena6688/qcSSMDhomo. The file qcSSMDhomo_1.0.0.tar.gz (for Windows) containing qcSSMDhomo is also available at Bioinformatics online. qcSSMDhomo is distributed under the GNU General Public License. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

gpart: human genome partitioning and visualization of high-density SNP data by identifying haplotype blocks

Bioinformatics ◽

10.1093/bioinformatics/btz308 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4419-4421 ◽

Cited By ~ 3

Author(s):

Sun Ah Kim ◽

Myriam Brossard ◽

Delnaz Roshandel ◽

Andrew D Paterson ◽

Shelley B Bull ◽

...

Keyword(s):

Clustering Algorithms ◽

R Package ◽

Supplementary Information ◽

Visualization Tool ◽

Sequencing Data ◽

Haplotype Blocks ◽

Snp Data ◽

Computing Environments ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Summary For the analysis of high-throughput genomic data produced by next-generation sequencing (NGS) technologies, researchers need to identify linkage disequilibrium (LD) structure in the genome. In this work, we developed an R package gpart which provides clustering algorithms to define LD blocks or analysis units consisting of SNPs. The visualization tool in gpart can display the LD structure and gene positions for up to 20 000 SNPs in one image. The gpart functions facilitate construction of LD blocks and SNP partitions for vast amounts of genome sequencing data within reasonable time and memory limits in personal computing environments. Availability and implementation The R package is available at https://bioconductor.org/packages/gpart. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ensembldb: an R package to create and use Ensembl-based annotation resources

Bioinformatics ◽

10.1093/bioinformatics/btz031 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3151-3153 ◽

Cited By ~ 13

Author(s):

Johannes Rainer ◽

Laurent Gatto ◽

Christian X Weichenberger

Keyword(s):

Coordinate System ◽

Positional Information ◽

R Package ◽

Research Data ◽

Supplementary Information ◽

Reproducible Research ◽

Bioconductor Package ◽

Supplementary Data ◽

Reference Coordinate System ◽

Associated Data

Abstract Summary Bioinformatics research frequently involves handling gene-centric data such as exons, transcripts, proteins and their positions relative to a reference coordinate system. The ensembldb Bioconductor package retrieves and stores Ensembl-based genetic annotations and positional information, and furthermore offers identifier conversion and coordinates mappings for gene-associated data. In support of reproducible research, data are tied to Ensembl releases and are kept separately from the software. Premade data packages are available for a variety of genomes and Ensembl releases. Three examples demonstrate typical use cases of this software. Availability and implementation ensembldb is part of Bioconductor (https://bioconductor.org/packages/ensembldb). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MsPAC: a tool for haplotype-phased structural variant detection

Bioinformatics ◽

10.1093/bioinformatics/btz618 ◽

2019 ◽

Vol 36 (3) ◽

pp. 922-924 ◽

Cited By ~ 3

Author(s):

Oscar L Rodriguez ◽

Anna Ritz ◽

Andrew J Sharp ◽

Ali Bashir

Keyword(s):

Genomic Data ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Structural Variant ◽

Long Read ◽

One Step ◽

Variant Detection ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Summary While next-generation sequencing (NGS) has dramatically increased the availability of genomic data, phased genome assembly and structural variant (SV) analyses are limited by NGS read lengths. Long-read sequencing from Pacific Biosciences and NGS barcoding from 10x Genomics hold the potential for far more comprehensive views of individual genomes. Here, we present MsPAC, a tool that combines both technologies to partition reads, assemble haplotypes (via existing software) and convert assemblies into high-quality, phased SV predictions. MsPAC represents a framework for haplotype-resolved SV calls that moves one step closer to fully resolved, diploid genomes. Availability and implementation https://github.com/oscarlr/MsPAC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses

10.1101/165191 ◽

2017 ◽

Author(s):

Claire Rioualen ◽

Lucie Charbonnier-Khamvongsa ◽

Jacques van Helden

Keyword(s):

Next Generation Sequencing ◽

Life Sciences ◽

Supplementary Information ◽

Supplementary Data ◽

Rna Seq ◽

Genome Wide ◽

Domains Of Life ◽

Supplementary Material ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

AbstractSummaryNext-Generation Sequencing (NGS) is becoming a routine approach for most domains of life sciences, yet there is a crucial need to improve the automation of processing for the huge amounts of data generated and to ensure reproducible results. We present SnakeChunks, a collection of Snakemake rules enabling to compose modular and user-configurable workflows, and show its usage with analyses of transcriptome (RNA-seq) and genome-wide location (ChIP-seq) data.AvailabilityThe code is freely available (github.com/SnakeChunks/SnakeChunks), and documented with tutorials and illustrative demos (snakechunks.readthedocs.io)[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

NACHO: an R package for quality control of NanoString nCounter data

Bioinformatics ◽

10.1093/bioinformatics/btz647 ◽

2019 ◽

Author(s):

Mickaël Canouil ◽

Gerard A Bouland ◽

Amélie Bonnefond ◽

Philippe Froguel ◽

Leen M ’t Hart ◽

...

Keyword(s):

Quality Control ◽

R Package ◽

Third Parties ◽

Supplementary Information ◽

Expression Data ◽

Supplementary Data ◽

Scaling Factors ◽

Comprehensive Method ◽

Normalization Methods

Abstract Summary The NanoStringTM nCounter® is a platform for the targeted quantification of expression data in biofluids and tissues. While software by the manufacturer is available in addition to third parties packages, they do not provide a complete quality control (QC) pipeline. Here, we present NACHO (‘NAnostring quality Control dasHbOard’), a comprehensive QC R-package. The package consists of three subsequent steps: summarize, visualize and normalize. The summarize function collects all the relevant data and stores it in a tidy format, the visualize function initiates a dashboard with plots of the relevant QC outcomes. It contains QC metrics that are measured by default by the manufacturer, but also calculates other insightful measures, including the scaling factors that are needed in the normalization step. In this normalization step, different normalization methods can be chosen to optimally preprocess data. Together, NACHO is a comprehensive method that optimizes insight and preprocessing of nCounter® data. Availability and implementation NACHO is available as an R-package on CRAN and the development version on GitHub https://github.com/mcanouil/NACHO. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DEsingle for detecting three types of differential expression in single-cell RNA-seq data

10.1101/173997 ◽

2017 ◽

Cited By ~ 1

Author(s):

Zhun Miao ◽

Ke Deng ◽

Xiaowo Wang ◽

Xuegong Zhang

Keyword(s):

Single Cell ◽

Differential Expression ◽

Negative Binomial ◽

Single Cells ◽

R Package ◽

Supplementary Information ◽

Binomial Model ◽

Supplementary Data ◽

Rna Seq ◽

Real Zeros

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.

Download Full-text