hichipper: A preprocessing pipeline for assessing library quality and DNA loops from HiChIP data

Mapping Intimacies ◽

10.1101/192302 ◽

2017 ◽

Cited By ~ 1

Author(s):

Caleb Lareau ◽

Martin Aryee

Keyword(s):

Quality Control ◽

Open Source ◽

Data Preprocessing ◽

Chromatin Conformation ◽

Peak Calling ◽

Dna Loops ◽

Dna Loop ◽

Downstream Analysis ◽

Novel Protein

Mumbach et al. recently described HiChIP, a novel protein-mediated chromatin conformation assay that lowers cellular input requirements while simultaneously increasing the yield of informative reads compared to previous methods. To facilitate the dissemination and adoption of this assay, we introduce hichipper (http://aryeelab.org/hichipper), an open-source HiChIP data preprocessing tool, with features that include bias-corrected peak calling, library quality control, DNA loop calling, and output of processed data for downstream analysis and visualization.

Download Full-text

fastp: an ultra-fast all-in-one FASTQ preprocessor

10.1101/274100 ◽

2018 ◽

Cited By ~ 12

Author(s):

Shifu Chen ◽

Yanqing Zhou ◽

Yaru Chen ◽

Jia Gu

Keyword(s):

Quality Control ◽

Open Source ◽

Programming Languages ◽

Source Code ◽

Data Filtering ◽

Quality Filtering ◽

Downstream Analysis ◽

High Level ◽

Unique Molecular Identifier ◽

Adapter Trimming

AbstractMotivationQuality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming, and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g., Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient.ResultsWe developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality cutting, and many other operations with a single scan of the FASTQ data. It also supports unique molecular identifier preprocessing, poly tail trimming, output splitting, and base correction for paired-end data. It can automatically detect adapters for single-end and paired-end FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2–5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools.Availability and ImplementationThe open-source code and corresponding instructions are available at https://github.com/OpenGene/[email protected]

Download Full-text

CHIPS: A Snakemake pipeline for quality control and reproducible processing of chromatin profiling data

10.1101/2021.03.09.434676 ◽

2021 ◽

Author(s):

Len Taing ◽

Clara Cousins ◽

Gali Bai ◽

Paloma Cejas ◽

Xintao Qiu ◽

...

Keyword(s):

Gene Expression ◽

Quality Control ◽

Motif Finding ◽

Biological Processes ◽

Peak Calling ◽

Quality Control Metrics ◽

Regulatory Potential ◽

Downstream Analysis ◽

Chromatin Profiling ◽

Genomic Regions

AbstractMotivationThe chromatin profile measured by ATAC-seq, ChIP-seq, or DNase-seq experiments can identify genomic regions critical in regulating gene expression and provide insights on biological processes such as diseases and development. However, quality control and processing chromatin profiling data involve many steps, and different bioinformatics tools are used at each step. It can be challenging to manage the analysis.ResultsWe developed a Snakemake pipeline called CHIPS (CHromatin enrichment Processor) to streamline the processing of ChIP-seq, ATAC-seq, and DNase-seq data. The pipeline supports single- and paired-end data and is flexible to start with FASTQ or BAM files. It includes basic steps such as read trimming, mapping, and peak calling. In addition, it calculates quality control metrics such as contamination profiles, PCR bottleneck coefficient, the fraction of reads in peaks, percentage of peaks overlapping with the union of public DNaseI hypersensitivity sites, and conservation profile of the peaks. For downstream analysis, it carries out peak annotations, motif finding, and regulatory potential calculation for all genes. The pipeline ensures that the processing is robust and reproducible.AvailabilityCHIPS is available at https://github.com/liulab-dfci/CHIPS

Download Full-text

MCR: Open-Source Software to Automate Compilation of Health Study Report-Back

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18116104 ◽

2021 ◽

Vol 18 (11) ◽

pp. 6104

Author(s):

Erin Polka ◽

Ellen Childs ◽

Alexa Friedman ◽

Kathryn S. Tomsho ◽

Birgit Claus Henn ◽

...

Keyword(s):

Quality Control ◽

Open Source ◽

Open Source Software ◽

Research Team ◽

Self Determination ◽

Health Study ◽

Equitable Access ◽

Team Members ◽

Study Participants ◽

The Impact

Sharing individualized results with health study participants, a practice we and others refer to as “report-back,” ensures participant access to exposure and health information and may promote health equity. However, the practice of report-back and the content shared is often limited by the time-intensive process of personalizing reports. Software tools that automate creation of individualized reports have been built for specific studies, but are largely not open-source or broadly modifiable. We created an open-source and generalizable tool, called the Macro for the Compilation of Report-backs (MCR), to automate compilation of health study reports. We piloted MCR in two environmental exposure studies in Massachusetts, USA, and interviewed research team members (n = 7) about the impact of MCR on the report-back process. Researchers using MCR created more detailed reports than during manual report-back, including more individualized numerical, text, and graphical results. Using MCR, researchers saved time producing draft and final reports. Researchers also reported feeling more creative in the design process and more confident in report-back quality control. While MCR does not expedite the entire report-back process, we hope that this open-source tool reduces the barriers to personalizing health study reports, promotes more equitable access to individualized data, and advances self-determination among participants.

Download Full-text

Open-source toolbox for analysis and spectra quality control of magnetic resonance spectroscopic imaging

Medical Imaging 2021: Biomedical Applications in Molecular, Structural, and Functional Imaging ◽

10.1117/12.2582186 ◽

2021 ◽

Author(s):

Danilo R. Pereira ◽

Larissa Ganaha ◽

Simone Appenzeller ◽

Leticia Rittner

Keyword(s):

Quality Control ◽

Magnetic Resonance ◽

Open Source ◽

Spectroscopic Imaging ◽

Magnetic Resonance Spectroscopic Imaging

Download Full-text

Developers, Quality Control and Download Volume in Open Source Software (OSS) Projects

Journal of Organizational and End User Computing ◽

10.4018/joeuc.2017040103 ◽

2017 ◽

Vol 29 (2) ◽

pp. 43-66

Author(s):

Geoffrey Hill ◽

Pratim Datta ◽

Candice Vander Weerdt

Keyword(s):

Quality Control ◽

Public Good ◽

Open Source ◽

Open Source Software ◽

Inflection Point ◽

Returns To Scale ◽

Tragedy Of The Commons ◽

Curvilinear Relationship ◽

Segmented Regression ◽

The Commons

The open-source software (OSS) movement is often analogized as a commons, where products are developed by and consumed in an open community. However, does a larger commons automatically beget success or does the phenomenon fall prey to the tragedy of the commons? This research forwards and empirically investigates the curvilinear relationship between developers and OSS project quality and a project's download volume. Using segmented regression on over 12,000 SourceForge OSS projects, findings suggest an inflection point in the number of contributing developers on download volume – suggesting increasing and diminishing returns to scale from adding developers to OSS projects. Findings support the economic principle of the tragedy of the commons, a concept where an over-allocated (large number) of developers, even in an open-source environment, can lead to resource mismanagement and reduce the benefit of a public good, i.e. the OSS project.

Download Full-text

DEBoost: A Python Library for Weighted Distance Ensembling in Machine Learning

10.20944/preprints202005.0354.v1 ◽

2020 ◽

Author(s):

Wei Hao Khoong

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Preprocessing ◽

Weighted Distance ◽

Open Source License ◽

Classification Tasks ◽

Python Package

In this paper, we introduce deboost, a Python library devoted to weighted distance ensembling of predictions for regression and classification tasks. Its backbone resides on the scikit-learn library for default models and data preprocessing functions. It offers flexible choices of models for the ensemble as long as they contain the predict method, like the models available from scikit-learn. deboost is released under the MIT open-source license and can be downloaded from the Python Package Index (PyPI) at https://pypi.org/project/deboost. The source scripts are also available on a GitHub repository at https://github.com/weihao94/DEBoost.

Download Full-text

A major role for Eco1 in regulating cohesin-mediated mitotic chromosome folding

10.1101/589101 ◽

2019 ◽

Cited By ~ 3

Author(s):

Lise Dauban ◽

Rémi Montagne ◽

Agnès Thierry ◽

Luciana Lazar-Stefanita ◽

Olivier Gadal ◽

...

Keyword(s):

Mitotic Chromosome ◽

Dna Looping ◽

Loop Formation ◽

Dna Loops ◽

Sister Chromatids ◽

Topologically Associating Domains ◽

Dna Loop ◽

Main Barrier ◽

Structural Maintenance Of Chromosomes ◽

Mitotic Chromatin

AbstractUnderstanding how chromatin organizes spatially into chromatid and how sister chromatids are maintained together during mitosis is of fundamental importance in chromosome biology. Cohesin, a member of the Structural Maintenance of Chromosomes (SMC) complex family, holds sister chromatids together 1–3 and promotes long-range intra-chromatid DNA looping 4,5. These cohesin-mediated DNA loops are important for both higher-order mitotic chromatin compaction6,7 and, in some organisms, compartmentalization of chromosomes during interphase into topologically associating domains (TADs) 8,9. Our understanding of the mechanism(s) by which cohesin generates large DNA loops remains incomplete. It involves a combination of molecular partners and active expansion/extrusion of DNA loops. Here we dissect the roles on loop formation of three partners of the cohesin complex: Pds5 10, Wpl1 11 and Eco1 acetylase 12, during yeast mitosis. We identify a new function for Eco1 in negatively regulating cohesin translocase activity, which powers loop extrusion. In the absence of negative regulation, the main barrier to DNA loop expansion appears to be the centromere. Those results provide new insights on the mechanisms regulating cohesin dependent DNA looping.

Download Full-text

A (fire)cloud-based DNA methylation data preprocessing and quality control platform

BMC Bioinformatics ◽

10.1186/s12859-019-2750-4 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 2

Author(s):

Divy Kangeyan ◽

Andrew Dunford ◽

Sowmya Iyer ◽

Chip Stewart ◽

Megan Hanna ◽

...

Keyword(s):

Dna Methylation ◽

Quality Control ◽

Data Preprocessing ◽

Methylation Data

Download Full-text

HistoQC: An Open-Source Quality Control Tool for Digital Pathology Slides

JCO Clinical Cancer Informatics ◽

10.1200/cci.18.00157 ◽

2019 ◽

pp. 1-7 ◽

Cited By ~ 13

Author(s):

Andrew Janowczyk ◽

Ren Zuo ◽

Hannah Gilmore ◽

Michael Feldman ◽

Anant Madabhushi

Keyword(s):

Quality Control ◽

Open Source ◽

Computational Analysis ◽

Digital Pathology ◽

Control Process ◽

The Cancer Genome Atlas ◽

Quality Control Process ◽

Quality Control Tool ◽

Edge Detectors ◽

Active Research

PURPOSE Digital pathology (DP), referring to the digitization of tissue slides, is beginning to change the landscape of clinical diagnostic workflows and has engendered active research within the area of computational pathology. One of the challenges in DP is the presence of artefacts and batch effects, unintentionally introduced during both routine slide preparation (eg, staining, tissue folding) and digitization (eg, blurriness, variations in contrast and hue). Manual review of glass and digital slides is laborious, qualitative, and subject to intra- and inter-reader variability. Therefore, there is a critical need for a reproducible automated approach of precisely localizing artefacts to identify slides that need to be reproduced or regions that should be avoided during computational analysis. METHODS Here we present HistoQC, a tool for rapidly performing quality control to not only identify and delineate artefacts but also discover cohort-level outliers (eg, slides stained darker or lighter than others in the cohort). This open-source tool employs a combination of image metrics (eg, color histograms, brightness, contrast), features (eg, edge detectors), and supervised classifiers (eg, pen detection) to identify artefact-free regions on digitized slides. These regions and metrics are presented to the user via an interactive graphical user interface, facilitating artefact detection through real-time visualization and filtering. These same metrics afford users the opportunity to explicitly define acceptable tolerances for their workflows. RESULTS The output of HistoQC on 450 slides from The Cancer Genome Atlas was reviewed by two pathologists and found to be suitable for computational analysis more than 95% of the time. CONCLUSION These results suggest that HistoQC could provide an automated, quantifiable, quality control process for identifying artefacts and measuring slide quality, in turn helping to improve both the repeatability and robustness of DP workflows.

Download Full-text

Offline Next Generation Metagenomics Sequence Analysis Using MinION Detection Software (MINDS)

Genes ◽

10.3390/genes10080578 ◽

2019 ◽

Vol 10 (8) ◽

pp. 578 ◽

Cited By ~ 6

Author(s):

Deshpande ◽

Reed ◽

Sullivan ◽

Kerkhof ◽

Beigel ◽

...

Keyword(s):

Open Source ◽

Cost Effective ◽

Ease Of Use ◽

Remote Locations ◽

Resource Limited ◽

Field Samples ◽

Sample Characterization ◽

Internet Availability ◽

Detection Software ◽

Downstream Analysis

Field laboratories interested in using the MinION often need the internet to perform sample analysis. Thus, the lack of internet connectivity in resource-limited or remote locations renders downstream analysis problematic, resulting in a lack of sample identification in the field. Due to this dependency, field samples are generally transported back to the lab for analysis where internet availability for downstream analysis is available. These logistics problems and the time lost in sample characterization and identification, pose a significant problem for field scientists. To address this limitation, we have developed a stand-alone data analysis packet using open source tools developed by the Nanopore community that does not depend on internet availability. Like Oxford Nanopore Technologies’ (ONT) cloud-based What’s In My Pot (WIMP) software, we developed the offline MinION Detection Software (MINDS) based on the Centrifuge classification engine for rapid species identification. Several online bioinformatics applications have been developed surrounding ONT’s framework for analysis of long reads. We have developed and evaluated an offline real time classification application pipeline using open source tools developed by the Nanopore community that does not depend on internet availability. Our application has been tested on ATCC’s 20 strain even mix whole cell (ATCC MSA-2002) sample. Using the Rapid Sequencing Kit (SQK-RAD004), we were able to identify all 20 organisms at species level. The analysis was performed in 15 min using a Dell Precision 7720 laptop. Our offline downstream bioinformatics application provides a cost-effective option as well as quick turn-around time when analyzing samples in the field, thus enabling researchers to fully utilize ONT’s MinION portability, ease-of-use, and identification capability in remote locations.

Download Full-text