pyBedGraph: a python package for fast operations on 1D genomic signal tracks

Henry B Zhang; Minji Kim; Jeffrey H Chuang; Yijun Ruan

doi:10.1093/bioinformatics/btaa061

pyBedGraph: a python package for fast operations on 1D genomic signal tracks

Bioinformatics ◽

10.1093/bioinformatics/btaa061 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3234-3235

Author(s):

Henry B Zhang ◽

Minji Kim ◽

Jeffrey H Chuang ◽

Yijun Ruan

Keyword(s):

Chromatin Accessibility ◽

Genomic Research ◽

Supplementary Information ◽

Summary Statistics ◽

Rna Seq ◽

Binary Format ◽

Modern Genomic ◽

Binding Intensity ◽

Python Package ◽

Generation Sequencing

Abstract Motivation Modern genomic research is driven by next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed. Results We developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph or a bigWig file. When tested on 12 ChIP-seq, ATAC-seq, RNA-seq and ChIA-PET datasets, pyBedGraph is on average 260 times faster than the existing program pyBigWig. On average, pyBedGraph can look up the exact mean signal of 1 million regions in ∼0.26 s and can compute their approximate means in <0.12 s on a conventional laptop. Availability and implementation pyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

pyBedGraph: a Python package for fast operations on 1-dimensional genomic signal tracks

10.1101/709683 ◽

2019 ◽

Author(s):

Henry B. Zhang ◽

Minji Kim ◽

Jeffrey H. Chuang ◽

Yijun Ruan

Keyword(s):

Chromatin Accessibility ◽

Genomic Research ◽

Summary Statistics ◽

Text Format ◽

Binary Format ◽

Modern Genomic ◽

Binding Intensity ◽

Python Package ◽

Generation Sequencing ◽

Genomic Signal

AbstractMotivationModern genomic research relies heavily on next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed.ResultsWe developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph file. When tested on 8 ChIP-seq and ATAC-seq datasets, pyBedGraph is on average 245 times faster than the existing program. Notably, pyBedGraph can look up the exact mean signal of 1 million regions in ~0.26 second on a conventional laptop. An approximate mean for 10,000 regions can be computed in ~0.0012 second with an error rate of less than 5 percent.AvailabilitypyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license.

Download Full-text

BioSeqZip: a collapser of NGS redundant reads for the optimization of sequence analysis

Bioinformatics ◽

10.1093/bioinformatics/btaa051 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2705-2711 ◽

Cited By ~ 2

Author(s):

Gianvito Urgese ◽

Emanuele Parisi ◽

Orazio Scicolone ◽

Santa Di Cataldo ◽

Elisa Ficarra

Keyword(s):

Sequence Analysis ◽

Supplementary Information ◽

Sorting Algorithm ◽

Rna Seq ◽

Compact Sets ◽

Analysis Pipeline ◽

Alignment Algorithms ◽

External Sorting ◽

Computational Resources ◽

Generation Sequencing

Abstract Motivation High-throughput next-generation sequencing can generate huge sequence files, whose analysis requires alignment algorithms that are typically very demanding in terms of memory and computational resources. This is a significant issue, especially for machines with limited hardware capabilities. As the redundancy of the sequences typically increases with coverage, collapsing such files into compact sets of non-redundant reads has the 2-fold advantage of reducing file size and speeding-up the alignment, avoiding to map the same sequence multiple times. Method BioSeqZip generates compact and sorted lists of alignment-ready non-redundant sequences, keeping track of their occurrences in the raw files as well as of their quality score information. By exploiting a memory-constrained external sorting algorithm, it can be executed on either single- or multi-sample datasets even on computers with medium computational capabilities. On request, it can even re-expand the compacted files to their original state. Results Our extensive experiments on RNA-Seq data show that BioSeqZip considerably brings down the computational costs of a standard sequence analysis pipeline, with particular benefits for the alignment procedures that typically have the highest requirements in terms of memory and execution time. In our tests, BioSeqZip was able to compact 2.7 billion of reads into 963 million of unique tags reducing the size of sequence files up to 70% and speeding-up the alignment by 50% at least. Availability and implementation BioSeqZip is available at https://github.com/bioinformatics-polito/BioSeqZip. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CoCo: RNA-seq read assignment correction for nested genes and multimapped reads

Bioinformatics ◽

10.1093/bioinformatics/btz433 ◽

2019 ◽

Vol 35 (23) ◽

pp. 5039-5047 ◽

Cited By ~ 6

Author(s):

Gabrielle Deschamps-Francoeur ◽

Vincent Boivin ◽

Sherif Abou Elela ◽

Michelle S Scott

Keyword(s):

Supplementary Information ◽

Rna Seq ◽

Non Coding Rna ◽

Abundance Estimates ◽

Gene Coverage ◽

Nested Genes ◽

Quantification Accuracy ◽

Whole Transcriptome Analysis ◽

Whole Transcriptome ◽

Generation Sequencing

Abstract Motivation Next-generation sequencing techniques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage. Results Here we present count corrector (CoCo), a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bedgraph comparisons. Availability and implementation The CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/coco. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Holistic optimization of an RNA-seq workflow for multi-threaded environments

Bioinformatics ◽

10.1093/bioinformatics/btz169 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4173-4175 ◽

Cited By ~ 3

Author(s):

Ling-Hong Hung ◽

Wes Lloyd ◽

Radhika Agumbe Sridhar ◽

Saranya Devi Athmalingam Ravishankar ◽

Yuguang Xiong ◽

...

Keyword(s):

Parallel Implementation ◽

Reference Sequence ◽

Supplementary Information ◽

Rna Seq ◽

Computationally Intensive ◽

Holistic Optimization ◽

Parallel Workflow ◽

Unique Molecular Identifier ◽

Generation Sequencing ◽

Alignment Step

Abstract Summary For many next generation-sequencing pipelines, the most computationally intensive step is the alignment of reads to a reference sequence. As a result, alignment software such as the Burrows-Wheeler Aligner is optimized for speed and is often executed in parallel on the cloud. However, there are other less demanding steps that can also be optimized to significantly increase the speed especially when using many threads. We demonstrate this using a unique molecular identifier RNA-sequencing pipeline consisting of 3 steps: split, align, and merge. Optimization of all three steps yields a 40% increase in speed when executed using a single thread. However, when executed using 16 threads, we observe a 4-fold improvement over the original parallel implementation and more than an 8-fold improvement over the original single-threaded implementation. In contrast, optimizing only the alignment step results in just a 13% improvement over the original parallel workflow using 16 threads. Availability and implementation Code (M.I.T. license), supporting scripts and Dockerfiles are available at https://github.com/BioDepot/LINCS_RNAseq_cpp and Docker images at https://hub.docker.com/r/biodepot/rnaseq-umi-cpp/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Metric Learning on Expression Data for Gene Function Prediction

10.1101/651042 ◽

2019 ◽

Author(s):

Stavros Makrodimitris ◽

Marcel J.T. Reinders ◽

Roeland C.H.J. van Ham

Keyword(s):

Pearson Correlation ◽

Metric Learning ◽

Specific Weight ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Experimental Conditions ◽

Expression Of Genes ◽

Guilt By Association ◽

Python Package

AbstractMotivationCo-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, using RNA-Seq datasets with many experimental conditions from diverse sources introduces batch effects and other artefacts that might obscure the real co-expression signal. Moreover, only a subset of experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similar functioning genes that the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest.ResultsTo address both types of effects, we developed MLC (Metric Learning for Co-expression), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression, and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance.AvailabilityMLC is available as a Python package at www.github.com/stamakro/[email protected] informationSupplementary data are available online.

Download Full-text

RAISS: robust and accurate imputation from summary statistics

Bioinformatics ◽

10.1093/bioinformatics/btz466 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4837-4839 ◽

Cited By ~ 1

Author(s):

Hanna Julienne ◽

Huwenbo Shi ◽

Bogdan Pasaniuc ◽

Hugues Aschard

Keyword(s):

Effect Size ◽

Association Studies ◽

Real Data ◽

Supplementary Information ◽

P Value ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Genome Wide ◽

Small Effect Size ◽

Python Package

Abstract Motivation Multi-trait analyses using public summary statistics from genome-wide association studies (GWASs) are becoming increasingly popular. A constraint of multi-trait methods is that they require complete summary data for all traits. Although methods for the imputation of summary statistics exist, they lack precision for genetic variants with small effect size. This is benign for univariate analyses where only variants with large effect size are selected a posteriori. However, it can lead to strong p-value inflation in multi-trait testing. Here we present a new approach that improve the existing imputation methods and reach a precision suitable for multi-trait analyses. Results We fine-tuned parameters to obtain a very high accuracy imputation from summary statistics. We demonstrate this accuracy for variants of all effect sizes on real data of 28 GWAS. We implemented the resulting methodology in a python package specially designed to efficiently impute multiple GWAS in parallel. Availability and implementation The python package is available at: https://gitlab.pasteur.fr/statistical-genetics/raiss, its accompanying documentation is accessible here http://statistical-genetics.pages.pasteur.fr/raiss/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BYASE: a Python library for estimating gene and isoform level allele-specific expression

Bioinformatics ◽

10.1093/bioinformatics/btaa636 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4955-4956

Author(s):

Lili Dong ◽

Jianan Wang ◽

Guohua Wang

Keyword(s):

Graphical User Interface ◽

Supplementary Information ◽

Rna Seq ◽

Biological Mechanisms ◽

Specific Expression ◽

Allele Specific Expression ◽

Source Codes ◽

Gene Level ◽

Allele Specific ◽

Python Package

Abstract Summary Allele-specific expression (ASE) is involved in many important biological mechanisms. We present a python package BYASE and its graphical user interface (GUI) tool BYASE-GUI for the identification of ASE from single-end and paired-end RNA-seq data based on Bayesian inference, which can simultaneously report differences in gene-level and isoform-level expression. BYASE uses both phased SNPs and non-phased SNPs, and supports polyploid organisms. Availability and implementation The source codes of BYASE and BYASE-GUI are freely available at https://github.com/ncjllld/byase and https://github.com/ncjllld/byase_gui. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses

10.1101/165191 ◽

2017 ◽

Author(s):

Claire Rioualen ◽

Lucie Charbonnier-Khamvongsa ◽

Jacques van Helden

Keyword(s):

Next Generation Sequencing ◽

Life Sciences ◽

Supplementary Information ◽

Supplementary Data ◽

Rna Seq ◽

Genome Wide ◽

Domains Of Life ◽

Supplementary Material ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

AbstractSummaryNext-Generation Sequencing (NGS) is becoming a routine approach for most domains of life sciences, yet there is a crucial need to improve the automation of processing for the huge amounts of data generated and to ensure reproducible results. We present SnakeChunks, a collection of Snakemake rules enabling to compose modular and user-configurable workflows, and show its usage with analyses of transcriptome (RNA-seq) and genome-wide location (ChIP-seq) data.AvailabilityThe code is freely available (github.com/SnakeChunks/SnakeChunks), and documented with tutorials and illustrative demos (snakechunks.readthedocs.io)[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

BAMixChecker: an automated checkup tool for matched sample pairs in NGS cohort

Bioinformatics ◽

10.1093/bioinformatics/btz479 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4806-4808 ◽

Cited By ~ 2

Author(s):

Hein Chun ◽

Sangwoo Kim

Keyword(s):

Genomic Analysis ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Sequencing Data ◽

Single Nucleotide ◽

Frequent Problem ◽

Generation Sequencing ◽

User Intervention ◽

Genotype Concordance

Abstract Summary Mislabeling in the process of next generation sequencing is a frequent problem that can cause an entire genomic analysis to fail, and a regular cohort-level checkup is needed to ensure that it has not occurred. We developed a new, automated tool (BAMixChecker) that accurately detects sample mismatches from a given BAM file cohort with minimal user intervention. BAMixChecker uses a flexible, data-specific set of single-nucleotide polymorphisms and detects orphan (unpaired) and swapped (mispaired) samples based on genotype-concordance score and entropy-based file name analysis. BAMixChecker shows ∼100% accuracy in real WES, RNA-Seq and targeted sequencing data cohorts, even for small panels (<50 genes). BAMixChecker provides an HTML-style report that graphically outlines the sample matching status in tables and heatmaps, with which users can quickly inspect any mismatch events. Availability and implementation BAMixChecker is available at https://github.com/heinc1010/BAMixChecker Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

smallrnaseq: short non coding RNA-Seq analysis with Python

10.1101/110585 ◽

2017 ◽

Cited By ~ 1

Author(s):

Damien Farrell

Keyword(s):

Novel Species ◽

Command Line ◽

Rna Seq ◽

Absolute Abundance ◽

Command Line Interface ◽

Novel Mirna ◽

Non Coding Rna ◽

Regulatory Functions ◽

Python Package ◽

Generation Sequencing

ABSTRACTThe use of next generation sequencing is now a standard approach to elucidate the small non-coding RNA species (sncRNAs) present in tissue and biofluid samples. This has revealed the wide variety of RNAs with regulatory functions the best studied of which are microRNAs. Profiling of sncRNAs by deep sequencing allows measures of absolute abundance and for the discovery of novel species that have eluded previous methods. Specific considerations must be made when quantifying and cataloging sncRNAs and multiple algorithms are now available, mostly focused on miRNA analysis. smallrnaseq is a Python package that implements some of the standard approaches for quantification and analysis of sncRNAs. This includes miRNA quantification and novel miRNA prediction. A command line interface makes the software accessible for general users.

Download Full-text