scholarly journals pyBedGraph: a python package for fast operations on 1D genomic signal tracks

2020 ◽  
Vol 36 (10) ◽  
pp. 3234-3235
Author(s):  
Henry B Zhang ◽  
Minji Kim ◽  
Jeffrey H Chuang ◽  
Yijun Ruan

Abstract Motivation Modern genomic research is driven by next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed. Results We developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph or a bigWig file. When tested on 12 ChIP-seq, ATAC-seq, RNA-seq and ChIA-PET datasets, pyBedGraph is on average 260 times faster than the existing program pyBigWig. On average, pyBedGraph can look up the exact mean signal of 1 million regions in ∼0.26 s and can compute their approximate means in <0.12 s on a conventional laptop. Availability and implementation pyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Henry B. Zhang ◽  
Minji Kim ◽  
Jeffrey H. Chuang ◽  
Yijun Ruan

AbstractMotivationModern genomic research relies heavily on next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed.ResultsWe developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph file. When tested on 8 ChIP-seq and ATAC-seq datasets, pyBedGraph is on average 245 times faster than the existing program. Notably, pyBedGraph can look up the exact mean signal of 1 million regions in ~0.26 second on a conventional laptop. An approximate mean for 10,000 regions can be computed in ~0.0012 second with an error rate of less than 5 percent.AvailabilitypyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license.


2020 ◽  
Vol 36 (9) ◽  
pp. 2705-2711 ◽  
Author(s):  
Gianvito Urgese ◽  
Emanuele Parisi ◽  
Orazio Scicolone ◽  
Santa Di Cataldo ◽  
Elisa Ficarra

Abstract Motivation High-throughput next-generation sequencing can generate huge sequence files, whose analysis requires alignment algorithms that are typically very demanding in terms of memory and computational resources. This is a significant issue, especially for machines with limited hardware capabilities. As the redundancy of the sequences typically increases with coverage, collapsing such files into compact sets of non-redundant reads has the 2-fold advantage of reducing file size and speeding-up the alignment, avoiding to map the same sequence multiple times. Method BioSeqZip generates compact and sorted lists of alignment-ready non-redundant sequences, keeping track of their occurrences in the raw files as well as of their quality score information. By exploiting a memory-constrained external sorting algorithm, it can be executed on either single- or multi-sample datasets even on computers with medium computational capabilities. On request, it can even re-expand the compacted files to their original state. Results Our extensive experiments on RNA-Seq data show that BioSeqZip considerably brings down the computational costs of a standard sequence analysis pipeline, with particular benefits for the alignment procedures that typically have the highest requirements in terms of memory and execution time. In our tests, BioSeqZip was able to compact 2.7 billion of reads into 963 million of unique tags reducing the size of sequence files up to 70% and speeding-up the alignment by 50% at least. Availability and implementation BioSeqZip is available at https://github.com/bioinformatics-polito/BioSeqZip. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (23) ◽  
pp. 5039-5047 ◽  
Author(s):  
Gabrielle Deschamps-Francoeur ◽  
Vincent Boivin ◽  
Sherif Abou Elela ◽  
Michelle S Scott

Abstract Motivation Next-generation sequencing techniques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage. Results Here we present count corrector (CoCo), a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bedgraph comparisons. Availability and implementation The CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/coco. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (20) ◽  
pp. 4173-4175 ◽  
Author(s):  
Ling-Hong Hung ◽  
Wes Lloyd ◽  
Radhika Agumbe Sridhar ◽  
Saranya Devi Athmalingam Ravishankar ◽  
Yuguang Xiong ◽  
...  

Abstract Summary For many next generation-sequencing pipelines, the most computationally intensive step is the alignment of reads to a reference sequence. As a result, alignment software such as the Burrows-Wheeler Aligner is optimized for speed and is often executed in parallel on the cloud. However, there are other less demanding steps that can also be optimized to significantly increase the speed especially when using many threads. We demonstrate this using a unique molecular identifier RNA-sequencing pipeline consisting of 3 steps: split, align, and merge. Optimization of all three steps yields a 40% increase in speed when executed using a single thread. However, when executed using 16 threads, we observe a 4-fold improvement over the original parallel implementation and more than an 8-fold improvement over the original single-threaded implementation. In contrast, optimizing only the alignment step results in just a 13% improvement over the original parallel workflow using 16 threads. Availability and implementation Code (M.I.T. license), supporting scripts and Dockerfiles are available at https://github.com/BioDepot/LINCS_RNAseq_cpp and Docker images at https://hub.docker.com/r/biodepot/rnaseq-umi-cpp/ Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Stavros Makrodimitris ◽  
Marcel J.T. Reinders ◽  
Roeland C.H.J. van Ham

AbstractMotivationCo-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, using RNA-Seq datasets with many experimental conditions from diverse sources introduces batch effects and other artefacts that might obscure the real co-expression signal. Moreover, only a subset of experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similar functioning genes that the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest.ResultsTo address both types of effects, we developed MLC (Metric Learning for Co-expression), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression, and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance.AvailabilityMLC is available as a Python package at www.github.com/stamakro/[email protected] informationSupplementary data are available online.


2019 ◽  
Vol 35 (22) ◽  
pp. 4837-4839 ◽  
Author(s):  
Hanna Julienne ◽  
Huwenbo Shi ◽  
Bogdan Pasaniuc ◽  
Hugues Aschard

Abstract Motivation Multi-trait analyses using public summary statistics from genome-wide association studies (GWASs) are becoming increasingly popular. A constraint of multi-trait methods is that they require complete summary data for all traits. Although methods for the imputation of summary statistics exist, they lack precision for genetic variants with small effect size. This is benign for univariate analyses where only variants with large effect size are selected a posteriori. However, it can lead to strong p-value inflation in multi-trait testing. Here we present a new approach that improve the existing imputation methods and reach a precision suitable for multi-trait analyses. Results We fine-tuned parameters to obtain a very high accuracy imputation from summary statistics. We demonstrate this accuracy for variants of all effect sizes on real data of 28 GWAS. We implemented the resulting methodology in a python package specially designed to efficiently impute multiple GWAS in parallel. Availability and implementation The python package is available at: https://gitlab.pasteur.fr/statistical-genetics/raiss, its accompanying documentation is accessible here http://statistical-genetics.pages.pasteur.fr/raiss/. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (19) ◽  
pp. 4955-4956
Author(s):  
Lili Dong ◽  
Jianan Wang ◽  
Guohua Wang

Abstract Summary Allele-specific expression (ASE) is involved in many important biological mechanisms. We present a python package BYASE and its graphical user interface (GUI) tool BYASE-GUI for the identification of ASE from single-end and paired-end RNA-seq data based on Bayesian inference, which can simultaneously report differences in gene-level and isoform-level expression. BYASE uses both phased SNPs and non-phased SNPs, and supports polyploid organisms. Availability and implementation The source codes of BYASE and BYASE-GUI are freely available at https://github.com/ncjllld/byase and https://github.com/ncjllld/byase_gui. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Claire Rioualen ◽  
Lucie Charbonnier-Khamvongsa ◽  
Jacques van Helden

AbstractSummaryNext-Generation Sequencing (NGS) is becoming a routine approach for most domains of life sciences, yet there is a crucial need to improve the automation of processing for the huge amounts of data generated and to ensure reproducible results. We present SnakeChunks, a collection of Snakemake rules enabling to compose modular and user-configurable workflows, and show its usage with analyses of transcriptome (RNA-seq) and genome-wide location (ChIP-seq) data.AvailabilityThe code is freely available (github.com/SnakeChunks/SnakeChunks), and documented with tutorials and illustrative demos (snakechunks.readthedocs.io)[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (22) ◽  
pp. 4806-4808 ◽  
Author(s):  
Hein Chun ◽  
Sangwoo Kim

Abstract Summary Mislabeling in the process of next generation sequencing is a frequent problem that can cause an entire genomic analysis to fail, and a regular cohort-level checkup is needed to ensure that it has not occurred. We developed a new, automated tool (BAMixChecker) that accurately detects sample mismatches from a given BAM file cohort with minimal user intervention. BAMixChecker uses a flexible, data-specific set of single-nucleotide polymorphisms and detects orphan (unpaired) and swapped (mispaired) samples based on genotype-concordance score and entropy-based file name analysis. BAMixChecker shows ∼100% accuracy in real WES, RNA-Seq and targeted sequencing data cohorts, even for small panels (<50 genes). BAMixChecker provides an HTML-style report that graphically outlines the sample matching status in tables and heatmaps, with which users can quickly inspect any mismatch events. Availability and implementation BAMixChecker is available at https://github.com/heinc1010/BAMixChecker Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Damien Farrell

ABSTRACTThe use of next generation sequencing is now a standard approach to elucidate the small non-coding RNA species (sncRNAs) present in tissue and biofluid samples. This has revealed the wide variety of RNAs with regulatory functions the best studied of which are microRNAs. Profiling of sncRNAs by deep sequencing allows measures of absolute abundance and for the discovery of novel species that have eluded previous methods. Specific considerations must be made when quantifying and cataloging sncRNAs and multiple algorithms are now available, mostly focused on miRNA analysis. smallrnaseq is a Python package that implements some of the standard approaches for quantification and analysis of sncRNAs. This includes miRNA quantification and novel miRNA prediction. A command line interface makes the software accessible for general users.


Sign in / Sign up

Export Citation Format

Share Document