Bedshift: perturbation of genomic interval sets

Results of functional genomics experiments such as ChIP-Seq or ATAC-Seq produce data summarized as a region set. Many tools have been developed to analyze region sets, including computing similarity metrics to compare them. However, there is no way to objectively evaluate the effectiveness of region set similarity metrics. In this paper we present bedshift, a command-line tool and Python API to generate new BED files by making random perturbations to an original BED file. Perturbed files have known similarity to the original file and are therefore useful to benchmark similarity metrics. To demonstrate, we used bedshift to create an evaluation dataset of 3,600 perturbed files generated by shifting, adding, and dropping regions from a reference BED file. Then, we compared four similarity metrics: Jaccard score, coverage score, Euclidean distance, and cosine similarity. The results show that the Jaccard score is most sensitive to detecting adding and dropping regions, while the coverage score is more sensitive to shifted regions.AvailabilityBSD2-licensed source code and documentation can be found at https://bedshift.databio.org.

Download Full-text

Alview: Portable Software for Viewing Sequence Reads in BAM Formatted Files

Cancer Informatics ◽

10.4137/cin.s26470 ◽

2015 ◽

Vol 14 ◽

pp. CIN.S26470 ◽

Cited By ~ 2

Author(s):

Richard P. Finney ◽

Qing-Rong Chen ◽

Cu V. Nguyen ◽

Chih Hao Hsu ◽

Chunhua Yan ◽

...

Keyword(s):

Graphical User Interface ◽

Reference Genome ◽

Source Code ◽

Software Tool ◽

Command Line ◽

Sequencing Data ◽

Genome Data ◽

Command Line Tool ◽

Portable Software ◽

Microsoft Windows

The name Alview is a contraction of the term Alignment Viewer. Alview is a compiled to native architecture software tool for visualizing the alignment of sequencing data. Inputs are files of short-read sequences aligned to a reference genome in the SAM/BAM format and files containing reference genome data. Outputs are visualizations of these aligned short reads. Alview is written in portable C with optional graphical user interface (GUI) code written in C, C++, and Objective-C. The application can run in three different ways: as a web server, as a command line tool, or as a native, GUI program. Alview is compatible with Microsoft Windows, Linux, and Apple OS X. It is available as a web demo at https://cgwb.nci.nih.gov/cgi-bin/alview . The source code and Windows/Mac/Linux executables are available via https://github.com/NCIP/alview .

Download Full-text

Bedshift: perturbation of genomic interval sets

Genome Biology ◽

10.1186/s13059-021-02440-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Aaron Gu ◽

Hyun Jae Cho ◽

Nathan C. Sheffield

Keyword(s):

Functional Genomics ◽

Similarity Metrics ◽

Reference File ◽

Genomic Interval ◽

Coverage Score

AbstractFunctional genomics experiments, like ChIP-Seq or ATAC-Seq, produce results that are summarized as a region set. There is no way to objectively evaluate the effectiveness of region set similarity metrics. We present Bedshift, a tool for perturbing BED files by randomly shifting, adding, and dropping regions from a reference file. The perturbed files can be used to benchmark similarity metrics, as well as for other applications. We highlight differences in behavior between metrics, such as that the Jaccard score is most sensitive to added or dropped regions, while coverage score is most sensitive to shifted regions.

Download Full-text

SgTiler: a fast method to design tiling sgRNAs for CRISPR/Cas9 mediated screening

10.1101/217166 ◽

2017 ◽

Author(s):

Musaddeque Ahmed ◽

Housheng Hansen He

Keyword(s):

Source Code ◽

Regions Of Interest ◽

Fast Method ◽

Command Line ◽

Regulatory Regions ◽

Guide Rnas ◽

Command Line Tool ◽

Genomic Regions ◽

The Web ◽

Target Effects

AbstractSummaryScreening of genomic regions of interest using CRISPR/Cas9 is getting increasingly popular. The system requires designing of single guide RNAs (sgRNAs) that can efficiently guide the Cas9 endonuclease to the targeted region with minimal off-target effects. Tiling sgRNAs is the most effective way to perturb regulatory regions, such as promoters and enhancers. sgTiler is the first tool that provides a fast method for designing tiling sgRNAs.Availability and ImplementationsgTiler is a command line tool that requires only one command to execute. Its source code is freely available on the web at https://github.com/HansenHeLab/sgTiler. sgTiler is implemented in Python and supported on any platform with Python and Bowtie.

Download Full-text

BiasAway: command-line and web server to generate nucleotide composition-matched DNA background sequences

Bioinformatics ◽

10.1093/bioinformatics/btaa928 ◽

2020 ◽

Author(s):

Aziz Khan ◽

Rafael Riudavets Puig ◽

Paul Boddie ◽

Anthony Mathelier

Keyword(s):

Dna Sequences ◽

Source Code ◽

Web Server ◽

Enrichment Analysis ◽

Nucleotide Composition ◽

Supplementary Information ◽

Command Line ◽

Sequence Composition ◽

Command Line Tool ◽

Gc Bias

Abstract Motivation Accurate motif enrichment analyses depend on the choice of background DNA sequences used, which should ideally match the sequence composition of the foreground sequences. It is important to avoid false positive enrichment due to sequence biases in the genome, such as GC-bias. Therefore, relying on an appropriate set of background sequences is crucial for enrichment analysis. Results We developed BiasAway, a command line tool and its dedicated easy-to-use web server to generate synthetic sequences matching any k-mer nucleotide composition or select genomic DNA sequences matching the mononucleotide composition of the foreground sequences through four different models. For genomic sequences, we provide precomputed partitions of genomes from nine species with five different bin sizes to generate appropriate genomic background sequences. Availability and implementation BiasAway source code is freely available from Bitbucket (https://bitbucket.org/CBGR/biasaway) and can be easily installed using bioconda or pip. The web server is available at https://biasaway.uio.no and a detailed documentation is available at https://biasaway.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

era5cli: The command line tool to download ERA5 data

10.5194/egusphere-egu2020-21619 ◽

2020 ◽

Author(s):

Jaro Camphuijsen ◽

Ronald van Haren ◽

Yifat Dzigan ◽

Niels Drost ◽

Fakhareh Alidoost ◽

...

Keyword(s):

Source Code ◽

Reanalysis Data ◽

Command Line ◽

Climate Data ◽

Web Interface ◽

Command Line Interface ◽

Short Introduction ◽

Data Store ◽

Command Line Tool ◽

Advanced Knowledge

With the release of the ERA5 dataset, worldwide high resolution reanalysis data became available with open access for public use. The Copernicus CDS (Climate Data Store) offers two options for accessing the data: a web interface and a Python API. Consequently, automated downloading of the data requires advanced knowledge of Python and a lot of work. To make this process easier, we developed era5cli.&#160;The command line interface tool era5cli enables automated downloading of ERA5 using a single command. All variables and options available in the CDS web form are now available for download in an efficient way. Both the monthly and hourly dataset are supported. Besides automation, era5cli adds several useful functionalities to the download pipeline.One of the key options in era5cli is to spread one download command over multiple CDS requests, resulting in higher download speeds. Files can be saved in both GRIB and NETCDF format with automatic, yet customizable file names. The `info` command lists correct names of the available variables and pressure levels for 3D variables. For debugging purposes and testing the `--dryrun` option can be selected to return only the CDS request. An overview of all available options, including instructions on how to configure your CDS account, is available in our documentation. Source code is available on https://github.com/eWaterCycle/era5cli.In this PICO presentation we will provide an overview of era5cli, as well as a short introduction on how to use era5cli.

Download Full-text

FAN-C: A Feature-rich Framework for the Analysis and Visualisation of C data

10.1101/2020.02.03.932517 ◽

2020 ◽

Cited By ~ 6

Author(s):

Kai Kruse ◽

Clemens B. Hug ◽

Juan M. Vaquerizas

Keyword(s):

High Throughput ◽

Matrix Analysis ◽

Set Covering ◽

Command Line ◽

Chromosome Conformation ◽

C Storage ◽

Data Formats ◽

Analysis Tools ◽

Command Line Tool ◽

Broad Feature

Chromosome conformation capture data, particularly from high-throughput approaches such as Hi-C and its derivatives, are typically very complex to analyse. Existing analysis tools are often single-purpose, or limited in compatibility to a small number of data formats, frequently making Hi-C analyses tedious and time-consuming. Here, we present FAN-C, an easy-to-use command-line tool and powerful Python API with a broad feature set covering matrix generation, analysis, and visualisation for C-like data (https://github.com/vaquerizaslab/fanc). Due to its comprehensiveness and compatibility with the most prevalent Hi-C storage formats, FAN-C can be used in combination with a large number of existing analysis tools, thus greatly simplifying Hi-C matrix analysis.

Download Full-text

Vargas: heuristic-free alignment for assessing linear and graph read aligners

10.1101/2019.12.20.884676 ◽

2019 ◽

Author(s):

Charlotte A. Darby ◽

Ravi Gaddipati ◽

Michael C. Schatz ◽

Ben Langmead

Keyword(s):

Gold Standard ◽

Source Code ◽

Alignment Accuracy ◽

Local Alignment ◽

Maximum Speed ◽

Command Line ◽

Scoring Functions ◽

Large Numbers ◽

Computationally Intensive ◽

Optimal Alignments

AbstractRead alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these “gold standard” Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-MEM, and vg to align more reads correctly. Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license.

Download Full-text

Associating Natural Language Comment and Source Code Entities

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6382 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8592-8599

Author(s):

Sheena Panthaplackel ◽

Milos Gligoric ◽

Raymond J. Mooney ◽

Junyi Jessy Li

Keyword(s):

Software Development ◽

Natural Language ◽

Open Source ◽

Source Code ◽

Initial Step ◽

Binary Classifier ◽

Sequence Labeling ◽

Evaluation Dataset ◽

Revision Histories

Comments are an integral part of software development; they are natural language descriptions associated with source code elements. Understanding explicit associations can be useful in improving code comprehensibility and maintaining the consistency between code and comments. As an initial step towards this larger goal, we address the task of associating entities in Javadoc comments with elements in Java source code. We propose an approach for automatically extracting supervised data using revision histories of open source projects and present a manually annotated evaluation dataset for this task. We develop a binary classifier and a sequence labeling model by crafting a rich feature set which encompasses various aspects of code, comments, and the relationships between them. Experiments show that our systems outperform several baselines learning from the proposed supervision.

Download Full-text

idCOV: a pipeline for quick clade identification of SARS-CoV-2 isolates

10.1101/2020.10.08.330456 ◽

2020 ◽

Author(s):

Xun Zhu ◽

Ti-Cheng Chang ◽

Richard Webby ◽

Gang Wu

Keyword(s):

Personal Computer ◽

Source Code ◽

Command Line ◽

Sequencing Data ◽

Link Type ◽

Public Dataset ◽

Virus Isolates

AbstractidCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md.

Download Full-text

ScaffoldGraph: an open-source library for the generation and analysis of molecular scaffold networks and scaffold trees

Bioinformatics ◽

10.1093/bioinformatics/btaa219 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3930-3931 ◽

Cited By ~ 1

Author(s):

Oliver B Scott ◽

A W Edith Chan

Keyword(s):

Open Source ◽

High Throughput Screening ◽

Chemical Space ◽

Diversity Analysis ◽

Graph Analysis ◽

Command Line ◽

Molecular Scaffold ◽

Large Sets ◽

Command Line Tool ◽

Scaffold Diversity

Abstract Summary ScaffoldGraph (SG) is an open-source Python library and command-line tool for the generation and analysis of molecular scaffold networks and trees, with the capability of processing large sets of input molecules. With the increase in high-throughput screening data, scaffold graphs have proven useful for the navigation and analysis of chemical space, being used for visualization, clustering, scaffold-diversity analysis and active-series identification. Built on RDKit and NetworkX, SG integrates scaffold graph analysis into the growing scientific/cheminformatics Python stack, increasing the flexibility and extendibility of the tool compared to existing software. Availability and implementation SG is freely available and released under the MIT licence at https://github.com/UCLCheminformatics/ScaffoldGraph.

Download Full-text