scholarly journals Bedshift: perturbation of genomic interval sets

2020 ◽  
Author(s):  
Aaron Gu ◽  
Hyun Jae Cho ◽  
Nathan C. Sheffield

Results of functional genomics experiments such as ChIP-Seq or ATAC-Seq produce data summarized as a region set. Many tools have been developed to analyze region sets, including computing similarity metrics to compare them. However, there is no way to objectively evaluate the effectiveness of region set similarity metrics. In this paper we present bedshift, a command-line tool and Python API to generate new BED files by making random perturbations to an original BED file. Perturbed files have known similarity to the original file and are therefore useful to benchmark similarity metrics. To demonstrate, we used bedshift to create an evaluation dataset of 3,600 perturbed files generated by shifting, adding, and dropping regions from a reference BED file. Then, we compared four similarity metrics: Jaccard score, coverage score, Euclidean distance, and cosine similarity. The results show that the Jaccard score is most sensitive to detecting adding and dropping regions, while the coverage score is more sensitive to shifted regions.AvailabilityBSD2-licensed source code and documentation can be found at https://bedshift.databio.org.

2015 ◽  
Vol 14 ◽  
pp. CIN.S26470 ◽  
Author(s):  
Richard P. Finney ◽  
Qing-Rong Chen ◽  
Cu V. Nguyen ◽  
Chih Hao Hsu ◽  
Chunhua Yan ◽  
...  

The name Alview is a contraction of the term Alignment Viewer. Alview is a compiled to native architecture software tool for visualizing the alignment of sequencing data. Inputs are files of short-read sequences aligned to a reference genome in the SAM/BAM format and files containing reference genome data. Outputs are visualizations of these aligned short reads. Alview is written in portable C with optional graphical user interface (GUI) code written in C, C++, and Objective-C. The application can run in three different ways: as a web server, as a command line tool, or as a native, GUI program. Alview is compatible with Microsoft Windows, Linux, and Apple OS X. It is available as a web demo at https://cgwb.nci.nih.gov/cgi-bin/alview . The source code and Windows/Mac/Linux executables are available via https://github.com/NCIP/alview .


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Aaron Gu ◽  
Hyun Jae Cho ◽  
Nathan C. Sheffield

AbstractFunctional genomics experiments, like ChIP-Seq or ATAC-Seq, produce results that are summarized as a region set. There is no way to objectively evaluate the effectiveness of region set similarity metrics. We present Bedshift, a tool for perturbing BED files by randomly shifting, adding, and dropping regions from a reference file. The perturbed files can be used to benchmark similarity metrics, as well as for other applications. We highlight differences in behavior between metrics, such as that the Jaccard score is most sensitive to added or dropped regions, while coverage score is most sensitive to shifted regions.


2017 ◽  
Author(s):  
Musaddeque Ahmed ◽  
Housheng Hansen He

AbstractSummaryScreening of genomic regions of interest using CRISPR/Cas9 is getting increasingly popular. The system requires designing of single guide RNAs (sgRNAs) that can efficiently guide the Cas9 endonuclease to the targeted region with minimal off-target effects. Tiling sgRNAs is the most effective way to perturb regulatory regions, such as promoters and enhancers. sgTiler is the first tool that provides a fast method for designing tiling sgRNAs.Availability and ImplementationsgTiler is a command line tool that requires only one command to execute. Its source code is freely available on the web at https://github.com/HansenHeLab/sgTiler. sgTiler is implemented in Python and supported on any platform with Python and Bowtie.


Author(s):  
Aziz Khan ◽  
Rafael Riudavets Puig ◽  
Paul Boddie ◽  
Anthony Mathelier

Abstract Motivation Accurate motif enrichment analyses depend on the choice of background DNA sequences used, which should ideally match the sequence composition of the foreground sequences. It is important to avoid false positive enrichment due to sequence biases in the genome, such as GC-bias. Therefore, relying on an appropriate set of background sequences is crucial for enrichment analysis. Results We developed BiasAway, a command line tool and its dedicated easy-to-use web server to generate synthetic sequences matching any k-mer nucleotide composition or select genomic DNA sequences matching the mononucleotide composition of the foreground sequences through four different models. For genomic sequences, we provide precomputed partitions of genomes from nine species with five different bin sizes to generate appropriate genomic background sequences. Availability and implementation BiasAway source code is freely available from Bitbucket (https://bitbucket.org/CBGR/biasaway) and can be easily installed using bioconda or pip. The web server is available at https://biasaway.uio.no and a detailed documentation is available at https://biasaway.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Jaro Camphuijsen ◽  
Ronald van Haren ◽  
Yifat Dzigan ◽  
Niels Drost ◽  
Fakhareh Alidoost ◽  
...  

<p>With the release of the ERA5 dataset, worldwide high resolution reanalysis data became available with open access for public use. The Copernicus CDS (Climate Data Store) offers two options for accessing the data: a web interface and a Python API. Consequently, automated downloading of the data requires advanced knowledge of Python and a lot of work. To make this process easier, we developed era5cli. </p><p>The command line interface tool era5cli enables automated downloading of ERA5 using a single command. All variables and options available in the CDS web form are now available for download in an efficient way. Both the monthly and hourly dataset are supported. Besides automation, era5cli adds several useful functionalities to the download pipeline.</p><p>One of the key options in era5cli is to spread one download command over multiple CDS requests, resulting in higher download speeds. Files can be saved in both GRIB and NETCDF format with automatic, yet customizable file names. The `info` command lists correct names of the available variables and pressure levels for 3D variables. For debugging purposes and testing the `--dryrun` option can be selected to return only the CDS request. An overview of all available options, including instructions on how to configure your CDS account, is available in our documentation. Source code is available on https://github.com/eWaterCycle/era5cli.</p><p>In this PICO presentation we will provide an overview of era5cli, as well as a short introduction on how to use era5cli.</p>


Author(s):  
Kai Kruse ◽  
Clemens B. Hug ◽  
Juan M. Vaquerizas

Chromosome conformation capture data, particularly from high-throughput approaches such as Hi-C and its derivatives, are typically very complex to analyse. Existing analysis tools are often single-purpose, or limited in compatibility to a small number of data formats, frequently making Hi-C analyses tedious and time-consuming. Here, we present FAN-C, an easy-to-use command-line tool and powerful Python API with a broad feature set covering matrix generation, analysis, and visualisation for C-like data (https://github.com/vaquerizaslab/fanc). Due to its comprehensiveness and compatibility with the most prevalent Hi-C storage formats, FAN-C can be used in combination with a large number of existing analysis tools, thus greatly simplifying Hi-C matrix analysis.


2019 ◽  
Author(s):  
Charlotte A. Darby ◽  
Ravi Gaddipati ◽  
Michael C. Schatz ◽  
Ben Langmead

AbstractRead alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these “gold standard” Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-MEM, and vg to align more reads correctly. Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license.


2020 ◽  
Vol 34 (05) ◽  
pp. 8592-8599
Author(s):  
Sheena Panthaplackel ◽  
Milos Gligoric ◽  
Raymond J. Mooney ◽  
Junyi Jessy Li

Comments are an integral part of software development; they are natural language descriptions associated with source code elements. Understanding explicit associations can be useful in improving code comprehensibility and maintaining the consistency between code and comments. As an initial step towards this larger goal, we address the task of associating entities in Javadoc comments with elements in Java source code. We propose an approach for automatically extracting supervised data using revision histories of open source projects and present a manually annotated evaluation dataset for this task. We develop a binary classifier and a sequence labeling model by crafting a rich feature set which encompasses various aspects of code, comments, and the relationships between them. Experiments show that our systems outperform several baselines learning from the proposed supervision.


2020 ◽  
Author(s):  
Xun Zhu ◽  
Ti-Cheng Chang ◽  
Richard Webby ◽  
Gang Wu

AbstractidCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md.


2020 ◽  
Vol 36 (12) ◽  
pp. 3930-3931 ◽  
Author(s):  
Oliver B Scott ◽  
A W Edith Chan

Abstract Summary ScaffoldGraph (SG) is an open-source Python library and command-line tool for the generation and analysis of molecular scaffold networks and trees, with the capability of processing large sets of input molecules. With the increase in high-throughput screening data, scaffold graphs have proven useful for the navigation and analysis of chemical space, being used for visualization, clustering, scaffold-diversity analysis and active-series identification. Built on RDKit and NetworkX, SG integrates scaffold graph analysis into the growing scientific/cheminformatics Python stack, increasing the flexibility and extendibility of the tool compared to existing software. Availability and implementation SG is freely available and released under the MIT licence at https://github.com/UCLCheminformatics/ScaffoldGraph.


Sign in / Sign up

Export Citation Format

Share Document