BioCantor: a Python library for genomic feature arithmetic in arbitrarily related coordinate systems

Mapping Intimacies ◽

10.1101/2021.07.09.451743 ◽

2021 ◽

Author(s):

Pamela H Russell ◽

Ian T Fiddes

Keyword(s):

Genomic Feature ◽

Software Library ◽

Coordinate Systems ◽

Minimal Set ◽

Genomic Features ◽

File Formats ◽

Extensive Documentation ◽

Library Support ◽

Python Package

Motivation: Bioinformaticians frequently navigate among a diverse set of coordinate systems: for example, converting between genomic, transcript, and protein coordinates. The abstraction of coordinate systems and feature arithmetic allows genomic workflows to be expressed more elegantly and succinctly. However, no publicly available software library offers fully featured interoperable support for multiple coordinate systems. As such, bioinformatics programmers must either implement custom solutions, or make do with existing utilities, which may lack the full functionality they require. Results: We present BioCantor, a Python library that provides integrated library support for arbitrarily related coordinate systems and rich operations on genomic features, with I/O support for a variety of file formats. Availability and implementation: BioCantor is implemented as a Python 3 library with a minimal set of external dependencies. The library is freely available under the MIT license at https://github.com/InscriptaLabs/BioCantor and on the Python Package Index at https://pypi.org/project/BioCantor/. BioCantor has extensive documentation and vignettes available on ReadTheDocs at https://biocantor.readthedocs.io/en/latest/.

Download Full-text

Topoly: Python package to analyze topology of polymers

Briefings in Bioinformatics ◽

10.1093/bib/bbaa196 ◽

2020 ◽

Author(s):

Pawel Dabrowski-Tumanski ◽

Pawel Rubach ◽

Wanda Niemyska ◽

Bartosz Ambrozy Gren ◽

Joanna Ida Sulkowska

Keyword(s):

Test Cases ◽

Sample Analysis ◽

Polynomial Invariants ◽

Python Programming Language ◽

Spatial Graphs ◽

File Formats ◽

Extensive Documentation ◽

Python Programming ◽

Python Package

Abstract The increasing role of topology in (bio)physical properties of matter creates a need for an efficient method of detecting the topology of a (bio)polymer. However, the existing tools allow one to classify only the simplest knots and cannot be used in automated sample analysis. To answer this need, we created the Topoly Python package. This package enables the distinguishing of knots, slipknots, links and spatial graphs through the calculation of different topological polynomial invariants. It also enables one to create the minimal spanning surface on a given loop, e.g. to detect a lasso motif or to generate random closed polymers. It is capable of reading various file formats, including PDB. The extensive documentation along with test cases and the simplicity of the Python programming language make it a very simple to use yet powerful tool, suitable even for inexperienced users. Topoly can be obtained from https://topoly.cent.uw.edu.pl.

Download Full-text

SynBiopython: an open-source software library for Synthetic Biology

Synthetic Biology ◽

10.1093/synbio/ysab001 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Jing Wui Yeoh ◽

Neil Swainston ◽

Peter Vegh ◽

Valentin Zulkower ◽

Pablo Carbonell ◽

...

Keyword(s):

Synthetic Biology ◽

Open Source ◽

Open Source Software ◽

Development Projects ◽

Software Library ◽

Current State ◽

Starting Point ◽

Common Problems ◽

Data Tracking ◽

Python Package

Abstract Advances in hardware automation in synthetic biology laboratories are not yet fully matched by those of their software counterparts. Such automated laboratories, now commonly called biofoundries, require software solutions that would help with many specialized tasks such as batch DNA design, sample and data tracking, and data analysis, among others. Typically, many of the challenges facing biofoundries are shared, yet there is frequent wheel-reinvention where many labs develop similar software solutions in parallel. In this article, we present the first attempt at creating a standardized, open-source Python package. A number of tools will be integrated and developed that we envisage will become the obvious starting point for software development projects within biofoundries globally. Specifically, we describe the current state of available software, present usage scenarios and case studies for common problems, and finally describe plans for future development. SynBiopython is publicly available at the following address: http://synbiopython.org.

Download Full-text

Genomic Feature Analysis of Betacoronavirus Provides Insights Into SARS and COVID-19 Pandemics

Frontiers in Microbiology ◽

10.3389/fmicb.2021.614494 ◽

2021 ◽

Vol 12 ◽

Author(s):

Xin Li ◽

Jia Chang ◽

Shunmei Chen ◽

Liangge Wang ◽

Tung On Yau ◽

...

Keyword(s):

Feature Analysis ◽

Genomic Feature ◽

Cleavage Sites ◽

Furin Cleavage ◽

Genomic Features ◽

Future Studies ◽

Origin And Evolution ◽

The World ◽

Main Factors ◽

Similar Risk

In December 2019, the world awoke to a new betacoronavirus strain named severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). Betacoronavirus consists of A, B, C and D subgroups. Both SARS-CoV and SARS-CoV-2 belong to betacoronavirus subgroup B. In the present study, we divided betacoronavirus subgroup B into the SARS1 and SARS2 classes by six key insertions and deletions (InDels) in betacoronavirus genomes, and identified a recently detected betacoronavirus strains RmYN02 as a recombinant strain across the SARS1 and SARS2 classes, which has potential to generate a new strain with similar risk as SARS-CoV and SARS-CoV-2. By analyzing genomic features of betacoronavirus, we concluded: (1) the jumping transcription and recombination of CoVs share the same molecular mechanism, which inevitably causes CoV outbreaks; (2) recombination, receptor binding abilities, junction furin cleavage sites (FCSs), first hairpins and ORF8s are main factors contributing to extraordinary transmission, virulence and host adaptability of betacoronavirus; and (3) the strong recombination ability of CoVs integrated other main factors to generate multiple recombinant strains, two of which evolved into SARS-CoV and SARS-CoV-2, resulting in the SARS and COVID-19 pandemics. As the most important genomic features of SARS-CoV and SARS-CoV-2, an enhanced ORF8 and a novel junction FCS, respectively, are indispensable clues for future studies of their origin and evolution. The WIV1 strain without the enhanced ORF8 and the RaTG13 strain without the junction FCS “RRAR” may contribute to, but are not the immediate ancestors of SARS-CoV and SARS-CoV-2, respectively.

Download Full-text

COPAR: A ChIP-Seq Optimal Peak Analyzer

BioMed Research International ◽

10.1155/2017/5346793 ◽

2017 ◽

Vol 2017 ◽

pp. 1-4

Author(s):

Binhua Tang ◽

Xihan Wang ◽

Victor X. Jin

Keyword(s):

High Throughput ◽

Genomic Feature ◽

Data Sets ◽

Sequencing Data ◽

Genomic Features ◽

Peak Alignment ◽

Chip Sequencing ◽

Quality Check ◽

User Friendly ◽

High Throughput Experiments

Sequencing data quality and peak alignment efficiency of ChIP-sequencing profiles are directly related to the reliability and reproducibility of NGS experiments. Till now, there is no tool specifically designed for optimal peak alignment estimation and quality-related genomic feature extraction for ChIP-sequencing profiles. We developed open-sourced COPAR, a user-friendly package, to statistically investigate, quantify, and visualize the optimal peak alignment and inherent genomic features using ChIP-seq data from NGS experiments. It provides a versatile perspective for biologists to perform quality-check for high-throughput experiments and optimize their experiment design. The package COPAR can process mapped ChIP-seq read file in BED format and output statistically sound results for multiple high-throughput experiments. Together with three public ChIP-seq data sets verified with the developed package, we have deposited COPAR on GitHub under a GNU GPL license.

Download Full-text

pyCancerSig: subclassifying human cancer with comprehensive single nucleotide, structural and microsatellite mutational signature deconstruction from whole genome sequencing

10.1101/785410 ◽

2019 ◽

Author(s):

Jessada Thutkawkorapin ◽

Jesper Eisfeldt ◽

Emma Tham ◽

Daniel Nilsson

Keyword(s):

Human Cancer ◽

The Cancer Genome Atlas ◽

Driver Mutations ◽

Learning Technology ◽

Single Nucleotide ◽

File Formats ◽

Learning Technique ◽

Cancer Genome Atlas ◽

Python Package ◽

Non Negative Matrix Factorization

AbstractBackgroundDNA damage accumulates over the course of cancer development. The often-substantial amount of somatic mutations in cancer poses a challenge to traditional methods to characterize tumors based on driver mutations. However, advances in machine learning technology can take advantage of this substantial amount of data.ResultsWe developed a command line interface python package, pyCancerSig, to perform sample profiling by integrating single nucleotide variation (SNV), structural variation (SV) and microsatellite instability (MSI) profiles into a unified profile. It also provides a command to decipher underlying cancer processes, employing an unsupervised learning technique, Non-negative Matrix Factorization, and a command to visualize the results. The package accepts common standard file formats (vcf, bam). The program was evaluated using a cohort of breast- and colorectal cancer from The Cancer Genome Atlas project (TCGA). The result showed that by integrating multiple mutations modes, the tool can correctly identify cases with known clear mutational signatures and can strengthen signatures in cases with unclear signal from an SNV-only profile.ConclusionspyCancerSig has demonstrated its capability in identifying known and unknown cancer processes, and at the same time, illuminates the association within and between the mutation modes.

Download Full-text

Pydigree: a python library for manipulation and forward-time simulation and of genetic datasets

10.1101/213413 ◽

2017 ◽

Author(s):

James E. Hicks

Keyword(s):

Population Genetics ◽

Data Structures ◽

Genetic Epidemiology ◽

Genetic Data ◽

Link Type ◽

File Formats ◽

Time Simulation ◽

Cross Platform ◽

User Friendly ◽

Python Package

AbstractThe development of software for working with data from population genetics or genetic epidemiology often requires substantial time spent implementing common procedures. Pydigree is a cross-platform Python 3 library that contains efficient, user friendly implementations for many of these common functions, and support for input from common file formats. Developers can combine the functions and data structures to rapidly implement programs handling genetic data. Pydigree presents a useful environment for development of applications for genetic data or rapid prototyping before reimplementation in a higher-performance language.Pydigree is freely available under an open source license. Stable sources can be found in the Python Package Index at https://pypi.python.org/pypi/pydigree/, and development sources can be downloaded at https://github.com/jameshicks/pydigree/

Download Full-text

Explore, edit and leverage genomic annotations using Python GTF toolkit

Bioinformatics ◽

10.1093/bioinformatics/btz116 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3487-3488 ◽

Cited By ~ 3

Author(s):

F Lopez ◽

G Charbonnier ◽

Y Kermezli ◽

M Belhocine ◽

Q Ferré ◽

...

Keyword(s):

External Information ◽

Read Coverage ◽

Site Location ◽

Transcription Start ◽

Command Line Interface ◽

Genomic Features ◽

The Core ◽

Dynamic Library ◽

Go Terms ◽

Python Package

AbstractMotivationWhile Python has become very popular in bioinformatics, a limited number of libraries exist for fast manipulation of gene coordinates in Ensembl GTF format.ResultsWe have developed the GTF toolkit Python package (pygtftk), which aims at providing easy and powerful manipulation of gene coordinates in GTF format. For optimal performances, the core engine of pygtftk is a C dynamic library (libgtftk) while the Python API provides usability and readability for developing scripts. Based on this Python package, we have developed the gtftk command line interface that contains 57 sub-commands (v0.9.10) to ease handling of GTF files. These commands may be used to (i) perform basic tasks (e.g. selections, insertions, updates or deletions of features/keys), (ii) select genes/transcripts based on various criteria (e.g. size, exon number, transcription start site location, intron length, GO terms) or (iii) carry out more advanced operations such as coverage analyses of genomic features using bigWig files to create faceted read-coverage diagrams. In conclusion, the pygtftk package greatly simplifies the annotation of GTF files with external information while providing advance tools to perform gene analyses.Availability and implementationpygtftk and gtftk have been tested on Linux and MacOSX and are available from https://github.com/dputhier/pygtftk under the MIT license. The libgtftk dynamic library written in C is available from https://github.com/dputhier/libgtftk.

Download Full-text

Unified methods for feature selection in large-scale genomic studies with censored survival outcomes

Bioinformatics ◽

10.1093/bioinformatics/btaa161 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3409-3417

Author(s):

Lauren Spirko-Burns ◽

Karthik Devarajan

Keyword(s):

Feature Selection ◽

Large Scale ◽

Proportional Hazards ◽

Cox Model ◽

Supplementary Information ◽

Genomic Feature ◽

Prognostic Impact ◽

Genomic Features ◽

Special Cases ◽

Genomic Studies

Abstract Motivation One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous datasets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback–Leibler information divergence and the Yang–Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 measures for the PH and PO models that can be interpreted in terms of explained randomness. Lastly, we propose a generalized pseudo-R2 index that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature measurements. Results We evaluate the performance of our measures using extensive simulation studies and publicly available datasets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods. Availability and implementation R code for the proposed methods is available at github.com/lburns27/Feature-Selection. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Genomic profiling of Chinese patients with urothelial carcinoma

BMC Cancer ◽

10.1186/s12885-021-07829-1 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Bo Yang ◽

Xiao Zhao ◽

Chong Wan ◽

Xin Ma ◽

Shaoxi Niu ◽

...

Keyword(s):

Urothelial Carcinoma ◽

Tumor Tissue ◽

Genetic Alterations ◽

Chinese Patients ◽

Study Cohort ◽

Genomic Feature ◽

Genomic Alterations ◽

Tissue Samples ◽

Genomic Features ◽

Somatic Alterations

Abstract Backgrounds Urothelial carcinoma (UC) is the most common genitourinary malignancy in China. In this study, we surveyed the genomic features in Chinese UC patients and investigated the concordance of genetic alterations between circulating tumor DNA (ctDNA) in plasma and matched tumor tissue. Materials and methods A total of 112 UC patients were enrolled, of which 31 were upper tract UC (UTUC) and 81 were UC of bladder (UCB). Genomic alterations in 92 selected genes were analyzed by targeted next-generation sequencing. Results In the study cohort, 94.64, 86.61 and 62.50% of patients were identified as having valid somatic, oncogenic and actionable somatic alterations, respectively. The most frequently altered genes included TP53, KMT2D, KDM6A, FAT4, FAT1, CREBBP and ARID1A. The higher prevalence of HRAS (22.0% vs 3.7%) and KMT2D (59.26% vs 34.57%) was identified in UTUC than in UCB. Comparisons of somatic alterations of UCB and UTUC between the study cohort and western cohorts revealed significant differences in mutant prevalence. Notably, 28.57, 17.86 and 47.32% of the cases harbored alterations in FGFRs, ERBBs and DNA damage repair genes, respectively. Furthermore, 75% of the patients carried non-benign germline variants, but only two (1.79%) were pathogenic. The overall concordance for genomic alterations in ctDNA and matched tumor tissue was 42.97% (0–100%). Notably, 47.25% of alterations detected in ctDNA were not detected in the matched tissue, and 54.14% of which were oncogenic mutations. Conclusions We found a unique genomic feature of Chinese UC patients. A reasonably good concordance of genomic features between ctDNA and tissue samples were identified.

Download Full-text

Unified Methods for Feature Selection in Large-Scale Genomic Studies with Censored Survival Outcomes

10.1101/2020.02.14.944314 ◽

2020 ◽

Author(s):

Lauren Spirko-Burns ◽

Karthik Devarajan

Keyword(s):

Feature Selection ◽

Large Scale ◽

Proportional Hazards ◽

Genomic Feature ◽

Data Sets ◽

Prognostic Impact ◽

Genomic Features ◽

Special Cases ◽

Information Divergence ◽

Genomic Studies

AbstractOne of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease’s process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous data sets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback-Leibler information divergence and the Yang-Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 indices for the PH and PO models that can be interpreted in terms of explained randomness. Lastly, we propose a generalized pseudo-R2 measure that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature expression. We evaluate the performance of our measures using extensive simulation studies and publicly available data sets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods. The proposed information divergence, R2 and pseudo-R2 measures were implemented in R (www.R-project.org) and code is available upon request.

Download Full-text