scholarly journals Explore, edit and leverage genomic annotations using Python GTF toolkit

2019 ◽  
Vol 35 (18) ◽  
pp. 3487-3488 ◽  
Author(s):  
F Lopez ◽  
G Charbonnier ◽  
Y Kermezli ◽  
M Belhocine ◽  
Q Ferré ◽  
...  

AbstractMotivationWhile Python has become very popular in bioinformatics, a limited number of libraries exist for fast manipulation of gene coordinates in Ensembl GTF format.ResultsWe have developed the GTF toolkit Python package (pygtftk), which aims at providing easy and powerful manipulation of gene coordinates in GTF format. For optimal performances, the core engine of pygtftk is a C dynamic library (libgtftk) while the Python API provides usability and readability for developing scripts. Based on this Python package, we have developed the gtftk command line interface that contains 57 sub-commands (v0.9.10) to ease handling of GTF files. These commands may be used to (i) perform basic tasks (e.g. selections, insertions, updates or deletions of features/keys), (ii) select genes/transcripts based on various criteria (e.g. size, exon number, transcription start site location, intron length, GO terms) or (iii) carry out more advanced operations such as coverage analyses of genomic features using bigWig files to create faceted read-coverage diagrams. In conclusion, the pygtftk package greatly simplifies the annotation of GTF files with external information while providing advance tools to perform gene analyses.Availability and implementationpygtftk and gtftk have been tested on Linux and MacOSX and are available from https://github.com/dputhier/pygtftk under the MIT license. The libgtftk dynamic library written in C is available from https://github.com/dputhier/libgtftk.

2019 ◽  
Vol 214 ◽  
pp. 06027
Author(s):  
Adrian Bevan ◽  
Thomas Charman ◽  
Jonathan Hays

HIPSTER (Heavily Ionising Particle Standard Toolkit for Event Recognition) is an open source Python package designed to facilitate the use of TensorFlow in a high energy physics analysis context. The core functionality of the software is presented, with images from the MoEDAL experiment Nuclear Track Detectors (NTDs) serving as an example dataset. Convolutional neural networks are selected as the classification algorithm for this dataset and the process of training a variety of models with different hyper-parameters is detailed. Next the results are shown for the MoEDAL problem demonstrating the rich information output by HIPSTER that enables the user to probe the performance of their model in detail.


2014 ◽  
Vol 2 (1) ◽  
pp. 76-95
Author(s):  
Martin Oja

Abstract The main purpose of the article is to bring more clarity to the concept of art film, shedding light on the mechanisms of subjective reception and evaluating the presence of subjectivity-inducing segments as the grounds for defining art film. The second aim is to take a fresh look at the littlediscussed Estonian art cinema, drawing on a framework of cognitive film studies in order to analyse its borders and characteristics. I will evaluate the use of darkness as a device for creating meaning, both independently of and combined with other visual or auditory devices. The dark screen, although not always a major factor in the creation of subjectivity, accompanies the core problem both directly and metaphorically: what happens to the viewer when external information is absent? I will look at the subjectivity- inducing devices in the films of two Estonian directors, Sulev Keedus and Veiko Õunpuu. For the theoretical background, I rely mostly on Torben Grodal’s idea about the subjective mode as a main characteristic of art film, and the disruption of character simulation as the basis for the film viewer’s subjectivity.


2021 ◽  
Vol 3 (4) ◽  
Author(s):  
Quentin Ferré ◽  
Cécile Capponi ◽  
Denis Puthier

Abstract Most epigenetic marks, such as Transcriptional Regulators or histone marks, are biological objects known to work together in n-wise complexes. A suitable way to infer such functional associations between them is to study the overlaps of the corresponding genomic regions. However, the problem of the statistical significance of n-wise overlaps of genomic features is seldom tackled, which prevent rigorous studies of n-wise interactions. We introduce OLOGRAM-MODL, which considers overlaps between n ≥ 2 sets of genomic regions, and computes their statistical mutual enrichment by Monte Carlo fitting of a Negative Binomial distribution, resulting in more resolutive P-values. An optional machine learning method is proposed to find complexes of interest, using a new itemset mining algorithm based on dictionary learning which is resistant to noise inherent to biological assays. The overall approach is implemented through an easy-to-use CLI interface for workflow integration, and a visual tree-based representation of the results suited for explicability. The viability of the method is experimentally studied using both artificial and biological data. This approach is accessible through the command line interface of the pygtftk toolkit, available on Bioconda and from https://github.com/dputhier/pygtftk


2021 ◽  
Author(s):  
Pamela H Russell ◽  
Ian T Fiddes

Motivation: Bioinformaticians frequently navigate among a diverse set of coordinate systems: for example, converting between genomic, transcript, and protein coordinates. The abstraction of coordinate systems and feature arithmetic allows genomic workflows to be expressed more elegantly and succinctly. However, no publicly available software library offers fully featured interoperable support for multiple coordinate systems. As such, bioinformatics programmers must either implement custom solutions, or make do with existing utilities, which may lack the full functionality they require. Results: We present BioCantor, a Python library that provides integrated library support for arbitrarily related coordinate systems and rich operations on genomic features, with I/O support for a variety of file formats. Availability and implementation: BioCantor is implemented as a Python 3 library with a minimal set of external dependencies. The library is freely available under the MIT license at https://github.com/InscriptaLabs/BioCantor and on the Python Package Index at https://pypi.org/project/BioCantor/. BioCantor has extensive documentation and vignettes available on ReadTheDocs at https://biocantor.readthedocs.io/en/latest/.


2010 ◽  
Vol 30 (14) ◽  
pp. 3471-3479 ◽  
Author(s):  
Joshua W. M. Theisen ◽  
Chin Yan Lim ◽  
James T. Kadonaga

ABSTRACT The RNA polymerase II core promoter is a diverse and complex regulatory element. To gain a better understanding of the core promoter, we examined the motif 10 element (MTE), which is located downstream of the transcription start site and acts in conjunction with the initiator (Inr). We found that the MTE promotes the binding of purified TFIID to the core promoter and that the TAF6 and TAF9 subunits of TFIID appear to be in close proximity to the MTE. To identify the specific nucleotides that contribute to MTE activity, we performed a detailed mutational analysis and determined a functional MTE consensus sequence. These studies identified favored as well as disfavored nucleotides and demonstrated the previously unrecognized importance of nucleotides in the subregion of nucleotides 27 to 29 (+27 to + 29 relative to A+1 in the Inr consensus) for MTE function. Further analysis led to the identification of three downstream subregions (nucleotides 18 to 22, 27 to 29, and 30 to 33) that contribute to core promoter activity. The three binary combinations of these subregions lead to the MTE (nucleotides 18 to 22 and 27 to 29), a downstream core promoter element (nucleotides 27 to 29 and 30 to 33), and a novel “bridge” core promoter motif (nucleotides 18 to 22 and 30 to 33). These studies have thus revealed a tripartite organization of key subregions in the downstream core promoter.


2021 ◽  
Vol 12 ◽  
Author(s):  
Kexin Wang ◽  
Kai Li ◽  
Yupeng Chen ◽  
Genxia Wei ◽  
Hailang Yu ◽  
...  

Traditional Chinese medicine (TCM) usually plays therapeutic roles on complex diseases in the form of formulas. However, the multicomponent and multitarget characteristics of formulas bring great challenges to the mechanism analysis and secondary development of TCM in treating complex diseases. Modern bioinformatics provides a new opportunity for the optimization of TCM formulas. In this report, a new bioinformatics analysis of a computational network pharmacology model was designed, which takes Chai-Hu-Shu-Gan-San (CHSGS) treatment of depression as the case. In this model, effective intervention space was constructed to depict the core network of the intervention effect transferred from component targets to pathogenic genes based on a novel node importance calculation method. The intervention-response proteins were selected from the effective intervention space, and the core group of functional components (CGFC) was selected based on these intervention-response proteins. Results show that the enriched pathways and GO terms of intervention-response proteins in effective intervention space could cover 95.3 and 95.7% of the common pathways and GO terms that respond to the major functional therapeutic effects. Additionally, 71 components from 1,012 components were predicted as CGFC, the targets of CGFC enriched in 174 pathways which cover the 86.19% enriched pathways of pathogenic genes. Based on the CGFC, two major mechanism chains were inferred and validated. Finally, the core components in CGFC were evaluated by in vitro experiments. These results indicate that the proposed model with good accuracy in screening the CGFC and inferring potential mechanisms in the formula of TCM, which provides reference for the optimization and mechanism analysis of the formula in TCM.


2016 ◽  
Author(s):  
Elena D. Stavrovskaya ◽  
Tejasvi Niranjan ◽  
Elana J. Fertig ◽  
Sarah J. Wheelan ◽  
Alexander Favorov ◽  
...  

AbstractMotivationGenomics features with similar genomewide distributions are generally hypothesized to be functionally related, for example, co-localization of histones and transcription start sites indicate chromatin regulation of transcription factor activity. Therefore, statistical algorithms to perform spatial, genomewide correlation among genomic features are required.ResultsHere, we propose a method, StereoGene, that rapidly estimates genomewide correlation among pairs of genomic features. These features may represent high throughput data mapped to reference genome or sets of genomic annotations in that reference genome. StereoGene enables correlation of continuous data directly, avoiding the data binarization and subsequent data loss. Correlations are computed among neighboring genomic positions using kernel correlation. Representing the correlation as a function of the genome position, StereoGene outputs the local correlation track as part of the analysis. StereoGene also accounts for confounders such as input DNA by partial correlation. We apply our method to numerous comparisons of ChIP-Seq datasets from the Human Epigenome Atlas and FANTOM CAGE to demonstrate its wide applicability. We observe the changes in the correlation between epigenomic features across developmental trajectories of several tissue types consistent with known biology, and find a novel spatial correlation of CAGE clusters with donor splice sites and with poly(A) sites. These analyses provide examples for the broad applicability of StereoGene for regulatory genomics.AvailabilityThe StereoGene C++ source code, program documentation, Galaxy integration scripts and examples are available from the project homepage http://stereogene.bioinf.fbb.msu.ru/[email protected] informationSupplementary data are available online.


2019 ◽  
Vol 12 (8) ◽  
pp. 3795-3803 ◽  
Author(s):  
Loïc Huder ◽  
Nicolas Gillet ◽  
Franck Thollard

Abstract. The pygeodyn package is a sequential geomagnetic data assimilation tool written in Python. It gives access to the core surface dynamics, controlled by geomagnetic observations, by means of a stochastic model anchored to geodynamo simulation statistics. The pygeodyn package aims to give access to a user-friendly and flexible data assimilation algorithm. It is designed to be tunable by the community by different means, including the following: the possibility to use embedded data and priors or to supply custom ones; tunable parameters through configuration files; and adapted documentation for several user profiles. In addition, output files are directly supported by the package webgeodyn that provides a set of visualization tools to explore the results of computations.


2018 ◽  
Author(s):  
Hong-Dong Li

AbstractSummaryGene-centric bioinformatics studies frequently involve calculation or extraction of various features of genes such as gene ID mapping, GC content calculation and different types of gene lengths, through manipulation of gene models that are often annotated in GTF format and available from ENSEMBL or GENCODE database. Such computation is essential for subsequent analysis such as intron retention detection where independent introns may need to be identified, converting RNA-seq read counts to FPKM where gene length is required, and obtaining flanking regions around transcription start sites. However, to our knowledge, a software package that is dedicated to analyzing various modes of gene models directly from GTF file is not publicly available. In this work, GTFtools (implemented in Python and not dependent on any non-python third-party software), a stand-alone command-line software that provides a set of functions to analyze various modes of gene models, is provided for facilitating routine bioinformatics studies where information about gene models needs to be calculated.AvailabilityGTFtools is freely available at www.genemine.org/[email protected].


2021 ◽  
Author(s):  
Hongxiang Cao ◽  
Zhangshuai Yang ◽  
Shu Song ◽  
Jiazong Liu ◽  
Ning Li

Abstract Banded leaf and sheath blight (BLSB) caused by the necrotrophic fungus Rhizoctonia solani is a devasting disease on maize worldwide, especially in China and Southeast Asia. To explore the maize defense mechanisms against R. solani expansion, the expression profile of maize infected by low virulence strain (LVS) and high virulence strain (HVS) of R. solani for 3 and 5 d was analyzed by RNA-sequencing. A total of 3015 and 1628 differentially expressed genes (DEGs) were identified under LVS and HVS infection, respectively. Meanwhile, these DEGs were classified by Gene Ontology (GO) for biological process analysis. Only defense-related GO terms were commonly enriched in LVS- and HVS-regulated genes. Furthermore, a core set of 388 up-regulated genes that are involved in maize response to R. solani infection were identified. Additionally, among the core genes, overexpressing ZmNAC41 and ZmBAK1 enhanced rice resistance to R. solani. Taken together, our study provides additional insight into maize defense mechanisms against R. solani, and the core genes identified in this study will be important resources for improving BLSB resistance in the future.


Sign in / Sign up

Export Citation Format

Share Document