bioSyntax: Syntax Highlighting For Computational Biology

AbstractComputational biology requires the reading and comprehension of biological data files. Plain-text formats such as SAM, VCF, GTF, PDB and FASTA, often contain critical information that is obfuscated by the complexity of the data structures. bioSyntax (http://bioSyntax.org) is a freely available suite of syntax highlighting packages for vim, gedit, Sublime, and less, which aids computational scientists to parse and work with their data more efficiently.

Download Full-text

AGEpy: a Python package for computational biology

10.1101/450890 ◽

2018 ◽

Cited By ~ 1

Author(s):

Franziska Metge ◽

Robert Sehlke ◽

Jorge Boucas

Keyword(s):

Computational Biology ◽

Open Source ◽

High Throughput ◽

Biological Data ◽

Command Line ◽

High Throughput Analysis ◽

Throughput Analysis ◽

Link Type ◽

Biological Meaning ◽

Python Package

AbstractSummary:AGEpy is a Python package focused on the transformation of interpretable data into biological meaning. It is designed to support high-throughput analysis of pre-processed biological data using either local Python based processing or Python based API calls to local or remote servers. In this application note we describe its different Python modules as well as its command line accessible toolsaDiff,abed,blasto,david, andobo2tsv.Availability:The open source AGEpy Python package is freely available at:https://github.com/mpg-age-bioinformatics/AGEpy.Contact:[email protected]

Download Full-text

LevioSAM: Fast lift-over of alternate reference alignments

10.1101/2021.02.05.429867 ◽

2021 ◽

Author(s):

Taher Mun ◽

Nae-Chyun Chen ◽

Ben Langmead

Keyword(s):

Population Genetics ◽

Coordinate System ◽

Data Structures ◽

Succinct Data Structures ◽

Reference Coordinate System ◽

Link Type ◽

A Chain ◽

Time Required ◽

Effective Use

AbstractMotivationAs more population genetics datasets and population-specific references become available, the task of translating (“lifting”) read alignments from one reference coordinate system to another is becoming more common. Existing tools generally require a chain file, whereas VCF files are the more common way to represent variation. Existing tools also do not make effective use of threads, creating a post-alignment bottleneck.ResultsLevioSAM is a tool for lifting SAM/BAM alignments from one reference to another using a VCF file containing population variants. LevioSAM uses succinct data structures and scales efficiently to many threads. When run downstream of a read aligner, levioSAM completes in less than 13% the time required by an aligner when both are run with 16 threads.Availabilityhttps://github.com/alshai/[email protected], [email protected]

Download Full-text

2017 ISCB Accomplishment by a Senior Scientist Award: Pavel Pevzner

F1000Research ◽

10.12688/f1000research.11588.1 ◽

2017 ◽

Vol 6 ◽

pp. 1001

Author(s):

Christiana N. Fogg ◽

Diane E. Kovats ◽

Bonnie Berger

Keyword(s):

Computational Biology ◽

Intelligent Systems ◽

University Of California ◽

Joint Meeting ◽

Massachusetts Institute ◽

Massachusetts Institute Of Technology ◽

Senior Scientist ◽

Link Type ◽

Institute Of Technology ◽

Research Service

The International Society for Computational Biology (ISCB) recognizes an established scientist each year with the Accomplishment by a Senior Scientist Award for significant contributions he or she has made to the field. This award honors scientists who have contributed to the advancement of computational biology and bioinformatics through their research, service, and education work. Pavel Pevzner, PhD, Ronald R. Taylor Professor of Computer Science and Director of the NIH Center for Computational Mass Spectrometry at University of California, San Diego, has been selected as the winner of the 2017 Accomplishment by a Senior Scientist Award. The ISCB awards committee, chaired by Dr. Bonnie Berger of the Massachusetts Institute of Technology, selected Pevzner as the 2017 winner. Pevzner will receive his award and deliver a keynote address at the 2017 Intelligent Systems for Molecular Biology-European Conference on Computational Biology joint meeting (ISMB/ECCB 2017) held in Prague, Czech Republic from July 21-July 25, 2017. ISMB/ECCB is a biennial joint meeting that brings together leading scientists in computational biology and bioinformatics from around the globe.

Download Full-text

gprofiler2 -- an R package for gene list functional enrichment analysis and namespace conversion toolset g:Profiler

F1000Research ◽

10.12688/f1000research.24956.2 ◽

2020 ◽

Vol 9 ◽

pp. 709 ◽

Cited By ~ 1

Author(s):

Liis Kolberg ◽

Uku Raudvere ◽

Ivan Kuzmin ◽

Jaak Vilo ◽

Hedi Peterson

Keyword(s):

Gene List ◽

Enrichment Analysis ◽

Functional Enrichment Analysis ◽

Automated Analysis ◽

R Package ◽

Biological Data ◽

Functional Enrichment ◽

Link Type ◽

Functional Profiling ◽

Rest Api

g:Profiler (https://biit.cs.ut.ee/gprofiler) is a widely used gene list functional profiling and namespace conversion toolset that has been contributing to reproducible biological data analysis already since 2007. Here we introduce the accompanying R package, gprofiler2, developed to facilitate programmatic access to g:Profiler computations and databases via REST API. The gprofiler2 package provides an easy-to-use functionality that enables researchers to incorporate functional enrichment analysis into automated analysis pipelines written in R. The package also implements interactive visualisation methods to help to interpret the enrichment results and to illustrate them for publications. In addition, gprofiler2 gives access to the versatile gene/protein identifier conversion functionality in g:Profiler enabling to map between hundreds of different identifier types or orthologous species. The gprofiler2 package is freely available at the CRAN repository.

Download Full-text

Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions

PeerJ Computer Science ◽

10.7717/peerj-cs.90 ◽

2016 ◽

Vol 2 ◽

pp. e90 ◽

Cited By ~ 24

Author(s):

Ranko Gacesa ◽

David J. Barlow ◽

Paul F. Long

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Biological Data ◽

Biological Databases ◽

Web Based ◽

Physiological Functions ◽

Link Type ◽

Venom Toxins ◽

Venomous Animals ◽

Toxin Protein

Ascribing function to sequence in the absence of biological data is an ongoing challenge in bioinformatics. Differentiating the toxins of venomous animals from homologues having other physiological functions is particularly problematic as there are no universally accepted methods by which to attribute toxin function using sequence data alone. Bioinformatics tools that do exist are difficult to implement for researchers with little bioinformatics training. Here we announce a machine learning tool called ‘ToxClassifier’ that enables simple and consistent discrimination of toxins from non-toxin sequences with >99% accuracy and compare it to commonly used toxin annotation methods. ‘ToxClassifer’ also reports the best-hit annotation allowing placement of a toxin into the most appropriate toxin protein family, or relates it to a non-toxic protein having the closest homology, giving enhanced curation of existing biological databases and new venomics projects. ‘ToxClassifier’ is available for free, either to download (https://github.com/rgacesa/ToxClassifier) or to use on a web-based server (http://bioserv7.bioinfo.pbf.hr/ToxClassifier/).

Download Full-text

New Trends in Graph Mining

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/jkdb.2010100206 ◽

2010 ◽

Vol 1 (1) ◽

pp. 81-99 ◽

Cited By ~ 6

Author(s):

Francesco Bruno ◽

Luigi Palopoli ◽

Simona E. Rombo

Keyword(s):

Computational Biology ◽

Biological Networks ◽

Graph Mining ◽

Building Blocks ◽

Biological Data ◽

Future Research ◽

Motif Extraction

Searching for repeated features characterizing biological data is fundamental in computational biology. When biological networks are under analysis, the presence of repeated modules across the same network (or several distinct ones) is shown to be very relevant. Indeed, several studies prove that biological networks can be often understood in terms of coalitions of basic repeated building blocks, often referred to as network motifs.This work provides a review of the main techniques proposed for motif extraction from biological networks. In particular, main intrinsic difficulties related to the problem are pointed out, along with solutions proposed in the literature to overcome them. Open challenges and directions for future research are finally discussed.

Download Full-text

An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study

F1000Research ◽

10.12688/f1000research.9110.1 ◽

2016 ◽

Vol 5 ◽

pp. 1574 ◽

Cited By ~ 19

Author(s):

Zichen Wang ◽

Avi Ma'ayan

Keyword(s):

Small Molecules ◽

Zika Virus ◽

Principal Component ◽

Global Gene Expression ◽

Brain Morphology ◽

Rna Seq ◽

Link Type ◽

Neuronal Progenitors ◽

Global Gene Expression Profiling ◽

Data Files

RNA-seq analysis is becoming a standard method for global gene expression profiling. However, open and standard pipelines to perform RNA-seq analysis by non-experts remain challenging due to the large size of the raw data files and the hardware requirements for running the alignment step. Here we introduce a reproducible open source RNA-seq pipeline delivered as an IPython notebook and a Docker image. The pipeline uses state-of-the-art tools and can run on various platforms with minimal configuration overhead. The pipeline enables the extraction of knowledge from typical RNA-seq studies by generating interactive principal component analysis (PCA) and hierarchical clustering (HC) plots, performing enrichment analyses against over 90 gene set libraries, and obtaining lists of small molecules that are predicted to either mimic or reverse the observed changes in mRNA expression. We apply the pipeline to a recently published RNA-seq dataset collected from human neuronal progenitors infected with the Zika virus (ZIKV). In addition to confirming the presence of cell cycle genes among the genes that are downregulated by ZIKV, our analysis uncovers significant overlap with upregulated genes that when knocked out in mice induce defects in brain morphology. This result potentially points to the molecular processes associated with the microcephaly phenotype observed in newborns from pregnant mothers infected with the virus. In addition, our analysis predicts small molecules that can either mimic or reverse the expression changes induced by ZIKV. The IPython notebook and Docker image are freely available at: http://nbviewer.jupyter.org/github/maayanlab/Zika-RNAseq-Pipeline/blob/master/Zika.ipynb and https://hub.docker.com/r/maayanlab/zika/.

Download Full-text

jFuzzyMachine – An Open–source Fuzzy Logic–based Regulatory Inference Engine for High–throughput Biological Data

10.1101/2020.10.06.315994 ◽

2020 ◽

Author(s):

Paul Aiyetan

Keyword(s):

Fuzzy Logic ◽

High Throughput ◽

Regulatory Network ◽

Network Inference ◽

Hct116 Cell ◽

Inference Engine ◽

Biological Data ◽

Inference System ◽

Apparent Lack ◽

Link Type

AbstractElucidating mechanistic relationships between and among intracellular macromolecules is fundamental to understanding the molecular basis of normal and diseased processes. Here, we introduce jFuzzyMachine – a fuzzy logic-based regulatory network inference engine for high-throughput biological data. We describe its design and implementation. We demonstrate its functions on a sampled expression profile of the vorinostat-resistant HCT116 cell line. We compared jFuzzyMachine’s inferred regulatory network to that inferred by the ARACNe (an Algorithm for the Reconstruction of Gene Regulatory Networks) tool. Potentially more sensitive, jFuzzyMachine showed a slight increase in identified regulatory edges compared to ARACNe. A significant overlap was also observed in the identified edges between the two inference methods. Over 70 percent of edges identified by ARACNe were identified by jFuzzyMachine. Beyond identifying edges, jFuzzyMachine shows direction of interactions, including bidirectional interactions – specifying regulatory inputs and outputs of inferred relationships. jFuzzyMachine addresses an apparent lack of freely available community tool implementing a fuzzy logic regulatory network inference method – mitigating a limitation to applying and extending benefits of the fuzzy inference system to understanding biological data. jFuzzyMachine’s source codes and precompiled binaries are freely available at the Github repository locations:https://github.com/paiyetan/jfuzzymachine andhttps://github.com/paiyetan/jfuzzymachine/releases/tag/v1.7.21.

Download Full-text

A fully featured COMBINE archive of a simulation study on syncytial mitotic cycles in Drosophila embryos

F1000Research ◽

10.12688/f1000research.9379.1 ◽

2016 ◽

Vol 5 ◽

pp. 2421 ◽

Cited By ~ 5

Author(s):

Martin Scharm ◽

Dagmar Waltemath

Keyword(s):

Computational Biology ◽

Simulation Study ◽

Scientific Community ◽

Original Work ◽

Original Publication ◽

Valuable Resource ◽

Data Files ◽

Drosophila Embryos ◽

Archived Data

COMBINE archives are standardised containers for data files related to a simulation study in computational biology. This manuscript describes a fully featured archive of a previously published simulation study, including (i) the original publication, (ii) the model, (iii) the analyses, and (iv) metadata describing the files and their origin. With the archived data at hand, it is possible to reproduce the results of the original work. The archive can be used for both, educational and research purposes. Anyone may reuse, extend and update the archive to make it a valuable resource for the scientific community.

Download Full-text

Data organization in spreadsheets

10.7287/peerj.preprints.3183 ◽

2018 ◽

Cited By ~ 1

Author(s):

Karl W Broman ◽

Kara H. Woo

Keyword(s):

Data Entry ◽

Data Organization ◽

Data Dictionary ◽

Basic Principles ◽

Plain Text ◽

A Cell ◽

Data Files ◽

And Storage ◽

Font Color ◽

Practical Recommendations

Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this paper offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, don't leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, don't include calculations in the raw data files, don't use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files.

Download Full-text