Machine Boss: rapid prototyping of bioinformatic automata

Bioinformatics ◽

10.1093/bioinformatics/btaa633 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jordi Silvestre-Ryan ◽

Yujie Wang ◽

Mehak Sharma ◽

Stephen Lin ◽

Yolanda Shen ◽

...

Keyword(s):

Data Storage ◽

Markov Models ◽

Software Tool ◽

Supplementary Information ◽

Time Saving ◽

Parameter Fitting ◽

Software Libraries ◽

Calculation Parameter ◽

Report Data ◽

Dna Alignment

Abstract Motivation Many software libraries for using Hidden Markov Models in bioinformatics focus on inference tasks, such as likelihood calculation, parameter-fitting and alignment. However, construction of the state machines can be a laborious task, automation of which would be time-saving and less error-prone. Results We present Machine Boss, a software tool implementing not just inference and parameter-fitting algorithms, but also a set of operations for manipulating and combining automata. The aim is to make prototyping of bioinformatics HMMs as quick and easy as the construction of regular expressions, with one-line ‘recipes’ for many common applications. We report data from several illustrative examples involving protein-to-DNA alignment, DNA data storage and nanopore sequence analysis. Availability and implementation Machine Boss is released under the BSD-3 open source license and is available from http://machineboss.org/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Machine Boss: Rapid Prototyping of Bioinformatic Automata

10.1101/2020.02.13.945071 ◽

2020 ◽

Cited By ~ 1

Author(s):

J. Silvestre-Ryan ◽

Y. Wang ◽

M. Sharma ◽

S. Lin ◽

Y. Shen ◽

...

Keyword(s):

Data Storage ◽

Markov Models ◽

Hidden Markov ◽

Software Tool ◽

Regular Expressions ◽

Time Saving ◽

Parameter Fitting ◽

Calculation Parameter ◽

Report Data ◽

Dna Alignment

ABSTRACTMotivationMany C++ libraries for using Hidden Markov Models in bioinformatics focus on inference tasks, such as likelihood calculation, parameter-fitting, and alignment. However, construction of the state machines can be a laborious task, automation of which would be time-saving and less error-prone.ResultsWe present Machine Boss, a software tool implementing not just inference and parameter-fitting algorithms, but also a set of operations for manipulating and combining automata. The aim is to make prototyping of bioinformatics HMMs as quick and easy as the construction of regular expressions, with one-line “recipes” for many common applications. We report data from several illustrative examples involving protein-to-DNA alignment, DNA data storage, and nanopore sequence analysis.Availability and ImplementationMachine Boss is released under the BSD-3 open source license and is available from http://machineboss.org/.ContactIan Holmes, [email protected]

Download Full-text

MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs

Bioinformatics ◽

10.1093/bioinformatics/btaa045 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2690-2696

Author(s):

Jarkko Toivonen ◽

Pratyush K Das ◽

Jussi Taipale ◽

Esko Ukkonen

Keyword(s):

Markov Models ◽

Expectation Maximization Algorithm ◽

Software Tool ◽

Specific Weight ◽

Training Data ◽

Supplementary Information ◽

Markov Modeling ◽

Binding Motifs ◽

The Difference ◽

Probability Matrices

Abstract Motivation Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. Results We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. Availability and implementation Software implementation is available from https://github.com/jttoivon/moder2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The Use of a Simulation Model for High-Runner Strategy Implementation in Warehouse Logistics

Sustainability ◽

10.3390/su12239818 ◽

2020 ◽

Vol 12 (23) ◽

pp. 9818

Author(s):

Gabriel Fedorko ◽

Vieroslav Molnár ◽

Nikoleta Mikušová

Keyword(s):

Genetic Algorithm ◽

Simulation Model ◽

Optimization Problem ◽

Software Tool ◽

Time Saving ◽

General Validity ◽

Resistance Function ◽

Genetic Algorithm Method ◽

Working Day ◽

Enterprise Logistics

This paper examines the use of computer simulation methods to streamline the process of picking materials within warehouse logistics. The article describes the use of a genetic algorithm to optimize the storage of materials in shelving positions, in accordance with the method of High-Runner Strategy. The goal is to minimize the time needed for picking. The presented procedure enables the creation of a software tool in the form of an optimization model that can be used for the needs of the optimization of warehouse logistics processes within various types of production processes. There is a defined optimization problem in the form of a resistance function, which is of general validity. The optimization is represented using the example of 400 types of material items in 34 categories, stored in six rack rows. Using a simulation model, a comparison of a normal and an optimized state is realized, while a time saving of 48 min 36 s is achieved. The mentioned saving was achieved within one working day. However, the application of an approach based on the use of optimization using a genetic algorithm is not limited by the number of material items or the number of categories and shelves. The acquired knowledge demonstrates the application possibilities of the genetic algorithm method, even for the lowest levels of enterprise logistics, where the application of this approach is not yet a matter of course but, rather, a rarity.

Download Full-text

TriPOINT: a software tool to prioritize important genes in pathways and their non-coding regulators

Bioinformatics ◽

10.1093/bioinformatics/bty998 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2686-2689

Author(s):

Asa Thibodeau ◽

Dong-Guk Shin

Keyword(s):

Gene Expression ◽

Software Tool ◽

Supplementary Information ◽

Analysis Tool ◽

Graph Representations ◽

Expression Levels ◽

Conducting Pathway ◽

Pathway Analysis Tool ◽

Pathway Analyses ◽

Gene Expression Levels

Abstract Summary Current approaches for pathway analyses focus on representing gene expression levels on graph representations of pathways and conducting pathway enrichment among differentially expressed genes. However, gene expression levels by themselves do not reflect the overall picture as non-coding factors play an important role to regulate gene expression. To incorporate these non-coding factors into pathway analyses and to systematically prioritize genes in a pathway we introduce a new software: Triangulation of Perturbation Origins and Identification of Non-Coding Targets. Triangulation of Perturbation Origins and Identification of Non-Coding Targets is a pathway analysis tool, implemented in Java that identifies the significance of a gene under a condition (e.g. a disease phenotype) by studying graph representations of pathways, analyzing upstream and downstream gene interactions and integrating non-coding regions that may be regulating gene expression levels. Availability and implementation The TriPOINT open source software is freely available at https://github.uconn.edu/ajt06004/TriPOINT under the GPL v3.0 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences

Bioinformatics ◽

10.1093/bioinformatics/btz689 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jun Wang ◽

Pu-Feng Du ◽

Xin-Yu Xue ◽

Guang-Ping Li ◽

Yuan-Ke Zhou ◽

...

Keyword(s):

Sequence Data ◽

Software Tool ◽

Data Retrieval ◽

Supplementary Information ◽

Statistical Features ◽

Biological Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Source Codes ◽

Multiple Sequence Alignments

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Genetic association testing using the GENESIS R/Bioconductor package

Bioinformatics ◽

10.1093/bioinformatics/btz567 ◽

2019 ◽

Cited By ~ 20

Author(s):

Stephanie M Gogarten ◽

Tamar Sofer ◽

Han Chen ◽

Chaoyu Yu ◽

Jennifer A Brody ◽

...

Keyword(s):

Data Storage ◽

Genomic Analysis ◽

Supplementary Information ◽

Storage And Retrieval ◽

Association Testing ◽

Link Functions ◽

Efficient Storage ◽

Genetic Association Testing ◽

Analysis Workflow ◽

Complete Genomic

Abstract Summary The Genomic Data Storage (GDS) format provides efficient storage and retrieval of genotypes measured by microarrays and sequencing. We developed GENESIS to perform various single- and aggregate-variant association tests using genotype data stored in GDS format. GENESIS implements highly flexible mixed models, allowing for different link functions, multiple variance components and phenotypic heteroskedasticity. GENESIS integrates cohesively with other R/Bioconductor packages to build a complete genomic analysis workflow entirely within the R environment. Availability and implementation https://bioconductor.org/packages/GENESIS; vignettes included. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Higher-order Markov models for metagenomic sequence classification

Bioinformatics ◽

10.1093/bioinformatics/btaa562 ◽

2020 ◽

Vol 36 (14) ◽

pp. 4130-4136

Author(s):

David J Burks ◽

Rajeev K Azad

Keyword(s):

Dna Sequences ◽

Markov Models ◽

Fragment Size ◽

Higher Order ◽

Training Data ◽

Supplementary Information ◽

Local Alignment ◽

Metagenomic Sequence ◽

Higher Order Models

Abstract Motivation Alignment-free, stochastic models derived from k-mer distributions representing reference genome sequences have a rich history in the classification of DNA sequences. In particular, the variants of Markov models have previously been used extensively. Higher-order Markov models have been used with caution, perhaps sparingly, primarily because of the lack of enough training data and computational power. Advances in sequencing technology and computation have enabled exploitation of the predictive power of higher-order models. We, therefore, revisited higher-order Markov models and assessed their performance in classifying metagenomic sequences. Results Comparative assessment of higher-order models (HOMs, 9th order or higher) with interpolated Markov model, interpolated context model and lower-order models (8th order or lower) was performed on metagenomic datasets constructed using sequenced prokaryotic genomes. Our results show that HOMs outperform other models in classifying metagenomic fragments as short as 100 nt at all taxonomic ranks, and at lower ranks when the fragment size was increased to 250 nt. HOMs were also found to be significantly more accurate than local alignment which is widely relied upon for taxonomic classification of metagenomic sequences. A novel software implementation written in C++ performs classification faster than the existing Markovian metagenomic classifiers and can therefore be used as a standalone classifier or in conjunction with existing taxonomic classifiers for more robust classification of metagenomic sequences. Availability and implementation The software has been made available at https://github.com/djburks/SMM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SeqEditor: an application for primer design and sequence analysis with or without GTF/GFF files

Bioinformatics ◽

10.1093/bioinformatics/btaa903 ◽

2020 ◽

Author(s):

Ahmed Hafez ◽

Ricardo Futami ◽

Amir Arastehfar ◽

Farnaz Daneshnia ◽

Ana Miguel ◽

...

Keyword(s):

Protein Sequences ◽

Software Tool ◽

Primer Design ◽

Reference Sequence ◽

Supplementary Information ◽

Interactive Software ◽

Rna Sequences ◽

Content Mining ◽

Species Specific ◽

Flexible Application

Abstract Motivation Sequence analyses oriented to investigate specific features, patterns and functions of protein and DNA/RNA sequences usually require tools based on graphic interfaces whose main characteristic is their intuitiveness and interactivity with the user’s expertise, especially when curation or primer design tasks are required. However, interface-based tools usually pose certain computational limitations when managing large sequences or complex datasets, such as genome and transcriptome assemblies. Having these requirments in mind we have developed SeqEditor an interactive software tool for nucleotide and protein sequences’ analysis. Result SeqEditor is a cross-platform desktop application for the analysis of nucleotide and protein sequences. It is managed through a Graphical User Interface and can work either as a graphical sequence browser or as a fasta task manager for multi-fasta files. SeqEditor has been optimized for the management of large sequences, such as contigs, scaffolds or even chromosomes, and includes a GTF/GFF viewer to visualize and manage annotation files. In turn, this allows for content mining from reference genomes and transcriptomes with similar efficiency to that of command line tools. SeqEditor also incorporates a set of tools for singleplex and multiplex PCR primer design and pooling that uses a newly optimized and validated search strategy for target and species-specific primers. All these features make SeqEditor a flexible application that can be used to analyses complex sequences, design primers in PCR assays oriented for diagnosis, and/or manage, edit and personalize reference sequence datasets. Availabilityand implementation SeqEditor was developed in Java using Eclipse Rich Client Platform and is publicly available at https://gpro.biotechvana.com/download/SeqEditor as binaries for Windows, Linux and Mac OS. The user manual and tutorials are available online at https://gpro.biotechvana.com/tool/seqeditor/manual. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Jasmine: a Java pipeline for isomiR characterization in miRNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/btz806 ◽

2019 ◽

Cited By ~ 2

Author(s):

Xiangfu Zhong ◽

Albert Pla ◽

Simon Rayner

Keyword(s):

Population Structure ◽

Software Tool ◽

Supplementary Information ◽

Supplementary Data ◽

Analysis Pipeline ◽

Detailed Characterization ◽

Fasta Format ◽

Java Application

Abstract Motivation The existence of complex subpopulations of miRNA isoforms, or isomiRs, is well established. While many tools exist for investigating isomiR populations, they differ in how they characterize an isomiR, making it difficult to compare results across different tools. Thus, there is a need for a more comprehensive and systematic standard for defining isomiRs. Such a standard would allow investigation of isomiR population structure in progressively more refined sub-populations, permitting the identification of more subtle changes between conditions and leading to an improved understanding of the processes that generate these differences. Results We developed Jasmine, a software tool that incorporates a hierarchal framework for characterizing isomiR populations. Jasmine is a Java application that can process raw read data in fastq/fasta format, or mapped reads in SAM format to produce a detailed characterization of isomiR populations. Thus, Jasmine can reveal structure not apparent in a standard miRNA-Seq analysis pipeline. Availability and implementation Jasmine is implemented in Java and R and freely available at bitbucket https://bitbucket.org/bipous/jasmine/src/master/. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

geneCo: a visualized comparative genomic method to analyze multiple genome structures

Bioinformatics ◽

10.1093/bioinformatics/btz596 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5303-5305 ◽

Cited By ~ 4

Author(s):

Jaehee Jung ◽

Jong Im Kim ◽

Gangman Yi

Keyword(s):

Genome Structure ◽

Software Tool ◽

Detailed Comparison ◽

Supplementary Information ◽

Comparative Genomic ◽

Web Based ◽

Computational Environment ◽

Gene Comparison ◽

User Data ◽

Gain Loss

Abstract Summary In comparative and evolutionary genomics, a detailed comparison of common features between organisms is essential to evaluate genetic distance. However, identifying differences in matched and mismatched genes among multiple genomes is difficult using current comparative genomic approaches due to complicated methodologies or the generation of meager information from obtained results. This study describes a visualized software tool, geneCo (gene Comparison), for comparing genome structure and gene arrangements between various organisms. User data are aligned, gene information is recognized, and genome structures are compared based on user-defined GenBank files. Information regarding inversion, gain, loss, duplication and gene rearrangement among multiple organisms being compared is provided by geneCo, which uses a web-based interface that users can easily access without any need to consider the computational environment. Availability and implementation Users can freely use the software, and the accessible URL is https://bigdata.dongguk.edu/geneCo. The main module of geneCo is implemented by Python and the web-based user interface is built by PHP, HTML and CSS to support all browsers. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text