MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs

Jarkko Toivonen; Pratyush K Das; Jussi Taipale; Esko Ukkonen

doi:10.1093/bioinformatics/btaa045

MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs

Bioinformatics ◽

10.1093/bioinformatics/btaa045 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2690-2696

Author(s):

Jarkko Toivonen ◽

Pratyush K Das ◽

Jussi Taipale ◽

Esko Ukkonen

Keyword(s):

Markov Models ◽

Expectation Maximization Algorithm ◽

Software Tool ◽

Specific Weight ◽

Training Data ◽

Supplementary Information ◽

Markov Modeling ◽

Binding Motifs ◽

The Difference ◽

Probability Matrices

Abstract Motivation Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. Results We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. Availability and implementation Software implementation is available from https://github.com/jttoivon/moder2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Higher-order Markov models for metagenomic sequence classification

Bioinformatics ◽

10.1093/bioinformatics/btaa562 ◽

2020 ◽

Vol 36 (14) ◽

pp. 4130-4136

Author(s):

David J Burks ◽

Rajeev K Azad

Keyword(s):

Dna Sequences ◽

Markov Models ◽

Fragment Size ◽

Higher Order ◽

Training Data ◽

Supplementary Information ◽

Local Alignment ◽

Metagenomic Sequence ◽

Higher Order Models

Abstract Motivation Alignment-free, stochastic models derived from k-mer distributions representing reference genome sequences have a rich history in the classification of DNA sequences. In particular, the variants of Markov models have previously been used extensively. Higher-order Markov models have been used with caution, perhaps sparingly, primarily because of the lack of enough training data and computational power. Advances in sequencing technology and computation have enabled exploitation of the predictive power of higher-order models. We, therefore, revisited higher-order Markov models and assessed their performance in classifying metagenomic sequences. Results Comparative assessment of higher-order models (HOMs, 9th order or higher) with interpolated Markov model, interpolated context model and lower-order models (8th order or lower) was performed on metagenomic datasets constructed using sequenced prokaryotic genomes. Our results show that HOMs outperform other models in classifying metagenomic fragments as short as 100 nt at all taxonomic ranks, and at lower ranks when the fragment size was increased to 250 nt. HOMs were also found to be significantly more accurate than local alignment which is widely relied upon for taxonomic classification of metagenomic sequences. A novel software implementation written in C++ performs classification faster than the existing Markovian metagenomic classifiers and can therefore be used as a standalone classifier or in conjunction with existing taxonomic classifiers for more robust classification of metagenomic sequences. Availability and implementation The software has been made available at https://github.com/djburks/SMM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A framework for space-efficient variable-order Markov models

Bioinformatics ◽

10.1093/bioinformatics/btz268 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4607-4616

Author(s):

Fabio Cunial ◽

Jarno Alanko ◽

Djamal Belazzougui

Keyword(s):

Language Processing ◽

Data Structures ◽

Markov Models ◽

Biological Properties ◽

Specific Model ◽

Suffix Array ◽

Training Data ◽

Supplementary Information ◽

Variable Order ◽

Scoring Functions

Abstract Motivation Markov models with contexts of variable length are widely used in bioinformatics for representing sets of sequences with similar biological properties. When models contain many long contexts, existing implementations are either unable to handle genome-scale training datasets within typical memory budgets, or they are optimized for specific model variants and are thus inflexible. Results We provide practical, versatile representations of variable-order Markov models and of interpolated Markov models, that support a large number of context-selection criteria, scoring functions, probability smoothing methods, and interpolations, and that take up to four times less space than previous implementations based on the suffix array, regardless of the number and length of contexts, and up to ten times less space than previous trie-based representations, or more, while matching the size of related, state-of-the-art data structures from Natural Language Processing. We describe how to further compress our indexes to a quantity related to the redundancy of the training data, saving up to 90% of their space on very repetitive datasets, and making them become up to 60 times smaller than previous implementations based on the suffix array. Finally, we show how to exploit constraints on the length and frequency of contexts to further shrink our compressed indexes to half of their size or more, achieving data structures that are a hundred times smaller than previous implementations based on the suffix array, or more. This allows variable-order Markov models to be used with bigger datasets and with longer contexts on the same hardware, thus possibly enabling new applications. Availability and implementation https://github.com/jnalanko/VOMM Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Machine Boss: rapid prototyping of bioinformatic automata

Bioinformatics ◽

10.1093/bioinformatics/btaa633 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jordi Silvestre-Ryan ◽

Yujie Wang ◽

Mehak Sharma ◽

Stephen Lin ◽

Yolanda Shen ◽

...

Keyword(s):

Data Storage ◽

Markov Models ◽

Software Tool ◽

Supplementary Information ◽

Time Saving ◽

Parameter Fitting ◽

Software Libraries ◽

Calculation Parameter ◽

Report Data ◽

Dna Alignment

Abstract Motivation Many software libraries for using Hidden Markov Models in bioinformatics focus on inference tasks, such as likelihood calculation, parameter-fitting and alignment. However, construction of the state machines can be a laborious task, automation of which would be time-saving and less error-prone. Results We present Machine Boss, a software tool implementing not just inference and parameter-fitting algorithms, but also a set of operations for manipulating and combining automata. The aim is to make prototyping of bioinformatics HMMs as quick and easy as the construction of regular expressions, with one-line ‘recipes’ for many common applications. We report data from several illustrative examples involving protein-to-DNA alignment, DNA data storage and nanopore sequence analysis. Availability and implementation Machine Boss is released under the BSD-3 open source license and is available from http://machineboss.org/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Semi-supervised learning of Hidden Markov Models for biological sequence analysis

Bioinformatics ◽

10.1093/bioinformatics/bty910 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2208-2215 ◽

Cited By ~ 5

Author(s):

Ioannis A Tamposis ◽

Konstantinos D Tsirigos ◽

Margarita C Theodoropoulou ◽

Panagiota I Kontou ◽

Pantelis G Bagos

Keyword(s):

Sequence Analysis ◽

Supervised Learning ◽

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Transmembrane Protein ◽

Training Data ◽

Supplementary Information ◽

Training Procedure ◽

Partially Labeled Data

Abstract Motivation Hidden Markov Models (HMMs) are probabilistic models widely used in applications in computational sequence analysis. HMMs are basically unsupervised models. However, in the most important applications, they are trained in a supervised manner. Training examples accompanied by labels corresponding to different classes are given as input and the set of parameters that maximize the joint probability of sequences and labels is estimated. A main problem with this approach is that, in the majority of the cases, labels are hard to find and thus the amount of training data is limited. On the other hand, there are plenty of unclassified (unlabeled) sequences deposited in the public databases that could potentially contribute to the training procedure. This approach is called semi-supervised learning and could be very helpful in many applications. Results We propose here, a method for semi-supervised learning of HMMs that can incorporate labeled, unlabeled and partially labeled data in a straightforward manner. The algorithm is based on a variant of the Expectation-Maximization (EM) algorithm, where the missing labels of the unlabeled or partially labeled data are considered as the missing data. We apply the algorithm to several biological problems, namely, for the prediction of transmembrane protein topology for alpha-helical and beta-barrel membrane proteins and for the prediction of archaeal signal peptides. The results are very promising, since the algorithms presented here can significantly improve the prediction performance of even the top-scoring classifiers. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An Equivariant Bayesian Convolutional Network predicts recombination hotspots and accurately resolves binding motifs

10.1101/351254 ◽

2018 ◽

Author(s):

Richard Brown ◽

Gerton Lunter

Keyword(s):

High Resolution ◽

Predictive Accuracy ◽

Training Data ◽

Supplementary Information ◽

Mathematical Framework ◽

Convolutional Network ◽

Binding Motifs ◽

Recombination Hotspots ◽

Reverse Complement ◽

Bayesian Approximation

AbstractMotivationConvolutional neural networks (CNNs) have been trememdously successful in many contexts, particularly where training data is abundant and signal-to-noise ratios are large. However, when predicting noisily observed biological phenotypes from DNA sequence, each training instance is only weakly informative, and the amount of training data is often fundamentally limited, emphasizing the need for methods that make optimal use of training data and any structure inherent in the model.ResultsHere we show how to combine equivariant networks, a general mathematical framework for handling exact symmetries in CNNs, with Bayesian dropout, a version of MC dropout suggested by a reinterpretation of dropout as a variational Bayesian approximation, to develop a model that exhibits exact reverse-complement symmetry and is more resistant to overtraining. We find that this model has increased power and generalizability, resulting in significantly better predictive accuracy compared to standard CNN implementations and state-of-art deep-learning-based motif finders. We use our network to predict recombination hotspots from sequence, and identify high-resolution binding motifs for the recombination-initiation protein PRDM9, which were recently validated by high-resolution assays. The network achieves a predictive accuracy comparable to that attainable by a direct assay of the H3K4me3 histone mark, a proxy for PRDM9 binding.Availabilityhttps://github.com/luntergroup/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

An equivariant Bayesian convolutional network predicts recombination hotspots and accurately resolves binding motifs

Bioinformatics ◽

10.1093/bioinformatics/bty964 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2177-2184 ◽

Cited By ~ 3

Author(s):

Richard C Brown ◽

Gerton Lunter

Keyword(s):

Predictive Accuracy ◽

Histone Mark ◽

Training Data ◽

Supplementary Information ◽

Mathematical Framework ◽

Convolutional Network ◽

Binding Motifs ◽

Recombination Hotspots ◽

Reverse Complement ◽

Bayesian Approximation

Abstract Motivation Convolutional neural networks (CNNs) have been tremendously successful in many contexts, particularly where training data are abundant and signal-to-noise ratios are large. However, when predicting noisily observed phenotypes from DNA sequence, each training instance is only weakly informative, and the amount of training data is often fundamentally limited, emphasizing the need for methods that make optimal use of training data and any structure inherent in the process. Results Here we show how to combine equivariant networks, a general mathematical framework for handling exact symmetries in CNNs, with Bayesian dropout, a version of Monte Carlo dropout suggested by a reinterpretation of dropout as a variational Bayesian approximation, to develop a model that exhibits exact reverse-complement symmetry and is more resistant to overtraining. We find that this model combines improved prediction consistency with better predictive accuracy compared to standard CNN implementations and state-of-art motif finders. We use our network to predict recombination hotspots from sequence, and identify binding motifs for the recombination–initiation protein PRDM9 previously unobserved in this data, which were recently validated by high-resolution assays. The network achieves a predictive accuracy comparable to that attainable by a direct assay of the H3K4me3 histone mark, a proxy for PRDM9 binding. Availability and implementation https://github.com/luntergroup/EquivariantNetworks Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

maTE: discovering expressed interactions between microRNAs and their targets

Bioinformatics ◽

10.1093/bioinformatics/btz204 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4020-4028 ◽

Cited By ~ 4

Author(s):

Malik Yousef ◽

Loai Abdallah ◽

Jens Allmer

Keyword(s):

Gene Expression ◽

Target Genes ◽

Training Data ◽

Supplementary Information ◽

Protein Abundance ◽

Expression Data ◽

Mirna Regulation ◽

Novel Approach ◽

The Difference ◽

And Control

Abstract Motivation Disease is often manifested via changes in transcript and protein abundance. MicroRNAs (miRNAs) are instrumental in regulating protein abundance and may measurably influence transcript levels. miRNAs often target more than one mRNA (for humans, the average is three), and mRNAs are often targeted by more than one miRNA (for the genes considered in this study, the average is also three). Therefore, it is difficult to determine the miRNAs that may cause the observed differential gene expression. We present a novel approach, maTE, which is based on machine learning, that integrates information about miRNA target genes with gene expression data. maTE depends on the availability of a sufficient amount of patient and control samples. The samples are used to train classifiers to accurately classify the samples on a per miRNA basis. Multiple high scoring miRNAs are used to build a final classifier to improve separation. Results The aim of the study is to find a set of miRNAs causing the regulation of their target genes that best explains the difference between groups (e.g. cancer versus control). maTE provides a list of significant groups of genes where each group is targeted by a specific miRNA. For the datasets used in this study, maTE generally achieves an accuracy well above 80%. Also, the results show that when the accuracy is much lower (e.g. ∼50%), the set of miRNAs provided is likely not causative of the difference in expression. This new approach of integrating miRNA regulation with expression data yields powerful results and is independent of external labels and training data. Thereby, this approach allows new avenues for exploring miRNA regulation and may enable the development of miRNA-based biomarkers and drugs. Availability and implementation The KNIME workflow, implementing maTE, is available at Bioinformatics online. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ANALYTICAL INDICATOR. ANALYSIS OF THE REAL STATE OF DYNAMICS OF MORTALITY OF THE POPULATION OF RUSSIA FROM MALIGNANT TUMORS AND CHANGES IN ITS STRUCTURE

Problems in oncology ◽

10.37469/0507-3758-2019-65-2-205-219 ◽

2019 ◽

Vol 65 (2) ◽

pp. 205-219 ◽

Cited By ~ 1

Author(s):

V. Merabishvili

Keyword(s):

Malignant Tumors ◽

Specific Weight ◽

Mortality Rates ◽

Retirement Age ◽

Medium Term ◽

Age Composition ◽

The Real ◽

Indicator Analysis ◽

The Difference ◽

Real State

The mortality rate is one of the most important criteria for assessing the health of the population. However, it is important to use analytical indicators correctly, especially when evaluating time series. The value of the “gross” mortality is closely linked with a specific weight of persons of elderly and senile ages. All international publications (WHO, IARC, territorial cancer registers) assess the dynamics of morbidity and mortality only by standardized indicators that eliminate the difference in the age composition of the compared population groups. In Russia, from 1960 to 2017, the share of people of retirement age has increased more than 2 times. The structure of mortality from malignant tumors has changed dramatically. The paper presents the dynamics of gross and standardized mortality rates from malignant tumors in Russia and in all administrative territories. Shows the real success of the Oncology service. The medium-term interval forecast until 2025 has been calculated.

Download Full-text

Blinking statistics and molecular counting in direct stochastic reconstruction microscopy (dSTORM)

Bioinformatics ◽

10.1093/bioinformatics/btab136 ◽

2021 ◽

Author(s):

Lekha Patel ◽

David Williamson ◽

Dylan M Owen ◽

Edward A K Cohen

Keyword(s):

Probability Distribution ◽

Single Molecule ◽

Immunological Synapse ◽

Training Data ◽

Supplementary Information ◽

Cellular Structures ◽

Exact Probability ◽

Stochastic Optical Reconstruction Microscopy ◽

Exact Probability Distribution ◽

Optical Reconstruction

Abstract Motivation Many recent advancements in single-molecule localization microscopy exploit the stochastic photoswitching of fluorophores to reveal complex cellular structures beyond the classical diffraction limit. However, this same stochasticity makes counting the number of molecules to high precision extremely challenging, preventing key insight into the cellular structures and processes under observation. Results Modelling the photoswitching behaviour of a fluorophore as an unobserved continuous time Markov process transitioning between a single fluorescent and multiple dark states, and fully mitigating for missed blinks and false positives, we present a method for computing the exact probability distribution for the number of observed localizations from a single photoswitching fluorophore. This is then extended to provide the probability distribution for the number of localizations in a direct stochastic optical reconstruction microscopy experiment involving an arbitrary number of molecules. We demonstrate that when training data are available to estimate photoswitching rates, the unknown number of molecules can be accurately recovered from the posterior mode of the number of molecules given the number of localizations. Finally, we demonstrate the method on experimental data by quantifying the number of adapter protein linker for activation of T cells on the cell surface of the T-cell immunological synapse. Availability and implementation Software and data available at https://github.com/lp1611/mol_count_dstorm. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CATH functional families predict functional sites in proteins

Bioinformatics ◽

10.1093/bioinformatics/btaa937 ◽

2020 ◽

Author(s):

Sayoni Das ◽

Harry M Scholes ◽

Neeladri Sen ◽

Christine Orengo

Keyword(s):

Functional Characterization ◽

Functional Site ◽

Training Data ◽

Supplementary Information ◽

Conserved Residues ◽

Functional Sites ◽

Protein Protein Interaction ◽

Evolutionary Features ◽

Functional Families

Abstract Motivation Identification of functional sites in proteins is essential for functional characterization, variant interpretation and drug design. Several methods are available for predicting either a generic functional site, or specific types of functional site. Here, we present FunSite, a machine learning predictor that identifies catalytic, ligand-binding and protein–protein interaction functional sites using features derived from protein sequence and structure, and evolutionary data from CATH functional families (FunFams). Results FunSite’s prediction performance was rigorously benchmarked using cross-validation and a holdout dataset. FunSite outperformed other publicly available functional site prediction methods. We show that conserved residues in FunFams are enriched in functional sites. We found FunSite’s performance depends greatly on the quality of functional site annotations and the information content of FunFams in the training data. Finally, we analyze which structural and evolutionary features are most predictive for functional sites. Availabilityand implementation https://github.com/UCL/cath-funsite-predictor. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text