An equivariant Bayesian convolutional network predicts recombination hotspots and accurately resolves binding motifs

Richard C Brown; Gerton Lunter

doi:10.1093/bioinformatics/bty964

An equivariant Bayesian convolutional network predicts recombination hotspots and accurately resolves binding motifs

Bioinformatics ◽

10.1093/bioinformatics/bty964 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2177-2184 ◽

Cited By ~ 3

Author(s):

Richard C Brown ◽

Gerton Lunter

Keyword(s):

Predictive Accuracy ◽

Histone Mark ◽

Training Data ◽

Supplementary Information ◽

Mathematical Framework ◽

Convolutional Network ◽

Binding Motifs ◽

Recombination Hotspots ◽

Reverse Complement ◽

Bayesian Approximation

Abstract Motivation Convolutional neural networks (CNNs) have been tremendously successful in many contexts, particularly where training data are abundant and signal-to-noise ratios are large. However, when predicting noisily observed phenotypes from DNA sequence, each training instance is only weakly informative, and the amount of training data is often fundamentally limited, emphasizing the need for methods that make optimal use of training data and any structure inherent in the process. Results Here we show how to combine equivariant networks, a general mathematical framework for handling exact symmetries in CNNs, with Bayesian dropout, a version of Monte Carlo dropout suggested by a reinterpretation of dropout as a variational Bayesian approximation, to develop a model that exhibits exact reverse-complement symmetry and is more resistant to overtraining. We find that this model combines improved prediction consistency with better predictive accuracy compared to standard CNN implementations and state-of-art motif finders. We use our network to predict recombination hotspots from sequence, and identify binding motifs for the recombination–initiation protein PRDM9 previously unobserved in this data, which were recently validated by high-resolution assays. The network achieves a predictive accuracy comparable to that attainable by a direct assay of the H3K4me3 histone mark, a proxy for PRDM9 binding. Availability and implementation https://github.com/luntergroup/EquivariantNetworks Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An Equivariant Bayesian Convolutional Network predicts recombination hotspots and accurately resolves binding motifs

10.1101/351254 ◽

2018 ◽

Author(s):

Richard Brown ◽

Gerton Lunter

Keyword(s):

High Resolution ◽

Predictive Accuracy ◽

Training Data ◽

Supplementary Information ◽

Mathematical Framework ◽

Convolutional Network ◽

Binding Motifs ◽

Recombination Hotspots ◽

Reverse Complement ◽

Bayesian Approximation

AbstractMotivationConvolutional neural networks (CNNs) have been trememdously successful in many contexts, particularly where training data is abundant and signal-to-noise ratios are large. However, when predicting noisily observed biological phenotypes from DNA sequence, each training instance is only weakly informative, and the amount of training data is often fundamentally limited, emphasizing the need for methods that make optimal use of training data and any structure inherent in the model.ResultsHere we show how to combine equivariant networks, a general mathematical framework for handling exact symmetries in CNNs, with Bayesian dropout, a version of MC dropout suggested by a reinterpretation of dropout as a variational Bayesian approximation, to develop a model that exhibits exact reverse-complement symmetry and is more resistant to overtraining. We find that this model has increased power and generalizability, resulting in significantly better predictive accuracy compared to standard CNN implementations and state-of-art deep-learning-based motif finders. We use our network to predict recombination hotspots from sequence, and identify high-resolution binding motifs for the recombination-initiation protein PRDM9, which were recently validated by high-resolution assays. The network achieves a predictive accuracy comparable to that attainable by a direct assay of the H3K4me3 histone mark, a proxy for PRDM9 binding.Availabilityhttps://github.com/luntergroup/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs

Bioinformatics ◽

10.1093/bioinformatics/btaa045 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2690-2696

Author(s):

Jarkko Toivonen ◽

Pratyush K Das ◽

Jussi Taipale ◽

Esko Ukkonen

Keyword(s):

Markov Models ◽

Expectation Maximization Algorithm ◽

Software Tool ◽

Specific Weight ◽

Training Data ◽

Supplementary Information ◽

Markov Modeling ◽

Binding Motifs ◽

The Difference ◽

Probability Matrices

Abstract Motivation Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. Results We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. Availability and implementation Software implementation is available from https://github.com/jttoivon/moder2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Predicting Individual Task Contrasts From Resting-state Functional Connectivity using a Surface-based Convolutional Network

10.1101/2021.04.19.440523 ◽

2021 ◽

Author(s):

Gia Ngo ◽

Meenakshi Khosla ◽

Keith Jamison ◽

Amy Kuceyeski ◽

Mert R Sabuncu

Keyword(s):

Resting State ◽

Predictive Accuracy ◽

Functional Neuroimaging ◽

Resting State Fmri ◽

Training Data ◽

Resting State Functional Connectivity ◽

Convolutional Network ◽

Individual Level ◽

Human Connectome Project ◽

Individual Task

Task-based and resting-state represent the two most common experimental paradigms of functional neuroimaging. While resting-state offers a flexible and scalable approach for characterizing brain function, task-based techniques provide superior localization. In this paper, we build on recent deep learning methods to create a model that predicts task-based contrast maps from resting-state fMRI scans. Specifically, we propose BrainSurfCNN, a surface-based fully-convolutional neural network model that works with a representation of the brain's cortical sheet. Our model achieves state of the art predictive accuracy on independent test data from the Human Connectome Project and yields individual-level predicted maps that are on par with the target-repeat reliability of the measured contrast maps. We also demonstrate that BrainSurfCNN can generalize remarkably well to novel domains with limited training data.

Download Full-text

Blinking statistics and molecular counting in direct stochastic reconstruction microscopy (dSTORM)

Bioinformatics ◽

10.1093/bioinformatics/btab136 ◽

2021 ◽

Author(s):

Lekha Patel ◽

David Williamson ◽

Dylan M Owen ◽

Edward A K Cohen

Keyword(s):

Probability Distribution ◽

Single Molecule ◽

Immunological Synapse ◽

Training Data ◽

Supplementary Information ◽

Cellular Structures ◽

Exact Probability ◽

Stochastic Optical Reconstruction Microscopy ◽

Exact Probability Distribution ◽

Optical Reconstruction

Abstract Motivation Many recent advancements in single-molecule localization microscopy exploit the stochastic photoswitching of fluorophores to reveal complex cellular structures beyond the classical diffraction limit. However, this same stochasticity makes counting the number of molecules to high precision extremely challenging, preventing key insight into the cellular structures and processes under observation. Results Modelling the photoswitching behaviour of a fluorophore as an unobserved continuous time Markov process transitioning between a single fluorescent and multiple dark states, and fully mitigating for missed blinks and false positives, we present a method for computing the exact probability distribution for the number of observed localizations from a single photoswitching fluorophore. This is then extended to provide the probability distribution for the number of localizations in a direct stochastic optical reconstruction microscopy experiment involving an arbitrary number of molecules. We demonstrate that when training data are available to estimate photoswitching rates, the unknown number of molecules can be accurately recovered from the posterior mode of the number of molecules given the number of localizations. Finally, we demonstrate the method on experimental data by quantifying the number of adapter protein linker for activation of T cells on the cell surface of the T-cell immunological synapse. Availability and implementation Software and data available at https://github.com/lp1611/mol_count_dstorm. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CATH functional families predict functional sites in proteins

Bioinformatics ◽

10.1093/bioinformatics/btaa937 ◽

2020 ◽

Author(s):

Sayoni Das ◽

Harry M Scholes ◽

Neeladri Sen ◽

Christine Orengo

Keyword(s):

Functional Characterization ◽

Functional Site ◽

Training Data ◽

Supplementary Information ◽

Conserved Residues ◽

Functional Sites ◽

Protein Protein Interaction ◽

Evolutionary Features ◽

Functional Families

Abstract Motivation Identification of functional sites in proteins is essential for functional characterization, variant interpretation and drug design. Several methods are available for predicting either a generic functional site, or specific types of functional site. Here, we present FunSite, a machine learning predictor that identifies catalytic, ligand-binding and protein–protein interaction functional sites using features derived from protein sequence and structure, and evolutionary data from CATH functional families (FunFams). Results FunSite’s prediction performance was rigorously benchmarked using cross-validation and a holdout dataset. FunSite outperformed other publicly available functional site prediction methods. We show that conserved residues in FunFams are enriched in functional sites. We found FunSite’s performance depends greatly on the quality of functional site annotations and the information content of FunFams in the training data. Finally, we analyze which structural and evolutionary features are most predictive for functional sites. Availabilityand implementation https://github.com/UCL/cath-funsite-predictor. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Structure-aware protein–protein interaction site prediction using deep graph convolutional network

Bioinformatics ◽

10.1093/bioinformatics/btab643 ◽

2021 ◽

Author(s):

Qianmu Yuan ◽

Jianwen Chen ◽

Huiying Zhao ◽

Yaoqi Zhou ◽

Yuedong Yang

Keyword(s):

Protein Interactions ◽

Spatial Information ◽

Screening Tools ◽

Supplementary Information ◽

Protein Protein Interactions ◽

Convolutional Network ◽

Source Codes ◽

Site Prediction ◽

Protein Protein Interaction ◽

Mapping Techniques

Abstract Motivation Protein–protein interactions (PPI) play crucial roles in many biological processes, and identifying PPI sites is an important step for mechanistic understanding of diseases and design of novel drugs. Since experimental approaches for PPI site identification are expensive and time-consuming, many computational methods have been developed as screening tools. However, these methods are mostly based on neighbored features in sequence, and thus limited to capture spatial information. Results We propose a deep graph-based framework deep Graph convolutional network for Protein–Protein-Interacting Site prediction (GraphPPIS) for PPI site prediction, where the PPI site prediction problem was converted into a graph node classification task and solved by deep learning using the initial residual and identity mapping techniques. We showed that a deeper architecture (up to eight layers) allows significant performance improvement over other sequence-based and structure-based methods by more than 12.5% and 10.5% on AUPRC and MCC, respectively. Further analyses indicated that the predicted interacting sites by GraphPPIS are more spatially clustered and closer to the native ones even when false-positive predictions are made. The results highlight the importance of capturing spatially neighboring residues for interacting site prediction. Availability and implementation The datasets, the pre-computed features, and the source codes along with the pre-trained models of GraphPPIS are available at https://github.com/biomed-AI/GraphPPIS. The GraphPPIS web server is freely available at https://biomed.nscc-gz.cn/apps/GraphPPIS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DeepIsoFun: a deep domain adaptation approach to predict isoform functions

Bioinformatics ◽

10.1093/bioinformatics/bty1017 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2535-2544 ◽

Cited By ~ 4

Author(s):

Dipan Shaw ◽

Hao Chen ◽

Tao Jiang

Keyword(s):

Network Architecture ◽

Domain Adaptation ◽

Multiple Instance Learning ◽

Training Data ◽

Supplementary Information ◽

Operating Characteristics ◽

Neural Network Architecture ◽

Receiver Operating Characteristics Curve ◽

Gene Functions ◽

Better Than

AbstractMotivationIsoforms are mRNAs produced from the same gene locus by alternative splicing and may have different functions. Although gene functions have been studied extensively, little is known about the specific functions of isoforms. Recently, some computational approaches based on multiple instance learning have been proposed to predict isoform functions from annotated gene functions and expression data, but their performance is far from being desirable primarily due to the lack of labeled training data. To improve the performance on this problem, we propose a novel deep learning method, DeepIsoFun, that combines multiple instance learning with domain adaptation. The latter technique helps to transfer the knowledge of gene functions to the prediction of isoform functions and provides additional labeled training data. Our model is trained on a deep neural network architecture so that it can adapt to different expression distributions associated with different gene ontology terms.ResultsWe evaluated the performance of DeepIsoFun on three expression datasets of human and mouse collected from SRA studies at different times. On each dataset, DeepIsoFun performed significantly better than the existing methods. In terms of area under the receiver operating characteristics curve, our method acquired at least 26% improvement and in terms of area under the precision-recall curve, it acquired at least 10% improvement over the state-of-the-art methods. In addition, we also study the divergence of the functions predicted by our method for isoforms from the same gene and the overall correlation between expression similarity and the similarity of predicted functions.Availability and implementationhttps://github.com/dls03/DeepIsoFun/Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Evaluation of Power Insulator Detection Efficiency with the Use of Limited Training Dataset

Applied Sciences ◽

10.3390/app10062104 ◽

2020 ◽

Vol 10 (6) ◽

pp. 2104

Author(s):

Michał Tomaszewski ◽

Paweł Michalski ◽

Jakub Osuchowski

Keyword(s):

Neural Network ◽

Neural Networks ◽

Object Detection ◽

Convolutional Neural Network ◽

Deep Neural Networks ◽

Detection Efficiency ◽

Training Data ◽

Training Dataset ◽

Training Set ◽

Convolutional Network

This article presents an analysis of the effectiveness of object detection in digital images with the application of a limited quantity of input. The possibility of using a limited set of learning data was achieved by developing a detailed scenario of the task, which strictly defined the conditions of detector operation in the considered case of a convolutional neural network. The described solution utilizes known architectures of deep neural networks in the process of learning and object detection. The article presents comparisons of results from detecting the most popular deep neural networks while maintaining a limited training set composed of a specific number of selected images from diagnostic video. The analyzed input material was recorded during an inspection flight conducted along high-voltage lines. The object detector was built for a power insulator. The main contribution of the presented papier is the evidence that a limited training set (in our case, just 60 training frames) could be used for object detection, assuming an outdoor scenario with low variability of environmental conditions. The decision of which network will generate the best result for such a limited training set is not a trivial task. Conducted research suggests that the deep neural networks will achieve different levels of effectiveness depending on the amount of training data. The most beneficial results were obtained for two convolutional neural networks: the faster region-convolutional neural network (faster R-CNN) and the region-based fully convolutional network (R-FCN). Faster R-CNN reached the highest AP (average precision) at a level of 0.8 for 60 frames. The R-FCN model gained a worse AP result; however, it can be noted that the relationship between the number of input samples and the obtained results has a significantly lower influence than in the case of other CNN models, which, in the authors’ assessment, is a desired feature in the case of a limited training set.

Download Full-text

EVALUATION OF SEMANTIC SEGMENTATION METHODS FOR DEFORESTATION DETECTION IN THE AMAZON

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliii-b3-2020-1497-2020 ◽

2020 ◽

Vol XLIII-B3-2020 ◽

pp. 1497-1505

Author(s):

R. B. Andrade ◽

G. A. O. P. Costa ◽

G. L. A. Mota ◽

M. X. Ortega ◽

R. Q. Feitosa ◽

...

Keyword(s):

Global Climate ◽

Biodiversity Loss ◽

Semantic Segmentation ◽

Training Sample ◽

Training Data ◽

Detection Methods ◽

Amazon Forest ◽

Convolutional Network ◽

Detection Approach ◽

Segmentation Methods

Abstract. Deforestation is a wide-reaching problem, responsible for serious environmental issues, such as biodiversity loss and global climate change. Containing approximately ten percent of all biomass on the planet and home to one tenth of the known species, the Amazon biome has faced important deforestation pressure in the last decades. Devising efficient deforestation detection methods is, therefore, key to combat illegal deforestation and to aid in the conception of public policies directed to promote sustainable development in the Amazon. In this work, we implement and evaluate a deforestation detection approach which is based on a Fully Convolutional, Deep Learning (DL) model: the DeepLabv3+. We compare the results obtained with the devised approach to those obtained with previously proposed DL-based methods (Early Fusion and Siamese Convolutional Network) using Landsat OLI-8 images acquired at different dates, covering a region of the Amazon forest. In order to evaluate the sensitivity of the methods to the amount of training data, we also evaluate them using varying training sample set sizes. The results show that all tested variants of the proposed method significantly outperform the other DL-based methods in terms of overall accuracy and F1-score. The gains in performance were even more substantial when limited amounts of samples were used in training the evaluated methods.

Download Full-text

Higher-order Markov models for metagenomic sequence classification

Bioinformatics ◽

10.1093/bioinformatics/btaa562 ◽

2020 ◽

Vol 36 (14) ◽

pp. 4130-4136

Author(s):

David J Burks ◽

Rajeev K Azad

Keyword(s):

Dna Sequences ◽

Markov Models ◽

Fragment Size ◽

Higher Order ◽

Training Data ◽

Supplementary Information ◽

Local Alignment ◽

Metagenomic Sequence ◽

Higher Order Models

Abstract Motivation Alignment-free, stochastic models derived from k-mer distributions representing reference genome sequences have a rich history in the classification of DNA sequences. In particular, the variants of Markov models have previously been used extensively. Higher-order Markov models have been used with caution, perhaps sparingly, primarily because of the lack of enough training data and computational power. Advances in sequencing technology and computation have enabled exploitation of the predictive power of higher-order models. We, therefore, revisited higher-order Markov models and assessed their performance in classifying metagenomic sequences. Results Comparative assessment of higher-order models (HOMs, 9th order or higher) with interpolated Markov model, interpolated context model and lower-order models (8th order or lower) was performed on metagenomic datasets constructed using sequenced prokaryotic genomes. Our results show that HOMs outperform other models in classifying metagenomic fragments as short as 100 nt at all taxonomic ranks, and at lower ranks when the fragment size was increased to 250 nt. HOMs were also found to be significantly more accurate than local alignment which is widely relied upon for taxonomic classification of metagenomic sequences. A novel software implementation written in C++ performs classification faster than the existing Markovian metagenomic classifiers and can therefore be used as a standalone classifier or in conjunction with existing taxonomic classifiers for more robust classification of metagenomic sequences. Availability and implementation The software has been made available at https://github.com/djburks/SMM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text