Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics

Deep neural networks and support vector machines have been shown to accurately predict genomewide signals of regulatory activity from raw DNA sequences. These models are appealing in part because they can learn predictive DNA sequence features without prior assumptions. Several methods such as in-silico mutagenesis, GradCAM, DeepLIFT, Integrated Gradients and GkmExplain have been developed to reveal these learned features. However, the behavior of these methods on regulatory genomic data remains an area of active research. Although prior work has benchmarked these methods on simulated datasets with known ground-truth motifs, these simulations employed highly simplified regulatory logic that is not representative of the genome. In this work, we propose a novel pipeline for designing simulated data that comes closer to modeling the complexity of regulatory genomic DNA. We apply the pipeline to build simulated datasets based on publicly-available chromatin accessibility experiments and use these datasets to benchmark different interpretation methods based on their ability to identify ground-truth motifs. We find that a GradCAM-based method, which was reported to perform well on a more simplified dataset, does not do well on this dataset (particularly when using an architecture with shorter convolutional kernels in the first layer), and we theoretically show that this is expected based on the nature of regulatory genomic data. We also show that Integrated Gradients sometimes performs worse than gradient-times-input, likely owing to its linear interpolation path. We additionally explore the impact of user-defined settings on the interpretation methods, such as the choice of "reference"/"baseline", and identify recommended settings for genomics. Our analysis suggests several promising directions for future research on these model interpretation methods. Code and links to data are available at https://github.com/kundajelab/interpret-benchmark.

Download Full-text

Deep Highway Networks and Tree-Based Ensemble for Predicting Short-Term Building Energy Consumption

Energies ◽

10.3390/en11123408 ◽

2018 ◽

Vol 11 (12) ◽

pp. 3408 ◽

Cited By ~ 7

Author(s):

Muhammad Ahmad ◽

Anthony Mouraud ◽

Yacine Rezgui ◽

Monjur Mourshed

Keyword(s):

Energy Consumption ◽

Power Systems ◽

Predictive Analytics ◽

Learning Algorithm ◽

Research Work ◽

Supervised Machine Learning ◽

Future Research ◽

Support Vector ◽

Highway Networks ◽

The Impact

Predictive analytics play a significant role in ensuring optimal and secure operation of power systems, reducing energy consumption, detecting fault and diagnosis, and improving grid resilience. However, due to system nonlinearities, delay, and complexity of the problem because of many influencing factors (e.g., climate, occupants’ behaviour, occupancy pattern, building type), it is a challenging task to get accurate energy consumption prediction. This paper investigates the accuracy and generalisation capabilities of deep highway networks (DHN) and extremely randomized trees (ET) for predicting hourly heating, ventilation and air conditioning (HVAC) energy consumption of a hotel building. Their performance was compared with support vector regression (SVR), a most widely used supervised machine learning algorithm. Results showed that both ET and DHN models marginally outperform the SVR algorithm. The paper also details the impact of increasing the deep highway network’s complexity on its performance. The paper concludes that all developed models are equally applicable for predicting hourly HVAC energy consumption. Possible reasons for the minimum impact of DHN complexity and future research work are also highlighted in the paper.

Download Full-text

Single-Cell Transcriptome Profiling Simulation Reveals the Impact of Sequencing Parameters and Algorithms on Clustering

Life ◽

10.3390/life11070716 ◽

2021 ◽

Vol 11 (7) ◽

pp. 716

Author(s):

Yunhe Liu ◽

Aoshen Wu ◽

Xueqing Peng ◽

Xiaona Liu ◽

Gang Liu ◽

...

Keyword(s):

Clustering Algorithms ◽

Simulated Data ◽

Ground Truth ◽

Real Data ◽

Transcriptome Profiling ◽

Actual Data ◽

Generation Process ◽

Data Generation ◽

The Matrix ◽

The Impact

Despite the scRNA-seq analytic algorithms developed, their performance for cell clustering cannot be quantified due to the unknown “true” clusters. Referencing the transcriptomic heterogeneity of cell clusters, a “true” mRNA number matrix of cell individuals was defined as ground truth. Based on the matrix and the actual data generation procedure, a simulation program (SSCRNA) for raw data was developed. Subsequently, the consistency between simulated data and real data was evaluated. Furthermore, the impact of sequencing depth and algorithms for analyses on cluster accuracy was quantified. As a result, the simulation result was highly consistent with that of the actual data. Among the clustering algorithms, the Gaussian normalization method was the more recommended. As for the clustering algorithms, the K-means clustering method was more stable than K-means plus Louvain clustering. In conclusion, the scRNA simulation algorithm developed restores the actual data generation process, discovers the impact of parameters on classification, compares the normalization/clustering algorithms, and provides novel insight into scRNA analyses.

Download Full-text

Gkmexplain: Fast and Accurate Interpretation of Nonlinear Gapped k-mer SVMs Using Integrated Gradients

10.1101/457606 ◽

2018 ◽

Cited By ~ 1

Author(s):

Avanti Shrikumar ◽

Eva Prakash ◽

Anshul Kundaje

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

Chromatin Accessibility ◽

Support Vector ◽

Computationally Efficient ◽

Link Type ◽

Novel Approach ◽

Mutation Impact ◽

Regulatory Dna

AbstractSupport Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns identified by gkmexplain. Finally, we find that mutation impact scores derived through gkmexplain using gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines consistently outperform deltaSVM and ISM at identifying regulatory genetic variants (dsQTLs). Code and example notebooks replicating the workflow are available at https://github.com/kundajelab/gkmexplain. Explanatory videos available at http://bit.ly/gkmexplainvids.

Download Full-text

Bayesian Analysis of Evolutionary Divergence with Genomic Data Under Diverse Demographic Models

10.1101/080606 ◽

2016 ◽

Author(s):

Yujin Chung ◽

Jody Hey

Keyword(s):

Markov Chain ◽

Dna Sequences ◽

Simulated Data ◽

Genomic Data ◽

New Method ◽

Model Parameters ◽

Evolutionary Divergence ◽

Posterior Density ◽

Demographic Models ◽

Pan Troglodytes Troglodytes

AbstractWe present a new Bayesian method for estimating demographic and phylogenetic history using population genomic data. Several key innovations are introduced that allow the study of diverse models within an Isolation with Migration framework. For the Markov chain Monte Carlo (MCMC) phase of the analysis, we use a reduced state space, consisting of simple coalescent trees without migration paths, and a simple importance sampling distribution without demography. Migration paths are analytically integrated using a Markov chain as a representation of genealogy. The new method is scalable to a large number of loci with excellent MCMC mixing properties. Once obtained, a single sample of trees is used to calculate the joint posterior density for model parameters under multiple diverse demographic models, without having to repeat MCMC runs. As implemented in the computer program MIST, we demonstrate the accuracy, scalability and other advantages of the new method using simulated data and DNA sequences of two common chimpanzee subspecies: Pan troglodytes troglodytes (P. t.) and P. t. verus.

Download Full-text

Analyze the Impact of Healthy Behavior on Weight Change with a Mathematical Model using the Harris-Benedict Equations

10.21203/rs.3.rs-584141/v1 ◽

2021 ◽

Author(s):

AYAN CHATTERJEE ◽

Ram Bajpai ◽

Martin W. Gerdes

Keyword(s):

Physical Activity ◽

Mathematical Model ◽

Health Behavior ◽

Health Behaviors ◽

Weight Change ◽

Simulated Data ◽

Future Research ◽

Healthy Behavior ◽

Lifestyle Diseases ◽

The Impact

Abstract Background: Lifestyle diseases are the leading cause of death worldwide. The gradual increase of negative behavior in humans because of physical inactivity, unhealthy habit, and improper nutrition expedites the growth of lifestyle diseases. Proper lifestyle management in the obesity context may help to reach personal weight goal or maintain a normal weight range with optimization of health behaviors (physical activity, diet, and habits). Objective: In this study, we develop a mathematical model to analyze the impact of regular physical activity, a proper diet, and healthy habits on weight change, targeting obesity as a study case. Followed by, we design an algorithm to verify our proposed model with simulated data and compare it with related proven models based on the defined constrains. Methods: We proposed a weight-change mathematical model as a function of activity, habit, and nutrition with the first law of thermodynamics, basal metabolic rate ( BMR ), total daily energy expenditure ( TDEE ), and body-mass-index ( BMI ) to establish a relationship between health behavior and weight change. Followed by, we verify the model with simulated data and compared it with related established models. In this study, we have used revised Harris-Benedict equations (HB) for BMR and TDEE calculation. Results: The proposed mathematical model showed a strong relationship between health behavior and weight change. We verified the mathematical model with a proposed algorithm using simulated data with defined constraints. The adoption of BMR and TDEE calculation following revised Harris-Benedict equations has beaten the classical Wishnofsky’s rule (3500 cal. ≈ 1 lb.) , and the models proposed by Toumasis et al., Azzeh et. Al., and Mickens et. al. with a standard deviation of ±1.829, ±2.006, ±1.85, and ±1.80, respectively. Conclusions: This study helped us to understand the impact of healthy behavior on weight change with mathematical implications and the importance of a healthy lifestyle. As a future research scope, we wish to use this model in a health eCoach system to generate personalized lifestyle recommendations to optimize health behaviors to accomplish personal weight goals.

Download Full-text

Automated Classification of Circulating Tumor Cells and the Impact of Interobsever Variability on Classifier Training and Performance

Journal of Immunology Research ◽

10.1155/2015/573165 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 17

Author(s):

Carl-Magnus Svensson ◽

Ron Hübler ◽

Marc Thilo Figge

Keyword(s):

Random Forest ◽

Tumor Cells ◽

Circulating Tumor Cells ◽

Interobserver Variability ◽

Ground Truth ◽

Training Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Data Set ◽

The Impact

Application of personalized medicine requires integration of different data to determine each patient’s unique clinical constitution. The automated analysis of medical data is a growing field where different machine learning techniques are used to minimize the time-consuming task of manual analysis. The evaluation, and often training, of automated classifiers requires manually labelled data as ground truth. In many cases such labelling is not perfect, either because of the data being ambiguous even for a trained expert or because of mistakes. Here we investigated the interobserver variability of image data comprising fluorescently stained circulating tumor cells and its effect on the performance of two automated classifiers, a random forest and a support vector machine. We found that uncertainty in annotation between observers limited the performance of the automated classifiers, especially when it was included in the test set on which classifier performance was measured. The random forest classifier turned out to be resilient to uncertainty in the training data while the support vector machine’s performance is highly dependent on the amount of uncertainty in the training data. We finally introduced the consensus data set as a possible solution for evaluation of automated classifiers that minimizes the penalty of interobserver variability.

Download Full-text

Leeway Prediction of Oceanic Disastrous Target via Support Vector Regression

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2004.p0606 ◽

2004 ◽

Vol 8 (6) ◽

pp. 606-612 ◽

Cited By ~ 2

Author(s):

Nipon Theera-Umpon ◽

◽

Udomsak Boonprasert ◽

Keyword(s):

Support Vector Regression ◽

Simulated Data ◽

Ground Truth ◽

Real Data ◽

Ocean Model ◽

Search And Rescue ◽

Support Vector ◽

Good Prediction ◽

Rescue Operation ◽

Princeton Ocean Model

This paper demonstrates an application of support vector machine (SVM) to the oceanic disasters search and rescue operation. The support vector regression (SVR) for system identification of a nonlinear black-box model is utilized in this research. The SVR-based ocean model helps the search and rescue unit by predicting the disastrous target’s position at any given time instant. The closer the predicted location to the actual location would shorten the searching time and minimize the loss. One of the most popular ocean models, namely the Princeton ocean model, is applied to provide the ground truth of the target leeway. From the experiments, the results on the simulated data show that the proposed SVR-based ocean model provides a good prediction compared to the Princeton ocean model. Moreover, the experimental results on the real data collected by the Royal Thai Navy also show that the proposed model can be used as an auxiliary tool in the search and rescue operation.

Download Full-text

Automatic Artifact Recognition and Correction for Electrodermal Activity in Uncontrolled Environments

10.21203/rs.3.rs-717360/v1 ◽

2021 ◽

Author(s):

Jose Llanes-Jurado ◽

Lucía Amalia Carrasco-Ribelles ◽

Mariano Alcañiz ◽

Javier Marín-Morales

Keyword(s):

Neural Networks ◽

Intelligent Systems ◽

Electrodermal Activity ◽

Time Windows ◽

Ground Truth ◽

Linear Interpolation ◽

Support Vector ◽

Emotional States ◽

Vector Machines ◽

High Degree

Abstract Scholars are increasingly using electrodermal activity (EDA) to assess cognitive-emotional states in laboratory environments, while recent applications have recorded EDA in uncontrolled settings, such as daily-life and virtual reality (VR) contexts, in which users can freely walk and move their hands. However, these records can be affected by major artifacts stemming from movements that can obscure valuable information. Previous work has analyzed signal correction methods to improve the quality of the signal or proposed artifact recognition models based on time windows. Despite these efforts, the correction of EDA signals in uncontrolled environments is still limited, and no existing research has used a signal manually corrected by an expert as a benchmark. This work investigates different machine learning and deep learning architectures, including support vector machines, recurrent neural networks (RNNs), and convolutional neural networks, for the automatic artifact recognition of EDA signals. The data from 44 subjects during an immersive VR task were collected and cleaned by two experts as ground truth. The best model, which used an RNN fed with the raw signal, recognized 72% of the artifacts and had an accuracy of 87%. An automatic correction was performed on the detected artifacts through a combination of linear interpolation and a high degree polynomial. The evaluation of this correction showed that the automatically and manually corrected signals did not present differences in terms of phasic components, while both showed differences to the raw signal. This work provides a tool to automatically correct artifacts of EDA signals which can be used in uncontrolled conditions, allowing for the development of intelligent systems based on EDA monitoring without human intervention.

Download Full-text

GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs

Bioinformatics ◽

10.1093/bioinformatics/btz322 ◽

2019 ◽

Vol 35 (14) ◽

pp. i173-i182 ◽

Cited By ~ 12

Author(s):

Avanti Shrikumar ◽

Eva Prakash ◽

Anshul Kundaje

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

Chromatin Accessibility ◽

Supplementary Information ◽

Support Vector ◽

Computationally Efficient ◽

Sequence Patterns ◽

Mutation Impact ◽

Regulatory Dna

Abstract Summary Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, non-redundant TF motifs. Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines. Availability and implementation Code and example notebooks to reproduce results are at https://github.com/kundajelab/gkmexplain. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Self-Compassion and Quality of Life in Adults Who Stutter

American Journal of Speech-Language Pathology ◽

10.1044/2020_ajslp-20-00055 ◽

2020 ◽

Vol 29 (4) ◽

pp. 2097-2108

Author(s):

Robyn L. Croft ◽

Courtney T. Byrd

Keyword(s):

Quality Of Life ◽

Social Connectedness ◽

The Self ◽

Future Research ◽

Self Compassion ◽

Adults Who Stutter ◽

And Gender ◽

Relationship Of ◽

The Impact

Purpose The purpose of this study was to identify levels of self-compassion in adults who do and do not stutter and to determine whether self-compassion predicts the impact of stuttering on quality of life in adults who stutter. Method Participants included 140 adults who do and do not stutter matched for age and gender. All participants completed the Self-Compassion Scale. Adults who stutter also completed the Overall Assessment of the Speaker's Experience of Stuttering. Data were analyzed for self-compassion differences between and within adults who do and do not stutter and to predict self-compassion on quality of life in adults who stutter. Results Adults who do and do not stutter exhibited no significant differences in total self-compassion, regardless of participant gender. A simple linear regression of the total self-compassion score and total Overall Assessment of the Speaker's Experience of Stuttering score showed a significant, negative linear relationship of self-compassion predicting the impact of stuttering on quality of life. Conclusions Data suggest that higher levels of self-kindness, mindfulness, and social connectedness (i.e., self-compassion) are related to reduced negative reactions to stuttering, an increased participation in daily communication situations, and an improved overall quality of life. Future research should replicate current findings and identify moderators of the self-compassion–quality of life relationship.

Download Full-text