scholarly journals Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics

2021 ◽  
Author(s):  
Eva Prakash ◽  
Avanti Shrikumar ◽  
Anshul Kundaje

Deep neural networks and support vector machines have been shown to accurately predict genomewide signals of regulatory activity from raw DNA sequences. These models are appealing in part because they can learn predictive DNA sequence features without prior assumptions. Several methods such as in-silico mutagenesis, GradCAM, DeepLIFT, Integrated Gradients and GkmExplain have been developed to reveal these learned features. However, the behavior of these methods on regulatory genomic data remains an area of active research. Although prior work has benchmarked these methods on simulated datasets with known ground-truth motifs, these simulations employed highly simplified regulatory logic that is not representative of the genome. In this work, we propose a novel pipeline for designing simulated data that comes closer to modeling the complexity of regulatory genomic DNA. We apply the pipeline to build simulated datasets based on publicly-available chromatin accessibility experiments and use these datasets to benchmark different interpretation methods based on their ability to identify ground-truth motifs. We find that a GradCAM-based method, which was reported to perform well on a more simplified dataset, does not do well on this dataset (particularly when using an architecture with shorter convolutional kernels in the first layer), and we theoretically show that this is expected based on the nature of regulatory genomic data. We also show that Integrated Gradients sometimes performs worse than gradient-times-input, likely owing to its linear interpolation path. We additionally explore the impact of user-defined settings on the interpretation methods, such as the choice of "reference"/"baseline", and identify recommended settings for genomics. Our analysis suggests several promising directions for future research on these model interpretation methods. Code and links to data are available at https://github.com/kundajelab/interpret-benchmark.

Energies ◽  
2018 ◽  
Vol 11 (12) ◽  
pp. 3408 ◽  
Author(s):  
Muhammad Ahmad ◽  
Anthony Mouraud ◽  
Yacine Rezgui ◽  
Monjur Mourshed

Predictive analytics play a significant role in ensuring optimal and secure operation of power systems, reducing energy consumption, detecting fault and diagnosis, and improving grid resilience. However, due to system nonlinearities, delay, and complexity of the problem because of many influencing factors (e.g., climate, occupants’ behaviour, occupancy pattern, building type), it is a challenging task to get accurate energy consumption prediction. This paper investigates the accuracy and generalisation capabilities of deep highway networks (DHN) and extremely randomized trees (ET) for predicting hourly heating, ventilation and air conditioning (HVAC) energy consumption of a hotel building. Their performance was compared with support vector regression (SVR), a most widely used supervised machine learning algorithm. Results showed that both ET and DHN models marginally outperform the SVR algorithm. The paper also details the impact of increasing the deep highway network’s complexity on its performance. The paper concludes that all developed models are equally applicable for predicting hourly HVAC energy consumption. Possible reasons for the minimum impact of DHN complexity and future research work are also highlighted in the paper.


Life ◽  
2021 ◽  
Vol 11 (7) ◽  
pp. 716
Author(s):  
Yunhe Liu ◽  
Aoshen Wu ◽  
Xueqing Peng ◽  
Xiaona Liu ◽  
Gang Liu ◽  
...  

Despite the scRNA-seq analytic algorithms developed, their performance for cell clustering cannot be quantified due to the unknown “true” clusters. Referencing the transcriptomic heterogeneity of cell clusters, a “true” mRNA number matrix of cell individuals was defined as ground truth. Based on the matrix and the actual data generation procedure, a simulation program (SSCRNA) for raw data was developed. Subsequently, the consistency between simulated data and real data was evaluated. Furthermore, the impact of sequencing depth and algorithms for analyses on cluster accuracy was quantified. As a result, the simulation result was highly consistent with that of the actual data. Among the clustering algorithms, the Gaussian normalization method was the more recommended. As for the clustering algorithms, the K-means clustering method was more stable than K-means plus Louvain clustering. In conclusion, the scRNA simulation algorithm developed restores the actual data generation process, discovers the impact of parameters on classification, compares the normalization/clustering algorithms, and provides novel insight into scRNA analyses.


2018 ◽  
Author(s):  
Avanti Shrikumar ◽  
Eva Prakash ◽  
Anshul Kundaje

AbstractSupport Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns identified by gkmexplain. Finally, we find that mutation impact scores derived through gkmexplain using gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines consistently outperform deltaSVM and ISM at identifying regulatory genetic variants (dsQTLs). Code and example notebooks replicating the workflow are available at https://github.com/kundajelab/gkmexplain. Explanatory videos available at http://bit.ly/gkmexplainvids.


2016 ◽  
Author(s):  
Yujin Chung ◽  
Jody Hey

AbstractWe present a new Bayesian method for estimating demographic and phylogenetic history using population genomic data. Several key innovations are introduced that allow the study of diverse models within an Isolation with Migration framework. For the Markov chain Monte Carlo (MCMC) phase of the analysis, we use a reduced state space, consisting of simple coalescent trees without migration paths, and a simple importance sampling distribution without demography. Migration paths are analytically integrated using a Markov chain as a representation of genealogy. The new method is scalable to a large number of loci with excellent MCMC mixing properties. Once obtained, a single sample of trees is used to calculate the joint posterior density for model parameters under multiple diverse demographic models, without having to repeat MCMC runs. As implemented in the computer program MIST, we demonstrate the accuracy, scalability and other advantages of the new method using simulated data and DNA sequences of two common chimpanzee subspecies: Pan troglodytes troglodytes (P. t.) and P. t. verus.


2021 ◽  
Author(s):  
AYAN CHATTERJEE ◽  
Ram Bajpai ◽  
Martin W. Gerdes

Abstract Background: Lifestyle diseases are the leading cause of death worldwide. The gradual increase of negative behavior in humans because of physical inactivity, unhealthy habit, and improper nutrition expedites the growth of lifestyle diseases. Proper lifestyle management in the obesity context may help to reach personal weight goal or maintain a normal weight range with optimization of health behaviors (physical activity, diet, and habits). Objective: In this study, we develop a mathematical model to analyze the impact of regular physical activity, a proper diet, and healthy habits on weight change, targeting obesity as a study case. Followed by, we design an algorithm to verify our proposed model with simulated data and compare it with related proven models based on the defined constrains. Methods: We proposed a weight-change mathematical model as a function of activity, habit, and nutrition with the first law of thermodynamics, basal metabolic rate ( BMR ), total daily energy expenditure ( TDEE ), and body-mass-index ( BMI ) to establish a relationship between health behavior and weight change. Followed by, we verify the model with simulated data and compared it with related established models. In this study, we have used revised Harris-Benedict equations (HB) for BMR and TDEE calculation. Results: The proposed mathematical model showed a strong relationship between health behavior and weight change. We verified the mathematical model with a proposed algorithm using simulated data with defined constraints. The adoption of BMR and TDEE calculation following revised Harris-Benedict equations has beaten the classical Wishnofsky’s rule (3500 cal. ≈ 1 lb.) , and the models proposed by Toumasis et al., Azzeh et. Al., and Mickens et. al. with a standard deviation of ±1.829, ±2.006, ±1.85, and ±1.80, respectively. Conclusions: This study helped us to understand the impact of healthy behavior on weight change with mathematical implications and the importance of a healthy lifestyle. As a future research scope, we wish to use this model in a health eCoach system to generate personalized lifestyle recommendations to optimize health behaviors to accomplish personal weight goals.


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Carl-Magnus Svensson ◽  
Ron Hübler ◽  
Marc Thilo Figge

Application of personalized medicine requires integration of different data to determine each patient’s unique clinical constitution. The automated analysis of medical data is a growing field where different machine learning techniques are used to minimize the time-consuming task of manual analysis. The evaluation, and often training, of automated classifiers requires manually labelled data as ground truth. In many cases such labelling is not perfect, either because of the data being ambiguous even for a trained expert or because of mistakes. Here we investigated the interobserver variability of image data comprising fluorescently stained circulating tumor cells and its effect on the performance of two automated classifiers, a random forest and a support vector machine. We found that uncertainty in annotation between observers limited the performance of the automated classifiers, especially when it was included in the test set on which classifier performance was measured. The random forest classifier turned out to be resilient to uncertainty in the training data while the support vector machine’s performance is highly dependent on the amount of uncertainty in the training data. We finally introduced the consensus data set as a possible solution for evaluation of automated classifiers that minimizes the penalty of interobserver variability.


Author(s):  
Nipon Theera-Umpon ◽  
◽  
Udomsak Boonprasert ◽  

This paper demonstrates an application of support vector machine (SVM) to the oceanic disasters search and rescue operation. The support vector regression (SVR) for system identification of a nonlinear black-box model is utilized in this research. The SVR-based ocean model helps the search and rescue unit by predicting the disastrous target’s position at any given time instant. The closer the predicted location to the actual location would shorten the searching time and minimize the loss. One of the most popular ocean models, namely the Princeton ocean model, is applied to provide the ground truth of the target leeway. From the experiments, the results on the simulated data show that the proposed SVR-based ocean model provides a good prediction compared to the Princeton ocean model. Moreover, the experimental results on the real data collected by the Royal Thai Navy also show that the proposed model can be used as an auxiliary tool in the search and rescue operation.


2021 ◽  
Author(s):  
Jose Llanes-Jurado ◽  
Lucía Amalia Carrasco-Ribelles ◽  
Mariano Alcañiz ◽  
Javier Marín-Morales

Abstract Scholars are increasingly using electrodermal activity (EDA) to assess cognitive-emotional states in laboratory environments, while recent applications have recorded EDA in uncontrolled settings, such as daily-life and virtual reality (VR) contexts, in which users can freely walk and move their hands. However, these records can be affected by major artifacts stemming from movements that can obscure valuable information. Previous work has analyzed signal correction methods to improve the quality of the signal or proposed artifact recognition models based on time windows. Despite these efforts, the correction of EDA signals in uncontrolled environments is still limited, and no existing research has used a signal manually corrected by an expert as a benchmark. This work investigates different machine learning and deep learning architectures, including support vector machines, recurrent neural networks (RNNs), and convolutional neural networks, for the automatic artifact recognition of EDA signals. The data from 44 subjects during an immersive VR task were collected and cleaned by two experts as ground truth. The best model, which used an RNN fed with the raw signal, recognized 72% of the artifacts and had an accuracy of 87%. An automatic correction was performed on the detected artifacts through a combination of linear interpolation and a high degree polynomial. The evaluation of this correction showed that the automatically and manually corrected signals did not present differences in terms of phasic components, while both showed differences to the raw signal. This work provides a tool to automatically correct artifacts of EDA signals which can be used in uncontrolled conditions, allowing for the development of intelligent systems based on EDA monitoring without human intervention.


2019 ◽  
Vol 35 (14) ◽  
pp. i173-i182 ◽  
Author(s):  
Avanti Shrikumar ◽  
Eva Prakash ◽  
Anshul Kundaje

Abstract Summary Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, non-redundant TF motifs. Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines. Availability and implementation Code and example notebooks to reproduce results are at https://github.com/kundajelab/gkmexplain. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 29 (4) ◽  
pp. 2097-2108
Author(s):  
Robyn L. Croft ◽  
Courtney T. Byrd

Purpose The purpose of this study was to identify levels of self-compassion in adults who do and do not stutter and to determine whether self-compassion predicts the impact of stuttering on quality of life in adults who stutter. Method Participants included 140 adults who do and do not stutter matched for age and gender. All participants completed the Self-Compassion Scale. Adults who stutter also completed the Overall Assessment of the Speaker's Experience of Stuttering. Data were analyzed for self-compassion differences between and within adults who do and do not stutter and to predict self-compassion on quality of life in adults who stutter. Results Adults who do and do not stutter exhibited no significant differences in total self-compassion, regardless of participant gender. A simple linear regression of the total self-compassion score and total Overall Assessment of the Speaker's Experience of Stuttering score showed a significant, negative linear relationship of self-compassion predicting the impact of stuttering on quality of life. Conclusions Data suggest that higher levels of self-kindness, mindfulness, and social connectedness (i.e., self-compassion) are related to reduced negative reactions to stuttering, an increased participation in daily communication situations, and an improved overall quality of life. Future research should replicate current findings and identify moderators of the self-compassion–quality of life relationship.


Sign in / Sign up

Export Citation Format

Share Document