GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs

Avanti Shrikumar; Eva Prakash; Anshul Kundaje

doi:10.1093/bioinformatics/btz322

GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs

Bioinformatics ◽

10.1093/bioinformatics/btz322 ◽

2019 ◽

Vol 35 (14) ◽

pp. i173-i182 ◽

Cited By ~ 12

Author(s):

Avanti Shrikumar ◽

Eva Prakash ◽

Anshul Kundaje

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

Chromatin Accessibility ◽

Supplementary Information ◽

Support Vector ◽

Computationally Efficient ◽

Sequence Patterns ◽

Mutation Impact ◽

Regulatory Dna

Abstract Summary Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, non-redundant TF motifs. Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines. Availability and implementation Code and example notebooks to reproduce results are at https://github.com/kundajelab/gkmexplain. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Gkmexplain: Fast and Accurate Interpretation of Nonlinear Gapped k-mer SVMs Using Integrated Gradients

10.1101/457606 ◽

2018 ◽

Cited By ~ 1

Author(s):

Avanti Shrikumar ◽

Eva Prakash ◽

Anshul Kundaje

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

Chromatin Accessibility ◽

Support Vector ◽

Computationally Efficient ◽

Link Type ◽

Novel Approach ◽

Mutation Impact ◽

Regulatory Dna

AbstractSupport Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns identified by gkmexplain. Finally, we find that mutation impact scores derived through gkmexplain using gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines consistently outperform deltaSVM and ISM at identifying regulatory genetic variants (dsQTLs). Code and example notebooks replicating the workflow are available at https://github.com/kundajelab/gkmexplain. Explanatory videos available at http://bit.ly/gkmexplainvids.

Download Full-text

Discovering epistatic feature interactions from neural network models of regulatory DNA sequences

10.1101/302711 ◽

2018 ◽

Cited By ~ 2

Author(s):

Peyton Greenside ◽

Tyler Shimko ◽

Polly Fordyce ◽

Anshul Kundaje

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Chromatin Accessibility ◽

Feature Interaction ◽

Core Motif ◽

Feature Interactions ◽

Binding Models ◽

Regulatory Dna Sequences ◽

Regulatory Dna

AbstractMotivationTranscription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models.ResultsWe present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.AvailabilityCode is available at: https://github.com/kundajelab/dfim.Contact: [email protected]

Download Full-text

FastSK: fast sequence analysis with gapped string kernels

Bioinformatics ◽

10.1093/bioinformatics/btaa817 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i857-i865

Author(s):

Derrick Blakely ◽

Eamon Collins ◽

Ritambhara Singh ◽

Andrew Norton ◽

Jack Lanchantin ◽

...

Keyword(s):

Sequence Analysis ◽

Dna Sequences ◽

English Language ◽

Computation Time ◽

Entity Recognition ◽

Supplementary Information ◽

Support Vector ◽

Homology Detection ◽

Scalable Algorithm ◽

String Kernels

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Characterizing Longitudinal Changes in the Impedance Spectra of In-Vivo Peripheral Nerve Electrodes

Micromachines ◽

10.3390/mi9110587 ◽

2018 ◽

Vol 9 (11) ◽

pp. 587 ◽

Cited By ~ 11

Author(s):

Malgorzata Straka ◽

Benjamin Shafer ◽

Srikanth Vasudevan ◽

Cristin Welle ◽

Loren Rieth

Keyword(s):

Electrochemical Impedance ◽

Iridium Oxide ◽

Support Vector ◽

Impedance Spectra ◽

Computationally Efficient ◽

Wide Range ◽

Tissue Interface ◽

Computationally Efficient Algorithms ◽

Physical Changes

Characterizing the aging processes of electrodes in vivo is essential in order to elucidate the changes of the electrode–tissue interface and the device. However, commonly used impedance measurements at 1 kHz are insufficient for determining electrode viability, with measurements being prone to false positives. We implanted cohorts of five iridium oxide (IrOx) and six platinum (Pt) Utah arrays into the sciatic nerve of rats, and collected the electrochemical impedance spectroscopy (EIS) up to 12 weeks or until array failure. We developed a method to classify the shapes of the magnitude and phase spectra, and correlated the classifications to circuit models and electrochemical processes at the interface likely responsible. We found categories of EIS characteristic of iridium oxide tip metallization, platinum tip metallization, tip metal degradation, encapsulation degradation, and wire breakage in the lead. We also fitted the impedance spectra as features to a fine-Gaussian support vector machine (SVM) algorithm for both IrOx and Pt tipped arrays, with a prediction accuracy for categories of 95% and 99%, respectively. Together, this suggests that these simple and computationally efficient algorithms are sufficient to explain the majority of variance across a wide range of EIS data describing Utah arrays. These categories were assessed over time, providing insights into the degradation and failure mechanisms for both the electrode–tissue interface and wire bundle. Methods developed in this study will allow for a better understanding of how EIS can characterize the physical changes to electrodes in vivo.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Enhanced TF binding site maps improve regulatory networks learned from accessible chromatin data

10.1101/545780 ◽

2019 ◽

Author(s):

Shubhada R. Kulkarni ◽

D. Marc Jones ◽

Klaas Vandepoele

Keyword(s):

Binding Site ◽

Dna Sequences ◽

Regulatory Networks ◽

Transcriptional Control ◽

Chromatin Accessibility ◽

Systematic Evaluation ◽

Open Chromatin ◽

Ensemble Approach ◽

Complete Cell ◽

Regulatory Dna

ABSTRACTDetermining where transcription factors (TF) bind in genomes provides insights into which transcriptional programs are active across organs, tissue types, and environmental conditions. Recent advances in high-throughput profiling of regulatory DNA have yielded large amounts of information about chromatin accessibility. Interpreting the functional significance of these datasets requires knowledge of which regulators are likely to bind these regions. This can be achieved by using information about TF binding preferences, or motifs, to identify TF binding events that are likely to be functional. Although different approaches exist to map motifs to DNA sequences, a systematic evaluation of these tools in plants is missing. Here we compare four motif mapping tools widely used in the Arabidopsis research community and evaluate their performance using chromatin immunoprecipitation datasets for 40 TFs. Downstream gene regulatory network (GRN) reconstruction was found to be sensitive to the motif mapper used. We further show that the low recall of FIMO, one of the most frequently used motif mapping tools, can be overcome by using an Ensemble approach, which combines results from different mapping tools. Several examples are provided demonstrating how the Ensemble approach extends our view on transcriptional control for TFs active in different biological processes. Finally, a new protocol is presented to efficiently derive more complete cell type-specific GRNs through the integrative analysis of open chromatin regions, known binding site information, and expression datasets.

Download Full-text

FOCUS2: agile and sensitive classification of metagenomics data using a reduced database

10.1101/046425 ◽

2016 ◽

Cited By ~ 2

Author(s):

Genivaldo Gueiros Z. Silva ◽

Bas E. Dutilh ◽

Robert A. Edwards

Keyword(s):

Microbial Community ◽

Dna Sequences ◽

Computational Method ◽

Environmental Research ◽

Supplementary Information ◽

Sequence Classification ◽

Computationally Efficient ◽

Link Type ◽

Metagenomics Data

ABSTRACTSummaryMetagenomics approaches rely on identifying the presence of organisms in the microbial community from a set of unknown DNA sequences. Sequence classification has valuable applications in multiple important areas of medical and environmental research. Here we introduce FOCUS2, an update of the previously published computational method FOCUS. FOCUS2 was tested with 10 simulated and 543 real metagenomes demonstrating that the program is more sensitive, faster, and more computationally efficient than existing methods.AvailabilityThe Python implementation is freely available at https://edwards.sdsu.edu/FOCUS2.Supplementary informationavailable at Bioinformatics online.

Download Full-text

Complex Relationships between Chromatin Accessibility, Sequence Divergence, and Gene Expression in Arabidopsis thaliana

Molecular Biology and Evolution ◽

10.1093/molbev/msx326 ◽

2017 ◽

Vol 35 (4) ◽

pp. 837-854 ◽

Cited By ~ 10

Author(s):

Cristina M Alexandre ◽

James R Urton ◽

Ken Jean-Baptiste ◽

John Huddleston ◽

Michael W Dorrity ◽

...

Keyword(s):

Gene Expression ◽

Arabidopsis Thaliana ◽

Transcription Factors ◽

Dna Sequences ◽

Sequence Variation ◽

Expression Patterns ◽

Sequence Divergence ◽

Chromatin Accessibility ◽

Regulatory Dna ◽

Regulatory Sites

AbstractVariation in regulatory DNA is thought to drive phenotypic variation, evolution, and disease. Prior studies of regulatory DNA and transcription factors across animal species highlighted a fundamental conundrum: Transcription factor binding domains and cognate binding sites are conserved, while regulatory DNA sequences are not. It remains unclear how conserved transcription factors and dynamic regulatory sites produce conserved expression patterns across species. Here, we explore regulatory DNA variation and its functional consequences within Arabidopsis thaliana, using chromatin accessibility to delineate regulatory DNA genome-wide. Unlike in previous cross-species comparisons, the positional homology of regulatory DNA is maintained among A. thaliana ecotypes and less nucleotide divergence has occurred. Of the ∼50,000 regulatory sites in A. thaliana, we found that 15% varied in accessibility among ecotypes. Some of these accessibility differences were associated with extensive, previously unannotated sequence variation, encompassing many deletions and ancient hypervariable alleles. Unexpectedly, for the majority of such regulatory sites, nearby gene expression was unaffected. Nevertheless, regulatory sites with high levels of sequence variation and differential chromatin accessibility were the most likely to be associated with differential gene expression. Finally, and most surprising, we found that the vast majority of differentially accessible sites show no underlying sequence variation. We argue that these surprising results highlight the necessity to consider higher-order regulatory context in evaluating regulatory variation and predicting its phenotypic consequences.

Download Full-text

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase

Bioinformatics ◽

10.1093/bioinformatics/btaa699 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Sequencing Data ◽

Computationally Efficient ◽

Alignment Free

Abstract Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102−104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability and implementation CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ChimeraUGEM: unsupervised gene expression modeling in any given organism

Bioinformatics ◽

10.1093/bioinformatics/btz080 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3365-3371 ◽

Cited By ~ 4

Author(s):

Alon Diament ◽

Iddo Weiner ◽

Noam Shahar ◽

Shira Landman ◽

Yael Feldman ◽

...

Keyword(s):

Gene Expression ◽

Target Gene ◽

Source Code ◽

Software Tool ◽

Supplementary Information ◽

Host Organism ◽

Protein Levels ◽

Commercial Use ◽

Sequence Patterns

Abstract Motivation Regulation of the amount of protein that is synthesized from genes has proved to be a serious challenge in terms of analysis and prediction, and in terms of engineering and optimization, due to the large diversity in expression machinery across species. Results To address this challenge, we developed a methodology and a software tool (ChimeraUGEM) for predicting gene expression as well as adapting the coding sequence of a target gene to any host organism. We demonstrate these methods by predicting protein levels in seven organisms, in seven human tissues, and by increasing in vivo the expression of a synthetic gene up to 26-fold in the single-cell green alga Chlamydomonas reinhardtii. The underlying model is designed to capture sequence patterns and regulatory signals with minimal prior knowledge on the host organism and can be applied to a multitude of species and applications. Availability and implementation Source code (MATLAB, C) and binaries are freely available for download for non-commercial use at http://www.cs.tau.ac.il/~tamirtul/ChimeraUGEM/, and supported on macOS, Linux and Windows. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text