A similarity study of I/O traces via string kernels

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

All fingers are not the same: Handling variable-length sequences in a discriminative setting using conformal multi-instance kernels

10.1101/139618 ◽

2017 ◽

Author(s):

Sarvesh Nikumbh ◽

Peter Ebert ◽

Nico Pfeifer

Keyword(s):

Binary Classification ◽

Positional Information ◽

Weight Vector ◽

Variable Length ◽

Genomic Sequences ◽

String Kernels ◽

Promoter Sequences ◽

Novel Approach ◽

The Individual ◽

Visualization Techniques

AbstractMost string kernels for comparison of genomic sequences are generally tied to using (absolute) positional information of the features in the individual sequences. This poses limitations when comparing variable-length sequences using such string kernels. For example, profiling chromatin interactions by 3C-based experiments results in variable-length genomic sequences (restriction fragments). Here, exact position-wise occurrence of signals in sequences may not be as important as in the scenario of analysis of the promoter sequences, that typically have a transcription start site as reference. Existing position-aware string kernels have been shown to be useful for the latter scenario.In this work, we propose a novel approach for sequence comparison that enables larger positional freedom than most of the existing approaches, can identify a possibly dispersed set of features in comparing variable-length sequences, and can handle both the aforementioned scenarios. Our approach, CoMIK, identifies not just the features useful towards classification but also their locations in the variable-length sequences, as evidenced by the results of three binary classification experiments, aided by recently introduced visualization techniques. Furthermore, we show that we are able to efficiently retrieve and interpret the weight vector for the complex setting of multiple multi-instance kernels.

Download Full-text

A Clustering Algorithm of Trajectories for Behaviour Understanding Based on String Kernels

2012 Eighth International Conference on Signal Image Technology and Internet Based Systems ◽

10.1109/sitis.2012.47 ◽

2012 ◽

Cited By ~ 7

Author(s):

L. Brun ◽

A. Saggese ◽

M. Vento

Keyword(s):

Clustering Algorithm ◽

String Kernels

Download Full-text

String Kernels for Polarity Classification: A Study Across Different Languages

Natural Language Processing and Information Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-319-91947-8_50 ◽

2018 ◽

pp. 489-493 ◽

Cited By ~ 1

Author(s):

Rosa M. Giménez-Pérez ◽

Marc Franco-Salvador ◽

Paolo Rosso

Keyword(s):

String Kernels ◽

Polarity Classification

Download Full-text

String kernels for the classification of speech data

The Journal of the Acoustical Society of America ◽

10.1121/1.4779275 ◽

2002 ◽

Vol 112 (5) ◽

pp. 2304-2304

Author(s):

John Ch. Goddard Close ◽

Fabiola M. Martinez Licona ◽

Alma E. Martinez Licona ◽

H. Leonardo Rufiner

Keyword(s):

String Kernels ◽

Speech Data

Download Full-text

Using string kernels to identify famous performers from their playing style

Intelligent Data Analysis ◽

10.3233/ida-2008-12408 ◽

2008 ◽

Vol 12 (4) ◽

pp. 425-440 ◽

Cited By ~ 5

Author(s):

Craig Saunders ◽

David R. Hardoon ◽

John Shawe-Taylor ◽

Gerhard Widmer

Keyword(s):

String Kernels

Download Full-text

Protocol anomaly detection based on string kernels

2010 International Conference on Optics, Photonics and Energy Engineering (OPEE) ◽

10.1109/opee.2010.5508146 ◽

2010 ◽

Author(s):

Jing Zhao ◽

Houkuan Huang ◽

Shengfeng Tian ◽

Chuanhuan Yin

Keyword(s):

Anomaly Detection ◽

String Kernels

Download Full-text

Exploiting physico-chemical properties in string kernels

BMC Bioinformatics ◽

10.1186/1471-2105-11-s8-s7 ◽

2010 ◽

Vol 11 (S8) ◽

Cited By ~ 14

Author(s):

Nora C Toussaint ◽

Christian Widmer ◽

Oliver Kohlbacher ◽

Gunnar Rätsch

Keyword(s):

Chemical Properties ◽

String Kernels ◽

Physico Chemical

Download Full-text

PROFILE-BASED STRING KERNELS FOR REMOTE HOMOLOGY DETECTION AND MOTIF EXTRACTION

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972000500120x ◽

2005 ◽

Vol 03 (03) ◽

pp. 527-550 ◽

Cited By ~ 77

Author(s):

RUI KUANG ◽

EUGENE IE ◽

KE WANG ◽

KAI WANG ◽

MAHIRA SIDDIQI ◽

...

Keyword(s):

Structural Features ◽

Support Vector ◽

Protein Classification ◽

Svm Classifier ◽

Sequence Motifs ◽

Homology Detection ◽

String Kernels ◽

Remote Homology ◽

Structure Information ◽

Remote Homology Detection

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" — short regions of the original profile that contribute almost all the weight of the SVM classification score — and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets. Supplementary website:.

Download Full-text