scholarly journals A similarity study of I/O traces via string kernels

2018 ◽  
Vol 75 (12) ◽  
pp. 7814-7826
Author(s):  
Raul Torres ◽  
Julian M. Kunkel ◽  
Manuel F. Dolz ◽  
Thomas Ludwig
Keyword(s):  
2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i857-i865
Author(s):  
Derrick Blakely ◽  
Eamon Collins ◽  
Ritambhara Singh ◽  
Andrew Norton ◽  
Jack Lanchantin ◽  
...  

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Sarvesh Nikumbh ◽  
Peter Ebert ◽  
Nico Pfeifer

AbstractMost string kernels for comparison of genomic sequences are generally tied to using (absolute) positional information of the features in the individual sequences. This poses limitations when comparing variable-length sequences using such string kernels. For example, profiling chromatin interactions by 3C-based experiments results in variable-length genomic sequences (restriction fragments). Here, exact position-wise occurrence of signals in sequences may not be as important as in the scenario of analysis of the promoter sequences, that typically have a transcription start site as reference. Existing position-aware string kernels have been shown to be useful for the latter scenario.In this work, we propose a novel approach for sequence comparison that enables larger positional freedom than most of the existing approaches, can identify a possibly dispersed set of features in comparing variable-length sequences, and can handle both the aforementioned scenarios. Our approach, CoMIK, identifies not just the features useful towards classification but also their locations in the variable-length sequences, as evidenced by the results of three binary classification experiments, aided by recently introduced visualization techniques. Furthermore, we show that we are able to efficiently retrieve and interpret the weight vector for the complex setting of multiple multi-instance kernels.


2002 ◽  
Vol 112 (5) ◽  
pp. 2304-2304
Author(s):  
John Ch. Goddard Close ◽  
Fabiola M. Martinez Licona ◽  
Alma E. Martinez Licona ◽  
H. Leonardo Rufiner
Keyword(s):  

2008 ◽  
Vol 12 (4) ◽  
pp. 425-440 ◽  
Author(s):  
Craig Saunders ◽  
David R. Hardoon ◽  
John Shawe-Taylor ◽  
Gerhard Widmer
Keyword(s):  

2010 ◽  
Vol 11 (S8) ◽  
Author(s):  
Nora C Toussaint ◽  
Christian Widmer ◽  
Oliver Kohlbacher ◽  
Gunnar Rätsch

2005 ◽  
Vol 03 (03) ◽  
pp. 527-550 ◽  
Author(s):  
RUI KUANG ◽  
EUGENE IE ◽  
KE WANG ◽  
KAI WANG ◽  
MAHIRA SIDDIQI ◽  
...  

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" — short regions of the original profile that contribute almost all the weight of the SVM classification score — and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets. Supplementary website:.


Sign in / Sign up

Export Citation Format

Share Document