string kernels Latest Research Papers

Computer-Supported Collaborative Learning tools are exhibiting an increased popularity in education, as they allow multiple participants to easily communicate, share knowledge, solve problems collaboratively, or seek advice. Nevertheless, multi-participant conversation logs are often hard to follow by teachers due to the mixture of multiple and many times concurrent discussion threads, with different interaction patterns between participants. Automated guidance can be provided with the help of Natural Language Processing techniques that target the identification of topic mixtures and of semantic links between utterances in order to adequately observe the debate and continuation of ideas. This paper introduces a method for discovering such semantic links embedded within chat conversations using string kernels, word embeddings, and neural networks. Our approach was validated on two datasets and obtained state-of-the-art results on both. Trained on a relatively small set of conversations, our models relying on string kernels are very effective for detecting such semantic links with a matching accuracy larger than 50% and represent a better alternative to complex deep neural networks, frequently employed in various Natural Language Processing tasks where large datasets are available.

Download Full-text

FastSK: fast sequence analysis with gapped string kernels

Bioinformatics ◽

10.1093/bioinformatics/btaa817 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i857-i865

Author(s):

Derrick Blakely ◽

Eamon Collins ◽

Ritambhara Singh ◽

Andrew Norton ◽

Jack Lanchantin ◽

...

Keyword(s):

Sequence Analysis ◽

Dna Sequences ◽

English Language ◽

Computation Time ◽

Entity Recognition ◽

Supplementary Information ◽

Support Vector ◽

Homology Detection ◽

Scalable Algorithm ◽

String Kernels

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FastSK: Fast Sequence Analysis with Gapped String Kernels

10.1101/2020.04.21.053975 ◽

2020 ◽

Author(s):

Derrick Blakely ◽

Eamon Collins ◽

Ritambhara Singh ◽

Andrew Norton ◽

Jack Lanchantin ◽

...

Keyword(s):

Sequence Analysis ◽

Dna Sequences ◽

English Language ◽

Computation Time ◽

Predictive Performance ◽

Entity Recognition ◽

Support Vector ◽

Homology Detection ◽

Scalable Algorithm ◽

String Kernels

AbstractGapped k-mer kernels with Support Vector Machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly-sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature-length, number of mismatch positions, and the task’s alphabet size. In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On 10 DNA transcription factor binding site (TFBS) prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in AUC, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks across all 10 TFBS tasks. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Our algorithm is available as a Python package and as C++ source code1.

Download Full-text

Accelerating Legacy String Kernels via Bounded Automata Learning

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems ◽

10.1145/3373376.3378503 ◽

2020 ◽

Author(s):

Kevin Angstadt ◽

Jean-Baptiste Jeannin ◽

Westley Weimer

Keyword(s):

String Kernels ◽

Automata Learning

Download Full-text

Sparse Bayesian learning for predicting phenotypes and ranking influential markers in yeast

10.1101/489245 ◽

2018 ◽

Author(s):

Maryam Ayat ◽

Michael Domaratzki

Keyword(s):

Genomic Selection ◽

Bayesian Learning ◽

Association Studies ◽

Kernel Functions ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Sparse Bayesian Learning ◽

Negative Effects ◽

String Kernels ◽

Genome Wide

Genomic selection and genome-wide association studies are two related problems that can be applied to the plant breeding industry. Genomic selection is a method to predict phenotypes (i.e., traits) such as yield and drought resistance in crops from high-density markers positioned throughout the genome of the varieties. In this paper, we employ employ sparse Bayesian learning as a technique for genomic selection and ranking markers based on their relevance to a trait, which can aid in genome-wide association studies. We define and explore two different forms of the sparse Bayesian learning for predicting phenotypes and identifying the most influential markers of a trait, respectively. In particular, we introduce a new framework based on sparse Bayesian and ensemble learning for ranking influential markers of a trait. Then, we apply our methods on a real-world \textit{Saccharomyces cerevisiae} dataset, and analyse our results with respect to existing related works, trait heritability, as well as the accuracies obtained from the use of different kernel functions including linear, Gaussian, and string kernels. We find that sparse Bayesian methods are not only as good as other machine learning methods in predicting yeast growth in different environments, but are also capable of identifying the most important markers, including both positive and negative effects on the growth, from which biologists can get insight. This attribute can make our proposed ensemble of sparse Bayesian learners favourable in ranking markers based on their relevance to a trait.

Download Full-text

TriPepSVM - de novo prediction of RNA-binding proteins based on short amino acid motifs

10.1101/466151 ◽

2018 ◽

Cited By ~ 2

Author(s):

Annkatrin Bressin ◽

Roman Schulte-Sasse ◽

Davide Figini ◽

Erika C Urdaneta ◽

Benedikt M Beckmann ◽

...

Keyword(s):

Binding Proteins ◽

Rna Binding ◽

De Novo ◽

Rna Binding Proteins ◽

Low Complexity ◽

Structural Features ◽

Support Vector ◽

String Kernels ◽

Rna Binders

In recent years hundreds of novel RNA-binding proteins (RBPs) have been identified leading to the discovery of novel RNA-binding domains (RBDs). Furthermore, unstructured or disordered low-complexity regions of RBPs have been identified to play an important role in interactions with nucleic acids. However, these advances in understanding RBPs are limited mainly to eukaryotic species and we only have limited tools to faithfully predict RNA-binders from bacteria. Here, we describe a support vector machine (SVM)-based method, called TriPepSVM, for the classification of RNA-binding proteins and non-RBPs. TriPepSVM applies string kernels to directly handle protein sequences using tri-peptide frequencies. Testing the method in human and bacteria, we find that several RBP-enriched tripeptides occur more often in structurally disordered regions of RBPs. TriPepSVM outperforms existing applications, which consider classical structural features of RNA-binding or homology, in the task of RBP prediction in both human and bacteria. Finally, we predict 66 novel RBPs in Salmonella Typhimurium and validate the bacterial proteins ClpX, DnaJ and UbiG to associate with RNA in vivo.

Download Full-text