A Survey of State-of-the-Art Short Text Matching Algorithms

Self-attention mechanisms have recently been embraced for a broad range of text-matching applications. Self-attention model takes only one sentence as an input with no extra information, i.e., one can utilize the final hidden state or pooling. However, text-matching problems can be interpreted either in symmetrical or asymmetrical scopes. For instance, paraphrase detection is an asymmetrical task, while textual entailment classification and question-answer matching are considered asymmetrical tasks. In this article, we leverage attractive properties of self-attention mechanism and proposes an attention-based network that incorporates three key components for inter-sequence attention: global pointwise features, preceding attentive features, and contextual features while updating the rest of the components. Our model follows evaluation on two benchmark datasets cover tasks of textual entailment and question-answer matching. The proposed efficient Self-attention-driven Network for Text Matching outperforms the state of the art on the Stanford Natural Language Inference and WikiQA datasets with much fewer parameters.

Download Full-text

Filtering and Classifying Relevant Short Text with a Few Seed Words

Data and Information Management ◽

10.2478/dim-2019-0011 ◽

2019 ◽

Vol 3 (3) ◽

pp. 165-186 ◽

Cited By ~ 1

Author(s):

Chenliang Li ◽

Shiqian Chen ◽

Yan Qi

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

State Of The Art ◽

Superior Performance ◽

Support Vector ◽

Short Text ◽

Text Filtering ◽

Supervised Classifiers ◽

Real World Datasets ◽

Weakly Supervised

Abstract Filtering out irrelevant documents and classifying the relevant ones into topical categories is a de facto task in many applications. However, supervised learning solutions require extravagant human efforts on document labeling. In this paper, we propose a novel seed-guided topic model for dataless short text classification and filtering, named SSCF. Without using any labeled documents, SSCF takes a few “seed words” for each category of interest, and conducts short text filtering and classification in a weakly supervised manner. To overcome the issues of data sparsity and imbalance, the short text collection is mapped to a collection of pseudodocuments, one for each word. SSCF infers two kinds of topics on pseudo-documents: category-topics and general-topics. Each category-topic is associated with one category of interest, covering the meaning of the latter. In SSCF, we devise a novel word relevance estimation process based on the seed words, for hidden topic inference. The dominating topic of a short text is identified through post inference and then used for filtering and classification. On two real-world datasets in two languages, experimental results show that our proposed SSCF consistently achieves better classification accuracy than state-of-the-art baselines. We also observe that SSCF can even achieve superior performance than the supervised classifiers supervised latent dirichlet allocation (sLDA) and support vector machine (SVM) on some testing tasks.

Download Full-text

CapsTM: capsule network for Chinese medical text matching

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01442-9 ◽

2021 ◽

Vol 21 (S2) ◽

Author(s):

Xiaoming Yu ◽

Yedan Shen ◽

Yuan Ni ◽

Xiaowei Huang ◽

Xiaolong Wang ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Learning ◽

Network Architecture ◽

State Of The Art ◽

Interaction Matrix ◽

Application Systems ◽

Input Layer ◽

Art Methods ◽

Text Matching

Abstract Background Text Matching (TM) is a fundamental task of natural language processing widely used in many application systems such as information retrieval, automatic question answering, machine translation, dialogue system, reading comprehension, etc. In recent years, a large number of deep learning neural networks have been applied to TM, and have refreshed benchmarks of TM repeatedly. Among the deep learning neural networks, convolutional neural network (CNN) is one of the most popular networks, which suffers from difficulties in dealing with small samples and keeping relative structures of features. In this paper, we propose a novel deep learning architecture based on capsule network for TM, called CapsTM, where capsule network is a new type of neural network architecture proposed to address some of the short comings of CNN and shows great potential in many tasks. Methods CapsTM is a five-layer neural network, including an input layer, a representation layer, an aggregation layer, a capsule layer and a prediction layer. In CapsTM, two pieces of text are first individually converted into sequences of embeddings and are further transformed by a highway network in the input layer. Then, Bidirectional Long Short-Term Memory (BiLSTM) is used to represent each piece of text and attention-based interaction matrix is used to represent interactive information of the two pieces of text in the representation layer. Subsequently, the two kinds of representations are fused together by BiLSTM in the aggregation layer, and are further represented with capsules (vectors) in the capsule layer. Finally, the prediction layer is a connected network used for classification. CapsTM is an extension of ESIM by adding a capsule layer before the prediction layer. Results We construct a corpus of Chinese medical question matching, which contains 36,360 question pairs. This corpus is randomly split into three parts: a training set of 32,360 question pairs, a development set of 2000 question pairs and a test set of 2000 question pairs. On this corpus, we conduct a series of experiments to evaluate the proposed CapsTM and compare it with other state-of-the-art methods. CapsTM achieves the highest F-score of 0.8666. Conclusion The experimental results demonstrate that CapsTM is effective for Chinese medical question matching and outperforms other state-of-the-art methods for comparison.

Download Full-text

Hybrid Algorithm for Anomaly Removal in Time Series Data Mining

10.20944/preprints202111.0440.v1 ◽

2021 ◽

Author(s):

Abdul Razaque ◽

Marzhan Abenova ◽

Munif Alotaibi ◽

Bandar Alotaibi ◽

Hamoud Alshammari ◽

...

Keyword(s):

Data Mining ◽

Time Series ◽

Hybrid Algorithm ◽

Time Series Data ◽

State Of The Art ◽

Large Data ◽

Series Data ◽

Multidimensional Data ◽

Search Problem ◽

Short Text

Time series data are significant and are derived from temporal data, which involve real numbers representing values collected regularly over time. Time series have a great impact on many types of data. However, time series have anomalies. We introduce hybrid algorithm named novel matrix profile (NMP) to solve the all-pairs similarity search problem for time series data. The proposed NMP inherits the features from two state-of-the art algorithms: similarity time-series automatic multivariate prediction (STAMP), and short text online microblogging protocol (STOMP). The proposed algorithm caches the output in an easy-to-access fashion for single- and multidimensional data. The proposed NMP algorithm can be used on large data sets and generates approximate solutions of high quality in a reasonable time. The proposed NMP can also handle several data mining tasks. It is implemented on a Python platform. To determine its effectiveness, it is compared with the state-of-the-art matrix profile algorithms i.e., STAMP and STOMP. The results confirm that the proposed NMP provides higher accuracy than the compared algorithms.

Download Full-text

A Sparse Topic Model for Bursty Topic Discovery in Social Networks

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/5/15 ◽

2020 ◽

Vol 17 (5) ◽

pp. 816-824

Author(s):

Lei Shi ◽

Junping Du ◽

Feifei Kou

Keyword(s):

Topic Model ◽

State Of The Art ◽

Sina Weibo ◽

Short Text ◽

Qualitative And Quantitative ◽

Topic Discovery ◽

Spike And Slab Prior ◽

Automatic Discovery ◽

The Common ◽

Sparse Topic Model

Bursty topic discovery aims to automatically identify bursty events and continuously keep track of known events. The existing methods focus on the topic model. However, the sparsity of short text brings the challenge to the traditional topic models because the words are too few to learn from the original corpus. To tackle this problem, we propose a Sparse Topic Model (STM) for bursty topic discovery. First, we distinguish the modeling between the bursty topic and the common topic to detect the change of the words in time and discover the bursty words. Second, we introduce “Spike and Slab” prior to decouple the sparsity and smoothness of a distribution. The bursty words are leveraged to achieve automatic discovery of the bursty topics. Finally, to evaluate the effectiveness of our proposed algorithm, we collect Sina weibo dataset to conduct various experiments. Both qualitative and quantitative evaluations demonstrate that the proposed STM algorithm outperforms favorably against several state-of-the-art methods

Download Full-text

Entity Linking for Short Text Using Structured Knowledge Graph via Multi-Grained Text Matching

10.21437/interspeech.2020-1934 ◽

2020 ◽

Author(s):

Binxuan Huang ◽

Han Wang ◽

Tong Wang ◽

Yue Liu ◽

Yang Liu

Keyword(s):

Knowledge Graph ◽

Entity Linking ◽

Short Text ◽

Structured Knowledge ◽

Text Matching

Download Full-text

Pseudo-siamese networks with lexicon for Chinese short text matching

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202592 ◽

2021 ◽

pp. 1-13

Author(s):

Jiawen Shi ◽

Hong Li ◽

Chiyu Wang ◽

Zhicheng Pang ◽

Jiale Zhou

Keyword(s):

Language Processing ◽

Chinese Text ◽

Experimental Studies ◽

Word Segmentation ◽

Chinese Word Segmentation ◽

Lexical Information ◽

Short Text ◽

Single Sentence ◽

Word Sequence ◽

Text Matching

Short text matching is one of the fundamental technologies in natural language processing. In previous studies, most of the text matching networks are initially designed for English text. The common approach to applying them to Chinese is segmenting each sentence into words, and then taking these words as input. However, this method often results in word segmentation errors. Chinese short text matching faces the challenges of constructing effective features and understanding the semantic relationship between two sentences. In this work, we propose a novel lexicon-based pseudo-siamese model (CL2 N), which can fully mine the information expressed in Chinese text. Instead of utilizing a character-sequence or a single word-sequence, CL2 N augments the text representation with multi-granularity information in characters and lexicons. Additionally, it integrates sentence-level features through single-sentence features as well as interactive features. Experimental studies on two Chinese text matching datasets show that our model has better performance than the state-of-the-art short text matching models, and the proposed method can solve the error propagation problem of Chinese word segmentation. Particularly, the incorporation of single-sentence features and interactive features allows the network to capture the contextual semantics and co-attentive lexical information, which contributes to our best result.

Download Full-text

Text-based Person Search via Multi-Granularity Embedding Learning

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/148 ◽

2021 ◽

Author(s):

Chengji Wang ◽

Zhiming Luo ◽

Yaojin Lin ◽

Shaozi Li

Keyword(s):

State Of The Art ◽

Spatial Scales ◽

Learning Model ◽

Embedding Problem ◽

Partial Knowledge ◽

Search Methods ◽

Person Search ◽

Art Performance ◽

Coarse To Fine ◽

Text Matching

Most existing text-based person search methods highly depend on exploring the corresponding relations between the regions of the image and the words in the sentence. However, these methods correlated image regions and words in the same semantic granularity. It 1) results in irrelevant corresponding relations between image and text, 2) causes an ambiguity embedding problem. In this study, we propose a novel multi-granularity embedding learning model for text-based person search. It generates multi-granularity embeddings of partial person bodies in a coarse-to-fine manner by revisiting the person image at different spatial scales. Specifically, we distill the partial knowledge from image scrips to guide the model to select the semantically relevant words from the text description. It can learn discriminative and modality-invariant visual-textual embeddings. In addition, we integrate the partial embeddings at each granularity and perform multi-granularity image-text matching. Extensive experiments validate the effectiveness of our method, which can achieve new state-of-the-art performance by the learned discriminative partial embeddings.

Download Full-text