Semantic Similarity Metric and its Application in Text Classification

2012 ◽  
Vol 170-173 ◽  
pp. 3711-3714 ◽  
Author(s):  
Pei Ying Zhang

Text classification is the task of assigning natural language textual documents to predefined categories based on their context. The main concern is this paper is to improve the accuracy of text classification system combined an improved CHI method and semantic similarity metric. Firstly, use an improved CHI method to select features from the raw features aim to reduce the dimensions of the features. Secondly, calculates the semantic distance between text feature vector and categorization feature vector so as to determine the document categorization. Finally, we carried out a series of experiments compared with other methods using the F1-measure. Experimental results show that our new method makes an important improvement in all categories.

2012 ◽  
Vol 524-527 ◽  
pp. 3866-3869
Author(s):  
Pei Ying Zhang

Text classification is the task of assigning natural language textual documents to predefined categories based on their context. The main concern in this paper is to improve the accuracy of text classification system combined an improved CHI method and category relevance factor. Firstly, use an improved CHI method to select features from the raw features aim to reduce the dimensions of the features. Secondly, through the TF-CRF method to calculate the feature weight, this method mainly consider that the features have different distributions in different categories. Finally, we carried out a series of experiments compared with other methods using the F1-measure. Experimental results show that our new method makes an important improvement in all categories.


2013 ◽  
Vol 346 ◽  
pp. 141-144
Author(s):  
Pei Ying Zhang

Text classification is a challenging problem which aims to automatically assign unlabeled documents to predefined one or more classes according to its contents. The major problem of text classification is the high dimensionality of the feature space. This paper proposes an approach based on the semantic similarity between the title vectors and the category vectors using the tf*rf weighting method. Experiments show that text classifier based on semantic similarity helps dimension sensitive learning algorithms such as KNN to eliminate the “curse of dimensionality” and as a result makes an important improvement in all categories.


Author(s):  
Shijie Qiu ◽  
Yan Niu ◽  
Jun Li ◽  
Xing Li

The research on semantic similarity of short text plays an important role in machine translation, emotion analysis, information retrieval and other AI business applications. However, according to existing short text similarity research, the characteristics of ambiguous vocabularies are difficult to be effectively analyzed, the solution of the problem caused by words order needs to be further optimized as well. This paper proposes a short text semantic similarity calculation method that combines BERT and time warping distance algorithm, in order to solve the problem of vocabulary ambiguity. The model first uses the pre trained Bert model to extract the semantic features of the short text from the whole level, and obtains a 768 dimensional short text feature vector. Then, it transforms the extracted feature vector into a point sequence in space, uses the CTW algorithm to calculate the time warping distance between the curves connected by the point sequence, and finally uses the weight function designed by the analysis, according to the smaller the time warpage distance is, the higher the degree of small similarity is, to calculate the similarity between short texts. The experimental results show that this model can mine the feature information of ambiguous words, and calculate the similarity of short texts with lexical ambiguity effectively. Compared with other models, it can distinguish the semantic features of ambiguous words more accurately.


2021 ◽  
Vol 14 ◽  
pp. 1-11
Author(s):  
Suraya Alias

In the edge where conversation merely involves online chatting and texting one another, an automated conversational agent is needed to support certain repetitive tasks such as providing FAQs, customer service and product recommendations. One of the key challenges is to identify and discover user’s intention in a social conversation where the focus of our work in the academic domain. Our unsupervised text feature extraction method for Intent Pattern Discovery is developed by applying text features constraints to the FP-Growth technique. The academic corpus was developed using a chat messages dataset where the conversation between students and academicians regarding undergraduate and postgraduate queries were extracted as text features for our model. We experimented with our new Constrained Frequent Intent Pattern (cFIP) model in contrast with the N-gram model in terms of feature-vector size reduction, descriptive intent discovery, and analysis of cFIP Rules. Our findings show significant and descriptive intent patterns was discovered with confidence rules value of 0.9 for cFIP of 3-sequence. We report an average feature-vector size reduction of 76% compared to the Bigram model using both undergraduate and postgraduate conversation datasets. The usability testing results depicted overall user satisfaction average mean score is 4.30 out of 5 in using the Academic chatbot which supported our intent discovery cFIP approach.


2014 ◽  
Vol 1046 ◽  
pp. 444-448 ◽  
Author(s):  
Lu Chen ◽  
Tao Zhang ◽  
Yuan Yuan Ma ◽  
Cheng Zhou

With the rapid development of Internet technology and information technology, the emergence of a large number of document data, text classification techniques for handling massive amounts of data is becoming increasingly important. This paper presents a distributed text feature extraction method based on distributed computing model—MapReduce. In the process of mass text processing, solve the problem of processing text size limit and inadequate performance, provide the research of text feature extraction method a new way of thinking.


2021 ◽  
pp. 2150151
Author(s):  
Dasong Sun

By clustering feature words, we can not only simplify the dimension of feature subsets, but also eliminate the redundancy of the feature. However, for a feature set with very large dimensions, the traditional [Formula: see text]-medoids algorithm is difficult to accurately estimate the value of [Formula: see text]. Moreover, the clustering results of the average linkage (AL) algorithm cannot be divided again, and the AL algorithm cannot be directly used for text classification. In order to overcome the limitations of AL and [Formula: see text]-medoids, in this paper, we combine the two algorithms together so as to be mutually complementary to each other. In particular, in order to meet the purpose of text classification, we improve the AL algorithm and propose the [Formula: see text] testing statistics to obtain the approximate number of clusters. Finally, the central feature words are preserved, and the other feature words are deleted. The experimental results show that the new algorithm largely eliminates the redundancy of the feature. Compared with the traditional TF-IDF algorithms, the performance of the text classification of the new algorithm is improved.


2013 ◽  
Vol 718-720 ◽  
pp. 2248-2251
Author(s):  
Pei Ying Zhang

FAQ system is a question answering system which finds the question sentence from question-answer collection and then returns its corresponding answer to user. The task of matching questions to corresponding question-answer pairs has become a major challenge in FAQ system. This paper proposes a method for sentence similarity metric between questions according to its semantic similarity as well as the length of question length. Experiments show that this method can improve the accuracy and intelligence of answering system, has some practical value.


2009 ◽  
Vol 08 (02) ◽  
pp. 239-248 ◽  
Author(s):  
XIAO-YING TAI ◽  
LI-DONG WANG ◽  
QIN CHEN ◽  
REN FUJI ◽  
KITA KENJI

This paper presents a method for endoscopic image retrieval based on color–texture correlogram and Generalized Tversky's Index (GTI) model. First we define a new image feature named color–texture correlogram, which is the extension of color correlogram. The texture image extracted by texture spectrum algorithm is combined with color feature vector, and then we calculate the spatial correlation of color–texture feature vector. Similarity metric is also the key technology during domain of image retrieval, GTI model is used in medical image retrieval for similarity metric, and the technique of relevance feedback is used in the algorithm to enhance the efficiency of retrieval. Experimental results show that the method discussed in this paper is much more effective.


Sign in / Sign up

Export Citation Format

Share Document