Semantic Similarity Metric and its Application in Text Classification

Text classification is the task of assigning natural language textual documents to predefined categories based on their context. The main concern is this paper is to improve the accuracy of text classification system combined an improved CHI method and semantic similarity metric. Firstly, use an improved CHI method to select features from the raw features aim to reduce the dimensions of the features. Secondly, calculates the semantic distance between text feature vector and categorization feature vector so as to determine the document categorization. Finally, we carried out a series of experiments compared with other methods using the F1-measure. Experimental results show that our new method makes an important improvement in all categories.

Download Full-text

Text Classification Combined an Improved CHI and Category Relevance Factor

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.524-527.3866 ◽

2012 ◽

Vol 524-527 ◽

pp. 3866-3869

Author(s):

Pei Ying Zhang

Keyword(s):

Natural Language ◽

Text Classification ◽

Classification System ◽

Experimental Results ◽

New Method ◽

Main Concern ◽

Important Improvement ◽

Feature Weight ◽

Series Of Experiments

Text classification is the task of assigning natural language textual documents to predefined categories based on their context. The main concern in this paper is to improve the accuracy of text classification system combined an improved CHI method and category relevance factor. Firstly, use an improved CHI method to select features from the raw features aim to reduce the dimensions of the features. Secondly, through the TF-CRF method to calculate the feature weight, this method mainly consider that the features have different distributions in different categories. Finally, we carried out a series of experiments compared with other methods using the F1-measure. Experimental results show that our new method makes an important improvement in all categories.

Download Full-text

The Application of Semantic Similarity in Text Classification

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.346.141 ◽

2013 ◽

Vol 346 ◽

pp. 141-144

Author(s):

Pei Ying Zhang

Keyword(s):

Semantic Similarity ◽

Text Classification ◽

Learning Algorithms ◽

Feature Space ◽

Curse Of Dimensionality ◽

High Dimensionality ◽

Challenging Problem ◽

Weighting Method ◽

Important Improvement

Text classification is a challenging problem which aims to automatically assign unlabeled documents to predefined one or more classes according to its contents. The major problem of text classification is the high dimensionality of the feature space. This paper proposes an approach based on the semantic similarity between the title vectors and the category vectors using the tf*rf weighting method. Experiments show that text classifier based on semantic similarity helps dimension sensitive learning algorithms such as KNN to eliminate the “curse of dimensionality” and as a result makes an important improvement in all categories.

Download Full-text

Research on Semantic Similarity of Short Text Based on Bert and Time Warping Distance

Journal of Web Engineering ◽

10.13052/jwe1540-9589.20814 ◽

2021 ◽

Author(s):

Shijie Qiu ◽

Yan Niu ◽

Jun Li ◽

Xing Li

Keyword(s):

Semantic Similarity ◽

Feature Vector ◽

Semantic Features ◽

Time Warping ◽

Short Text ◽

Business Applications ◽

Ambiguous Words ◽

Text Feature ◽

Feature Information ◽

Short Text Similarity

The research on semantic similarity of short text plays an important role in machine translation, emotion analysis, information retrieval and other AI business applications. However, according to existing short text similarity research, the characteristics of ambiguous vocabularies are difficult to be effectively analyzed, the solution of the problem caused by words order needs to be further optimized as well. This paper proposes a short text semantic similarity calculation method that combines BERT and time warping distance algorithm, in order to solve the problem of vocabulary ambiguity. The model first uses the pre trained Bert model to extract the semantic features of the short text from the whole level, and obtains a 768 dimensional short text feature vector. Then, it transforms the extracted feature vector into a point sequence in space, uses the CTW algorithm to calculate the time warping distance between the curves connected by the point sequence, and finally uses the weight function designed by the analysis, according to the smaller the time warpage distance is, the higher the degree of small similarity is, to calculate the similarity between short texts. The experimental results show that this model can mine the feature information of ambiguous words, and calculate the similarity of short texts with lexical ambiguity effectively. Compared with other models, it can distinguish the semantic features of ambiguous words more accurately.

Download Full-text

Unsupervised Text Feature Extraction for Academic Chatbot using Constrained FP-Growth

ASM Science Journal ◽

10.32802/asmscj.2020.576 ◽

2021 ◽

Vol 14 ◽

pp. 1-11

Author(s):

Suraya Alias

Keyword(s):

Feature Extraction ◽

Customer Service ◽

User Satisfaction ◽

Feature Vector ◽

Size Reduction ◽

Growth Technique ◽

Feature Extraction Method ◽

Text Feature ◽

Text Features ◽

N Gram

In the edge where conversation merely involves online chatting and texting one another, an automated conversational agent is needed to support certain repetitive tasks such as providing FAQs, customer service and product recommendations. One of the key challenges is to identify and discover user’s intention in a social conversation where the focus of our work in the academic domain. Our unsupervised text feature extraction method for Intent Pattern Discovery is developed by applying text features constraints to the FP-Growth technique. The academic corpus was developed using a chat messages dataset where the conversation between students and academicians regarding undergraduate and postgraduate queries were extracted as text features for our model. We experimented with our new Constrained Frequent Intent Pattern (cFIP) model in contrast with the N-gram model in terms of feature-vector size reduction, descriptive intent discovery, and analysis of cFIP Rules. Our findings show significant and descriptive intent patterns was discovered with confidence rules value of 0.9 for cFIP of 3-sequence. We report an average feature-vector size reduction of 76% compared to the Bigram model using both undergraduate and postgraduate conversation datasets. The usability testing results depicted overall user satisfaction average mean score is 4.30 out of 5 in using the Academic chatbot which supported our intent discovery cFIP approach.

Download Full-text

Semantic Similarity Metric Learning for Sketch-Based 3D Shape Retrieval

Computational Science – ICCS 2021 - Lecture Notes in Computer Science ◽

10.1007/978-3-030-77977-1_5 ◽

2021 ◽

pp. 59-69

Author(s):

Yu Xia ◽

Shuangbu Wang ◽

Lihua You ◽

Jianjun Zhang

Keyword(s):

Semantic Similarity ◽

Metric Learning ◽

Shape Retrieval ◽

3D Shape Retrieval ◽

Similarity Metric ◽

3D Shape

Download Full-text

Applied-Information Technology with Distributed Text Feature Extraction Method Based on MapReduce

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.1046.444 ◽

2014 ◽

Vol 1046 ◽

pp. 444-448 ◽

Cited By ~ 1

Author(s):

Lu Chen ◽

Tao Zhang ◽

Yuan Yuan Ma ◽

Cheng Zhou

Keyword(s):

Information Technology ◽

Feature Extraction ◽

Text Classification ◽

Extraction Method ◽

Text Processing ◽

Rapid Development ◽

Internet Technology ◽

Feature Extraction Method ◽

Computing Model ◽

Text Feature

With the rapid development of Internet technology and information technology, the emergence of a large number of document data, text classification techniques for handling massive amounts of data is becoming increasingly important. This paper presents a distributed text feature extraction method based on distributed computing model—MapReduce. In the process of mass text processing, solve the problem of processing text size limit and inadequate performance, provide the research of text feature extraction method a new way of thinking.

Download Full-text

MateTee: A Semantic Similarity Metric Based on Translation Embeddings for Knowledge Graphs

Lecture Notes in Computer Science - Web Engineering ◽

10.1007/978-3-319-60131-1_14 ◽

2017 ◽

pp. 246-263 ◽

Cited By ~ 4

Author(s):

Camilo Morales ◽

Diego Collarana ◽

Maria-Esther Vidal ◽

Sören Auer

Keyword(s):

Semantic Similarity ◽

Similarity Metric ◽

Knowledge Graphs

Download Full-text

Efficient text feature extraction by integrating the average linkage and K-medoids clustering

Modern Physics Letters B ◽

10.1142/s0217984921501517 ◽

2021 ◽

pp. 2150151

Author(s):

Dasong Sun

Keyword(s):

Feature Extraction ◽

Text Classification ◽

Experimental Results ◽

The Other ◽

Central Feature ◽

Number Of Clusters ◽

Average Linkage ◽

Text Feature

By clustering feature words, we can not only simplify the dimension of feature subsets, but also eliminate the redundancy of the feature. However, for a feature set with very large dimensions, the traditional [Formula: see text]-medoids algorithm is difficult to accurately estimate the value of [Formula: see text]. Moreover, the clustering results of the average linkage (AL) algorithm cannot be divided again, and the AL algorithm cannot be directly used for text classification. In order to overcome the limitations of AL and [Formula: see text]-medoids, in this paper, we combine the two algorithms together so as to be mutually complementary to each other. In particular, in order to meet the purpose of text classification, we improve the AL algorithm and propose the [Formula: see text] testing statistics to obtain the approximate number of clusters. Finally, the central feature words are preserved, and the other feature words are deleted. The experimental results show that the new algorithm largely eliminates the redundancy of the feature. Compared with the traditional TF-IDF algorithms, the performance of the text classification of the new algorithm is improved.

Download Full-text

Sentence Similarity Metric and its Application in FAQ System

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.718-720.2248 ◽

2013 ◽

Vol 718-720 ◽

pp. 2248-2251

Author(s):

Pei Ying Zhang

Keyword(s):

Semantic Similarity ◽

Question Answering ◽

Similarity Metric ◽

Question Answering System ◽

Sentence Similarity

FAQ system is a question answering system which finds the question sentence from question-answer collection and then returns its corresponding answer to user. The task of matching questions to corresponding question-answer pairs has become a major challenge in FAQ system. This paper proposes a method for sentence similarity metric between questions according to its semantic similarity as well as the length of question length. Experiments show that this method can improve the accuracy and intelligence of answering system, has some practical value.

Download Full-text

A NEW METHOD OF MEDICAL IMAGE RETRIEVAL BASED ON COLOR–TEXTURE CORRELOGRAM AND GTI MODEL

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622009003363 ◽

2009 ◽

Vol 08 (02) ◽

pp. 239-248 ◽

Cited By ~ 9

Author(s):

XIAO-YING TAI ◽

LI-DONG WANG ◽

QIN CHEN ◽

REN FUJI ◽

KITA KENJI

Keyword(s):

Image Retrieval ◽

Medical Image ◽

Feature Vector ◽

Texture Feature ◽

Image Feature ◽

Similarity Metric ◽

Medical Image Retrieval ◽

Endoscopic Image ◽

Color Correlogram ◽

Color Texture

This paper presents a method for endoscopic image retrieval based on color–texture correlogram and Generalized Tversky's Index (GTI) model. First we define a new image feature named color–texture correlogram, which is the extension of color correlogram. The texture image extracted by texture spectrum algorithm is combined with color feature vector, and then we calculate the spatial correlation of color–texture feature vector. Similarity metric is also the key technology during domain of image retrieval, GTI model is used in medical image retrieval for similarity metric, and the technique of relevance feedback is used in the algorithm to enhance the efficiency of retrieval. Experimental results show that the method discussed in this paper is much more effective.

Download Full-text