Relaxing Orthogonality Assumption in Conceptual Text Document Similarity

2020 ◽

Author(s):

ThanhThuong T. Huynh ◽

TruongAn Phamnguyen ◽

Nhon V. Do

Keyword(s):

Structural Information ◽

Knowledge Bases ◽

Similarity Measurement ◽

Document Similarity ◽

Fine Grained ◽

Text Document ◽

Structured Representations ◽

Popular Knowledge ◽

Relevance Evaluation ◽

To Come

To represent the text document more expressively, a kind of graph-based semantic model is proposed, in which more semantic information among keyphrases as well as the structural information of the text are incorporated. The method produces structured representations of texts by utilizing common, popular knowledge bases (e.g. DBpedia, Wikipedia) to acquire fine-grained information about concepts, entities, and their semantic relations, thus resulting in a knowledge-rich interpretation. We demonstrate the benefits of these representations in the task of document similarity measurement. Relevance evaluation between two documents is done by calculating the semantic similarity between two keyphrase graphs that represent them. Experimental results show that our approach outperforms standard baselines based on traditional document representations, and able to come close in performance to the specialized methods particularly tuned to this task on the specific dataset.

Download Full-text

Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity

International Journal of Applied Information Systems ◽

10.5120/ijais2017451724 ◽

2017 ◽

Vol 12 (9) ◽

pp. 1-7 ◽

Cited By ~ 1

Author(s):

Ifeanyi-Reuben Nkechi J. ◽

Ugwu Chidiebere ◽

Nwachukwu E. O.

Keyword(s):

Comparative Analysis ◽

Text Representation ◽

Document Similarity ◽

Text Document ◽

N Gram

Download Full-text

Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method

IEEE/WIC/ACM International Conference on Web Intelligence (WI'07) ◽

10.1109/wi.2007.11 ◽

2007 ◽

Cited By ~ 20

Author(s):

James Z. Wang ◽

William Taylor

Keyword(s):

Measurement Method ◽

Similarity Measurement ◽

Document Similarity ◽

Text Document

Download Full-text

An novel cluster based feature selection and document classification model on high dimension trec data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i1.1.10146 ◽

2017 ◽

Vol 7 (1.1) ◽

pp. 466

Author(s):

Lalitha Kumari ◽

Ch. Satyanarayana

Keyword(s):

Feature Selection ◽

Error Rate ◽

Mean Squared Error ◽

Similarity Measures ◽

Latent Semantic Indexing ◽

Classification Model ◽

Document Similarity ◽

Clinical Text ◽

Text Document ◽

Computational Memory

TREC text documents are complex to analyze the features its relevant similar documents using the traditional document similarity measures. As the size of the TREC repository is increasing, finding relevant clustered documents from a large collection of unstructured documents is a challenging task. Traditional document similarity and classification models are implemented on homogeneous TREC data to find essential features for document entities that are similar to the TREC documents. Also, most of the traditional models are applicable to limited text document sets for text analysis. The main issues in the traditional text mining models in TREC repository include :1) Each document is represented in vector form with many sparsity values 2) Failed to find the document semantic similarity between the intra and inter clusters 3) High mean squared error rate. In this paper, novel feature selection based clustered and classification model is proposed on large number of different TREC repositories. Traditional latent Semantic Indexing and document clustering models are failed to find the topic relevance on large number of TREC clinical text document sets due to computational memory and time. Proposed document feature selection and clustered based classification model is applied on TREC clinical benchmark datasets. From the experimental results, it is proved that the proposed model is efficient than the existing models in terms of computational memory, accuracy and error rate are concerned.

Download Full-text

Pembobotan Berdasarkan Tingkat Kesamaan Semantik pada Metode Fuzzy Semi-Supervised Co-Clustering untuk Pengelompokkan Dokumen Teks

Jurnal ULTIMATICS ◽

10.31937/ti.v6i2.333 ◽

2014 ◽

Vol 6 (2) ◽

pp. 46-51

Author(s):

Galang Amanda Dwi P. ◽

Gregorius Edwadr ◽

Agus Zainal Arifin

Keyword(s):

Supervised Learning ◽

Semantic Similarity ◽

The Other ◽

Classification Result ◽

Document Similarity ◽

The Matrix ◽

Index Terms ◽

Membership Value ◽

Degree Of Similarity

Nowadays, a large number of information can not be reached by the reader because of the misclassification of text-based documents. The misclassified data can also make the readers obtain the wrong information. The method which is proposed by this paper is aiming to classify the documents into the correct group. Each document will have a membership value in several different classes. The method will be used to find the degree of similarity between the two documents is the semantic similarity. In fact, there is no document that doesn’t have a relationship with the other but their relationship might be close to 0. This method calculates the similarity between two documents by taking into account the level of similarity of words and their synonyms. After all inter-document similarity values obtained, a matrix will be created. The matrix is then used as a semi-supervised factor. The output of this method is the value of the membership of each document, which must be one of the greatest membership value for each document which indicates where the documents are grouped. Classification result computed by the method shows a good value which is 90 %. Index Terms - Fuzzy co-clustering, Heuristic, Semantica Similiarity, Semi-supervised learning.

Download Full-text

Comparision of Different Distance Measure Methods in Text Document Clustering

INTERNATIONAL JOURNAL OF RESEARCH AND ENGINEERING ◽

10.21276/ijre.2018.5.7.2 ◽

2018 ◽

Vol 5 (7) ◽

Author(s):

Yin Min Tun ◽

Keyword(s):

Distance Measure ◽

Document Clustering ◽

Text Document ◽

Measure Methods

Download Full-text

An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405614666180903112541 ◽

2020 ◽

Vol 16 (4) ◽

pp. 296-306 ◽

Cited By ~ 3

Author(s):

Laith Mohammad Abualigah ◽

Essam Said Hanandeh ◽

Ahamad Tajudin Khader ◽

Mohammed Abdallh Otair ◽

Shishir Kumar Shandilya

Keyword(s):

Optimization Technique ◽

Document Clustering ◽

Text Clustering ◽

Hill Climbing ◽

Text Documents ◽

Clustering Problem ◽

Text Document ◽

Text Information ◽

Amount Of Knowledge ◽

The Hill

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text