CBER: An Effective Classification Approach Based on Enrichment Representation for Short Text Documents

AbstractIn this paper, we propose a novel approach called Classification Based on Enrichment Representation (CBER) of short text documents. The proposed approach extracts concepts occurring in short text documents and uses them to calculate the weight of the synonyms of each concept. Concepts with the same meanings will increase the weights of their synonyms. However, the text document is short and concepts are rarely repeated; therefore, we capture the semantic relationships among concepts and solve the disambiguation problem. The experimental results show that the proposed CBER is valuable in annotating short text documents to their best labels (classes). We used precision and recall measures to evaluate the proposed approach. CBER performance reached 93% and 94% in precision and recall, respectively.

Download Full-text

Text Documents Plagiarism Detection using Rabin-Karp and Jaro-Winkler Distance Algorithms

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v5.i2.pp462-471 ◽

2017 ◽

Vol 5 (2) ◽

pp. 462 ◽

Cited By ~ 3

Author(s):

Brinardi Leonardo ◽

Seng Hansun

Keyword(s):

Detection System ◽

String Matching ◽

Experimental Results ◽

Plagiarism Detection ◽

Text Documents ◽

Matching Algorithm ◽

Text Document ◽

Different Types ◽

The University

Plagiarism is an act that is considered by the university as a fraud by taking someone ideas or writings without mentioning the references and claimed as his own. Plagiarism detection system is generally implement string matching algorithm in a text document to search for common words between documents. There are some algorithms used for string matching, two of them are Rabin-Karp and Jaro-Winkler Distance algorithms. Rabin-Karp algorithm is one of compatible algorithms to solve the problem of multiple string patterns, while, Jaro-Winkler Distance algorithm has advantages in terms of time. A plagiarism detection application is developed and tested on different types of documents, i.e. doc, docx, pdf and txt. From the experimental results, we obtained that both of these algorithms can be used to perform plagiarism detection of those documents, but in terms of their effectiveness, Rabin-Karp algorithm is much more effective and faster in the process of detecting the document with the size more than 1000 KB.

Download Full-text

A Novel Approach for Ontology-Based Feature Vector Generation for Web Text Document Classification

International Journal of Software Innovation ◽

10.4018/ijsi.2018010101 ◽

2018 ◽

Vol 6 (1) ◽

pp. 1-10 ◽

Cited By ~ 7

Author(s):

Mohamed K. Elhadad ◽

Khaled M. Badran ◽

Gouda I. Salama

Keyword(s):

Feature Vector ◽

Text Processing ◽

Principal Component ◽

Document Classification ◽

Text Documents ◽

Lexical Categories ◽

Text Document ◽

Novel Approach ◽

Text Document Classification ◽

Traditional Approaches

The task of extracting the used feature vector in mining tasks (classification, clustering …etc.) is considered the most important task for enhancing the text processing capabilities. This paper proposes a novel approach to be used in building the feature vector used in web text document classification process; adding semantics in the generated feature vector. This approach is based on utilizing the benefit of the hierarchal structure of the WordNet ontology, to eliminate meaningless words from the generated feature vector that has no semantic relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text, also enriching the feature vector by concatenating each word with its corresponding WordNet lexical category. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting technique. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach, and against an ontology based reduction technique without the process of adding semantics to the generated feature vector using several experiments with five different classifiers (SVM, JRIP, J48, Naive-Bayes, and kNN). The experimental results reveal the effectiveness of the authors' proposed approach against other traditional approaches to achieve a better classification accuracy F-measure, precision, and recall.

Download Full-text

Topic Word Embedding-Based Methods for Automatically Extracting Main Aspects from Product Reviews

Applied Sciences ◽

10.3390/app10113831 ◽

2020 ◽

Vol 10 (11) ◽

pp. 3831 ◽

Cited By ~ 1

Author(s):

Sang-Min Park ◽

Sung Joon Lee ◽

Byung-Won On

Keyword(s):

Prior Knowledge ◽

Topic Model ◽

Unsupervised Clustering ◽

Word Embedding ◽

Experimental Results ◽

Product Reviews ◽

Baseline Method ◽

Text Documents ◽

Novel Approach ◽

Topic Word

Detecting the main aspects of a particular product from a collection of review documents is so challenging in real applications. To address this problem, we focus on utilizing existing topic models that can briefly summarize large text documents. Unlike existing approaches that are limited because of modifying any topic model or using seed opinion words as prior knowledge, we propose a novel approach of (1) identifying starting points for learning, (2) cleaning dirty topic results through word embedding and unsupervised clustering, and (3) automatically generating right aspects using topic and head word embedding. Experimental results show that the proposed methods create more clean topics, improving about 25% of Rouge–1, compared to the baseline method. In addition, through the proposed three methods, the main aspects suitable for given data are detected automatically.

Download Full-text

Text documents screen watermarking by changing background brightness in the interline spacing

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2021-33(4)-11 ◽

2021 ◽

Vol 33 (4) ◽

pp. 147-162

Author(s):

Aleksey Yur'evich Yakushev ◽

Yury Vital'evich Markin ◽

Stanislav Alexandrovich Fomin ◽

Dmitry Olegovich Obydenkov ◽

Boris Vladimirovich Kondrat’ev

Keyword(s):

Digital Camera ◽

Short Review ◽

Experimental Results ◽

Original Image ◽

Text Documents ◽

Novel Approach ◽

Extraction Algorithm ◽

Original Message ◽

Text Images ◽

Background Brightness

One of the most common ways documents leak is taking a picture of document displayed on the screen. For investigation of such cases data leakage prevention technologies including screen watermarking are used. The article gives short review on the problem of screen shooting watermarking and the existing research results. A novel approach for watermarking text images displayed on the screen is proposed. The watermark is embedded as slight changes in luminance into the interline spacing of marked text. The watermark is designed to be invisible for human eye but still able to be detected by digital camera. An algorithm for extraction of watermark from the screen photo is presented. The extraction algorithm doesn’t need the original image of document for successful extraction. The experimental results show that the approach is robust against screen-cam attacks, that means that the watermark stays persistent after the process of taking a photo of document displayed on the screen. A criterion for watermark message extraction accuracy without knowledge about the original message is proposed. The criterion represents the probability that the watermark was extracted correctly.

Download Full-text

Identification of Biological Relationships from Text Documents Using Efficient Computational Methods

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720003000137 ◽

2003 ◽

Vol 01 (02) ◽

pp. 307-342 ◽

Cited By ~ 11

Author(s):

Mathew Palakal ◽

Matthew Stephens ◽

Snehasis Mukhopadhyay ◽

Rajeev Raje ◽

Simon Rhodes

Keyword(s):

Markov Models ◽

Object Identification ◽

Biological Entity ◽

Computationally Efficient ◽

Text Documents ◽

Biological Objects ◽

Text Document ◽

Novel Approach ◽

N Gram ◽

Object Relationships

The biological literature databases continue to grow rapidly with vital information that is important for conducting sound biomedical research and development. The current practices of manually searching for information and extracting pertinent knowledge are tedious, time-consuming tasks even for motivated biological researchers. Accurate and computationally efficient approaches in discovering relationships between biological objects from text documents are important for biologists to develop biological models. The term "object" refers to any biological entity such as a protein, gene, cell cycle, etc. and relationship refers to any dynamic action one object has on another, e.g. protein inhibiting another protein or one object belonging to another object such as, the cells composing an organ. This paper presents a novel approach to extract relationships between multiple biological objects that are present in a text document. The approach involves object identification, reference resolution, ontology and synonym discovery, and extracting object-object relationships. Hidden Markov Models (HMMs), dictionaries, and N-Gram models are used to set the framework to tackle the complex task of extracting object-object relationships. Experiments were carried out using a corpus of one thousand Medline abstracts. Intermediate results were obtained for the object identification process, synonym discovery, and finally the relationship extraction. For the thousand abstracts, 53 relationships were extracted of which 43 were correct, giving a specificity of 81 percent. These results are promising for multi-object identification and relationship finding from biological documents.

Download Full-text

Improving a SVM Meta-classifier for Text Documents by using Naive Bayes

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2010.3.2487 ◽

2010 ◽

Vol 5 (3) ◽

pp. 351 ◽

Cited By ~ 4

Author(s):

Daniel Morariu ◽

Radu Crețulescu ◽

Lucian Vințan

Keyword(s):

Classification Accuracy ◽

Text Categorization ◽

Naive Bayes ◽

Naïve Bayes ◽

Experimental Results ◽

Text Documents ◽

Text Document ◽

Bayes Theory ◽

Individual Classifier ◽

Component Classifier

Text categorization is the problem of classifying text documents into a set of predefined classes. In this paper, we investigated two approaches: a) to develop a classifier for text document based on Naive Bayes Theory and b) to integrate this classifier into a meta-classifier in order to increase the classification accuracy. The basic idea is to learn a meta-classifier to optimally select the best component classifier for each data point. The experimental results show that combining classifiers can significantly improve the classification accuracy and that our improved meta-classification strategy gives better results than each individual classifier. For Reuters2000 text documents we obtained classification accuracies up to 93.87%

Download Full-text

Short Text Document Clustering using Distributed Word Representation and Document Distance

Walailak Journal of Science and Technology (WJST) ◽

10.48048/wjst.2019.4133 ◽

2018 ◽

Vol 16 (2) ◽

pp. 107-119

Author(s):

Supavit KONGWUDHIKUNAKORN ◽

Kitsana WAIYAMAI

Keyword(s):

Large Datasets ◽

Rand Index ◽

Adjusted Rand Index ◽

Text Documents ◽

Short Text ◽

Text Document ◽

Clustering Quality ◽

Word Representation ◽

News Headlines

This paper presents a method for clustering short text documents, such as instant messages, SMS, or news headlines. Vocabularies in the texts are expanded using external knowledge sources and represented by a Distributed Word Representation. Clustering is done using the K-means algorithm with Word Mover's Distance as the distance metric. Experiments were done to compare the clustering quality of this method, and several leading methods, using large datasets from BBC headlines, SearchSnippets, StackExchange, and Twitter. For all datasets, the proposed algorithm produced document clusters with higher accuracy, precision, F1-score, and Adjusted Rand Index. We also observe that cluster description can be inferred from keywords represented in each cluster.

Download Full-text

An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405614666180903112541 ◽

2020 ◽

Vol 16 (4) ◽

pp. 296-306 ◽

Cited By ~ 3

Author(s):

Laith Mohammad Abualigah ◽

Essam Said Hanandeh ◽

Ahamad Tajudin Khader ◽

Mohammed Abdallh Otair ◽

Shishir Kumar Shandilya

Keyword(s):

Optimization Technique ◽

Document Clustering ◽

Text Clustering ◽

Hill Climbing ◽

Text Documents ◽

Clustering Problem ◽

Text Document ◽

Text Information ◽

Amount Of Knowledge ◽

The Hill

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

Mining discriminative patches for script identification in natural scene images

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200260 ◽

2021 ◽

Vol 40 (1) ◽

pp. 551-563

Author(s):

Liqiong Lu ◽

Dong Wu ◽

Ziwei Tang ◽

Yaohua Yi ◽

Faliang Huang

Keyword(s):

Neural Networks ◽

Experimental Results ◽

The Other ◽

Natural Scene ◽

Fixed Size ◽

Script Identification ◽

Aspect Ratios ◽

Novel Approach ◽

Public Datasets ◽

Natural Scene Images

This paper focuses on script identification in natural scene images. Traditional CNNs (Convolution Neural Networks) cannot solve this problem perfectly for two reasons: one is the arbitrary aspect ratios of scene images which bring much difficulty to traditional CNNs with a fixed size image as the input. And the other is that some scripts with minor differences are easily confused because they share a subset of characters with the same shapes. We propose a novel approach combing Score CNN, Attention CNN and patches. Attention CNN is utilized to determine whether a patch is a discriminative patch and calculate the contribution weight of the discriminative patch to script identification of the whole image. Score CNN uses a discriminative patch as input and predict the score of each script type. Firstly patches with the same size are extracted from the scene images. Secondly these patches are used as inputs to Score CNN and Attention CNN to train two patch-level classifiers. Finally, the results of multiple discriminative patches extracted from the same image via the above two classifiers are fused to obtain the script type of this image. Using patches with the same size as inputs to CNN can avoid the problems caused by arbitrary aspect ratios of scene images. The trained classifiers can mine discriminative patches to accurately identify some confusing scripts. The experimental results show the good performance of our approach on four public datasets.

Download Full-text

Text Document Summarization Using POS tagging for Kannada Text Documents

2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence) ◽

10.1109/confluence51648.2021.9377106 ◽

2021 ◽

Author(s):

Jayashree R ◽

Basavaraj S Anami ◽

Poornima B K

Keyword(s):

Text Documents ◽

Document Summarization ◽

Pos Tagging ◽

Text Document

Download Full-text