A Novel Approach for Ontology-Based Dimensionality Reduction for Web Text Document Classification

Dimensionality reduction of feature vector size plays a vital role in enhancing the text processing capabilities; it aims in reducing the size of the feature vector used in the mining tasks (classification, clustering, etc.). This paper proposes an efficient approach to be used in reducing the size of the feature vector for web text document classification process. This approach is based on using WordNet ontology, utilizing the benefit of its hierarchal structure, to eliminate words from the generated feature vector that has no relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting method. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach using several experiments. The experimental results reveal the effectiveness of the authors' proposed approach against other traditional approaches to achieve a better classification accuracy F-measure, precision, and recall.

Download Full-text

A Novel Approach for Ontology-Based Feature Vector Generation for Web Text Document Classification

International Journal of Software Innovation ◽

10.4018/ijsi.2018010101 ◽

2018 ◽

Vol 6 (1) ◽

pp. 1-10 ◽

Cited By ~ 7

Author(s):

Mohamed K. Elhadad ◽

Khaled M. Badran ◽

Gouda I. Salama

Keyword(s):

Feature Vector ◽

Text Processing ◽

Principal Component ◽

Document Classification ◽

Text Documents ◽

Lexical Categories ◽

Text Document ◽

Novel Approach ◽

Text Document Classification ◽

Traditional Approaches

The task of extracting the used feature vector in mining tasks (classification, clustering …etc.) is considered the most important task for enhancing the text processing capabilities. This paper proposes a novel approach to be used in building the feature vector used in web text document classification process; adding semantics in the generated feature vector. This approach is based on utilizing the benefit of the hierarchal structure of the WordNet ontology, to eliminate meaningless words from the generated feature vector that has no semantic relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text, also enriching the feature vector by concatenating each word with its corresponding WordNet lexical category. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting technique. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach, and against an ontology based reduction technique without the process of adding semantics to the generated feature vector using several experiments with five different classifiers (SVM, JRIP, J48, Naive-Bayes, and kNN). The experimental results reveal the effectiveness of the authors' proposed approach against other traditional approaches to achieve a better classification accuracy F-measure, precision, and recall.

Download Full-text

AUTOMATIC SUBJECT LABELING IN DOCUMENTS BY USING ONTOLOGY AND GRAPH DATABASES

Journal of Science and Technology - IUH ◽

10.46242/jst-iuh.v38i02.292 ◽

2020 ◽

Vol 38 (02) ◽

Author(s):

TẠ DUY CÔNG CHIẾN

Keyword(s):

Machine Learning ◽

Language Processing ◽

Data Science ◽

Document Classification ◽

Graph Database ◽

Graph Databases ◽

Text Documents ◽

Domain Specific ◽

Text Document ◽

Text Document Classification

Ontologies apply to many applications in recent years, such as information retrieval, information extraction, and text document classification. The purpose of domain-specific ontology is to enrich the identification of concept and the interrelationships. In our research, we use ontology to specify a set of generic subjects (concept) that characterizes the domain as well as their definitions and interrelationships. This paper introduces a system for labeling subjects of a text documents based on the differential layers of domain specific ontology, which contains the information and the vocabularies related to the computer domain. A document can contain several subjects such as data science, database, and machine learning. The subjects in text document classification are determined based on the differential layers of the domain specific ontology. We combine the methodologies of Natural Language Processing with domain ontology to determine the subjects in text document. In order to increase performance, we use graph database to store and access ontology. Besides, the paper focuses on evaluating our proposed algorithm with some other methods. Experimental results show that our proposed algorithm yields performance significantly

Download Full-text

Exploring multinomial naïve Bayes for Yorùbá text document classification

Nigerian Journal of Technology ◽

10.4314/njt.v39i2.23 ◽

2020 ◽

Vol 39 (2) ◽

pp. 528-535

Author(s):

I.I. Ayogu

Keyword(s):

English Language ◽

Naive Bayes ◽

Naïve Bayes ◽

Document Classification ◽

Text Documents ◽

Bayes Model ◽

Text Document ◽

Text Document Classification ◽

Yoruba Language ◽

Language Text

The recent increase in the emergence of Nigerian language text online motivates this paper in which the problem of classifying text documents written in Yorùbá language into one of a few pre-designated classes is considered. Text document classification/categorization research is well established for English language and many other languages; this is not so for Nigerian languages. This paper evaluated the performance of a multinomial Naive Bayes model learned on a research dataset consisting of 100 samples of text each from business, sporting, entertainment, technology and political domains, separately on unigram, bigram and trigram features obtained using the bag of words representation approach. Results show that the performance of the model over unigram and bigram features is comparable but significantly better than a model learned on trigram features. The results generally indicate a possibility for the practical application of NB algorithm to the classification of text documents written in Yorùbá language. Keywords: Supervised learning, text classification, Yorùbá language, text mining, BoW Representation

Download Full-text

A novel approach for ontology-based dimensionality reduction for web text document classification

2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS) ◽

10.1109/icis.2017.7960021 ◽

2017 ◽

Cited By ~ 12

Author(s):

Mohamed K. Elhadad ◽

KhaledM. Badran ◽

Gouda I. Salama

Keyword(s):

Dimensionality Reduction ◽

Document Classification ◽

Text Document ◽

Novel Approach ◽

Text Document Classification

Download Full-text

Hybrid Neural Architecture for Intelligent Recommender System Classification Unit Design

Intelligent Techniques in Recommendation Systems ◽

10.4018/978-1-4666-2542-6.ch010 ◽

2013 ◽

pp. 192-213

Author(s):

Emmanuel Buabin

Keyword(s):

Recommender System ◽

Document Classification ◽

Research Field ◽

Neural Systems ◽

Fully Integrated ◽

Text Document ◽

Unit Design ◽

Boosting Algorithms ◽

New Research ◽

Text Document Classification

The objective is intelligent recommender system classification unit design using hybrid neural techniques. In particular, a neuroscience-based hybrid neural by Buabin (2011a) is introduced, explained, and examined for its potential in real world text document classification on the modapte version of the Reuters news text corpus. The so described neuroscience model (termed Hy-RNC) is fully integrated with a novel boosting algorithm to augment text document classification purposes. Hy-RNC outperforms existing works and opens up an entirely new research field in the area of machine learning. The main contribution of this book chapter is the provision of a step-by-step approach to modeling the hybrid system using underlying concepts such as boosting algorithms, recurrent neural networks, and hybrid neural systems. Results attained in the experiments show impressive performance by the hybrid neural classifier even with a minimal number of neurons in constituting structures.

Download Full-text

A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification

Arabian Journal for Science and Engineering ◽

10.1007/s13369-020-04763-5 ◽

2020 ◽

Vol 45 (12) ◽

pp. 10471-10491

Author(s):

Muhammad Sajid Ali ◽

Kashif Javed

Keyword(s):

Document Classification ◽

Text Document ◽

Feature Selector ◽

Text Document Classification

Download Full-text

Hindi Text Document Classification System Using SVM and Fuzzy

International Journal of Rough Sets and Data Analysis ◽

10.4018/ijrsda.2018100101 ◽

2018 ◽

Vol 5 (4) ◽

pp. 1-31 ◽

Cited By ~ 8

Author(s):

Shalini Puri ◽

Satya Prakash Singh

Keyword(s):

Classification System ◽

Character Recognition ◽

Optical Character Recognition ◽

Document Classification ◽

Data Availability ◽

Support Vector ◽

Handwritten Documents ◽

Text Document ◽

Survey Report ◽

Text Document Classification

In recent years, many information retrieval, character recognition, and feature extraction methodologies in Devanagari and especially in Hindi have been proposed for different domain areas. Due to enormous scanned data availability and to provide an advanced improvement of existing Hindi automated systems beyond optical character recognition, a new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic is introduced. This first pre-processes and then classifies textual imaged documents into predefined categories. With this concept, this article depicts a feasibility study of such systems with the relevance of Hindi, a survey report of statistical measurements of Hindi keywords obtained from different sources, and the inherent challenges found in printed and handwritten documents. The technical reviews are provided and graphically represented to compare many parameters and estimate contents, forms and classifiers used in various existing techniques.

Download Full-text

Cross-Lingual Short-Text Document Classification for Facebook Comments

2014 International Conference on Future Internet of Things and Cloud ◽

10.1109/ficloud.2014.99 ◽

2014 ◽

Cited By ~ 17

Author(s):

Mosab Faqeeh ◽

Nawaf Abdulla ◽

Mahmoud Al-Ayyoub ◽

Yaser Jararweh ◽

Muhannad Quwaider

Keyword(s):

Document Classification ◽

Short Text ◽

Text Document ◽

Cross Lingual ◽

Text Document Classification

Download Full-text

Text Document Classification using a hybrid approach of ACOGA for feature selection

International Journal of Advanced Intelligence Paradigms ◽

10.1504/ijaip.2018.10021822 ◽

2018 ◽

Vol 1 (1) ◽

pp. 1

Author(s):

Anoj Kumar ◽

Avjeet Singh

Keyword(s):

Feature Selection ◽

Hybrid Approach ◽

Document Classification ◽

Text Document ◽

Text Document Classification

Download Full-text

Text Document Classification Using Support Vector Machine with Feature Selection Using Singular Value Decomposition

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.905.528 ◽

2014 ◽

Vol 905 ◽

pp. 528-532

Author(s):

Hoan Manh Dau ◽

Ning Xu

Keyword(s):

Support Vector Machine ◽

Singular Value Decomposition ◽

Decision Tree ◽

Document Classification ◽

Singular Value ◽

Multidimensional Data ◽

Support Vector ◽

Text Document ◽

Text Document Classification ◽

Value Decomposition

Text document classification is content analysis task of the text document and then giving decision (or giving a prediction) whether this text document belongs to which group among given text document ones. There are many classification techniques such as decision method basing on Naive Bayer, decision tree, k-Nearest neighbor (KNN), neural network, Support Vector Machine (SVM) method. Among those techniques, SVM is considered the popular and powerful one, especially, it is suitable to huge and multidimensional data classification. Text document classification with characteristics of very huge dimensional numbers and selecting features before classifying impact the classification results. Support Vector Machine is a very effective method in this field. This article studies Support Vector Machine and applies it in the problem of text document classification. The study shows that Support Vector Machine method with choosing features by singular value decomposition (SVD) method is better than other methods and decision tree.

Download Full-text