AUTOMATIC SUBJECT LABELING IN DOCUMENTS BY USING ONTOLOGY AND GRAPH DATABASES

Ontologies apply to many applications in recent years, such as information retrieval, information extraction, and text document classification. The purpose of domain-specific ontology is to enrich the identification of concept and the interrelationships. In our research, we use ontology to specify a set of generic subjects (concept) that characterizes the domain as well as their definitions and interrelationships. This paper introduces a system for labeling subjects of a text documents based on the differential layers of domain specific ontology, which contains the information and the vocabularies related to the computer domain. A document can contain several subjects such as data science, database, and machine learning. The subjects in text document classification are determined based on the differential layers of the domain specific ontology. We combine the methodologies of Natural Language Processing with domain ontology to determine the subjects in text document. In order to increase performance, we use graph database to store and access ontology. Besides, the paper focuses on evaluating our proposed algorithm with some other methods. Experimental results show that our proposed algorithm yields performance significantly

Download Full-text

SUPPORT OF INFORMAL CARERS FOR PEOPLE AFTER A STROKE WITH CROWDSOURCING AND NATURAL LANGUAGE PROCESSING

Acta Electrotechnica et Informatica ◽

10.15546/aeei-2021-0013 ◽

2021 ◽

Vol 21 (3) ◽

pp. 3-10

Author(s):

Petr ŠALOUN ◽

◽

Barbora CIGÁNKOVÁ ◽

David ANDREŠIČ ◽

Lenka KRHUTOVÁ ◽

...

Keyword(s):

Language Processing ◽

Text Documents ◽

Data Set ◽

Text Document ◽

Long Time ◽

Informal Carers ◽

Effective Visualization ◽

Text Document Classification ◽

Lay Public

For a long time, both professionals and the lay public showed little interest in informal carers. Yet these people deals with multiple and common issues in their everyday lives. As the population is aging we can observe a change of this attitude. And thanks to the advances in computer science, we can offer them some effective assistance and support by providing necessary information and connecting them with both professional and lay public community. In this work we describe a project called “Research and development of support networks and information systems for informal carers for persons after stroke” producing an information system visible to public as a web portal. It does not provide just simple a set of information but using means of artificial intelligence, text document classification and crowdsourcing further improving its accuracy, it also provides means of effective visualization and navigation over the content made by most by the community itself and personalized on a level of informal carer’s phase of the care-taking timeline. In can be beneficial for informal carers as it allows to find a content specific to their current situation. This work describes our approach to classification of text documents and its improvement through crowdsourcing. Its goal is to test text documents classifier based on documents similarity measured by N-grams method and to design evaluation and crowdsourcing-based classification improvement mechanism. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful.

Download Full-text

The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification

SSRN Electronic Journal ◽

10.2139/ssrn.3547887 ◽

2020 ◽

Author(s):

Andrea Ferrario ◽

Mara Naegelin

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Document Classification ◽

Text Document ◽

Text Document Classification

Download Full-text

On the influence of training data quality on text document classification using machine learning methods

International Journal of Knowledge Engineering and Data Mining ◽

10.1504/ijkedm.2015.071284 ◽

2015 ◽

Vol 3 (2) ◽

pp. 143 ◽

Cited By ~ 1

Author(s):

Jyri Saarikoski ◽

Henry Joutsijoki ◽

Kalervo Järvelin ◽

Jorma Laurikkala ◽

Martti Juhola

Keyword(s):

Machine Learning ◽

Data Quality ◽

Document Classification ◽

Training Data ◽

Learning Methods ◽

Machine Learning Methods ◽

Text Document ◽

Text Document Classification

Download Full-text

A Novel Approach for Ontology-Based Feature Vector Generation for Web Text Document Classification

International Journal of Software Innovation ◽

10.4018/ijsi.2018010101 ◽

2018 ◽

Vol 6 (1) ◽

pp. 1-10 ◽

Cited By ~ 7

Author(s):

Mohamed K. Elhadad ◽

Khaled M. Badran ◽

Gouda I. Salama

Keyword(s):

Feature Vector ◽

Text Processing ◽

Principal Component ◽

Document Classification ◽

Text Documents ◽

Lexical Categories ◽

Text Document ◽

Novel Approach ◽

Text Document Classification ◽

Traditional Approaches

The task of extracting the used feature vector in mining tasks (classification, clustering …etc.) is considered the most important task for enhancing the text processing capabilities. This paper proposes a novel approach to be used in building the feature vector used in web text document classification process; adding semantics in the generated feature vector. This approach is based on utilizing the benefit of the hierarchal structure of the WordNet ontology, to eliminate meaningless words from the generated feature vector that has no semantic relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text, also enriching the feature vector by concatenating each word with its corresponding WordNet lexical category. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting technique. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach, and against an ontology based reduction technique without the process of adding semantics to the generated feature vector using several experiments with five different classifiers (SVM, JRIP, J48, Naive-Bayes, and kNN). The experimental results reveal the effectiveness of the authors' proposed approach against other traditional approaches to achieve a better classification accuracy F-measure, precision, and recall.

Download Full-text

A Novel Approach for Ontology-Based Dimensionality Reduction for Web Text Document Classification

International Journal of Software Innovation ◽

10.4018/ijsi.2017100104 ◽

2017 ◽

Vol 5 (4) ◽

pp. 44-58 ◽

Cited By ~ 1

Author(s):

Mohamed K. Elhadad ◽

Khaled M. Badran ◽

Gouda I. Salama

Keyword(s):

Dimensionality Reduction ◽

Feature Vector ◽

Text Processing ◽

Principal Component ◽

Vital Role ◽

Document Classification ◽

Weighting Method ◽

Text Documents ◽

Text Document ◽

Text Document Classification

Dimensionality reduction of feature vector size plays a vital role in enhancing the text processing capabilities; it aims in reducing the size of the feature vector used in the mining tasks (classification, clustering, etc.). This paper proposes an efficient approach to be used in reducing the size of the feature vector for web text document classification process. This approach is based on using WordNet ontology, utilizing the benefit of its hierarchal structure, to eliminate words from the generated feature vector that has no relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting method. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach using several experiments. The experimental results reveal the effectiveness of the authors' proposed approach against other traditional approaches to achieve a better classification accuracy F-measure, precision, and recall.

Download Full-text

Exploring multinomial naïve Bayes for Yorùbá text document classification

Nigerian Journal of Technology ◽

10.4314/njt.v39i2.23 ◽

2020 ◽

Vol 39 (2) ◽

pp. 528-535

Author(s):

I.I. Ayogu

Keyword(s):

English Language ◽

Naive Bayes ◽

Naïve Bayes ◽

Document Classification ◽

Text Documents ◽

Bayes Model ◽

Text Document ◽

Text Document Classification ◽

Yoruba Language ◽

Language Text

The recent increase in the emergence of Nigerian language text online motivates this paper in which the problem of classifying text documents written in Yorùbá language into one of a few pre-designated classes is considered. Text document classification/categorization research is well established for English language and many other languages; this is not so for Nigerian languages. This paper evaluated the performance of a multinomial Naive Bayes model learned on a research dataset consisting of 100 samples of text each from business, sporting, entertainment, technology and political domains, separately on unigram, bigram and trigram features obtained using the bag of words representation approach. Results show that the performance of the model over unigram and bigram features is comparable but significantly better than a model learned on trigram features. The results generally indicate a possibility for the practical application of NB algorithm to the classification of text documents written in Yorùbá language. Keywords: Supervised learning, text classification, Yorùbá language, text mining, BoW Representation

Download Full-text

Hybrid Neural Architecture for Intelligent Recommender System Classification Unit Design

Intelligent Techniques in Recommendation Systems ◽

10.4018/978-1-4666-2542-6.ch010 ◽

2013 ◽

pp. 192-213

Author(s):

Emmanuel Buabin

Keyword(s):

Recommender System ◽

Document Classification ◽

Research Field ◽

Neural Systems ◽

Fully Integrated ◽

Text Document ◽

Unit Design ◽

Boosting Algorithms ◽

New Research ◽

Text Document Classification

The objective is intelligent recommender system classification unit design using hybrid neural techniques. In particular, a neuroscience-based hybrid neural by Buabin (2011a) is introduced, explained, and examined for its potential in real world text document classification on the modapte version of the Reuters news text corpus. The so described neuroscience model (termed Hy-RNC) is fully integrated with a novel boosting algorithm to augment text document classification purposes. Hy-RNC outperforms existing works and opens up an entirely new research field in the area of machine learning. The main contribution of this book chapter is the provision of a step-by-step approach to modeling the hybrid system using underlying concepts such as boosting algorithms, recurrent neural networks, and hybrid neural systems. Results attained in the experiments show impressive performance by the hybrid neural classifier even with a minimal number of neurons in constituting structures.

Download Full-text

ADVANCING AN INTERDISCIPLINARY SCIENCE OF AGING THROUGH A PRACTICE-BASED DATA SCIENCE APPROACH

Innovation in Aging ◽

10.1093/geroni/igz038.1786 ◽

2019 ◽

Vol 3 (Supplement_1) ◽

pp. S480-S480

Author(s):

Robert Lucero ◽

Ragnhildur Bjarnadottir

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Older Adults ◽

Text Mining ◽

Language Processing ◽

Fall Risk ◽

Data Science ◽

Care Quality ◽

Science Approach ◽

Hospitalized Older Adults

Abstract Two hundred and fifty thousand older adults die annually in United States hospitals because of iatrogenic conditions (ICs). Clinicians, aging experts, patient advocates and federal policy makers agree that there is a need to enhance the safety of hospitalized older adults through improved identification and prevention of ICs. To this end, we are building a research program with the goal of enhancing the safety of hospitalized older adults by reducing ICs through an effective learning health system. Leveraging unique electronic data and healthcare system and human resources at the University of Florida, we are applying a state-of-the-art practice-based data science approach to identify risk factors of ICs (e.g., falls) from structured (i.e., nursing, clinical, administrative) and unstructured or text (i.e., registered nurse’s progress notes) data. Our interdisciplinary academic-clinical partnership includes scientific and clinical experts in patient safety, care quality, health outcomes, nursing and health informatics, natural language processing, data science, aging, standardized terminology, clinical decision support, statistics, machine learning, and hospital operations. Results to date have uncovered previously unknown fall risk factors within nursing (i.e., physical therapy initiation), clinical (i.e., number of fall risk increasing drugs, hemoglobin level), and administrative (i.e., Charlson Comorbidity Index, nurse skill mix, and registered nurse staffing ratio) structured data as well as patient cognitive, environmental, workflow, and communication factors in text data. The application of data science methods (i.e., machine learning and text-mining) and findings from this research will be used to develop text-mining pipelines to support sustained data-driven interdisciplinary aging studies to reduce ICs.

Download Full-text

A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification

Arabian Journal for Science and Engineering ◽

10.1007/s13369-020-04763-5 ◽

2020 ◽

Vol 45 (12) ◽

pp. 10471-10491

Author(s):

Muhammad Sajid Ali ◽

Kashif Javed

Keyword(s):

Document Classification ◽

Text Document ◽

Feature Selector ◽

Text Document Classification

Download Full-text

Hindi Text Document Classification System Using SVM and Fuzzy

International Journal of Rough Sets and Data Analysis ◽

10.4018/ijrsda.2018100101 ◽

2018 ◽

Vol 5 (4) ◽

pp. 1-31 ◽

Cited By ~ 8

Author(s):

Shalini Puri ◽

Satya Prakash Singh

Keyword(s):

Classification System ◽

Character Recognition ◽

Optical Character Recognition ◽

Document Classification ◽

Data Availability ◽

Support Vector ◽

Handwritten Documents ◽

Text Document ◽

Survey Report ◽

Text Document Classification

In recent years, many information retrieval, character recognition, and feature extraction methodologies in Devanagari and especially in Hindi have been proposed for different domain areas. Due to enormous scanned data availability and to provide an advanced improvement of existing Hindi automated systems beyond optical character recognition, a new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic is introduced. This first pre-processes and then classifies textual imaged documents into predefined categories. With this concept, this article depicts a feasibility study of such systems with the relevance of Hindi, a survey report of statistical measurements of Hindi keywords obtained from different sources, and the inherent challenges found in printed and handwritten documents. The technical reviews are provided and graphically represented to compare many parameters and estimate contents, forms and classifiers used in various existing techniques.

Download Full-text