scholarly journals RISING OF THE TEXT DOCUMENTS SEARCH PRECISION BY USING THE ADAPTIVE ONTOLOGY

2014 ◽  
pp. 51-58
Author(s):  
Romana Darevych

Conceptual graphs are an effective tool for representation of the semantic content of text documents and domain ontology as well. In this article the new method of evaluation of text documents content similarity is proposed. The method consists in representation compared texts as its weighted conceptual graphs supplemented by related context from domain ontology and estimation of a distance between semantic weights centers of these graphs. It is shown that the method satisfies axioms of a metric. Procedures of the automatic tuning of ontology to the specified domain and information needs of user are developed. The results of experiment shows that the taking into account semantics of the used concepts, assertions and significance coefficients from adaptive ontology during the text processing rises the search precision on average 20 %.

Entropy ◽  
2020 ◽  
Vol 22 (3) ◽  
pp. 275
Author(s):  
Igor A. Bessmertny ◽  
Xiaoxi Huang ◽  
Aleksei V. Platonov ◽  
Chuqiao Yu ◽  
Julia A. Koroleva

Search engines are able to find documents containing patterns from a query. This approach can be used for alphabetic languages such as English. However, Chinese is highly dependent on context. The significant problem of Chinese text processing is the missing blanks between words, so it is necessary to segment the text to words before any other action. Algorithms for Chinese text segmentation should consider context; that is, the word segmentation process depends on other ideograms. As the existing segmentation algorithms are imperfect, we have considered an approach to build the context from all possible n-grams surrounding the query words. This paper proposes a quantum-inspired approach to rank Chinese text documents by their relevancy to the query. Particularly, this approach uses Bell’s test, which measures the quantum entanglement of two words within the context. The contexts of words are built using the hyperspace analogue to language (HAL) algorithm. Experiments fulfilled in three domains demonstrated that the proposed approach provides acceptable results.


2020 ◽  
Vol 38 (02) ◽  
Author(s):  
TẠ DUY CÔNG CHIẾN

Question answering systems are applied to many different fields in recent years, such as education, business, and surveys. The purpose of these systems is to answer automatically the questions or queries of users about some problems. This paper introduces a question answering system is built based on a domain specific ontology. This ontology, which contains the data and the vocabularies related to the computing domain are built from text documents of the ACM Digital Libraries. Consequently, the system only answers the problems pertaining to the information technology domains such as database, network, machine learning, etc. We use the methodologies of Natural Language Processing and domain ontology to build this system. In order to increase performance, I use a graph database to store the computing ontology and apply no-SQL database for querying data of computing ontology.


2021 ◽  
Vol 12 (3) ◽  
pp. 1483-1491
Author(s):  
Syopiansyah Jaya Putra Et.al

Text Categorization plays an important role for clustering the rapidly growing, yet unstructured, Indonesian text in digital format. Furthermore, it is deemed even more important since access to digital format text has become more necessary and widespread. There are many clustering algorithms used for text categorization. Unfortunately, clustering algorithms for text categorization cannot easily cluster the texts due to imperfect process of stemming and stopword of Indonesian language. This paper presents an intelligent system that categorizes Indonesian text documents into meaningful cluster labels. Label Induction Grouping Algorithm (LINGO) and Bisecting K- means are applied to process it through five phases, namely the pre-processing, frequent phrase extraction, cluster label induction, content discovery and final cluster formation. The experimental result showed that the system could categorize Indonesian text and reach to 93%. Furthermore, clustering quality evaluation indicates that text categorization using LINGO has high Precision and Recall with a value of 0.85 and 1, respectively, compare to Bisecting K-means which has a value of 0.78 and 0.99. Therefore, the result shows that LINGO is suitable for categorizing Indonesian text. The main contribution of this study is to optimize the clustering results by applying and maximizing text processing using Indonesian stemmer and stopword.


2011 ◽  
Vol 7 (3) ◽  
pp. 11-26 ◽  
Author(s):  
Ulrich Reimer ◽  
Edith Maier ◽  
Stephan Streit ◽  
Thomas Diggelmann ◽  
Manfred Hoffleisch

The paper introduces a web-based eHealth platform currently being developed that will assist patients with certain chronic diseases. The ultimate aim is behavioral change. This is supported by online assessment and feedback which visualizes actual behavior in relation to target behavior. Disease-specific information is provided through an information portal that utilizes lightweight ontologies (associative networks) in combination with text mining. The paper argues that classical word-based information retrieval is often not sufficient for providing patients with relevant information, but that their information needs are better addressed by concept-based retrieval. The focus of the paper is on the semantic retrieval component and the learning of a lightweight ontology from text documents, which is achieved by using a biologically inspired neural network. The paper concludes with preliminary results of the evaluation of the proposed approach in comparison with traditional approaches.


2015 ◽  
Vol 2015 ◽  
pp. 1-7 ◽  
Author(s):  
Rapeeporn Chamchong ◽  
Chun Che Fung

Challenges for text processing in ancient document images are mainly due to the high degree of variations in foreground and background. Image binarization is an image segmentation technique used to separate the image into text and background components. Although several techniques for binarizing text documents have been proposed, the performance of these techniques varies and depends on the image characteristics. Therefore, selecting binarization techniques can be a key idea to achieve improved results. This paper proposes a framework for selecting binarizing techniques of palm leaf manuscripts using Support Vector Machines (SVMs). The overall process is divided into three steps: (i) feature extraction: feature patterns are extracted from grayscale images based on global intensity, local contrast, and intensity; (ii) treatment of imbalanced data: imbalanced dataset is balanced by using Synthetic Minority Oversampling Technique as to improve the performance of prediction; and (iii) selection: SVM is applied in order to select the appropriate binarization techniques. The proposed framework has been evaluated with palm leaf manuscript images and benchmarking dataset from DIBCO series and compared the performance of prediction between imbalanced and balanced datasets. Experimental results showed that the proposed framework can be used as an integral part of an automatic selection process.


2013 ◽  
Vol 336-338 ◽  
pp. 2217-2220
Author(s):  
Cai Yun Xie ◽  
Xiao Rong Hu

This paper proposes the classification algorithm of news pages based on domain Ontology. In order to improve the shortage of current classification algorithm that only considers the content similarity, this paper presents the semantic classification method which considers both content similarity and structural correlation. Firstly, it parses the Ontology to get Ontology category vector, extracts keywords of news pages texts and drops semantic dimension. At this time, finding out the same vocabulary and ontology category vector in page texts to constitute the text expectation vector, and then calculating the content similarity between ontology category vector and expectation vector of text by using the law of cosines. Secondly, the common vocabularies are mapped to the ontology hierarchy chart, and the structural relevancy is obtained by calculating weighted path of this directed acyclic graph. Finally, it calculates the correlation degree of the news pages and Ontology by combining both, and determines the category of news pages by judging the size relationship between the result and the initial threshold value.


2020 ◽  
Vol 24 ◽  
pp. 3-14
Author(s):  
Adrian Fonseca Bruzón ◽  
Aurelio López-López ◽  
José E. Medina Pagola

Humans tend to organize information in documents in a logical and intentional way. This organization, which we call textual structure, is commonly in terms of sections, chapters, paragraphs, or sentences. This structure facilitates the understanding of the content that we want to transmit to the readers. However, such structure, in which we usually encode the semantic content of information, is not usually exploited by the filtering methods for the construction of a user profile. In this work, we propose the use of term relations considering different context levels for enhancing document filtering. We propose methods for obtaining the representation, considering the existence of imbalance between the documents that satisfy the information needs of users, as well as the Cold Start problem (having scarce information) during the initial construction of the user profile. The experiments carried out allowed to assess the impact, in terms of T11SU measure, on the filtering task of the proposed representation.


Sign in / Sign up

Export Citation Format

Share Document