Selecting effective index terms using a decision tree

2002 ◽  
Vol 8 (2-3) ◽  
pp. 193-207 ◽  
Author(s):  
TOKUNAGA TAKENOBU ◽  
KIMURA KENJI ◽  
OGIBAYASHI HIRONORI ◽  
TANAKA HOZUMI

This paper explores the effectiveness of index terms more complex than the single words used in conventional information retrieval systems. Retrieval is done in two phases: in the first, a conventional retrieval method (the Okapi system) is used; in the second, complex index terms such as syntactic relations and single words with part-of-speech information are introduced to rerank the results of the first phase. We evaluated the effectiveness of the different types of index terms through experiments using the TREC-7 test collection and 50 queries. The retrieval effectiveness was improved for 32 out of 50 queries. Based on this investigation, we then introduce a method to select effective index terms by using a decision tree. Further experiments with the same test collection showed that retrieval effectiveness was improved in 25 of the 50 queries.

2014 ◽  
Vol 6 (1) ◽  
pp. 9-14
Author(s):  
Stefanie Sirapanji ◽  
Seng Hansun

Beauty is a precious asset for everyone. Everyone wants to have a healthy face. Unfortunately, there are always those problems that pops out on its own. For example, acnes, freckles, wrinkles, dull, oily and dry skin. Therefore, nowadays, there are a lot of beauty clinics available to help those who wants to solve their beauty troubles. But, not everyone can enjoy the facilities of those beauty clinics, for example those in the suburbs. The uneven distribution of doctors and the expensive cost of treatments are some of the reasons. In this research, the system that could help the patients to find the solution of their beauty problems is built. The decision tree method is used to take decision based on the shown schematic. Based on the system’s experiment, the average accuracy level hits 100%. Index Terms–Acnes, Decision Tree, Dry Skin, Dull, Facial Problems, Freckles, Wrinkles, Oily Skin, Eexpert System.


1985 ◽  
Vol 8 (2) ◽  
pp. 253-267
Author(s):  
S.K.M. Wong ◽  
Wojciech Ziarko

In information retrieval, it is common to model index terms and documents as vectors in a suitably defined vector space. The main difficulty with this approach is that the explicit representation of term vectors is not known a priori. For this reason, the vector space model adopted by Salton for the SMART system treats the terms as a set of orthogonal vectors. In such a model it is often necessary to adopt a separate, corrective procedure to take into account the correlations between terms. In this paper, we propose a systematic method (the generalized vector space model) to compute term correlations directly from automatic indexing scheme. We also demonstrate how such correlations can be included with minimal modification in the existing vector based information retrieval systems.


2021 ◽  
Vol 20 (2) ◽  
pp. 20-31
Author(s):  
István Pozsgai

The aim of this work is to examine the system of the syntactic relations of the cardinal numerals with the words which belong to them in the Kievo-Pecherskiy Paterikon that was compiled in the 13th century. The manuscript on the basis of which the text was published was copied in the late 15th – early 16th centuries. I am mainly searching those phenomena, which can give information about the conditions of the genesis and development of the numerals as a new independent part of speech. I am paying attention to the phenomena which can be connected with the unification of the several types of the syntactic relations of the cardinal numerals with their associated words. I am searching for all quantitative constructions except for the constructions containing the numeral 1 as a prime numeral. The found quantitative constructions are grouped according to the type of combination of cardinal numerals with names or participles. Particular attention is paid to the combinations of quantitative numerals with related words, which differ from the norms of other monuments, such as the Old Church Slavonic language of the Russian edition, Old Russian and early Russian Church Slavonic monuments, since it is these phenomena that can indicate the process of acquiring general morphological and syntactic properties by cardinal numerals. On the basis of the quantitative constructions that do not correspond to the above-mentioned norms, three important grammatical phenomena are distinguished that can indicate the process of replacing old norms with new ones. As a contrast I am showing data from the other manuscripts.


Author(s):  
Tomoki Takada ◽  
◽  
Mizuki Arai ◽  
Tomohiro Takagi

Nowadays, an increasingly large amount of information exists on the web. Therefore, a method is needed that enables us to find necessary information quickly because this is becoming increasingly difficult for users. To solve this problem, information retrieval systems like Google and recommendation systems like that on Amazon are used. In this paper, we focus on information retrieval systems. These retrieval systems require index terms, which affect the precision of retrieval. Two methods generally decide index terms. One is analyzing a text using natural language processing and deciding index terms using varying amounts of statistics. The other is someone choosing document keywords as index terms. However, the latter method requires too much time and effort and becomes more impractical as information grows. Therefore, we propose the Nikkei annotator system, which is based on the model of the human brain and learns patterns of past keyword annotation and automatically outputs keywords that users prefer. The purposes of the proposed method are automating manual keyword annotation and achieving high speed and high accuracy keyword annotation. Experimental results showed that the proposed method is more accurate than TFIDF and Naive Bayes in P@5 and P@10. Moreover, these results also showed that the proposed method could annotate about 19 times faster than Naive Bayes.


2017 ◽  
pp. 35-46 ◽  
Author(s):  
Irene Doval

This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.


2020 ◽  
Vol 11 (1) ◽  
pp. 1-8
Author(s):  
Aqib Ali ◽  
Jamal Abdul Nasir ◽  
Muhammad Munawar Ahmed ◽  
Samreen Naeem ◽  
Sania Anam ◽  
...  

Background: Humans can deliver many emotions during a conversation. Facial expressions show information about emotions. Objectives: This study proposed a Machine Learning (ML) approach based on a statistical analysis of emotion recognition using facial expression through a digital image. Methodology: A total of 600 digital image datasets divided into 6 classes (Anger, Happy, Fear, Surprise, Sad, and Normal) was collected from publicly available Taiwan Facial Expression Images Database. In the first step, all images are converted into a gray level format and 4 Regions of Interest (ROIs) are created on each image, so the total image dataset gets divided in 2400 (600 x 4) sub-images. In the second step, 3 types of statistical features named texture, histogram, and binary feature are extracted from each ROIs. The third step is a statistical feature optimization using the best-first search algorithm. Lastly, an optimized statistical feature dataset is deployed on various ML classifiers. Results: The analysis part was divided into two phases: firstly boosting algorithms-based ML classifiers (named as LogitBoost, AdaboostM1, and Stacking) which obtained 94.11%, 92.15%, and 89.21% accuracy, respectively. Secondly, decision tree algorithms named J48, Random Forest, and Random Committee were obtained with 97.05%, 93.14%, and 92.15% accuracy, respectively. Conclusion: It was observed that decision tree based J48 classifiers gave 97.05% classification accuracy.


2016 ◽  
Vol 105 (1) ◽  
pp. 63-76
Author(s):  
Theresa Guinard

Abstract Morphological analysis (finding the component morphemes of a word and tagging morphemes with part-of-speech information) is a useful preprocessing step in many natural language processing applications, especially for synthetic languages. Compound words from the constructed language Esperanto are formed by straightforward agglutination, but for many words, there is more than one possible sequence of component morphemes. However, one segmentation is usually more semantically probable than the others. This paper presents a modified n-gram Markov model that finds the most probable segmentation of any Esperanto word, where the model’s states represent morpheme part-of-speech and semantic classes. The overall segmentation accuracy was over 98% for a set of presegmented dictionary words.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Xiaoqiang Chi ◽  
Yang Xiang

Paraphrase generation is an essential yet challenging task in natural language processing. Neural-network-based approaches towards paraphrase generation have achieved remarkable success in recent years. Previous neural paraphrase generation approaches ignore linguistic knowledge, such as part-of-speech information regardless of its availability. The underlying assumption is that neural nets could learn such information implicitly when given sufficient data. However, it would be difficult for neural nets to learn such information properly when data are scarce. In this work, we endeavor to probe into the efficacy of explicit part-of-speech information for the task of paraphrase generation in low-resource scenarios. To this end, we devise three mechanisms to fuse part-of-speech information under the framework of sequence-to-sequence learning. We demonstrate the utility of part-of-speech information in low-resource paraphrase generation through extensive experiments on multiple datasets of varying sizes and genres.


Sign in / Sign up

Export Citation Format

Share Document