Exploiting semantic resources for large scale text categorization

2012 ◽  
Vol 39 (3) ◽  
pp. 763-788 ◽  
Author(s):  
Jian Qiang Li ◽  
Yu Zhao ◽  
Bo Liu
Technometrics ◽  
2007 ◽  
Vol 49 (3) ◽  
pp. 291-304 ◽  
Author(s):  
Alexander Genkin ◽  
David D Lewis ◽  
David Madigan

2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Lin Guo ◽  
Wanli Zuo ◽  
Tao Peng ◽  
Lin Yue

The diversities of large-scale semistructured data make the extraction of implicit semantic information have enormous difficulties. This paper proposes an automatic and unsupervised method of text categorization, in which tree-shape structures are used to represent semantic knowledge and to explore implicit information by mining hidden structures without cumbersome lexical analysis. Mining implicit frequent structures in trees can discover both direct and indirect semantic relations, which largely enhances the accuracy of matching and classifying texts. The experimental results show that the proposed algorithm remarkably reduces the time and effort spent in training and classifying, which outperforms established competitors in correctness and effectiveness.


2019 ◽  
Vol 10 (3) ◽  
pp. 17-32 ◽  
Author(s):  
Rujuan Wang ◽  
Gang Wang

In the field of modern information technology, how to find information quickly, accurately and comprehensively that users really needed has become the focus of research in this field. In this article, a feature selection method based on a complex network is proposed for the structure and content characteristics of large-scale web text information. The preprocessed web text is converted into a complex network. The nodes in the network correspond to the entries in the text. The edges of the network correspond to the links between the entries in the text, and the degree of nodes and the aggregation system are used. Second, the text classification method is studied from the point of view of data sampling, and a text classification method based on density statistics is proposed. This method uses not only the density information of the text feature set in the classification process, but also the use of statistical merging criteria to get the text. The difference information of each feature has a better classification effect for large text collections.


2015 ◽  
pp. 269-292 ◽  
Author(s):  
Paweł Kędzia ◽  
Maciej Piasecki ◽  
Marlena Orlińska

Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical ResourcesLexical resources can be applied in many different Natural Language Engineering tasks, but the most fundamental task is the recognition of word senses used in text contexts. The problem is difficult, not yet fully solved and different lexical resources provided varied support for it. Polish CLARIN lexical semantic resources are based on the plWordNet — a very large wordnet for Polish — as a central structure which is a basis for linking together several resources of different types. In this paper, several Word Sense Disambiguation (henceforth WSD) methods developed for Polish that utilise plWordNet are discussed. Textual sense descriptions in the traditional lexicon can be compared with text contexts using Lesk’s algorithm in order to find best matching senses. In the case of a wordnet, lexico-semantic relations provide the main description of word senses. Thus, first, we adapted and applied to Polish a WSD method based on the Page Rank. According to it, text words are mapped on their senses in the plWordNet graph and Page Rank algorithm is run to find senses with the highest scores. The method presents results lower but comparable to those reported for English. The error analysis showed that the main problems are: fine grained sense distinctions in plWordNet and limited number of connections between words of different parts of speech. In the second approach plWordNet expanded with the mapping onto the SUMO ontology concepts was used. Two scenarios for WSD were investigated: two step disambiguation and disambiguation based on combined networks of plWordNet and SUMO. In the former scenario, words are first assigned SUMO concepts and next plWordNet senses are disambiguated. In latter, plWordNet and SUMO are combined in one large network used next for the disambiguation of senses. The additional knowledge sources used in WSD improved the performance. The obtained results and potential further lines of developments were discussed.


Author(s):  
DEJAN GJORGJEVIKJ ◽  
GJORGJI MADJAROV ◽  
SAŠO DŽEROSKI

Multi-label learning (MLL) problems abound in many areas, including text categorization, protein function classification, and semantic annotation of multimedia. Issues that severely limit the applicability of many current machine learning approaches to MLL are the large-scale problem, which have a strong impact on the computational complexity of learning. These problems are especially pronounced for approaches that transform MLL problems into a set of binary classification problems for which Support Vector Machines (SVMs) are used. On the other hand, the most efficient approaches to MLL, based on decision trees, have clearly lower predictive performance. We propose a hybrid decision tree architecture, where the leaves do not give multi-label predictions directly, but rather utilize local SVM-based classifiers giving multi-label predictions. A binary relevance architecture is employed in the leaves, where a binary SVM classifier is built for each of the labels relevant to that particular leaf. We use a broad range of multi-label datasets with a variety of evaluation measures to evaluate the proposed method against related and state-of-the-art methods, both in terms of predictive performance and time complexity. Our hybrid architecture on almost every large classification problem outperforms the competing approaches in terms of the predictive performance, while its computational efficiency is significantly improved as a result of the integrated decision tree.


2017 ◽  
Vol 22 (3) ◽  
pp. 291-302 ◽  
Author(s):  
Zewen Xu ◽  
Jianqiang Li ◽  
Bo Liu ◽  
Jing Bi ◽  
Rong Li ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document