Large Scale Text Categorization Based on Density Statistics Merging

Author(s):  
Rujuan Wang ◽  
Suhua Wang
Technometrics ◽  
2007 ◽  
Vol 49 (3) ◽  
pp. 291-304 ◽  
Author(s):  
Alexander Genkin ◽  
David D Lewis ◽  
David Madigan

2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Lin Guo ◽  
Wanli Zuo ◽  
Tao Peng ◽  
Lin Yue

The diversities of large-scale semistructured data make the extraction of implicit semantic information have enormous difficulties. This paper proposes an automatic and unsupervised method of text categorization, in which tree-shape structures are used to represent semantic knowledge and to explore implicit information by mining hidden structures without cumbersome lexical analysis. Mining implicit frequent structures in trees can discover both direct and indirect semantic relations, which largely enhances the accuracy of matching and classifying texts. The experimental results show that the proposed algorithm remarkably reduces the time and effort spent in training and classifying, which outperforms established competitors in correctness and effectiveness.


2019 ◽  
Vol 10 (3) ◽  
pp. 17-32 ◽  
Author(s):  
Rujuan Wang ◽  
Gang Wang

In the field of modern information technology, how to find information quickly, accurately and comprehensively that users really needed has become the focus of research in this field. In this article, a feature selection method based on a complex network is proposed for the structure and content characteristics of large-scale web text information. The preprocessed web text is converted into a complex network. The nodes in the network correspond to the entries in the text. The edges of the network correspond to the links between the entries in the text, and the degree of nodes and the aggregation system are used. Second, the text classification method is studied from the point of view of data sampling, and a text classification method based on density statistics is proposed. This method uses not only the density information of the text feature set in the classification process, but also the use of statistical merging criteria to get the text. The difference information of each feature has a better classification effect for large text collections.


Author(s):  
DEJAN GJORGJEVIKJ ◽  
GJORGJI MADJAROV ◽  
SAŠO DŽEROSKI

Multi-label learning (MLL) problems abound in many areas, including text categorization, protein function classification, and semantic annotation of multimedia. Issues that severely limit the applicability of many current machine learning approaches to MLL are the large-scale problem, which have a strong impact on the computational complexity of learning. These problems are especially pronounced for approaches that transform MLL problems into a set of binary classification problems for which Support Vector Machines (SVMs) are used. On the other hand, the most efficient approaches to MLL, based on decision trees, have clearly lower predictive performance. We propose a hybrid decision tree architecture, where the leaves do not give multi-label predictions directly, but rather utilize local SVM-based classifiers giving multi-label predictions. A binary relevance architecture is employed in the leaves, where a binary SVM classifier is built for each of the labels relevant to that particular leaf. We use a broad range of multi-label datasets with a variety of evaluation measures to evaluate the proposed method against related and state-of-the-art methods, both in terms of predictive performance and time complexity. Our hybrid architecture on almost every large classification problem outperforms the competing approaches in terms of the predictive performance, while its computational efficiency is significantly improved as a result of the integrated decision tree.


2017 ◽  
Vol 22 (3) ◽  
pp. 291-302 ◽  
Author(s):  
Zewen Xu ◽  
Jianqiang Li ◽  
Bo Liu ◽  
Jing Bi ◽  
Rong Li ◽  
...  

2011 ◽  
Vol 32 (2) ◽  
pp. 101-106 ◽  
Author(s):  
Sujeevan Aseervatham ◽  
Anestis Antoniadis ◽  
Eric Gaussier ◽  
Michel Burlet ◽  
Yves Denneulin

Sign in / Sign up

Export Citation Format

Share Document