On Document Representation and Term Weights in Text Classification

Author(s):  
Ying Liu

In the automated text classification, a bag-of-words representation followed by the tfidf weighting is the most popular approach to convert the textual documents into various numeric vectors for the induction of classifiers. In this chapter, we explore the potential of enriching the document representation with the semantic information systematically discovered at the document sentence level. The salient semantic information is searched using a frequent word sequence method. Different from the classic tfidf weighting scheme, a probability based term weighting scheme which directly reflect the term’s strength in representing a specific category has been proposed. The experimental study based on the semantic enriched document representation and the newly proposed probability based term weighting scheme has shown a significant improvement over the classic approach, i.e., bag-of-words plus tfidf, in terms of Fscore. This study encourages us to further investigate the possibility of applying the semantic enriched document representation over a wide range of text based mining tasks.

Author(s):  
Ying Liu ◽  
Han Tong Loh ◽  
Wen Feng Lu

This chapter introduces an approach of deriving taxonomy from documents using a novel document profile model that enables document representations with the semantic information systematically generated at the document sentence level. A frequent word sequence method is proposed to search for the salient semantic information and has been integrated into the document profile model. The experimental study of taxonomy generation using hierarchical agglomerative clustering has shown a significant improvement in terms of Fscore based on the document profile model. A close examination reveals that the integration of semantic information has a clear contribution compared to the classic bag-of-words approach. This study encourages us to further investigate the possibility of applying document profile model over a wide range of text based mining tasks.


2020 ◽  
Vol 14 (4) ◽  
pp. 101076 ◽  
Author(s):  
Turgut Dogan ◽  
Alper Kursat Uysal

2021 ◽  
Vol 168 ◽  
pp. 114438
Author(s):  
Long Chen ◽  
Liangxiao Jiang ◽  
Chaoqun Li

Term Weighting Scheme (TWS) is a key component of the matching mechanism when using the vector space model In the context of information retrieval (IR) from text documents, the this paper described a new approach of term weighting methods to improve the classification performance. In this study, we propose an effective term weighting scheme, which gives highest accuracy with compare to the text classification methods. We compared performance parameter of KNN and Naïve Bayes Classification with different Weighting Method, Weight information gain, SVM and proposed method.We have implemented many term-weighting methods (TWM) on Amazon data collections in combination with Information-Gain and SVM and KNN algorithm and Naïve Bayes Algorithm.


2018 ◽  
Vol 110 ◽  
pp. 23-29 ◽  
Author(s):  
Guozhong Feng ◽  
Shaoting Li ◽  
Tieli Sun ◽  
Bangzuo Zhang

MATICS ◽  
2019 ◽  
Vol 10 (2) ◽  
pp. 30
Author(s):  
Syahroni Wahyu Iriananda ◽  
Muhammad Aziz Muslim ◽  
Harry Soekotjo Dachlan

Report handling on "LAPOR!" systemdepends on the system administrator who manually reads every incoming report [3]. Read manually can lead to errors<br />in handling complaints [4] if the data flow is very large and grows rapidly it can take at least three days and sensitive to inconsistencies [3]. In this study, the authors propose a model that can measure and identify the similarity of document reports computerized that can identify the similarity between the Query (Incoming) with Document (Archive). In this study, the authors employed term weighting scheme Class-Based Indexing, and Cosine Similarity to analyze document similarities. CoSimTFIDF, CoSimTFICF and CoSimTFIDFICF values are defined as feature sets for the text classification process using the KNearest<br /><br />Neighbor (K-NN) method. The optimum result<br />evaluation with preprocessing employ Stemming and the best<br />result of all features is 75% training data ratio and 25% test<br />data on the CoSimTFIDF feature that is 84%. Value k = 5<br />has a high accuracy of 84.12%


Sign in / Sign up

Export Citation Format

Share Document