On Document Representation and Term Weights in Text Classification

Handbook of Research on Text and Web Mining Technologies ◽

10.4018/978-1-59904-990-8.ch001 ◽

2010 ◽

pp. 1-22 ◽

Cited By ~ 1

Author(s):

Ying Liu

Keyword(s):

Text Classification ◽

Semantic Information ◽

Weighting Scheme ◽

Bag Of Words ◽

Document Representation ◽

Term Weighting ◽

Word Sequence ◽

Sentence Level ◽

Sequence Method ◽

Classic Approach

In the automated text classification, a bag-of-words representation followed by the tfidf weighting is the most popular approach to convert the textual documents into various numeric vectors for the induction of classifiers. In this chapter, we explore the potential of enriching the document representation with the semantic information systematically discovered at the document sentence level. The salient semantic information is searched using a frequent word sequence method. Different from the classic tfidf weighting scheme, a probability based term weighting scheme which directly reflect the term’s strength in representing a specific category has been proposed. The experimental study based on the semantic enriched document representation and the newly proposed probability based term weighting scheme has shown a significant improvement over the classic approach, i.e., bag-of-words plus tfidf, in terms of Fscore. This study encourages us to further investigate the possibility of applying the semantic enriched document representation over a wide range of text based mining tasks.

Download Full-text

Deriving Taxonomy from Documents at Sentence Level

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch005 ◽

2008 ◽

pp. 99-119 ◽

Cited By ~ 6

Author(s):

Ying Liu ◽

Han Tong Loh ◽

Wen Feng Lu

Keyword(s):

Experimental Study ◽

Semantic Information ◽

Bag Of Words ◽

Agglomerative Clustering ◽

Profile Model ◽

Word Sequence ◽

Sentence Level ◽

Wide Range ◽

Hierarchical Agglomerative Clustering ◽

Sequence Method

This chapter introduces an approach of deriving taxonomy from documents using a novel document profile model that enables document representations with the semantic information systematically generated at the document sentence level. A frequent word sequence method is proposed to search for the salient semantic information and has been integrated into the document profile model. The experimental study of taxonomy generation using hierarchical agglomerative clustering has shown a significant improvement in terms of Fscore based on the document profile model. A close examination reveals that the integration of semantic information has a clear contribution compared to the classic bag-of-words approach. This study encourages us to further investigate the possibility of applying document profile model over a wide range of text based mining tasks.

Download Full-text

A novel term weighting scheme for text classification: TF-MONO

Journal of Informetrics ◽

10.1016/j.joi.2020.101076 ◽

2020 ◽

Vol 14 (4) ◽

pp. 101076 ◽

Cited By ~ 1

Author(s):

Turgut Dogan ◽

Alper Kursat Uysal

Keyword(s):

Text Classification ◽

Weighting Scheme ◽

Term Weighting

Download Full-text

Modified DFS-based term weighting scheme for text classification

Expert Systems with Applications ◽

10.1016/j.eswa.2020.114438 ◽

2021 ◽

Vol 168 ◽

pp. 114438

Author(s):

Long Chen ◽

Liangxiao Jiang ◽

Chaoqun Li

Keyword(s):

Text Classification ◽

Weighting Scheme ◽

Term Weighting

Download Full-text

Analysis of Text Classification with various Term Weighting Schemes in Vector Space Model

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.d1938.0891020 ◽

2020 ◽

Vol 9 (10) ◽

pp. 390-393

Keyword(s):

Vector Space ◽

Text Classification ◽

Naive Bayes ◽

Information Gain ◽

Vector Space Model ◽

Naïve Bayes ◽

Weighting Scheme ◽

Term Weighting ◽

Space Model ◽

Weighting Methods

Term Weighting Scheme (TWS) is a key component of the matching mechanism when using the vector space model In the context of information retrieval (IR) from text documents, the this paper described a new approach of term weighting methods to improve the classification performance. In this study, we propose an effective term weighting scheme, which gives highest accuracy with compare to the text classification methods. We compared performance parameter of KNN and Naïve Bayes Classification with different Weighting Method, Weight information gain, SVM and proposed method.We have implemented many term-weighting methods (TWM) on Amazon data collections in combination with Information-Gain and SVM and KNN algorithm and Naïve Bayes Algorithm.

Download Full-text

Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification

10.3115/v1/w14-2614 ◽

2014 ◽

Cited By ~ 8

Author(s):

Yoon Kim ◽

Owen Zhang

Keyword(s):

Sentiment Analysis ◽

Text Classification ◽

Weighting Scheme ◽

Term Weighting ◽

Term Frequency

Download Full-text

An improved term weighting scheme for text classification

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.5604 ◽

2020 ◽

Vol 32 (9) ◽

Cited By ~ 1

Author(s):

Zhong Tang ◽

Wenqiang Li ◽

Yan Li

Keyword(s):

Text Classification ◽

Weighting Scheme ◽

Term Weighting

Download Full-text

A probabilistic model derived term weighting scheme for text classification

Pattern Recognition Letters ◽

10.1016/j.patrec.2018.03.003 ◽

2018 ◽

Vol 110 ◽

pp. 23-29 ◽

Cited By ~ 12

Author(s):

Guozhong Feng ◽

Shaoting Li ◽

Tieli Sun ◽

Bangzuo Zhang

Keyword(s):

Text Classification ◽

Probabilistic Model ◽

Weighting Scheme ◽

Term Weighting

Download Full-text

A new term weighting scheme based on class specific document frequency for document representation and classification

2015 7th Computer Science and Electronic Engineering Conference (CEEC) ◽

10.1109/ceec.2015.7332690 ◽

2015 ◽

Cited By ~ 4

Author(s):

Suthira Plansangket ◽

John Q Gan

Keyword(s):

Weighting Scheme ◽

Document Representation ◽

Term Weighting ◽

Document Frequency

Download Full-text

A new term-weighting scheme for text classification using the odds of positive and negative class probabilities

Journal of the Association for Information Science and Technology ◽

10.1002/asi.23338 ◽

2015 ◽

Vol 66 (12) ◽

pp. 2553-2565 ◽

Cited By ~ 9

Author(s):

Youngjoong Ko

Keyword(s):

Text Classification ◽

Weighting Scheme ◽

Term Weighting ◽

Negative Class

Download Full-text

Identifikasi Kemiripan Teks Menggunakan Class Indexing Based dan Cosine Similarity Untuk Klasifikasi Dokumen Pengaduan

MATICS ◽

10.18860/mat.v10i2.5327 ◽

2019 ◽

Vol 10 (2) ◽

pp. 30

Author(s):

Syahroni Wahyu Iriananda ◽

Muhammad Aziz Muslim ◽

Harry Soekotjo Dachlan

Keyword(s):

Text Classification ◽

Data Flow ◽

High Accuracy ◽

Weighting Scheme ◽

Cosine Similarity ◽

Training Data ◽

Term Weighting ◽

Optimum Result ◽

Feature Sets ◽

System Administrator

Report handling on "LAPOR!" systemdepends on the system administrator who manually reads every incoming report [3]. Read manually can lead to errors in handling complaints [4] if the data flow is very large and grows rapidly it can take at least three days and sensitive to inconsistencies [3]. In this study, the authors propose a model that can measure and identify the similarity of document reports computerized that can identify the similarity between the Query (Incoming) with Document (Archive). In this study, the authors employed term weighting scheme Class-Based Indexing, and Cosine Similarity to analyze document similarities. CoSimTFIDF, CoSimTFICF and CoSimTFIDFICF values are defined as feature sets for the text classification process using the KNearest Neighbor (K-NN) method. The optimum result evaluation with preprocessing employ Stemming and the best result of all features is 75% training data ratio and 25% test data on the CoSimTFIDF feature that is 84%. Value k = 5 has a high accuracy of 84.12%

Download Full-text