Label-Aware Document Representation via Hybrid Attention for Extreme Multi-Label Text Classification

In the automated text classification, a bag-of-words representation followed by the tfidf weighting is the most popular approach to convert the textual documents into various numeric vectors for the induction of classifiers. In this chapter, we explore the potential of enriching the document representation with the semantic information systematically discovered at the document sentence level. The salient semantic information is searched using a frequent word sequence method. Different from the classic tfidf weighting scheme, a probability based term weighting scheme which directly reflect the term’s strength in representing a specific category has been proposed. The experimental study based on the semantic enriched document representation and the newly proposed probability based term weighting scheme has shown a significant improvement over the classic approach, i.e., bag-of-words plus tfidf, in terms of Fscore. This study encourages us to further investigate the possibility of applying the semantic enriched document representation over a wide range of text based mining tasks.

Download Full-text

A Novel Neural Network-Based Method for Medical Text Classification

Future Internet ◽

10.3390/fi11120255 ◽

2019 ◽

Vol 11 (12) ◽

pp. 255 ◽

Cited By ~ 2

Author(s):

Li Qing ◽

Weng Linhong ◽

Ding Xuehai

Keyword(s):

Neural Network ◽

Text Classification ◽

Text Categorization ◽

Medical Literature ◽

Clinical Information ◽

Specific Area ◽

Attention Mechanism ◽

Document Representation ◽

Medical Texts ◽

Medical Text

Medical text categorization is a specific area of text categorization. Classification for medical texts is considered a special case of text classification. Medical text includes medical records and medical literature, both of which are important clinical information resources. However, medical text contains complex medical vocabularies, medical measures, which has problems with high-dimensionality and data sparsity, so text classification in the medical domain is more challenging than those in other general domains. In order to solve these problems, this paper proposes a unified neural network method. In the sentence representation, the convolutional layer extracts features from the sentence and a bidirectional gated recurrent unit (BIGRU) is used to access both the preceding and succeeding sentence features. An attention mechanism is employed to obtain the sentence representation with the important word weights. In the document representation, the method uses the BIGRU to encode the sentences, which is obtained in sentence representation and then decode it through the attention mechanism to get the document representation with important sentence weights. Finally, a category of medical text is obtained through a classifier. Experimental verifications are conducted on four medical text datasets, including two medical record datasets and two medical literature datasets. The results clearly show that our method is effective.

Download Full-text

Bag-of-Concepts Document Representation for Bayesian Text Classification

2016 IEEE International Conference on Computer and Information Technology (CIT) ◽

10.1109/cit.2016.50 ◽

2016 ◽

Cited By ~ 1

Author(s):

Marcos Mourino-Garcia ◽

Roberto Perez-Rodriguez ◽

Luis Anido-Rifon ◽

Miguel Gomez-Carballa

Keyword(s):

Text Classification ◽

Document Representation

Download Full-text

Emotionally charged text classification with deep learning and sentiment semantic

Neural Computing and Applications ◽

10.1007/s00521-021-06542-1 ◽

2021 ◽

Author(s):

Jeow Li Huan ◽

Arif Ahmed Sekh ◽

Chai Quek ◽

Dilip K. Prasad

Keyword(s):

Language Processing ◽

Text Classification ◽

Classification Accuracy ◽

State Of The Art ◽

Document Representation ◽

Classical Technique ◽

Text Classifiers ◽

Vector Sequences ◽

Fully Connected ◽

Better Than

AbstractText classification is one of the widely used phenomena in different natural language processing tasks. State-of-the-art text classifiers use the vector space model for extracting features. Recent progress in deep models, recurrent neural networks those preserve the positional relationship among words achieve a higher accuracy. To push text classification accuracy even higher, multi-dimensional document representation, such as vector sequences or matrices combined with document sentiment, should be explored. In this paper, we show that documents can be represented as a sequence of vectors carrying semantic meaning and classified using a recurrent neural network that recognizes long-range relationships. We show that in this representation, additional sentiment vectors can be easily attached as a fully connected layer to the word vectors to further improve classification accuracy. On the UCI sentiment labelled dataset, using the sequence of vectors alone achieved an accuracy of 85.6%, which is better than 80.7% from ridge regression classifier—the best among the classical technique we tested. Additional sentiment information further increases accuracy to 86.3%. On our suicide notes dataset, the best classical technique—the Naíve Bayes Bernoulli classifier, achieves accuracy of 71.3%, while our classifier, incorporating semantic and sentiment information, exceeds that at 75% accuracy.

Download Full-text

A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification

Mathematical Problems in Engineering ◽

10.1155/2018/7987691 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Jianming Zheng ◽

Yupu Guo ◽

Chong Feng ◽

Honghui Chen

Keyword(s):

Neural Network ◽

Text Classification ◽

Network Models ◽

Document Classification ◽

Neural Representation ◽

Hierarchical Architecture ◽

Document Representation ◽

Neural Network Models ◽

Hierarchical Neural Network ◽

Public Datasets

Document representation is widely used in practical application, for example, sentiment classification, text retrieval, and text classification. Previous work is mainly based on the statistics and the neural networks, which suffer from data sparsity and model interpretability, respectively. In this paper, we propose a general framework for document representation with a hierarchical architecture. In particular, we incorporate the hierarchical architecture into three traditional neural-network models for document representation, resulting in three hierarchical neural representation models for document classification, that is, TextHFT, TextHRNN, and TextHCNN. Our comprehensive experimental results on two public datasets, that is, Yelp 2016 and Amazon Reviews (Electronics), show that our proposals with hierarchical architecture outperform the corresponding neural-network models for document classification, resulting in a significant improvement ranging from 4.65% to 35.08% in terms of accuracy with a comparable (or substantially less) expense of time consumption. In addition, we find that the long documents benefit more from the hierarchical architecture than the short ones as the improvement in terms of accuracy on long documents is greater than that on short documents.

Download Full-text

Least information document representation for automated text classification

Proceedings of the American Society for Information Science and Technology ◽

10.1002/meet.14504901118 ◽

2012 ◽

Vol 49 (1) ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Weimao Ke

Keyword(s):

Text Classification ◽

Document Representation ◽

Automated Text Classification

Download Full-text

Label-Specific Document Representation for Multi-Label Text Classification

10.18653/v1/d19-1044 ◽

2019 ◽

Cited By ~ 1

Author(s):

Lin Xiao ◽

Xin Huang ◽

Boli Chen ◽

Liping Jing

Keyword(s):

Text Classification ◽

Document Representation

Download Full-text

A Topical Word Embeddings for Text Classification

10.5753/eniac.2018.4401 ◽

2018 ◽

Author(s):

João Marcos Carvalho Lima ◽

José Everardo Bessa Maia

Keyword(s):

Text Classification ◽

Text Categorization ◽

Topic Model ◽

State Of The Art ◽

Bag Of Words ◽

Word Embeddings ◽

Document Representation ◽

Linear Classifier ◽

Low Dimensionality ◽

Nonlinear Classifier

This paper presents an approach that uses topic models based on LDA to represent documents in text categorization problems. The document representation is achieved through the cosine similarity between document embeddings and embeddings of topic words, creating a Bag-of-Topics (BoT) variant. The performance of this approach is compared against those of two other representations: BoW (Bag-of-Words) and Topic Model, both based on standard tf-idf. Also, to reveal the effect of the classifier, we compared the performance of the nonlinear classifier SVM against that of the linear classifier Naive Bayes, taken as baseline. To evaluate the approach we use two bases, one multi-label (RCV-1) and another single-label (20 Newsgroup). The model presents significant results with low dimensionality when compared to the state of the art.

Download Full-text