Boosting biomedical document classification through the use of domain entity recognizers and semantic ontologies for document representation: the case of gluten bibliome

Document classification is an essential task in many real world applications. Existing approaches adopt both text semantics and document structure to obtain the document representation. However, these models usually require a large collection of annotated training instances, which are not always feasible, especially in low-resource settings. In this paper, we propose a multi-task learning framework to jointly train multiple related document classification tasks. We devise a hierarchical architecture to make use of the shared knowledge from all tasks to enhance the document representation of each task. We further propose an inter-attention approach to improve the task-specific modeling of documents with global information. Experimental results on 15 public datasets demonstrate the benefits of our proposed model.

Download Full-text

A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification

Mathematical Problems in Engineering ◽

10.1155/2018/7987691 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Jianming Zheng ◽

Yupu Guo ◽

Chong Feng ◽

Honghui Chen

Keyword(s):

Neural Network ◽

Text Classification ◽

Network Models ◽

Document Classification ◽

Neural Representation ◽

Hierarchical Architecture ◽

Document Representation ◽

Neural Network Models ◽

Hierarchical Neural Network ◽

Public Datasets

Document representation is widely used in practical application, for example, sentiment classification, text retrieval, and text classification. Previous work is mainly based on the statistics and the neural networks, which suffer from data sparsity and model interpretability, respectively. In this paper, we propose a general framework for document representation with a hierarchical architecture. In particular, we incorporate the hierarchical architecture into three traditional neural-network models for document representation, resulting in three hierarchical neural representation models for document classification, that is, TextHFT, TextHRNN, and TextHCNN. Our comprehensive experimental results on two public datasets, that is, Yelp 2016 and Amazon Reviews (Electronics), show that our proposals with hierarchical architecture outperform the corresponding neural-network models for document classification, resulting in a significant improvement ranging from 4.65% to 35.08% in terms of accuracy with a comparable (or substantially less) expense of time consumption. In addition, we find that the long documents benefit more from the hierarchical architecture than the short ones as the improvement in terms of accuracy on long documents is greater than that on short documents.

Download Full-text

The Influence of Feature Representation of Text on the Performance of Document Classification

Applied Sciences ◽

10.3390/app9040743 ◽

2019 ◽

Vol 9 (4) ◽

pp. 743 ◽

Cited By ~ 4

Author(s):

Sanda Martinčić-Ipšić ◽

Tanja Miličić ◽

and Todorovski

Keyword(s):

Comparative Analysis ◽

Document Classification ◽

Feature Representation ◽

Superior Performance ◽

Bag Of Words ◽

Document Representation ◽

Text Documents ◽

Continuous Space ◽

The One ◽

Low Dimensional

In this paper we perform a comparative analysis of three models for a feature representation of text documents in the context of document classification. In particular, we consider the most often used family of bag-of-words models, the recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based models that have been rarely considered for the representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated with the three models and their variants. Multi-objective rankings are proposed as the framework for multi-criteria comparative analysis of the results. Finally, the results of the empirical comparison show that the commonly used bag-of-words model has a performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.

Download Full-text