A Note on Document Classification with Small Training Data

Data acquisition is a major concern in text classification. The excessive human efforts required by conventional methods to build up quality training collection might not always be available to research workers. In this paper, the authors look into possibilities to automatically collect training data by sampling the Web with a set of given class names. The basic idea is to populate appropriate keywords and submit them as queries to search engines for acquiring training data. The first of two methods presented in this paper is based on sampling the common concepts among classes and the other is based on sampling the discriminative concepts for each class. A series of experiments were carried out independently on two different datasets and results show that the proposed methods significantly improve classifier performance even without using manually labeled training data. The authors’ strategy for retrieving Web samples substantially helps in the conventional document classification in terms of accuracy and efficiency.

Download Full-text

On the influence of training data quality on text document classification using machine learning methods

International Journal of Knowledge Engineering and Data Mining ◽

10.1504/ijkedm.2015.071284 ◽

2015 ◽

Vol 3 (2) ◽

pp. 143 ◽

Cited By ~ 1

Author(s):

Jyri Saarikoski ◽

Henry Joutsijoki ◽

Kalervo Järvelin ◽

Jorma Laurikkala ◽

Martti Juhola

Keyword(s):

Machine Learning ◽

Data Quality ◽

Document Classification ◽

Training Data ◽

Learning Methods ◽

Machine Learning Methods ◽

Text Document ◽

Text Document Classification

Download Full-text

Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6500 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9547-9554

Author(s):

Mozhi Zhang ◽

Yoshinari Fujinuma ◽

Jordan Boyd-Graber

Keyword(s):

Knowledge Transfer ◽

Text Classification ◽

Document Classification ◽

Training Data ◽

Target Language ◽

Source Language ◽

Low Resource ◽

Classification Framework ◽

Related Language ◽

Cross Lingual

Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (caco) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. Experiments confirm that character-level knowledge transfer is more data-efficient than word-level transfer between related languages.

Download Full-text

Classifying Patient and Professional Voice in Social Media Health Posts

10.21203/rs.3.rs-422198/v1 ◽

2021 ◽

Author(s):

Beatrice Alex ◽

Donald Whyte ◽

Daniel Duma ◽

Roma English Owen ◽

Elizabeth A.L. Fairley

Keyword(s):

Social Media ◽

Skin Diseases ◽

Document Classification ◽

Research Field ◽

Training Data ◽

Combined Training ◽

Patient Voice ◽

Starting Point ◽

Data Source ◽

Professional Voice

Abstract Background: Patient-based analysis of social media is a growing research field with the aim of delivering precision medicine but it requires accurate classification of posts relating to patients’ experiences. We motivate the need for this type of classification as a pre-processing step for further analysis of socialmedia data in the context of related work in this area. In this paper we present experiments for a three-way document classification by patient voice, professional voice or other. We present results for a Convolutional Neural Network classifier trained on English data from two different data sources (Reddit and Twitter) and two domains (cardiovascular and skin diseases). Results: We found that document classification by patient voice, professional voice or other can be done consistently manually (0.92 accuracy). Annotators agreedroughly equally for each domain (cardiovascular and skin) but they agreed more when annotating Reddit posts compared to Twitter posts. Best classification performance was obtained when training two separate classifiers for each data source, one for Reddit and one for Twitter posts, when evaluating on in-source test data for both test sets combined with an overall accuracy of 0.95 (and macro-average F1 of 0.92) and an F1-score of 0.95 for patient voice only.Conclusion: The main conclusion resulting from this work is that using more data for training a classifier does not necessarily result in best possible performance. In the context of classifying social media posts by patient and professional voice, we showed that it is best to train separate models per data source (Reddit andTwitter) instead of a model using the combined training data from both sources. We also found that it is preferable to train separate models per domain (cardiovascular and skin) while showing that the difference to the combined model is only minor (0.01 accuracy). Our highest overall F1-score (0.95) obtained for classifying posts as patient voice is a very good starting point for further analysis of social media data reflecting the experience of patients.

Download Full-text

Web Document Classification Using Changing Training Data Set

Computational Science and Its Applications - ICCSA 2006 - Lecture Notes in Computer Science ◽

10.1007/11751649_62 ◽

2006 ◽

pp. 565-574

Author(s):

Gilcheol Park ◽

Seoksoo Kim

Keyword(s):

Document Classification ◽

Training Data ◽

Data Set ◽

Web Document

Download Full-text

Classifying patient and professional voice in social media health posts

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01577-9 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Beatrice Alex ◽

Donald Whyte ◽

Daniel Duma ◽

Roma English Owen ◽

Elizabeth A. L. Fairley

Keyword(s):

Social Media ◽

Skin Diseases ◽

Document Classification ◽

Training Data ◽

Social Media Data ◽

Patient Voice ◽

Starting Point ◽

Data Source ◽

Professional Voice ◽

Media Data

Abstract Background Patient-based analysis of social media is a growing research field with the aim of delivering precision medicine but it requires accurate classification of posts relating to patients’ experiences. We motivate the need for this type of classification as a pre-processing step for further analysis of social media data in the context of related work in this area. In this paper we present experiments for a three-way document classification by patient voice, professional voice or other. We present results for a convolutional neural network classifier trained on English data from two different data sources (Reddit and Twitter) and two domains (cardiovascular and skin diseases). Results We found that document classification by patient voice, professional voice or other can be done consistently manually (0.92 accuracy). Annotators agreed roughly equally for each domain (cardiovascular and skin) but they agreed more when annotating Reddit posts compared to Twitter posts. Best classification performance was obtained when training two separate classifiers for each data source, one for Reddit and one for Twitter posts, when evaluating on in-source test data for both test sets combined with an overall accuracy of 0.95 (and macro-average F1 of 0.92) and an F1-score of 0.95 for patient voice only. Conclusion The main conclusion resulting from this work is that combining social media data from platforms with different characteristics for training a patient and professional voice classifier does not result in best possible performance. We showed that it is best to train separate models per data source (Reddit and Twitter) instead of a model using the combined training data from both sources. We also found that it is preferable to train separate models per domain (cardiovascular and skin) while showing that the difference to the combined model is only minor (0.01 accuracy). Our highest overall F1-score (0.95) obtained for classifying posts as patient voice is a very good starting point for further analysis of social media data reflecting the experience of patients.

Download Full-text