Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study

Author(s):  
Tomasz Walkowiak ◽  
Szymon Datko ◽  
Henryk Maciejewski
Author(s):  
Hau-Wen Chang ◽  
Hung-sik Kim ◽  
Shuyang Li ◽  
Jeongkyu Lee ◽  
Dongwon Lee

PeerJ ◽  
2015 ◽  
Vol 3 ◽  
pp. e1279 ◽  
Author(s):  
Marcos Antonio Mouriño García ◽  
Roberto Pérez Rodríguez ◽  
Luis E. Anido Rifón

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.


2021 ◽  
Vol 503 (2) ◽  
pp. 1828-1846
Author(s):  
Burger Becker ◽  
Mattia Vaccari ◽  
Matthew Prescott ◽  
Trienko Grobler

ABSTRACT The morphological classification of radio sources is important to gain a full understanding of galaxy evolution processes and their relation with local environmental properties. Furthermore, the complex nature of the problem, its appeal for citizen scientists, and the large data rates generated by existing and upcoming radio telescopes combine to make the morphological classification of radio sources an ideal test case for the application of machine learning techniques. One approach that has shown great promise recently is convolutional neural networks (CNNs). Literature, however, lacks two major things when it comes to CNNs and radio galaxy morphological classification. First, a proper analysis of whether overfitting occurs when training CNNs to perform radio galaxy morphological classification using a small curated training set is needed. Secondly, a good comparative study regarding the practical applicability of the CNN architectures in literature is required. Both of these shortcomings are addressed in this paper. Multiple performance metrics are used for the latter comparative study, such as inference time, model complexity, computational complexity, and mean per class accuracy. As part of this study, we also investigate the effect that receptive field, stride length, and coverage have on recognition performance. For the sake of completeness, we also investigate the recognition performance gains that we can obtain by employing classification ensembles. A ranking system based upon recognition and computational performance is proposed. MCRGNet, Radio Galaxy Zoo, and ConvXpress (novel classifier) are the architectures that best balance computational requirements with recognition performance.


Author(s):  
Yu. A. Sakhno

This article deals with the study of the structural and semantic features of tactile verbs (hereinafter TVs) in English, German and Russian. Particular attention is paid to the comparative study of TVs, which allows us to identify structural and semantic similarities and differences of linguistic units studied. The structural and semantic classification of TVs in the compared languages is also provided.


Sign in / Sign up

Export Citation Format

Share Document