Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study

Comparative Study on Subject Classification of Academic Videos Using Noisy Transcripts

2010 IEEE Fourth International Conference on Semantic Computing ◽

10.1109/icsc.2010.91 ◽

2010 ◽

Author(s):

Hau-Wen Chang ◽

Hung-sik Kim ◽

Shuyang Li ◽

Jeongkyu Lee ◽

Dongwon Lee

Keyword(s):

Comparative Study ◽

Subject Classification

Download Full-text

Open Set Subject Classification of Text Documents in Polish by Doc-to-Vec and Local Outlier Factor

Artificial Intelligence and Soft Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-030-20915-5_41 ◽

2019 ◽

pp. 455-463

Author(s):

Tomasz Walkowiak ◽

Szymon Datko ◽

Henryk Maciejewski

Keyword(s):

Subject Classification ◽

Text Documents ◽

Local Outlier Factor ◽

Open Set ◽

Local Outlier

Download Full-text

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach

PeerJ ◽

10.7717/peerj.1279 ◽

2015 ◽

Vol 3 ◽

pp. e1279 ◽

Cited By ~ 10

Author(s):

Marcos Antonio Mouriño García ◽

Roberto Pérez Rodríguez ◽

Luis E. Anido Rifón

Keyword(s):

Classification Problem ◽

Automatic Classification ◽

Important Application ◽

Biomedical Literature ◽

Daily Activities ◽

Bag Of Words ◽

Text Documents ◽

Semantic Relevance ◽

Automatic Document Classification

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.

Download Full-text

Reduction of Dimensionality of Feature Vectors in Subject Classification of Text Documents

Lecture Notes in Networks and Systems - Reliability and Statistics in Transportation and Communication ◽

10.1007/978-3-030-12450-2_15 ◽

2019 ◽

pp. 159-167 ◽

Cited By ~ 2

Author(s):

Tomasz Walkowiak ◽

Szymon Datko ◽

Henryk Maciejewski

Keyword(s):

Subject Classification ◽

Text Documents ◽

Feature Vectors ◽

Reduction Of Dimensionality

Download Full-text

Feature Extraction in Subject Classification of Text Documents in Polish

Artificial Intelligence and Soft Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-319-91262-2_40 ◽

2018 ◽

pp. 445-452 ◽

Cited By ~ 5

Author(s):

Tomasz Walkowiak ◽

Szymon Datko ◽

Henryk Maciejewski

Keyword(s):

Feature Extraction ◽

Subject Classification ◽

Text Documents

Download Full-text

A Comparative Study of Using Bag-of-Words and Word-Embedding Attributes in the Spoiler Classification of English and Thai Text

Applied Computing and Information Technology - Studies in Computational Intelligence ◽

10.1007/978-3-030-25217-5_7 ◽

2019 ◽

pp. 81-93 ◽

Cited By ~ 1

Author(s):

Rangsipan Marukatat

Keyword(s):

Comparative Study ◽

Word Embedding ◽

Bag Of Words

Download Full-text

CNN architecture comparison for radio galaxy classification

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stab325 ◽

2021 ◽

Vol 503 (2) ◽

pp. 1828-1846

Author(s):

Burger Becker ◽

Mattia Vaccari ◽

Matthew Prescott ◽

Trienko Grobler

Keyword(s):

Comparative Study ◽

Galaxy Evolution ◽

Radio Galaxy ◽

Recognition Performance ◽

Model Complexity ◽

Great Promise ◽

Radio Sources ◽

Morphological Classification ◽

Computational Performance

ABSTRACT The morphological classification of radio sources is important to gain a full understanding of galaxy evolution processes and their relation with local environmental properties. Furthermore, the complex nature of the problem, its appeal for citizen scientists, and the large data rates generated by existing and upcoming radio telescopes combine to make the morphological classification of radio sources an ideal test case for the application of machine learning techniques. One approach that has shown great promise recently is convolutional neural networks (CNNs). Literature, however, lacks two major things when it comes to CNNs and radio galaxy morphological classification. First, a proper analysis of whether overfitting occurs when training CNNs to perform radio galaxy morphological classification using a small curated training set is needed. Secondly, a good comparative study regarding the practical applicability of the CNN architectures in literature is required. Both of these shortcomings are addressed in this paper. Multiple performance metrics are used for the latter comparative study, such as inference time, model complexity, computational complexity, and mean per class accuracy. As part of this study, we also investigate the effect that receptive field, stride length, and coverage have on recognition performance. For the sake of completeness, we also investigate the recognition performance gains that we can obtain by employing classification ensembles. A ranking system based upon recognition and computational performance is proposed. MCRGNet, Radio Galaxy Zoo, and ConvXpress (novel classifier) are the architectures that best balance computational requirements with recognition performance.

Download Full-text

A Comparative Study of Feature Metrics for Classification of Human Passport Photos

2007 IEEE 15th Signal Processing and Communications Applications ◽

10.1109/siu.2007.4298721 ◽

2007 ◽

Author(s):

Sinem Aslan ◽

Turhan Tunall ◽

Muhammed Cinsdikici

Keyword(s):

Comparative Study

Download Full-text

Computing Correlative Association of Terms for Automatic Classification of Text Documents

Proceedings of the Third International Symposium on Computer Vision and the Internet - VisionNet'16 ◽

10.1145/2983402.2983424 ◽

2016 ◽

Cited By ~ 3

Author(s):

Deepak Agnihotri ◽

Kesari Verma ◽

Priyanka Tripathi

Keyword(s):

Automatic Classification ◽

Text Documents

Download Full-text

Structure and Semantics of Tactile Verbs in a Comparative Aspect (On the Material of the English, German and Russian Languages)

Uchenye zapiski St. Petersburg University of Management Technologies and Economics ◽

10.35854/2541-8106-2021-3-58-63 ◽

2021 ◽

pp. 58-63

Author(s):

Yu. A. Sakhno

Keyword(s):

Comparative Study ◽

Semantic Features ◽

Semantic Classification ◽

Comparative Aspect ◽

Similarities And Differences ◽

The Comparative Study ◽

Linguistic Units

This article deals with the study of the structural and semantic features of tactile verbs (hereinafter TVs) in English, German and Russian. Particular attention is paid to the comparative study of TVs, which allows us to identify structural and semantic similarities and differences of linguistic units studied. The structural and semantic classification of TVs in the compared languages is also provided.

Download Full-text