Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach

PeerJ ◽

10.7717/peerj.1279 ◽

2015 ◽

Vol 3 ◽

pp. e1279 ◽

Cited By ~ 10

Author(s):

Marcos Antonio Mouriño García ◽

Roberto Pérez Rodríguez ◽

Luis E. Anido Rifón

Keyword(s):

Classification Problem ◽

Automatic Classification ◽

Important Application ◽

Biomedical Literature ◽

Daily Activities ◽

Bag Of Words ◽

Text Documents ◽

Semantic Relevance ◽

Automatic Document Classification

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.

Download Full-text

Computing Correlative Association of Terms for Automatic Classification of Text Documents

Proceedings of the Third International Symposium on Computer Vision and the Internet - VisionNet'16 ◽

10.1145/2983402.2983424 ◽

2016 ◽

Cited By ~ 3

Author(s):

Deepak Agnihotri ◽

Kesari Verma ◽

Priyanka Tripathi

Keyword(s):

Automatic Classification ◽

Text Documents

Download Full-text

A Proposed Multi-Domain Approach for Automatic Classification of Text Documents

International Journal on Soft Computing ◽

10.5121/ijsc.2017.8101 ◽

2017 ◽

Vol 8 (1) ◽

pp. 01-12

Author(s):

Abdelrahman M. Arab ◽

Ahmed M. Gadallah ◽

Akram Salah

Keyword(s):

Automatic Classification ◽

Text Documents

Download Full-text

Classification of Biomedical Literature in Hypertension and Diabetes

International Journal on Data Science ◽

10.18517/ijods.1.2.114-119.2020 ◽

2020 ◽

Vol 1 (2) ◽

pp. 114-119

Author(s):

Nur Aniq Syafiq Rodzuan ◽

Shahreen Kasim ◽

Mohanavali Sithambranathan ◽

Muhammad Zaki Hassan

Keyword(s):

Text Mining ◽

New Technology ◽

Biomedical Literature ◽

Text Documents ◽

Textual Databases ◽

The Difference ◽

Classification Evaluation ◽

Linguistic Approaches ◽

Clear Information

Textual information gives us more clear information as it is presented using words and characters, which is easy for humans to understand. To extract this kind of information, text mining was introduced as new technology. Text mining is the process of extracting non-trivial patterns or knowledge from text documents or from textual databases. The purpose of this research paper is to perform and compare keyword extraction using statistical and linguistic extraction tools for 120 text documents related to hypertension and diabetes disease. In order to draw this comparison, RStudio, a statistical-based tool and TerMine, a linguistic-based tool have been used to demonstrate the process of extracting the specified keyword from the biomedical literature. Thus, classification evaluation using Naïve Bayes classifier is carried out in order to evaluate and compare the performance of the statistical and linguistic approaches using these tools. Experimental results show the result of the comparison and the difference between both tools in executing extraction keywords.

Download Full-text

An n-gram based approach to the automatic classification of schoolchildren’s writing

Vigo International Journal of Applied Linguistics ◽

10.35869/vial.v0i16.93 ◽

2019 ◽

pp. 53-80

Author(s):

Jordi Cicres ◽

Sheila Queralt

Keyword(s):

Primary School ◽

Primary Education ◽

Education System ◽

Automatic Classification ◽

Document Classification ◽

Literary Texts ◽

N Gram ◽

School Period ◽

Automatic Document Classification

This article focuses on the analysis of schoolchildren’s writing (throughout the whole primary school period) using sets of morphological labels (n-grams). We analyzed the sets of bigrams and trigrams from a group of literary texts written by Catalan schoolchildren in order to identify which bigrams and trigrams can help discriminate between texts from the three cycles into which the Spanish primary education system is divided: lower cycle (6- and 7-year-olds), middle cycle (8- and 9-year- olds) and upper cycle (10- and 11-year-olds). The results obtained are close to 70% of correct classifications (77.5% bigrams and 68.6% trigrams), making this technique useful for automatic document classification by age.

Download Full-text

Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study

Contemporary Complex Systems and Their Dependability - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-319-91446-6_49 ◽

2018 ◽

pp. 526-535 ◽

Cited By ~ 7

Author(s):

Tomasz Walkowiak ◽

Szymon Datko ◽

Henryk Maciejewski

Keyword(s):

Comparative Study ◽

Subject Classification ◽

Bag Of Words ◽

Text Documents

Download Full-text

Automatic classification of journalistic documents on the Internet1

Transinformação ◽

10.1590/2318-08892017000300003 ◽

2017 ◽

Vol 29 (3) ◽

pp. 245-255

Author(s):

Elias OLIVEIRA ◽

Delermando BRANQUINHO FILHO

Keyword(s):

Experimental Approach ◽

Automatic Classification ◽

Global Network ◽

Document Representation ◽

Online Journalism ◽

Vast Number ◽

Space Model ◽

News Agencies ◽

Automatic Document Classification

Abstract Online journalism is increasing every day. There are many news agencies, newspapers, and magazines using digital publication in the global network. Documents published online are available to users, who use search engines to find them. In order to deliver documents that are relevant to the search, they must be indexed and classified. Due to the vast number of documents published online every day, a lot of research has been carried out to find ways to facilitate automatic document classification. The objective of the present study is to describe an experimental approach for the automatic classification of journalistic documents published on the Internet using the Vector Space Model for document representation. The model was tested based on a real journalism database, using algorithms that have been widely reported in the literature. This article also describes the metrics used to assess the performance of these algorithms and their required configurations. The results obtained show the efficiency of the method used and justify further research to find ways to facilitate the automatic classification of documents.

Download Full-text

Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents

SpringerPlus ◽

10.1186/s40064-016-2573-y ◽

2016 ◽

Vol 5 (1) ◽

Cited By ~ 8

Author(s):

Deepak Agnihotri ◽

Kesari Verma ◽

Priyanka Tripathi

Keyword(s):

Automatic Classification ◽

Text Documents ◽

Filtering Approach

Download Full-text

Automatic Classification of Text Complexity

Applied Sciences ◽

10.3390/app10207285 ◽

2020 ◽

Vol 10 (20) ◽

pp. 7285

Author(s):

Valentino Santucci ◽

Filippo Santarelli ◽

Luciana Forti ◽

Stefania Spina

Keyword(s):

Italian Text ◽

Classification Problem ◽

Automatic Classification ◽

Point Of View ◽

Experimental Comparison ◽

Good Prediction ◽

Large Set ◽

Linguistic Features ◽

Good Prediction Accuracy

This work introduces an automatic classification system for measuring the complexity level of a given Italian text under a linguistic point-of-view. The task of measuring the complexity of a text is cast to a supervised classification problem by exploiting a dataset of texts purposely produced by linguistic experts for second language teaching and assessment purposes. The commonly adopted Common European Framework of Reference for Languages (CEFR) levels were used as target classification classes, texts were elaborated by considering a large set of numeric linguistic features, and an experimental comparison among ten widely used machine learning models was conducted. The results show that the proposed approach is able to obtain a good prediction accuracy, while a further analysis was conducted in order to identify the categories of features that influenced the predictions.

Download Full-text