scholarly journals Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach

PeerJ ◽  
2015 ◽  
Vol 3 ◽  
pp. e1279 ◽  
Author(s):  
Marcos Antonio Mouriño García ◽  
Roberto Pérez Rodríguez ◽  
Luis E. Anido Rifón

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.

2017 ◽  
Vol 8 (1) ◽  
pp. 01-12
Author(s):  
Abdelrahman M. Arab ◽  
Ahmed M. Gadallah ◽  
Akram Salah

2020 ◽  
Vol 1 (2) ◽  
pp. 114-119
Author(s):  
Nur Aniq Syafiq Rodzuan ◽  
Shahreen Kasim ◽  
Mohanavali Sithambranathan ◽  
Muhammad Zaki Hassan

Textual information gives us more clear information as it is presented using words and characters, which is easy for humans to understand. To extract this kind of information, text mining was introduced as new technology. Text mining is the process of extracting non-trivial patterns or knowledge from text documents or from textual databases. The purpose of this research paper is to perform and compare keyword extraction using statistical and linguistic extraction tools for 120 text documents related to hypertension and diabetes disease. In order to draw this comparison, RStudio, a statistical-based tool and TerMine, a linguistic-based tool have been used to demonstrate the process of extracting the specified keyword from the biomedical literature. Thus, classification evaluation using Naïve Bayes classifier is carried out in order to evaluate and compare the performance of the statistical and linguistic approaches using these tools. Experimental results show the result of the comparison and the difference between both tools in executing extraction keywords.


Author(s):  
Jordi Cicres ◽  
Sheila Queralt

This article focuses on the analysis of schoolchildren’s writing (throughout the whole primary school period) using sets of morphological labels (n-grams). We analyzed the sets of bigrams and trigrams from a group of literary texts written by Catalan schoolchildren in order to identify which bigrams and trigrams can help discriminate between texts from the three cycles into which the Spanish primary education system is divided: lower cycle (6- and 7-year-olds), middle cycle (8- and 9-year- olds) and upper cycle (10- and 11-year-olds). The results obtained are close to 70% of correct classifications (77.5% bigrams and 68.6% trigrams), making this technique useful for automatic document classification by age.


2017 ◽  
Vol 29 (3) ◽  
pp. 245-255
Author(s):  
Elias OLIVEIRA ◽  
Delermando BRANQUINHO FILHO

Abstract Online journalism is increasing every day. There are many news agencies, newspapers, and magazines using digital publication in the global network. Documents published online are available to users, who use search engines to find them. In order to deliver documents that are relevant to the search, they must be indexed and classified. Due to the vast number of documents published online every day, a lot of research has been carried out to find ways to facilitate automatic document classification. The objective of the present study is to describe an experimental approach for the automatic classification of journalistic documents published on the Internet using the Vector Space Model for document representation. The model was tested based on a real journalism database, using algorithms that have been widely reported in the literature. This article also describes the metrics used to assess the performance of these algorithms and their required configurations. The results obtained show the efficiency of the method used and justify further research to find ways to facilitate the automatic classification of documents.


2020 ◽  
Vol 10 (20) ◽  
pp. 7285
Author(s):  
Valentino Santucci ◽  
Filippo Santarelli ◽  
Luciana Forti ◽  
Stefania Spina

This work introduces an automatic classification system for measuring the complexity level of a given Italian text under a linguistic point-of-view. The task of measuring the complexity of a text is cast to a supervised classification problem by exploiting a dataset of texts purposely produced by linguistic experts for second language teaching and assessment purposes. The commonly adopted Common European Framework of Reference for Languages (CEFR) levels were used as target classification classes, texts were elaborated by considering a large set of numeric linguistic features, and an experimental comparison among ten widely used machine learning models was conducted. The results show that the proposed approach is able to obtain a good prediction accuracy, while a further analysis was conducted in order to identify the categories of features that influenced the predictions.


2017 ◽  
Vol 50 (3) ◽  
pp. 549-572 ◽  
Author(s):  
Deepak Agnihotri ◽  
Kesari Verma ◽  
Priyanka Tripathi

Sign in / Sign up

Export Citation Format

Share Document