scholarly journals Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks

2014 ◽  
Vol 2014 ◽  
pp. 1-6 ◽  
Author(s):  
Buzhou Tang ◽  
Hongxin Cao ◽  
Xiaolong Wang ◽  
Qingcai Chen ◽  
Hua Xu

Biomedical Named Entity Recognition (BNER), which extracts important entities such as genes and proteins, is a crucial step of natural language processing in the biomedical domain. Various machine learning-based approaches have been applied to BNER tasks and showed good performance. In this paper, we systematically investigated three different types of word representation (WR) features for BNER, including clustering-based representation, distributional representation, and word embeddings. We selected one algorithm from each of the three types of WR features and applied them to the JNLPBA and BioCreAtIvE II BNER tasks. Our results showed that all the three WR algorithms were beneficial to machine learning-based BNER systems. Moreover, combining these different types of WR features further improved BNER performance, indicating that they are complementary to each other. By combining all the three types of WR features, the improvements inF-measure on the BioCreAtIvE II GM and JNLPBA corpora were 3.75% and 1.39%, respectively, when compared with the systems using baseline features. To the best of our knowledge, this is the first study to systematically evaluate the effect of three different types of WR features for BNER tasks.

Data ◽  
2021 ◽  
Vol 6 (7) ◽  
pp. 71
Author(s):  
Gonçalo Carnaz ◽  
Mário Antunes ◽  
Vitor Beires Nogueira

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.


2019 ◽  
pp. 1-8 ◽  
Author(s):  
Tomasz Oliwa ◽  
Steven B. Maron ◽  
Leah M. Chase ◽  
Samantha Lomnicki ◽  
Daniel V.T. Catenacci ◽  
...  

PURPOSE Robust institutional tumor banks depend on continuous sample curation or else subsequent biopsy or resection specimens are overlooked after initial enrollment. Curation automation is hindered by semistructured free-text clinical pathology notes, which complicate data abstraction. Our motivation is to develop a natural language processing method that dynamically identifies existing pathology specimen elements necessary for locating specimens for future use in a manner that can be re-implemented by other institutions. PATIENTS AND METHODS Pathology reports from patients with gastroesophageal cancer enrolled in The University of Chicago GI oncology tumor bank were used to train and validate a novel composite natural language processing-based pipeline with a supervised machine learning classification step to separate notes into internal (primary review) and external (consultation) reports; a named-entity recognition step to obtain label (accession number), location, date, and sublabels (block identifiers); and a results proofreading step. RESULTS We analyzed 188 pathology reports, including 82 internal reports and 106 external consult reports, and successfully extracted named entities grouped as sample information (label, date, location). Our approach identified up to 24 additional unique samples in external consult notes that could have been overlooked. Our classification model obtained 100% accuracy on the basis of 10-fold cross-validation. Precision, recall, and F1 for class-specific named-entity recognition models show strong performance. CONCLUSION Through a combination of natural language processing and machine learning, we devised a re-implementable and automated approach that can accurately extract specimen attributes from semistructured pathology notes to dynamically populate a tumor registry.


2021 ◽  
Vol 75 (3) ◽  
pp. 94-99
Author(s):  
A.M. Yelenov ◽  
◽  
A.B. Jaxylykova ◽  

This research focuses on a comparative study of the Named Entity Recognition task for scientific article texts. Natural language processing could be considered as one of the cornerstones in the machine learning area which devotes its attention to the problems connected with the understanding of different natural languages and linguistic analysis. It was already shown that current deep learning techniques have a good performance and accuracy in such areas as image recognition, pattern recognition, computer vision, that could mean that such technology probably would be successful in the neuro-linguistic programming area too and lead to a dramatic increase on the research interest on this topic. For a very long time, quite trivial algorithms have been used in this area, such as support vector machines or various types of regression, basic encoding on text data was also used, which did not provide high results. The following dataset was used to process the experiment models: Dataset Scientific Entity Relation Core. The algorithms used were Long short-term memory, Random Forest Classifier with Conditional Random Fields, and Named-entity recognition with Bidirectional Encoder Representations from Transformers. In the findings, the metrics scores of all models were compared to each other to make a comparison. This research is devoted to the processing of scientific articles, concerning the machine learning area, because the subject is not investigated on enough properly level.The consideration of this task can help machines to understand natural languages better, so that they can solve other neuro-linguistic programming tasks better, enhancing scores in common sense.


2021 ◽  
Vol 7 ◽  
pp. e384
Author(s):  
Rigo E. Ramos-Vargas ◽  
Israel Román-Godínez ◽  
Sulema Torres-Ramos

Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.


2020 ◽  
Author(s):  
Xie-Yuan Xie

Abstract Named Entity Recognition (NER) is a key task which automatically extracts Named Entities (NE) from the text. Names of persons, places, date and time are examples of NEs. We are applying Conditional Random Fields (CRFs) for NER in biomedical domain. Examples of NEs in biomedical texts are gene, proteins. We used a minimal set of features to train CRF algorithm and obtained a good results for biomedical texts.


2019 ◽  
Vol 26 (1) ◽  
pp. 21-33 ◽  
Author(s):  
Qiang Wei ◽  
Yaoyun Zhang ◽  
Muhammad Amith ◽  
Rebecca Lin ◽  
Jenay Lapeyrolerie ◽  
...  

Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning–based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature.


Data ◽  
2018 ◽  
Vol 3 (4) ◽  
pp. 53 ◽  
Author(s):  
Maria Mitrofan ◽  
Verginica Barbu Mititelu ◽  
Grigorina Mitrofan

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.


Sign in / Sign up

Export Citation Format

Share Document