NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

AbstractMachine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

Download Full-text

NERO: A Biomedical Named-entity (Recognition) Ontology with a Large, Annotated Corpus Reveals Meaningful Associations Through Text Embedding

10.1101/2020.11.05.368969 ◽

2020 ◽

Author(s):

Kanix Wang ◽

Robert Stevens ◽

Halima Alachram ◽

Yu Li ◽

Larisa Soldatova ◽

...

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Analysis Tool ◽

Automated Extraction ◽

Named Entities ◽

Named Entity ◽

Automated Knowledge ◽

Biomedical Texts ◽

Machine Reading ◽

Biomedical Named Entity Recognition

Machine reading is essential for unlocking valuable knowledge contained in the millions of existing biomedical documents. Over the last two decades 1,2, the most dramatic advances in machine-reading have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in machine reading methodology and automated knowledge extraction systems in the same way that ImageNet 4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named-entity analysis tool for biomedicine: (a) a new, Named-Entity Recognition Ontology (NERO) developed specifically for describing entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named-entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named-entity recognition automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

Download Full-text

Conditional Random Fields for Biomedical Named Entity Recognition Revisited

10.21203/rs.3.rs-36431/v1 ◽

2020 ◽

Author(s):

Xie-Yuan Xie

Keyword(s):

Random Fields ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Entity Recognition ◽

Biomedical Domain ◽

Minimal Set ◽

Named Entities ◽

Named Entity ◽

Biomedical Texts ◽

Biomedical Named Entity Recognition

Abstract Named Entity Recognition (NER) is a key task which automatically extracts Named Entities (NE) from the text. Names of persons, places, date and time are examples of NEs. We are applying Conditional Random Fields (CRFs) for NER in biomedical domain. Examples of NEs in biomedical texts are gene, proteins. We used a minimal set of features to train CRF algorithm and obtained a good results for biomedical texts.

Download Full-text

Improving biomedical named entity recognition with syntactic information

BMC Bioinformatics ◽

10.1186/s12859-020-03834-6 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Yuanhe Tian ◽

Wang Shen ◽

Yan Song ◽

Fei Xia ◽

Min He ◽

...

Keyword(s):

Named Entity Recognition ◽

Model Performance ◽

Entity Recognition ◽

Input Word ◽

Named Entity ◽

Improve Model ◽

Syntactic Information ◽

Biomedical Texts ◽

Improve Model Performance ◽

Biomedical Named Entity Recognition

Abstract Background Biomedical named entity recognition (BioNER) is an important task for understanding biomedical texts, which can be challenging due to the lack of large-scale labeled training data and domain knowledge. To address the challenge, in addition to using powerful encoders (e.g., biLSTM and BioBERT), one possible method is to leverage extra knowledge that is easy to obtain. Previous studies have shown that auto-processed syntactic information can be a useful resource to improve model performance, but their approaches are limited to directly concatenating the embeddings of syntactic information to the input word embeddings. Therefore, such syntactic information is leveraged in an inflexible way, where inaccurate one may hurt model performance. Results In this paper, we propose BioKMNER, a BioNER model for biomedical texts with key-value memory networks (KVMN) to incorporate auto-processed syntactic information. We evaluate BioKMNER on six English biomedical datasets, where our method with KVMN outperforms the strong baseline method, namely, BioBERT, from the previous study on all datasets. Specifically, the F1 scores of our best performing model are 85.29% on BC2GM, 77.83% on JNLPBA, 94.22% on BC5CDR-chemical, 90.08% on NCBI-disease, 89.24% on LINNAEUS, and 76.33% on Species-800, where state-of-the-art performance is obtained on four of them (i.e., BC2GM, BC5CDR-chemical, NCBI-disease, and Species-800). Conclusion The experimental results on six English benchmark datasets demonstrate that auto-processed syntactic information can be a useful resource for BioNER and our method with KVMN can appropriately leverage such information to improve model performance.

Download Full-text

A System for Identifying Named Entities in Biomedical Text: how Results From two Evaluations Reflect on Both the System and the Evaluations

Comparative and Functional Genomics ◽

10.1002/cfg.457 ◽

2005 ◽

Vol 6 (1-2) ◽

pp. 77-85 ◽

Cited By ~ 17

Author(s):

Shipra Dingare ◽

Malvina Nissim ◽

Jenny Finkel ◽

Christopher Manning ◽

Claire Grover

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Rapid Adaptation ◽

Exact Match ◽

Web Searches ◽

Named Entities ◽

Data Annotation ◽

Knowledge Resources ◽

Named Entity ◽

Biomedical Named Entity Recognition

We present a maximum entropy-based system for identifying named entities (NEs) in biomedical abstracts and present its performance in the only two biomedical named entity recognition (NER) comparative evaluations that have been held to date, namely BioCreative and Coling BioNLP. Our system obtained an exact match F-score of 83.2% in the BioCreative evaluation and 70.1% in the BioNLP evaluation. We discuss our system in detail, including its rich use of local features, attention to correct boundary identification, innovative use of external knowledge resources, including parsing and web searches, and rapid adaptation to new NE sets. We also discuss in depth problems with data annotation in the evaluations which caused the final performance to be lower than optimal.

Download Full-text

Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

Data ◽

10.3390/data3040053 ◽

2018 ◽

Vol 3 (4) ◽

pp. 53 ◽

Cited By ~ 1

Author(s):

Maria Mitrofan ◽

Verginica Barbu Mititelu ◽

Grigorina Mitrofan

Keyword(s):

Language Processing ◽

Gold Standard ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Resources ◽

Named Entities ◽

Named Entity ◽

Pos Tagging ◽

Part Of Speech ◽

Biomedical Named Entity Recognition

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.

Download Full-text

ABioNER: A BERT-Based Model for Arabic Biomedical Named-Entity Recognition

Complexity ◽

10.1155/2021/6633213 ◽

2021 ◽

Vol 2021 ◽

pp. 1-6

Author(s):

Nada Boudjellal ◽

Huaping Zhang ◽

Asif Khan ◽

Arshad Ahmad ◽

Rashid Naseem ◽

...

Keyword(s):

Named Entity Recognition ◽

Model Performance ◽

Recognition Task ◽

Entity Recognition ◽

Small Scale ◽

Text Data ◽

Named Entities ◽

Named Entity ◽

Textual Data ◽

Biomedical Named Entity Recognition

The web is being loaded daily with a huge volume of data, mainly unstructured textual data, which increases the need for information extraction and NLP systems significantly. Named-entity recognition task is a key step towards efficiently understanding text data and saving time and effort. Being a widely used language globally, English is taking over most of the research conducted in this field, especially in the biomedical domain. Unlike other languages, Arabic suffers from lack of resources. This work presents a BERT-based model to identify biomedical named entities in the Arabic text data (specifically disease and treatment named entities) that investigates the effectiveness of pretraining a monolingual BERT model with a small-scale biomedical dataset on enhancing the model understanding of Arabic biomedical text. The model performance was compared with two state-of-the-art models (namely, AraBERT and multilingual BERT cased), and it outperformed both models with 85% F1-score.

Download Full-text

Improving Biomedical Named Entity Recognition with Syntactic Information

10.21203/rs.3.rs-21994/v1 ◽

2020 ◽

Author(s):

YUANHE TIAN ◽

Wang Shen ◽

Yan Song ◽

Fei Xia ◽

Min He ◽

...

Keyword(s):

Domain Knowledge ◽

Large Scale ◽

State Of The Art ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Named Entity ◽

Syntactic Information ◽

Biomedical Texts ◽

Biomedical Named Entity Recognition

Abstract Background Biomedical named entity recognition (BioNER) is an important task for understanding biomedical texts. The task can be challenging due to the lack of large-scale labeled training data and domain knowledge. Previous studies have shown that syntactic information can be useful for named entity recognition; however, most of them fail to weigh that information with respect to its contribution as they treat the syntactic information as gold reference. Results In this paper, we propose BioKMNER, a BioNER model for biomedical texts with key-value memory networks to incorporate syntactic information, which is extracted from syntactic structures automatically generated by existing toolkits. Our approach outperforms baselines without memories and achieves new state-of-the-art results on on four biomedical datasets compared with previous studies, i.e., 85.67% on BC2GM, 94.22% on BC5CDR-chemical, 90.11% on NCBI-diease, and 76.33% on Species-800. Conclusion Experimental results on four benchmark datasets demonstrate the effectiveness of our method, where the state-of-the-art performance is achieved on all of them.

Download Full-text

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Data ◽

10.3390/data6070071 ◽

2021 ◽

Vol 6 (7) ◽

pp. 71

Author(s):

Gonçalo Carnaz ◽

Mário Antunes ◽

Vitor Beires Nogueira

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Automatic Identification ◽

Named Entities ◽

Related Data ◽

Named Entity ◽

Chain Of Custody ◽

Evidence Collection

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Download Full-text

TEI-friendly annotation scheme for medieval named entities: a case on a Spanish medieval corpus

Language Resources and Evaluation ◽

10.1007/s10579-020-09516-2 ◽

2021 ◽

Cited By ~ 1

Author(s):

Elena Álvarez-Mellado ◽

María Luisa Díez-Platas ◽

Pablo Ruiz-Fabo ◽

Helena Bermúdez ◽

Salvador Ros ◽

...

Keyword(s):

Historical Data ◽

Named Entity Recognition ◽

Rich Source ◽

Entity Recognition ◽

Historical Evidence ◽

Annotation Scheme ◽

Named Entities ◽

General Domain ◽

Named Entity ◽

Entity Annotation

AbstractMedieval documents are a rich source of historical data. Performing named-entity recognition (NER) on this genre of texts can provide us with valuable historical evidence. However, traditional NER categories and schemes are usually designed with modern documents in mind (i.e. journalistic text) and the general-domain NER annotation schemes fail to capture the nature of medieval entities. In this paper we explore the challenges of performing named-entity annotation on a corpus of Spanish medieval documents: we discuss the mismatches that arise when applying traditional NER categories to a corpus of Spanish medieval documents and we propose a novel humanist-friendly TEI-compliant annotation scheme and guidelines intended to capture the particular nature of medieval entities.

Download Full-text

Biomedical Named Entity Recognition with Tri-Training Learning

2009 2nd International Conference on Biomedical Engineering and Informatics ◽

10.1109/bmei.2009.5304799 ◽

2009 ◽

Cited By ~ 3

Author(s):

YueHong Cai ◽

XianYi Cheng

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Biomedical Named Entity Recognition

Download Full-text