Improving classification of low-resource COVID-19 literature by using Named Entity Recognition

Automatic document classification for highly interrelated classes is a demanding task that becomes more challenging when there is little labeled data for training. Such is the case of the coronavirus disease 2019 (COVID-19) Clinical repository—a repository of classified and translated academic articles related to COVID-19 and relevant to the clinical practice—where a 3-way classification scheme is being applied to COVID-19 literature. During the 7th Biomedical Linked Annotation Hackathon (BLAH7) hackathon, we performed experiments to explore the use of named-entity-recognition (NER) to improve the classification. We processed the literature with OntoGene’s Biomedical Entity Recogniser (OGER) and used the resulting identified Named Entities (NE) and their links to major biological databases as extra input features for the classifier. We compared the results with a baseline model without the OGER extracted features. In these proof-of-concept experiments, we observed a clear gain on COVID-19 literature classification. In particular, NE’s origin was useful to classify document types and NE’s type for clinical specialties. Due to the limitations of the small dataset, we can only conclude that our results suggests that NER would benefit this classification task. In order to accurately estimate this benefit, further experiments with a larger dataset would be needed.

Download Full-text

A Probability based Classification of Named Entities for Malayalam Language combining Word, Part of Speech and Lexicalized features

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.a1968.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 839-842

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Supervised Machine Learning ◽

Named Entities ◽

Named Entity ◽

Domain Specific ◽

Part Of Speech ◽

Classification Probability ◽

Malayalam Language

Named Entity Recognition is the process wherein named entities which are designators of a sentence are identified. Designators of a sentence are domain specific. The proposed system identifies named entities in Malayalam language belonging to tourism domain which generally includes names of persons, places, organizations, dates etc. The system uses word, part of speech and lexicalized features to find the probability of a word belonging to a named entity category and to do the appropriate classification. Probability is calculated based on supervised machine learning using word and part of speech features present in a tagged training corpus and using certain rules applied based on lexicalized features.

Download Full-text

Indirectly Named Entity Recognition

Journal of Computer-Assisted Linguistic Research ◽

10.4995/jclr.2021.15922 ◽

2021 ◽

Vol 5 (1) ◽

pp. 27-46

Author(s):

Alexis Kauffmann ◽

François-Claude Rey ◽

Iana Atanassova ◽

Arnaud Gaudinat ◽

Peter Greenfield ◽

...

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Future Research ◽

Proof Of Concept ◽

Future Perspectives ◽

Named Entities ◽

Multiword Expressions ◽

Named Entity ◽

French Texts

We define here indirectly named entities, as a term to denote multiword expressions referring to known named entities by means of periphrasis. While named entity recognition is a classical task in natural language processing, little attention has been paid to indirectly named entities and their treatment. In this paper, we try to address this gap, describing issues related to the detection and understanding of indirectly named entities in texts. We introduce a proof of concept for retrieving both lexicalised and non-lexicalised indirectly named entities in French texts. We also show example cases where this proof of concept is applied, and discuss future perspectives. We have initiated the creation of a first lexicon of 712 indirectly named entity entries that is available for future research.

Download Full-text

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Data ◽

10.3390/data6070071 ◽

2021 ◽

Vol 6 (7) ◽

pp. 71

Author(s):

Gonçalo Carnaz ◽

Mário Antunes ◽

Vitor Beires Nogueira

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Automatic Identification ◽

Named Entities ◽

Related Data ◽

Named Entity ◽

Chain Of Custody ◽

Evidence Collection

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Download Full-text

TEI-friendly annotation scheme for medieval named entities: a case on a Spanish medieval corpus

Language Resources and Evaluation ◽

10.1007/s10579-020-09516-2 ◽

2021 ◽

Cited By ~ 1

Author(s):

Elena Álvarez-Mellado ◽

María Luisa Díez-Platas ◽

Pablo Ruiz-Fabo ◽

Helena Bermúdez ◽

Salvador Ros ◽

...

Keyword(s):

Historical Data ◽

Named Entity Recognition ◽

Rich Source ◽

Entity Recognition ◽

Historical Evidence ◽

Annotation Scheme ◽

Named Entities ◽

General Domain ◽

Named Entity ◽

Entity Annotation

AbstractMedieval documents are a rich source of historical data. Performing named-entity recognition (NER) on this genre of texts can provide us with valuable historical evidence. However, traditional NER categories and schemes are usually designed with modern documents in mind (i.e. journalistic text) and the general-domain NER annotation schemes fail to capture the nature of medieval entities. In this paper we explore the challenges of performing named-entity annotation on a corpus of Spanish medieval documents: we discuss the mismatches that arise when applying traditional NER categories to a corpus of Spanish medieval documents and we propose a novel humanist-friendly TEI-compliant annotation scheme and guidelines intended to capture the particular nature of medieval entities.

Download Full-text

A Survey of Arabic Named Entity Recognition and Classification

Computational Linguistics ◽

10.1162/coli_a_00178 ◽

2014 ◽

Vol 40 (2) ◽

pp. 469-510 ◽

Cited By ~ 62

Author(s):

Khaled Shaalan

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Relevant Information ◽

Arabic Language ◽

Entity Recognition ◽

Named Entities ◽

Linguistic Resources ◽

Named Entity ◽

To Receive ◽

Made In

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.

Download Full-text

Improving the Classification of Q&A Content for Android Fragmentation Using Named Entity Recognition

Progress in Artificial Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-030-30244-3_60 ◽

2019 ◽

pp. 731-743

Author(s):

Adriano Mendonça Rocha ◽

Marcelo de Almeida Maia

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

npj Systems Biology and Applications ◽

10.1038/s41540-021-00200-x ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Kanix Wang ◽

Robert Stevens ◽

Halima Alachram ◽

Yu Li ◽

Larisa Soldatova ◽

...

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Analysis Tool ◽

Automated Extraction ◽

Named Entities ◽

Named Entity ◽

Automated Knowledge ◽

Biomedical Texts ◽

Machine Reading ◽

Biomedical Named Entity Recognition

AbstractMachine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

Download Full-text

Recursively Binary Modification Model for Nested Named Entity Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6329 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8164-8171

Author(s):

Bing Li ◽

Shifeng Liu ◽

Yifang Sun ◽

Wei Wang ◽

Xiang Zhao

Keyword(s):

Strong Evidence ◽

State Of The Art ◽

Named Entity Recognition ◽

Bayesian Framework ◽

Entity Recognition ◽

Named Entities ◽

Named Entity ◽

Nested Structures ◽

Benchmark Datasets ◽

Head Component

Recently, there has been an increasing interest in identifying named entities with nested structures. Existing models only make independent typing decisions on the entire entity span while ignoring strong modification relations between sub-entity types. In this paper, we present a novel Recursively Binary Modification model for nested named entity recognition. Our model utilizes the modification relations among sub-entities types to infer the head component on top of a Bayesian framework and uses entity head as a strong evidence to determine the type of the entity span. The process is recursive, allowing lower-level entities to help better model those on the outer-level. To the best of our knowledge, our work is the first effort that uses modification relation in nested NER task. Extensive experiments on four benchmark datasets demonstrate that our model outperforms state-of-the-art models in nested NER tasks, and delivers competitive results with state-of-the-art models in flat NER task, without relying on any extra annotations or NLP tools.

Download Full-text

Named Entity Recognition and transliteration in Bengali

Lingvisticae Investigationes ◽

10.1075/li.30.1.07ekb ◽

2007 ◽

Vol 30 (1) ◽

pp. 95-114 ◽

Cited By ~ 13

Author(s):

Asif Ekbal ◽

Sudip Kumar Naskar ◽

Sivaji Bandyopadhyay

Keyword(s):

Channel Model ◽

Named Entity Recognition ◽

Entity Recognition ◽

Linguistic Knowledge ◽

Linguistic Features ◽

Named Entities ◽

Named Entity ◽

The Third ◽

News Corpus ◽

Model C

The paper reports about the development of a Named Entity Recognition (NER) system in Bengali using a tagged Bengali news corpus and the subsequent transliteration of the recognized Bengali Named Entities (NEs) into English. Three different models of the NER have been developed. A semi-supervised learning method has been adopted to develop the first two models, one without linguistic features (Model A) and the other with linguistic features (Model B). The third one (Model C) is based on statistical Hidden Markov Model. A modified joint-source channel model has been used along with a number of alternatives to generate the English transliterations of Bengali NEs and vice-versa. The transliteration models learn the mappings from the bilingual training sets optionally guided by linguistic knowledge in the form of conjuncts and diphthongs in Bengali and their representations in English. The NER system has demonstrated the highest average Recall, Precision and F-Score values of 89.62%, 78.67% and 83.79% respectively in Model C. Evaluation of the proposed transliteration models demonstrated that the modified joint source-channel model performs best in terms of evaluation metrics for person and location names for both Bengali to English (B2E) transliteration and English to Bengali transliteration (E2B). The use of the linguistic knowledge during training of the transliteration models improves performance.

Download Full-text

Using Nanoinformatics Methods for Automatically Identifying Relevant Nanotoxicology Entities from the Literature

BioMed Research International ◽

10.1155/2013/410294 ◽

2013 ◽

Vol 2013 ◽

pp. 1-9 ◽

Cited By ~ 12

Author(s):

Miguel García-Remesal ◽

Alejandro García-Ruiz ◽

David Pérez-Rey ◽

Diana de la Iglesia ◽

Víctor Maojo

Keyword(s):

Scientific Literature ◽

Named Entity Recognition ◽

Research Field ◽

Entity Recognition ◽

Toxic Effects ◽

Proof Of Concept ◽

Named Entity ◽

Potential Applications ◽

Dependent Tasks ◽

Fold Cross Validation

Nanoinformatics is an emerging research field that uses informatics techniques to collect, process, store, and retrieve data, information, and knowledge on nanoparticles, nanomaterials, and nanodevices and their potential applications in health care. In this paper, we have focused on the solutions that nanoinformatics can provide to facilitate nanotoxicology research. For this, we have taken a computational approach to automatically recognize and extract nanotoxicology-related entities from the scientific literature. The desired entities belong to four different categories: nanoparticles, routes of exposure, toxic effects, and targets. The entity recognizer was trained using a corpus that we specifically created for this purpose and was validated by two nanomedicine/nanotoxicology experts. We evaluated the performance of our entity recognizer using 10-fold cross-validation. The precisions range from 87.6% (targets) to 93.0% (routes of exposure), while recall values range from 82.6% (routes of exposure) to 87.4% (toxic effects). These results prove the feasibility of using computational approaches to reliably perform different named entity recognition (NER)-dependent tasks, such as for instance augmented reading or semantic searches. This research is a “proof of concept” that can be expanded to stimulate further developments that could assist researchers in managing data, information, and knowledge at the nanolevel, thus accelerating research in nanomedicine.

Download Full-text