entity identification
Recently Published Documents


TOTAL DOCUMENTS

84
(FIVE YEARS 21)

H-INDEX

9
(FIVE YEARS 3)

2021 ◽  
Vol 22 (S1) ◽  
Author(s):  
Renzo M. Rivera-Zavala ◽  
Paloma Martínez

Abstract Background The volume of biomedical literature and clinical data is growing at an exponential rate. Therefore, efficient access to data described in unstructured biomedical texts is a crucial task for the biomedical industry and research. Named Entity Recognition (NER) is the first step for information and knowledge acquisition when we deal with unstructured texts. Recent NER approaches use contextualized word representations as input for a downstream classification task. However, distributed word vectors (embeddings) are very limited in Spanish and even more for the biomedical domain. Methods In this work, we develop several biomedical Spanish word representations, and we introduce two Deep Learning approaches for pharmaceutical, chemical, and other biomedical entities recognition in Spanish clinical case texts and biomedical texts, one based on a Bi-STM-CRF model and the other on a BERT-based architecture. Results Several Spanish biomedical embeddigns together with the two deep learning models were evaluated on the PharmaCoNER and CORD-19 datasets. The PharmaCoNER dataset is composed of a set of Spanish clinical cases annotated with drugs, chemical compounds and pharmacological substances; our extended Bi-LSTM-CRF model obtains an F-score of 85.24% on entity identification and classification and the BERT model obtains an F-score of 88.80% . For the entity normalization task, the extended Bi-LSTM-CRF model achieves an F-score of 72.85% and the BERT model achieves 79.97%. The CORD-19 dataset consists of scholarly articles written in English annotated with biomedical concepts such as disorder, species, chemical or drugs, gene and protein, enzyme and anatomy. Bi-LSTM-CRF model and BERT model obtain an F-measure of 78.23% and 78.86% on entity identification and classification, respectively on the CORD-19 dataset. Conclusion These results prove that deep learning models with in-domain knowledge learned from large-scale datasets highly improve named entity recognition performance. Moreover, contextualized representations help to understand complexities and ambiguity inherent to biomedical texts. Embeddings based on word, concepts, senses, etc. other than those for English are required to improve NER tasks in other languages.


Author(s):  
Yukun Jiang ◽  
Xin Gao ◽  
Wenxin Su ◽  
Jinrong Li

Construction safety standards (CSS) have knowledge characteristics, but few studies have introduced knowledge graphs (KG) as a tool into CSS management. In order to improve CSS knowledge management, this paper first analyzed the knowledge structure of 218 standards and obtained three knowledge levels of CSS. Second, a concept layer was designed which consisted of five levels of concepts and eight types of relationships. Third, an entity layer containing 147 entities was constructed via entity identification, attribute extraction and entity extraction. Finally, 177 nodes and 11 types of attributes were collected and the construction of a knowledge graph of construction safety standard (KGCSS) was completed using knowledge storage. Furthermore, we implemented knowledge inference and obtained CSS planning, i.e., the list of standard work plans used to guide the development and revision of CSS. In addition, we conducted CSS knowledge retrieval; a process which supports interrogative input. The construction of KGCSS thus facilitates the analysis, querying, and sharing of safety standards knowledge.


2020 ◽  
Vol 9 (12) ◽  
pp. 712
Author(s):  
Agung Dewandaru ◽  
Dwi Hendratmo Widyantoro ◽  
Saiful Akbar

Geoparser is a fundamental component of a Geographic Information Retrieval (GIR) geoparser, which performs toponym recognition, disambiguation, and geographic coordinate resolution from unstructured text domain. However, geoparsing of news articles which report several events across many place-mentions in the document are not yet adequately handled by regular geoparser, where the scope of resolution is either toponym-level or document-level. The capacity to detect multiple events and geolocate their true coordinates along with their numerical arguments is still missing from modern geoparsers, much less in Indonesian news corpora domain. We propose an event geoparser model with three stages of processing, which tightly integrates event extraction model into geoparsing and provides precise event-level resolution scope. The model casts the geotagging and event extraction as sequence labeling and uses LSTM-CRF inferencer equipped with features derived using Aggregated Topic Model from a large corpus to increase the generalizability. Throughout the proposed workflow and features, the geoparser is able to significantly improve the identification of pseudo-location entities, resulting in a 23.43% increase for weighted F1 score compared to baseline gazetteer and POS Tag features. As a side effect of event extraction, various numerical arguments are also extracted, and the output is easily projected to a rich choropleth map from a single news document.


2020 ◽  
Author(s):  
Feichen Shen ◽  
Sijia Liu ◽  
Sunyang Fu ◽  
Yanshan Wang ◽  
Samuel Henry ◽  
...  

BACKGROUND As a risk factor for many diseases, family history captures both shared genetic variations and living environments among family members. Though there are several systems focusing on family history extraction (FHE) using natural language processing (NLP) techniques, the evaluation protocol of such systems has not been standardized. OBJECTIVE The n2c2/OHNLP 2019 FHE Task aims to encourage the community efforts on a standard evaluation and system development on FHE from synthetic clinical narratives. METHODS We organized the first BioCreative/OHNLP FHE shared task in 2018. We continued the shared task in 2019 in collaboration with n2c2 and OHNLP consortium, and organized the 2019 n2c2/OHNLP FHE track. The shared task composes of two subtasks. Subtask 1 focuses on identifying family member entities and clinical observations (diseases), and Subtask 2 expects the association of the living status, side of the family and clinical observations to family members to be extracted. Subtask 2 is an end-to-end task which is based on the result of Subtask 1. We manually curated the first de-identified clinical narrative from family history sections of clinical notes at Mayo Clinic Rochester, the content of which are highly relevant to patients’ family history. RESULTS 17 teams from all over the world have participated in the n2c2/OHNLP FHE shared task, where 38 runs were submitted for subtask 1 and 21 runs were submitted for subtask 2. For subtask 1, the top three runs were generated by Harbin Institute of Technology, ezDI, Inc, and The Medical University of South Carolina with F1 scores of 0.8745, 0.8225, and 0.8130, respectively. For subtask 2, the top three runs were from Harbin Institute of Technology, ezDI, Inc, and University of Florida with F1 scores of 0.681, 0.6586, and 0.6544, respectively. The workshop was held in conjunction with the AMIA 2019 Fall Symposium. Conclusions: A wide variety of methods were used by different teams in both tasks, such as BERT, CNN, Bi-LSTM, CRF, SVM, and rule-based strategies. System performances show that relation extraction from family history is a more challenging task when compared to entity identification task. CONCLUSIONS We summarize the 2019 n2c2/OHNLP FHE shared task in this overview. In this task, we have developed a corpus using de-identified family history data stored in Mayo Clinic. The corpus we prepared along with the shared task have encouraged participants internationally to develop FHE systems for understanding clinical narratives. We compared the performance of valid systems on two subtasks: entity identification and relation extraction. The optimal F1 score for subtask 1 and subtask 2 is 0.8745 and 0.6810 respectively. We also observed that most of the typical errors made by the submitted systems are related to co-reference resolution. The corpus could be viewed as valuable resources for more researchers to improve systems for family history analysis. CLINICALTRIAL


2020 ◽  
Author(s):  
Youngjun Kim ◽  
Paul M Heider ◽  
Isabel R H Lally ◽  
Stéphane M Meystre

BACKGROUND Family history information is important to assess the risk of inherited medical conditions. Natural language processing has the potential to extract this information from unstructured free-text notes to improve patient care and decision-making. We describe the end-to-end information extraction system the Medical University of South Carolina team developed when participating in the 2019 n2c2/OHNLP shared task. OBJECTIVE This task involves identifying mentions of family members and observations in electronic health record text notes, and recognizing the relations between family members, observations, and living status. Our system aims to achieve a high level of performance by integrating heuristics and advanced information extraction methods. Our efforts also include improving the performance of two subtasks by exploiting additional labeled data and clinical text-based embedding models. METHODS We present a hybrid method that combines machine learning and rule-based approaches. We implemented an end-to-end system with multiple information extraction and attribute classification components. For entity identification, we trained bidirectional long short-term memory deep learning models. These models incorporated static word embeddings and context-dependent embeddings. We created a voting ensemble that combined the predictions of all individual models. For relation extraction, we trained two relation extraction models. The first model determined the living status of each family member. The second model identified observations associated with each family member. We implemented online gradient descent models to extract related entity pairs. As part of post-challenge efforts, we used the BioCreative/OHNLP 2018 corpus and trained new models with the union of these two data sets. We also pre-trained language models using clinical notes from the MIMIC-III clinical database. RESULTS The voting ensemble achieved better performance than individual classifiers. In the entity identification task, the best performing system reached a precision of 78.90% and a recall of 83.84%. Our NLP system for entity identification and relation extraction ranked 3rd and 4th respectively in the challenge. Our end-to-end pipeline system substantially benefited from the combination of the two data sets. Compared to our official submission, the revised system yielded significantly better performance (p < 0.05) with F1-scores of 86.02% and 72.48% for entity identification and relation extraction, respectively. CONCLUSIONS We demonstrated that a hybrid model could be used to successfully extract family history information recorded in unstructured free-text notes. In this study, our approach of entity identification as a sequence labeling problem produced satisfactory results. Our post-challenge efforts significantly improved performance by leveraging additional labeled data and using word vector representations learned from large collections of clinical notes.


Author(s):  
Agung Dewandaru ◽  
Dwi Hendratmo Widyantoro ◽  
Saiful Akbar

One of the most important component of a Geographic Information Retrieval (GIR) is the geoparser, which performs toponym recognition, disambiguation, and geographic coordinate resolution from unstructured text domain. However, news articles which report several events across many place references mentioned in the document is not yet adequately modeled by regular geoparser types where the scope of resolution is either on toponym-level or document-level. The capacity to detect multiple events, geolocate its true locations and coordinates along with their numerical arguments are still missing from modern geoparsers, much less in Indonesian news corpora domain. We propose a novel type event geoparser which integrates an ACE-based event extraction model and provides precise event-level scope resolution. The geoparser casts the geotagging and event extraction as sequence labeling and uses Conditional Random Field with keywords feature obtained using Aggregated Topic Model as a semantic exploration from large corpus, which eventually increases the generalizability of the model. The geoparser also use Smallest Administrative Level feature along with Spatial Minimality-derived algorithm to improve the identification of Pseudo-location entities, resulting 19.4% increase for weighted F1 score. As a side effect of event extraction, the geoparser also extracts various numerical arguments and able to generate thematic choropleth map from a single news story.


Sign in / Sign up

Export Citation Format

Share Document