scholarly journals Improving clinical named entity recognition in Chinese using the graphical and phonetic feature

2019 ◽  
Vol 19 (S7) ◽  
Author(s):  
Yifei Wang ◽  
Sophia Ananiadou ◽  
Jun’ichi Tsujii

Abstract Background Clinical Named Entity Recognition is to find the name of diseases, body parts and other related terms from the given text. Because Chinese language is quite different with English language, the machine cannot simply get the graphical and phonetic information form Chinese characters. The method for Chinese should be different from that for English. Chinese characters present abundant information with the graphical features, recent research on Chinese word embedding tries to use graphical information as subword. This paper uses both graphical and phonetic features to improve Chinese Clinical Named Entity Recognition based on the presence of phono-semantic characters. Methods This paper proposed three different embedding models and tested them on the annotated data. The data have been divided into two sections for exploring the effect of the proportion of phono-semantic characters. Results The model using primary radical and pinyin can improve Clinical Named Entity Recognition in Chinese and get the F-measure of 0.712. More phono-semantic characters does not give a better result. Conclusions The paper proves that the use of the combination of graphical and phonetic features can improve the Clinical Named Entity Recognition in Chinese.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Irene Pérez-Díez ◽  
Raúl Pérez-Moraga ◽  
Adolfo López-Cerdán ◽  
Jose-Maria Salinas-Serrano ◽  
María de la Iglesia-Vayá

Abstract Background Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. Results We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%. Conclusions The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it could be easily extended to other languages and medical texts, such as electronic health records.


2020 ◽  
Vol 34 (05) ◽  
pp. 9274-9281
Author(s):  
Qianhui Wu ◽  
Zijia Lin ◽  
Guoxin Wang ◽  
Hui Chen ◽  
Börje F. Karlsson ◽  
...  

For languages with no annotated resources, transferring knowledge from rich-resource languages is an effective solution for named entity recognition (NER). While all existing methods directly transfer from source-learned model to a target language, in this paper, we propose to fine-tune the learned model with a few similar examples given a test case, which could benefit the prediction by leveraging the structural and semantic information conveyed in such similar examples. To this end, we present a meta-learning algorithm to find a good model parameter initialization that could fast adapt to the given test case and propose to construct multiple pseudo-NER tasks for meta-training by computing sentence similarities. To further improve the model's generalization ability across different languages, we introduce a masking scheme and augment the loss function with an additional maximum term during meta-training. We conduct extensive experiments on cross-lingual named entity recognition with minimal resources over five target languages. The results show that our approach significantly outperforms existing state-of-the-art methods across the board.


Author(s):  
Edgar Casasola Murillo ◽  
Raquel Fonseca

Abstract: One of the major consequences of the growth of social networks has been the generation of huge volumes of content. The text that is generated in social networks constitutes a new type of content, that is short, informal, lacking grammar in some cases, and noise prone. Given the volume of information that is produced every day, a manual processing of this data is unpractical, causing the need of exploring and applying automatic processing strategies, like Entity Recognition (ER). It becomes necessary to evaluate the performance of traditional ER algorithms in corpus with those characteristics. This paper presents the results of applying AlchemyAPI y Dandelion API algorithms in a corpus provided by The SemEval-2015 Aspect Based Sentiment Analysis Conference. The entities recognized by each algorithm were compared against the ones annotated in the collection in order to calculate their precision and recall. Dandelion API got better results than AlchemyAPI with the given corpus.  Spanish Abstract: Una de las principales consecuencias del auge actual de las redes sociales es la generación de grandes volúmenes de información. El texto generado en estas redes corresponde a un nuevo género de texto: corto, informal, gramaticalmente deficiente y propenso a ruido. Debido a la tasa de producción de la información, el procesamiento manual resulta poco práctico, surgiendo así la necesidad de aplicar estrategias de procesamiento automático, como Reconocimiento de Entidades (RE). Debido a las características del contenido, surge además la necesidad de evaluar el desempeño de los algoritmos tradicionales, en corpus extraídos de estas redes sociales. Este trabajo presenta los resultados obtenidos al aplicar los algoritmos de AlchemyAPI y Dandelion API en un corpus provisto por la conferencia The SemEval-2015 Aspect Based Sentiment Analysis. Las entidades reconocidas por cada algoritmo fueron comparadas con las anotadas en la colección, para calcular su precisión y exhaustividad. Dandelion API obtuvo mejores resultados que AlchemyAPI en el corpus dado.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Qiuli Qin ◽  
Shuang Zhao ◽  
Chunmei Liu

Because of difficulty processing the electronic medical record data of patients with cerebrovascular disease, there is little mature recognition technology capable of identifying the named entity of cerebrovascular disease. Excellent research results have been achieved in the field of named entity recognition (NER), but there are several problems in the pre processing of Chinese named entities that have multiple meanings, of which neglecting the combination of contextual information is one. Therefore, to extract five categories of key entity information for diseases, symptoms, body parts, medical examinations, and treatment in electronic medical records, this paper proposes the use of a BERT-BiGRU-CRF named entity recognition method, which is applied to the field of cerebrovascular diseases. The BERT layer first converts the electronic medical record text into a low-dimensional vector, then uses this vector as the input to the BiGRU layer to capture contextual features, and finally uses conditional random fields (CRFs) to capture the dependency between adjacent tags. The experimental results show that the F1 score of the model reaches 90.38%.


2020 ◽  
Author(s):  
Christian E. Lopez ◽  
Caleb Gallemore

Abstract We present an openly available dataset to facilitate researchers’ exploration of popular discourse about the COVID-19 pandemic. The dataset, whose collection is ongoing, currently consists of over 780 million tweets, from all over the world, in multiple languages. Tweets start from 22 January 2020, when the total cases of reported COVID-19 were below 600 worldwide. The dataset was collected using the Twitter API and by rehydrating tweets from another openly available database. To facilitate access for other researchers, the English-language tweet data has been augmented by state-of-the-art Twitter sentiment and named entity recognition algorithms. The dataset and the summary files we provide allow researchers to avoid some computationally intensive analyses, facilitating more widespread use of social media data to gain insights on issues such as (mis)information diffusion, semantic networks, sentiment, and the evolution of COVID-19 discussions. The insights extracted from such analyses could help inform policy and advocacy work amid the current and future pandemics.


2020 ◽  
Vol 34 (10) ◽  
pp. 13921-13922
Author(s):  
Chan Hee Song ◽  
Arijit Sehanobish

Most Named Entity Recognition (NER) systems use additional features like part-of-speech (POS) tags, shallow parsing, gazetteers, etc. Adding these external features to NER systems have been shown to have a positive impact. However, creating gazetteers or taggers can take a lot of time and may require extensive data cleaning. In this work instead of using these traditional features we use lexicographic features of Chinese characters. Chinese characters are composed of graphical components called radicals and these components often have some semantic indicators. We propose CNN based models that incorporate this semantic information and use them for NER. Our models show an improvement over the baseline BERT-BiLSTM-CRF model. We present one of the first studies on Chinese OntoNotes v5.0 and show an improvement of + .64 F1 score over the baseline. We present a state-of-the-art (SOTA) F1 score of 71.81 on the Weibo dataset, show a competitive improvement of + 0.72 over baseline on the ResumeNER dataset, and a SOTA F1 score of 96.49 on the MSRA dataset.


Knowledge ◽  
2021 ◽  
Vol 2 (1) ◽  
pp. 1-25
Author(s):  
Michalis Mountantonakis ◽  
Yannis Tzitzikas

There is a high increase in approaches that receive as input a text and perform named entity recognition (or extraction) for linking the recognized entities of the given text to RDF Knowledge Bases (or datasets). In this way, it is feasible to retrieve more information for these entities, which can be of primary importance for several tasks, e.g., for facilitating manual annotation, hyperlink creation, content enrichment, for improving data veracity and others. However, current approaches link the extracted entities to one or few knowledge bases, therefore, it is not feasible to retrieve the URIs and facts of each recognized entity from multiple datasets and to discover the most relevant datasets for one or more extracted entities. For enabling this functionality, we introduce a research prototype, called LODsyndesisIE, which exploits three widely used Named Entity Recognition and Disambiguation tools (i.e., DBpedia Spotlight, WAT and Stanford CoreNLP) for recognizing the entities of a given text. Afterwards, it links these entities to the LODsyndesis knowledge base, which offers data enrichment and discovery services for millions of entities over hundreds of RDF datasets. We introduce all the steps of LODsyndesisIE, and we provide information on how to exploit its services through its online application and its REST API. Concerning the evaluation, we use three evaluation collections of texts: (i) for comparing the effectiveness of combining different Named Entity Recognition tools, (ii) for measuring the gain in terms of enrichment by linking the extracted entities to LODsyndesis instead of using a single or a few RDF datasets and (iii) for evaluating the efficiency of LODsyndesisIE.


2020 ◽  
Author(s):  
Irene Pérez-Díez ◽  
Raúl Pérez-Moraga ◽  
Adolfo López-Cerdán ◽  
Jose-Maria Salinas-Serrano ◽  
María de la Iglesia-Vayá

Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Along-side, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%. The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it can be easily extended to other languages and medical texts, such as electronic health records.


Sign in / Sign up

Export Citation Format

Share Document