Named Entity System for Tweets in Hindi Language

2018 ◽  
Vol 14 (4) ◽  
pp. 55-76 ◽  
Author(s):  
Arti Jain ◽  
Anuja Arora

Due to the growing need of smart-health applications in Hindi language, there is a rapid demand for health-related Named Entity Recognition (NER) system for Hindi. For the purpose of the same, this research considers Twitter social network to extract tweets dated 1st October 2016 to 15th October 2017 from Patanjali, Dabur and other Hindi language-oriented Twitter based health sites; while considering four NE types- Person, Disease, Consumable and Organization. To the best of its knowledge, the considered Twitter dataset and NE types for Hindi language is one of the first resources that is being taken care. This article introduces three stage NER system for Tweets in Hindi language (HinTwtNER system)- pre-processing stage; machine Learning stage (Hyperspace Analogue to Language (HAL) and Conditional Random Field (CRF)); and post-processing stage. HinTwtNER looks into binary features and achieves an overall F-score of 49.87% which is comparable to the Twitter based NER systems for English and other languages.

Author(s):  
Erdenebileg Batbaatar ◽  
Keun Ho Ryu

Named Entity Recognition (NER) in the healthcare domain involves identifying and categorizing disease, drugs, and symptoms for biosurveillance, extracting their related properties and activities, and identifying adverse drug events appearing in texts. These tasks are important challenges in healthcare. Analyzing user messages in social media networks such as Twitter can provide opportunities to detect and manage public health events. Twitter provides a broad range of short messages that contain interesting information for information extraction. In this paper, we present a Health-Related Named Entity Recognition (HNER) task using healthcare-domain ontology that can recognize health-related entities from large numbers of user messages from Twitter. For this task, we employ a deep learning architecture which is based on a recurrent neural network (RNN) with little feature engineering. To achieve our goal, we collected a large number of Twitter messages containing health-related information, and detected biomedical entities from the Unified Medical Language System (UMLS). A bidirectional long short-term memory (BiLSTM) model learned rich context information, and a convolutional neural network (CNN) was used to produce character-level features. The conditional random field (CRF) model predicted a sequence of labels that corresponded to a sequence of inputs, and the Viterbi algorithm was used to detect health-related entities from Twitter messages. We provide comprehensive results giving valuable insights for identifying medical entities in Twitter for various applications. The BiLSTM-CRF model achieved a precision of 93.99%, recall of 73.31%, and F1-score of 81.77% for disease or syndrome HNER; a precision of 90.83%, recall of 81.98%, and F1-score of 87.52% for sign or symptom HNER; and a precision of 94.85%, recall of 73.47%, and F1-score of 84.51% for pharmacologic substance named entities. The ontology-based manual annotation results show that it is possible to perform high-quality annotation despite the complexity of medical terminology and the lack of context in tweets.


2019 ◽  
Vol 9 (1) ◽  
pp. 15 ◽  
Author(s):  
Runyu Fan ◽  
Lizhe Wang ◽  
Jining Yan ◽  
Weijing Song ◽  
Yingqian Zhu ◽  
...  

Constructing a knowledge graph of geological hazards literature can facilitate the reuse of geological hazards literature and provide a reference for geological hazard governance. Named entity recognition (NER), as a core technology for constructing a geological hazard knowledge graph, has to face the challenges that named entities in geological hazard literature are diverse in form, ambiguous in semantics, and uncertain in context. This can introduce difficulties in designing practical features during the NER classification. To address the above problem, this paper proposes a deep learning-based NER model; namely, the deep, multi-branch BiGRU-CRF model, which combines a multi-branch bidirectional gated recurrent unit (BiGRU) layer and a conditional random field (CRF) model. In an end-to-end and supervised process, the proposed model automatically learns and transforms features by a multi-branch bidirectional GRU layer and enhances the output with a CRF layer. Besides the deep, multi-branch BiGRU-CRF model, we also proposed a pattern-based corpus construction method to construct the corpus needed for the deep, multi-branch BiGRU-CRF model. Experimental results indicated the proposed deep, multi-branch BiGRU-CRF model outperformed state-of-the-art models. The proposed deep, multi-branch BiGRU-CRF model constructed a large-scale geological hazard literature knowledge graph containing 34,457 entities nodes and 84,561 relations.


Information ◽  
2020 ◽  
Vol 11 (1) ◽  
pp. 45 ◽  
Author(s):  
Shardrom Johnson ◽  
Sherlock Shen ◽  
Yuanchen Liu

Usually taken as linguistic features by Part-Of-Speech (POS) tagging, Named Entity Recognition (NER) is a major task in Natural Language Processing (NLP). In this paper, we put forward a new comprehensive-embedding, considering three aspects, namely character-embedding, word-embedding, and pos-embedding stitched in the order we give, and thus get their dependencies, based on which we propose a new Character–Word–Position Combined BiLSTM-Attention (CWPC_BiAtt) for the Chinese NER task. Comprehensive-embedding via the Bidirectional Llong Short-Term Memory (BiLSTM) layer can get the connection between the historical and future information, and then employ the attention mechanism to capture the connection between the content of the sentence at the current position and that at any location. Finally, we utilize Conditional Random Field (CRF) to decode the entire tagging sequence. Experiments show that CWPC_BiAtt model we proposed is well qualified for the NER task on Microsoft Research Asia (MSRA) dataset and Weibo NER corpus. A high precision and recall were obtained, which verified the stability of the model. Position-embedding in comprehensive-embedding can compensate for attention-mechanism to provide position information for the disordered sequence, which shows that comprehensive-embedding has completeness. Looking at the entire model, our proposed CWPC_BiAtt has three distinct characteristics: completeness, simplicity, and stability. Our proposed CWPC_BiAtt model achieved the highest F-score, achieving the state-of-the-art performance in the MSRA dataset and Weibo NER corpus.


2019 ◽  
Vol 9 (18) ◽  
pp. 3658 ◽  
Author(s):  
Jianliang Yang ◽  
Yuenan Liu ◽  
Minghui Qian ◽  
Chenghua Guan ◽  
Xiangfei Yuan

Clinical named entity recognition is an essential task for humans to analyze large-scale electronic medical records efficiently. Traditional rule-based solutions need considerable human effort to build rules and dictionaries; machine learning-based solutions need laborious feature engineering. For the moment, deep learning solutions like Long Short-term Memory with Conditional Random Field (LSTM–CRF) achieved considerable performance in many datasets. In this paper, we developed a multitask attention-based bidirectional LSTM–CRF (Att-biLSTM–CRF) model with pretrained Embeddings from Language Models (ELMo) in order to achieve better performance. In the multitask system, an additional task named entity discovery was designed to enhance the model’s perception of unknown entities. Experiments were conducted on the 2010 Informatics for Integrating Biology & the Bedside/Veterans Affairs (I2B2/VA) dataset. Experimental results show that our model outperforms the state-of-the-art solution both on the single model and ensemble model. Our work proposes an approach to improve the recall in the clinical named entity recognition task based on the multitask mechanism.


2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Han Huang ◽  
Hongyu Wang ◽  
Dawei Jin

Named entity recognition (NER) is an indispensable and very important part of many natural language processing technologies, such as information extraction, information retrieval, and intelligent Q & A. This paper describes the development of the AL-CRF model, which is a NER approach based on active learning (AL). The algorithmic sequence of the processes performed by the AL-CRF model is the following: first, the samples are clustered using the k-means approach. Then, stratified sampling is performed on the produced clusters in order to obtain initial samples, which are used to train the basic conditional random field (CRF) classifier. The next step includes the initiation of the selection process which uses the criterion of entropy. More specifically, samples having the highest entropy values are added to the training set. Afterwards, the learning process is repeated, and the CRF classifier is retrained based on the obtained training set. The learning and the selection process of the AL is running iteratively until the harmonic mean F stabilizes and the final NER model is obtained. Several NER experiments are performed on legislative and medical cases in order to validate the AL-CRF performance. The testing data include Chinese judicial documents and Chinese electronic medical records (EMRs). Testing indicates that our proposed algorithm has better recognition accuracy and recall rate compared to the conventional CRF model. Moreover, the main advantage of our approach is that it requires fewer manually labelled training samples, and at the same time, it is more effective. This can result in a more cost effective and more reliable process.


2020 ◽  
Vol 8 ◽  
pp. 605-620 ◽  
Author(s):  
Takashi Shibuya ◽  
Eduard Hovy

When an entity name contains other names within it, the identification of all combinations of names can become difficult and expensive. We propose a new method to recognize not only outermost named entities but also inner nested ones. We design an objective function for training a neural model that treats the tag sequence for nested entities as the second best path within the span of their parent entity. In addition, we provide the decoding method for inference that extracts entities iteratively from outermost ones to inner ones in an outside-to-inside way. Our method has no additional hyperparameters to the conditional random field based model widely used for flat named entity recognition tasks. Experiments demonstrate that our method performs better than or at least as well as existing methods capable of handling nested entities, achieving F1-scores of 85.82%, 84.34%, and 77.36% on ACE-2004, ACE-2005, and GENIA datasets, respectively.


2018 ◽  
Author(s):  
Yudi Wibisono ◽  
Masayu Leylia Khodra

Pengenalan entitas bernama (named-entity recognition atau NER) adalah proses otomatis mengekstraksi entitas bernama yang dianggap penting di dalam sebuah teks dan menentukan kategorinya ke dalam kategori terdefinisi. Sebagai contoh, untuk teks berita, NER dapat mengekstraksi nama orang, nama organisasi, dan nama lokasi. NER bermanfaat dalam berbagai aplikasi analisis teks, misalnya pencarian, sistem tanya jawab, peringkasan teks dan mesin penerjemah. Tantangan utama NER adalah penanganan ambiguitas makna karena konteks kata pada kalimat, misalnya kata “Cendana” dapat merupakan nama lokasi (Jalan Cendana), atau nama organisasi (Keluarga Cendana), atau nama tanaman. Tantangan lainnya adalah penentuan batas entitas, misalnya “[Istora Senayan] [Jakarta]”. Berbagai kakas NER telah dikembangkan untuk berbagai bahasa terutama Bahasa Inggris dengan kinerja yang baik, tetapi kakas NER bahasa Indonesia masih memiliki kinerja yang belum baik. Makalah ini membahas pendekatan berbasis pembelajaran mesin untuk menghasilkan model NER bahasa Indonesia. Pendekatan ini sangat bergantung pada korpus yang menjadi sumber belajar, dan teknik pembelajaran mesin yang digunakan. Teknik yang akan digunakan adalah LSTM - CRF (Long Short Term Memory – Conditional Random Field). Hasil terbaik (F-measure = 0.72) didapatkan dengan menggunakan word embedding GloVe Wikipedia Bahasa Indonesia.


Sign in / Sign up

Export Citation Format

Share Document