named entity
Recently Published Documents





Mohammad Sadegh Sheikhaei ◽  
Hasan Zafari ◽  
Yuan Tian

In this article, we propose a new encoding scheme for named entity recognition (NER) called Joined Type-Length encoding (JoinedTL). Unlike most existing named entity encoding schemes, which focus on flat entities, JoinedTL can label nested named entities in a single sequence. JoinedTL uses a packed encoding to represent both type and span of a named entity, which not only results in less tagged tokens compared to existing encoding schemes, but also enables it to support nested NER. We evaluate the effectiveness of JoinedTL for nested NER on three nested NER datasets: GENIA in English, GermEval in German, and PerNest, our newly created nested NER dataset in Persian. We apply CharLSTM+WordLSTM+CRF, a three-layer sequence tagging model on three datasets encoded using JoinedTL and two existing nested NE encoding schemes, i.e., JoinedBIO and JoinedBILOU. Our experiment results show that CharLSTM+WordLSTM+CRF trained with JoinedTL encoded datasets can achieve competitive F1 scores as the ones trained with datasets encoded by two other encodings, but with 27%–48% less tagged tokens. To leverage the power of three different encodings, i.e., JoinedTL, JoinedBIO, and JoinedBILOU, we propose an encoding-based ensemble method for nested NER. Evaluation results show that the ensemble method achieves higher F1 scores on all datasets than the three models each trained using one of the three encodings. By using nested NE encodings including JoinedTL with CharLSTM+WordLSTM+CRF, we establish new state-of-the-art performance with an F1 score of 83.7 on PerNest, 74.9 on GENIA, and 70.5 on GermEval, surpassing two recent neural models specially designed for nested NER.

2022 ◽  
Sebastião Pais ◽  
João Cordeiro ◽  
Muhammad Jamil

Abstract Nowadays, the use of language corpora for many purposes has increased significantly. General corpora exist for numerous languages, but research often needs more specialized corpora. The Web’s rapid growth has significantly improved access to thousands of online documents, highly specialized texts and comparable texts on the same subject covering several languages in electronic form. However, research has continued to concentrate on corpus annotation instead of corpus creation tools. Consequently, many researchers create their corpora, independently solve problems, and generate project-specific systems. The corpus construction is used for many NLP applications, including machine translation, information retrieval, and question-answering. This paper presents a new NLP Corpus and Services in the Cloud called HULTIG-C. HULTIG-C is characterized by various languages that include unique annotations such as keywords set, sentences set, named entity recognition set, and multiword set. Moreover, a framework incorporates the main components for license detection, language identification, boilerplate removal and document deduplication to process the HULTIG-C. Furthermore, this paper presents some potential issues related to constructing multilingual corpora from the Web.

2022 ◽  
Vol 23 (1) ◽  
Zhaoying Chai ◽  
Han Jin ◽  
Shenghui Shi ◽  
Siyan Zhan ◽  
Lin Zhuo ◽  

Abstract Background Biomedical named entity recognition (BioNER) is a basic and important medical information extraction task to extract medical entities with special meaning from medical texts. In recent years, deep learning has become the main research direction of BioNER due to its excellent data-driven context coding ability. However, in BioNER task, deep learning has the problem of poor generalization and instability. Results we propose the hierarchical shared transfer learning, which combines multi-task learning and fine-tuning, and realizes the multi-level information fusion between the underlying entity features and the upper data features. We select 14 datasets containing 4 types of entities for training and evaluate the model. The experimental results showed that the F1-scores of the five gold standard datasets BC5CDR-chemical, BC5CDR-disease, BC2GM, BC4CHEMD, NCBI-disease and LINNAEUS were increased by 0.57, 0.90, 0.42, 0.77, 0.98 and − 2.16 compared to the single-task XLNet-CRF model. BC5CDR-chemical, BC5CDR-disease and BC4CHEMD achieved state-of-the-art results.The reasons why LINNAEUS’s multi-task results are lower than single-task results are discussed at the dataset level. Conclusion Compared with using multi-task learning and fine-tuning alone, the model has more accurate recognition ability of medical entities, and has higher generalization and stability.

The customer feedbacks provide alternative and important sources to discover knowledge supporting the marketers and customers to make better decisions. However, the manual process to extract useful information depends on domain experts. This paper focuses on improving the performance of the automatic sentiment information extraction from customer feedbacks. The article proposes a new extraction method that consider multiple dimensions of feedback information, aspect, word, contrast, sentence or phrase, and document levels. The aspect-based sentiment extraction uses a named entity recognition technique to extract the desired aspects of a target product. The aspect-based sentiment combines with sentiment information from multiple levels of feedback contexts resulting in the fused sentiment information improves the extraction performance. We validate the effectiveness by measuring the accuracy of the sentiment and aspect recognition methods comparing with SentiStrength and Word-Count. This information gives some insights on customer satisfaction and can be applied in an alarming tool.

Sign in / Sign up

Export Citation Format

Share Document