named entity recognition
Recently Published Documents


TOTAL DOCUMENTS

2386
(FIVE YEARS 1250)

H-INDEX

47
(FIVE YEARS 10)

Author(s):  
Mohammad Sadegh Sheikhaei ◽  
Hasan Zafari ◽  
Yuan Tian

In this article, we propose a new encoding scheme for named entity recognition (NER) called Joined Type-Length encoding (JoinedTL). Unlike most existing named entity encoding schemes, which focus on flat entities, JoinedTL can label nested named entities in a single sequence. JoinedTL uses a packed encoding to represent both type and span of a named entity, which not only results in less tagged tokens compared to existing encoding schemes, but also enables it to support nested NER. We evaluate the effectiveness of JoinedTL for nested NER on three nested NER datasets: GENIA in English, GermEval in German, and PerNest, our newly created nested NER dataset in Persian. We apply CharLSTM+WordLSTM+CRF, a three-layer sequence tagging model on three datasets encoded using JoinedTL and two existing nested NE encoding schemes, i.e., JoinedBIO and JoinedBILOU. Our experiment results show that CharLSTM+WordLSTM+CRF trained with JoinedTL encoded datasets can achieve competitive F1 scores as the ones trained with datasets encoded by two other encodings, but with 27%–48% less tagged tokens. To leverage the power of three different encodings, i.e., JoinedTL, JoinedBIO, and JoinedBILOU, we propose an encoding-based ensemble method for nested NER. Evaluation results show that the ensemble method achieves higher F1 scores on all datasets than the three models each trained using one of the three encodings. By using nested NE encodings including JoinedTL with CharLSTM+WordLSTM+CRF, we establish new state-of-the-art performance with an F1 score of 83.7 on PerNest, 74.9 on GENIA, and 70.5 on GermEval, surpassing two recent neural models specially designed for nested NER.


Author(s):  
Xianwen Liao ◽  
Yongzhong Huang ◽  
Peng Yang ◽  
Lei Chen

By defining the computable word segmentation unit and studying its probability characteristics, we establish an unsupervised statistical language model (SLM) for a new pre-trained sequence labeling framework in this article. The proposed SLM is an optimization model, and its objective is to maximize the total binding force of all candidate word segmentation units in sentences under the condition of no annotated datasets and vocabularies. To solve SLM, we design a recursive divide-and-conquer dynamic programming algorithm. By integrating SLM with the popular sequence labeling models, Vietnamese word segmentation, part-of-speech tagging and named entity recognition experiments are performed. The experimental results show that our SLM can effectively promote the performance of sequence labeling tasks. Just using less than 10% of training data and without using a dictionary, the performance of our sequence labeling framework is better than the state-of-the-art Vietnamese word segmentation toolkit VnCoreNLP on the cross-dataset test. SLM has no hyper-parameter to be tuned, and it is completely unsupervised and applicable to any other analytic language. Thus, it has good domain adaptability.


Information ◽  
2022 ◽  
Vol 13 (1) ◽  
pp. 26
Author(s):  
Nestor Suat-Rojas ◽  
Camilo Gutierrez-Osorio ◽  
Cesar Pedraza

Traffic accident detection is an important strategy governments can use to implement policies intended to reduce accidents. They usually use techniques such as image processing, RFID devices, among others. Social network mining has emerged as a low-cost alternative. However, social networks come with several challenges such as informal language and misspellings. This paper proposes a method to extract traffic accident data from Twitter in Spanish. The method consists of four phases. The first phase establishes the data collection mechanisms. The second consists of vectorially representing the messages and classifying them as accidents or non-accidents. The third phase uses named entity recognition techniques to detect the location. In the fourth phase, locations pass through a geocoder that returns their geographic coordinates. This method was applied to Bogota city and the data on Twitter were compared with the official traffic information source; comparisons showed some influence of Twitter on the commercial and industrial area of the city. The results reveal how effective the information on accidents reported on Twitter can be. It should therefore be considered as a source of information that may complement existing detection methods.


2022 ◽  
Author(s):  
Sebastião Pais ◽  
João Cordeiro ◽  
Muhammad Jamil

Abstract Nowadays, the use of language corpora for many purposes has increased significantly. General corpora exist for numerous languages, but research often needs more specialized corpora. The Web’s rapid growth has significantly improved access to thousands of online documents, highly specialized texts and comparable texts on the same subject covering several languages in electronic form. However, research has continued to concentrate on corpus annotation instead of corpus creation tools. Consequently, many researchers create their corpora, independently solve problems, and generate project-specific systems. The corpus construction is used for many NLP applications, including machine translation, information retrieval, and question-answering. This paper presents a new NLP Corpus and Services in the Cloud called HULTIG-C. HULTIG-C is characterized by various languages that include unique annotations such as keywords set, sentences set, named entity recognition set, and multiword set. Moreover, a framework incorporates the main components for license detection, language identification, boilerplate removal and document deduplication to process the HULTIG-C. Furthermore, this paper presents some potential issues related to constructing multilingual corpora from the Web.


Author(s):  
Judita Preiss

AbstractWe exploit the Twitter platform to create a dataset of news articles derived from tweets concerning COVID-19, and use the associated tweets to define a number of popularity measures. The focus on (potentially) biomedical news articles allows the quantity of biomedically valid information (as extracted by biomedical relation extraction) to be included in the list of explored features. Aside from forming part of a systematic correlation exploration, the features – ranging from the semantic relations through readability measures to the article’s digital content – are used within a number of machine learning classifier and regression algorithms. Unsurprisingly, the results support that for more complex articles (as determined by a readability measure) more sophisticated syntactic structure may be expected. A weak correlation is found with information within an article suggesting that other factors, such as numbers of videos, have a notable impact on the popularity of a news article. The best popularity prediction performance is obtained using a random forest machine learning algorithm, and the feature describing the quantity of biomedical information is in the top 3 most important features in almost a third of the experiments performed. Additionally, this feature is found to be more valuable than the widely used named entity recognition.


2022 ◽  
Vol 6 (1) ◽  
pp. 4
Author(s):  
Dmitry Soshnikov ◽  
Tatiana Petrova ◽  
Vickie Soshnikova ◽  
Andrey Grunin

Since the beginning of the COVID-19 pandemic almost two years ago, there have been more than 700,000 scientific papers published on the subject. An individual researcher cannot possibly get acquainted with such a huge text corpus and, therefore, some help from artificial intelligence (AI) is highly needed. We propose the AI-based tool to help researchers navigate the medical papers collections in a meaningful way and extract some knowledge from scientific COVID-19 papers. The main idea of our approach is to get as much semi-structured information from text corpus as possible, using named entity recognition (NER) with a model called PubMedBERT and Text Analytics for Health service, then store the data into NoSQL database for further fast processing and insights generation. Additionally, the contexts in which the entities were used (neutral or negative) are determined. Application of NLP and text-based emotion detection (TBED) methods to COVID-19 text corpus allows us to gain insights on important issues of diagnosis and treatment (such as changes in medical treatment over time, joint treatment strategies using several medications, and the connection between signs and symptoms of coronavirus, etc.).


2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Zhaoying Chai ◽  
Han Jin ◽  
Shenghui Shi ◽  
Siyan Zhan ◽  
Lin Zhuo ◽  
...  

Abstract Background Biomedical named entity recognition (BioNER) is a basic and important medical information extraction task to extract medical entities with special meaning from medical texts. In recent years, deep learning has become the main research direction of BioNER due to its excellent data-driven context coding ability. However, in BioNER task, deep learning has the problem of poor generalization and instability. Results we propose the hierarchical shared transfer learning, which combines multi-task learning and fine-tuning, and realizes the multi-level information fusion between the underlying entity features and the upper data features. We select 14 datasets containing 4 types of entities for training and evaluate the model. The experimental results showed that the F1-scores of the five gold standard datasets BC5CDR-chemical, BC5CDR-disease, BC2GM, BC4CHEMD, NCBI-disease and LINNAEUS were increased by 0.57, 0.90, 0.42, 0.77, 0.98 and − 2.16 compared to the single-task XLNet-CRF model. BC5CDR-chemical, BC5CDR-disease and BC4CHEMD achieved state-of-the-art results.The reasons why LINNAEUS’s multi-task results are lower than single-task results are discussed at the dataset level. Conclusion Compared with using multi-task learning and fine-tuning alone, the model has more accurate recognition ability of medical entities, and has higher generalization and stability.


Sign in / Sign up

Export Citation Format

Share Document