Tibetan-Chinese Named Entity Extraction Based on Comparable Corpus

2014 ◽  
Vol 571-572 ◽  
pp. 1202-1205
Author(s):  
Yuan Sun ◽  
Qian Zhao

Tibetan-Chinese named entity extraction is the foundation of cross language information processing, and provides a basis for machine translation and cross language information retrieval research. In this paper, we use the multi-language links of Wikipedia to obtain Tibetan-Chinese comparable corpus, and combine sentence length, word matching and entity boundary words together to get parallel sentence. Then we extract Tibetan-Chinese named entity from the comparable corpus in three ways: (1) Extracting Natural labeling information. (2) Acquiring the links of Tibetan entries and Chinese entries. (3) Using sequence intersection method, which includes the sentence representation, Chinese named entity recognition and corresponding Tibetan sentences intersection. Finally, the results show the extraction method based on comparable corpus is effective.

Author(s):  
Hsu Myat Mo ◽  
Khin Mar Soe

Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language.


Author(s):  
Krešimir Baksa ◽  
Dino Golović ◽  
Goran Glavaš ◽  
Jan Šnajder

Named entity extraction tools designed for recognizing named entities in texts written in standard language (e.g., news stories or legal texts) have been shown to be inadequate for user-generated textual content (e.g., tweets, forum posts). In this work, we propose a supervised approach to named entity recognition and classification for Croatian tweets. We compare two sequence labelling models: a hidden Markov model (HMM) and conditional random fields (CRF). Our experiments reveal that CRF is the best model for the task, achieving a very good performance of over 87% micro-averaged F1 score. We analyse the contributions of different feature groups and influence of the training set size on the performance of the CRF model.


2020 ◽  
Vol 9 (6) ◽  
pp. 1-22
Author(s):  
Omar ASBAYOU

This article tries to explain our rule-based Arabic Named Entity recognition (NER) and classification system. It is based on lists of classified proper names (PN) and particularly on syntactico-semantic patterns resulting in fine classification of Arabic NE. These patterns use syntactico-semantic combination of morpho-syntactic and syntactic entities. It also uses lexical classification of trigger words and NE extensions. These linguistic data are essential not only to name entity extraction but also to the taxonomic classification and to determining the NE frontiers. Our method is also based on the contextualisation and on the notion of NE class attributes and values. Inspired from X-bar theory and immediate constituents, we built a rule-based NER system composed of five levels of syntactico-semantic combination. We also show how the fine NE annotations in our system output (XML database) is exploited in information retrieval and information extraction.


2015 ◽  
Vol 22 (3) ◽  
pp. 423-456 ◽  
Author(s):  
MENA B. HABIB ◽  
MAURICE VAN KEULEN

AbstractTwitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.


2021 ◽  
pp. 724-732
Author(s):  
Zeqi Ma ◽  
Lingwei Ma ◽  
Dongmei Fu ◽  
Guangxuan Song ◽  
Dawei Zhang

Author(s):  
Sergey Berezin ◽  
◽  
Ivan Bondarenko ◽  

Named Entity Extraction (NER) is the task of extracting information from text data that belongs to predefined categories, such as organizations names, place names, people's names, etc. Within the framework of the presented work, was developed an approach for the additional training of deep neural networks with the attention mechanism (BERT architecture). It is shown that the preliminary training of the language model in the tasks of recovering the masked word and determining the semantic relatedness of two sentences can significantly improve the quality of solving the problem of NER. One of the best results has been achieved in the task of extracting named entities on the RuREBus dataset. One of the key features of the described solution is the closeness of the formulation to real business problems and the selection of entities not of a general nature, but specific to the economic industry.


Sign in / Sign up

Export Citation Format

Share Document