Domain-Specific Entity Linking via Fake Named Entity Detection

Web enables to retrieve concise information about specific entities including people, organizations, movies and their features. Additionally, large amount of Web resources generally lies on a unstructured form and it tackles to find critical information for specific entities. Text analysis approaches such as Named Entity Recognizer and Entity Linking aim to identify entities and link them to relevant entities in the given knowledge base. To evaluate these approaches, there are a vast amount of general purpose benchmark datasets. However, it is difficult to evaluate domain-specific approaches due to lack of evaluation datasets for specific domains. This study presents WeDGeM that is a multilingual evaluation set generator for specific domains exploiting Wikipedia category pages and DBpedia hierarchy. Also, Wikipedia disambiguation pages are used to adjust the ambiguity level of the generated texts. Based on this generated test data, a use case for well-known Entity Linking systems supporting Turkish texts are evaluated in the movie domain.

Download Full-text

Robust named entity detection from optical character recognition output

International Journal on Document Analysis and Recognition (IJDAR) ◽

10.1007/s10032-011-0150-z ◽

2011 ◽

Vol 14 (2) ◽

pp. 189-200 ◽

Cited By ~ 3

Author(s):

Krishna Subramanian ◽

Rohit Prasad ◽

Prem Natarajan

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Named Entity ◽

Optical Character ◽

Named Entity Detection

Download Full-text

Variable-Span out-of-vocabulary named entity detection

10.21437/interspeech.2013-594 ◽

2013 ◽

Author(s):

Wei Chen ◽

Sankaranarayanan Ananthakrishnan ◽

Rohit Prasad ◽

Prem Natarajan

Keyword(s):

Named Entity ◽

Variable Span ◽

Named Entity Detection

Download Full-text

WeDGeM: A Domain-Specific Evaluation Dataset Generator for Multilingual Entity Linking Systems

Lecture Notes in Computer Science - Web Information Systems Engineering – WISE 2017 ◽

10.1007/978-3-319-68786-5_18 ◽

2017 ◽

pp. 221-228

Author(s):

Emrah Inan ◽

Oguz Dikenelli

Keyword(s):

Entity Linking ◽

Domain Specific ◽

Evaluation Dataset ◽

Specific Evaluation

Download Full-text

MicroNeel: Combining NLP Tools to Perform Named Entity Detection and Linking on Microposts

EVALITA. Evaluation of NLP and Speech Tools for Italian ◽

10.4000/books.aaccademia.1948 ◽

2016 ◽

pp. 60-65 ◽

Cited By ~ 2

Author(s):

Francesco Corcoglioniti ◽

Alessio Palmero Aprosio ◽

Yaroslav Nechaev ◽

Claudio Giuliano

Keyword(s):

Named Entity ◽

Named Entity Detection

Download Full-text

A Domain Specific Entity Linking Approach Consuming Multistore Environment

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.201805016 ◽

2018 ◽

pp. 46-52

Author(s):

Emrah Inan ◽

Burak Yonyul ◽

Fatih Tekbacak

Keyword(s):

Big Data ◽

Data Models ◽

Use Cases ◽

Unstructured Data ◽

Entity Linking ◽

Domain Specific ◽

Storage Technology ◽

Integrated Environment ◽

Data Environment ◽

The Web

Most of the data on the web is non-structural, and it is required that the data should be transformed into a machine operable structure. Therefore, it is appropriate to convert the unstructured data into a structured form according to the requirements and to store those data in different data models by considering use cases. As requirements and their types increase, it fails using one approach to perform on all. Thus, it is not suitable to use a single storage technology to carry out all storage requirements. Managing stores with various type of schemas in a joint and an integrated manner is named as 'multistore' and 'polystore' in the database literature. In this paper, Entity Linking task is leveraged to transform texts into wellformed data and this data is managed by an integrated environment of different data models. Finally, this integrated big data environment will be queried and be examined by presenting the method.

Download Full-text

A Probability based Classification of Named Entities for Malayalam Language combining Word, Part of Speech and Lexicalized features

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.a1968.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 839-842

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Supervised Machine Learning ◽

Named Entities ◽

Named Entity ◽

Domain Specific ◽

Part Of Speech ◽

Classification Probability ◽

Malayalam Language

Named Entity Recognition is the process wherein named entities which are designators of a sentence are identified. Designators of a sentence are domain specific. The proposed system identifies named entities in Malayalam language belonging to tourism domain which generally includes names of persons, places, organizations, dates etc. The system uses word, part of speech and lexicalized features to find the probability of a word belonging to a named entity category and to do the appropriate classification. Probability is calculated based on supervised machine learning using word and part of speech features present in a tagged training corpus and using certain rules applied based on lexicalized features.

Download Full-text

BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition

10.21203/rs.3.rs-90025/v1 ◽

2020 ◽

Author(s):

Usman Naseem ◽

Matloob Khushi ◽

Vinay Reddy ◽

Sakthivel Rajendran ◽

Imran Razzak ◽

...

Keyword(s):

State Of The Art ◽

Language Model ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Future Research ◽

Named Entity ◽

Domain Specific ◽

Context Dependent ◽

Biomedical Named Entity Recognition

Abstract Background: In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. Results: We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted self-supervised loss function used in ALBERT that targets on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI-disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61% (BC5CDR-Chem) and 3.89 (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) Species type corpora by 6.19% (LINNAEUS) and 23.71% (Species-800) is observed which leads to a state-of-the-art results. Conclusions: The performance of proposed model on four different biomedical entity types shows that our model is robust and generalizable in recognizing biomedical entities in text. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.

Download Full-text

HerCulB: content-based information extraction and retrieval for cultural heritage of the Balkans

The Electronic Library ◽

10.1108/el-03-2020-0052 ◽

2020 ◽

Vol 38 (5/6) ◽

pp. 905-918

Author(s):

Ivana Tanasijević ◽

Gordana Pavlović-Lažetić

Keyword(s):

Cultural Heritage ◽

Language Processing ◽

Digital Libraries ◽

Intangible Cultural Heritage ◽

Automatic Annotation ◽

Content Type ◽

The Balkans ◽

Named Entity ◽

Domain Specific ◽

Rule Based Approach

Purpose The purpose of this paper is to provide a methodology for automatic annotation of a multimedia collection of intangible cultural heritage mostly in the form of interviews. Assigned annotations provide a way to search the collection. Design/methodology/approach Annotation is based on automatic extraction of metadata and is conducted by named entity and topic extraction from textual descriptions with a rule-based approach supported by vocabulary resources, a compiled domain-specific classification scheme and domain-oriented corpus analysis. Findings The proposed methodology for automatic annotation of a collection of intangible cultural heritage, applied on the cultural heritage of the Balkans, has very good results according to F measure, which is 0.87 for the named entity and 0.90 for topic annotation. The overall methodology enables encapsulating domain-specific and language-specific knowledge into collections of finite state transducers and allows further improvements. Originality/value Although cultural heritage has a significant role in the development of identity of a group or an individual, it is one of those specific domains that have not yet been fully explored in case of many languages. A methodology is proposed that can be used for incorporating natural language processing techniques into digital libraries of cultural heritage.

Download Full-text