Combining the NER-OCR methods to improve information retrieval efficiency in the Indonesian posters

Event organizers in Indonesia often use websites to disseminate information about these events through digital posters. However, manually processing for transferring information from posters to websites is constrained by time efficiency, given the increasing number of posters uploaded. Also, information retrieval methods, such as Named Entity Recognition (NER) for Indonesian posters, are still rarely discussed in the literature. In contrast, the NER method application to Indonesian corpus is challenged by accuracy improvement because Indonesian is a low-resource language that causes a lack of corpus availability as a reference. This study proposes a solution to improve the efficiency of information extraction time from digital posters. The proposed solution is a combination of the NER method with the Optical Character Recognition (OCR) method to recognize text on posters developed with the support of relevant training data corpus to improve accuracy. The experimental results show that the system can increase time efficiency by 94 % with 82-92 % accuracy for several extracted information entities from 50 testing digital posters.

Download Full-text

Searchable Turkish OCRed historical newspaper collection 1928–1942

Journal of Information Science ◽

10.1177/01655515211000642 ◽

2021 ◽

pp. 016555152110006

Author(s):

Houssem Menhour ◽

Hasan Basri Şahin ◽

Ramazan Nejdet Sarıkaya ◽

Medine Aktaş ◽

Rümeysa Sağlam ◽

...

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Named Entity Recognition ◽

Early Modern Period ◽

Entity Recognition ◽

Text Search ◽

Cultural Form ◽

Named Entity ◽

Search Mechanism ◽

Original Website

The newspaper emerged as a distinct cultural form in early 17th-century Europe. It is bound up with the early modern period of history. Historical newspapers are of utmost importance to nations and its people, and researchers from different disciplines rely on these papers to improve our understanding of the past. In pursuit of satisfying this need, Istanbul University Head Office of Library and Documentation provides access to a big database of scanned historical newspapers. To take it another step further and make the documents more accessible, we need to run optical character recognition (OCR) and named entity recognition (NER) tasks on the whole database and index the results to allow for full-text search mechanism. We design and implement a system encompassing the whole pipeline starting from scrapping the dataset from the original website to providing a graphical user interface to run search queries, and it manages to do that successfully. Proposed system provides to search people, culture and security-related keywords and to visualise them.

Download Full-text

Emerging trends: Deep nets for poets

Natural Language Engineering ◽

10.1017/s1351324921000231 ◽

2021 ◽

Vol 27 (5) ◽

pp. 631-645

Author(s):

Kenneth Ward Church ◽

Xiaopeng Yuan ◽

Sheng Guo ◽

Zewu Wu ◽

Yehua Yang ◽

...

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Question Answering ◽

Named Entity Recognition ◽

Fine Tuning ◽

Entity Recognition ◽

Training Models ◽

Named Entity ◽

Emerging Trends ◽

Programming Skills

AbstractDeep nets have done well with early adopters, but the future will soon depend on crossing the chasm. The goal of this paper is to make deep nets more accessible to a broader audience including people with little or no programming skills, and people with little interest in training new models. A github is provided with simple implementations of image classification, optical character recognition, sentiment analysis, named entity recognition, question answering (QA/SQuAD), machine translation, speech to text (SST), and speech recognition (STT). The emphasis is on instant gratification. Non-programmers should be able to install these programs and use them in 15 minutes or less (per program). Programs are short (10–100 lines each) and readable by users with modest programming skills. Much of the complexity is hidden behind abstractions such as pipelines and auto classes, and pretrained models and datasets provided by hubs: PaddleHub, PaddleNLP, HuggingFaceHub, and Fairseq. Hubs have different priorities than research. Research is training models from corpora and fine-tuning them for tasks. Users are already overwhelmed with an embarrassment of riches (13k models and 1k datasets). Do they want more? We believe the broader market is more interested in inference (how to run pretrained models on novel inputs) and less interested in training (how to create even more models).

Download Full-text

Learning Task-Specific Representation for Novel Words in Sequence Labeling

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/715 ◽

2019 ◽

Author(s):

Minlong Peng ◽

Qi Zhang ◽

Xiaoyu Xing ◽

Tao Gui ◽

Jinlan Fu ◽

...

Keyword(s):

Empirical Studies ◽

Named Entity Recognition ◽

Learning Task ◽

Training Data ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech Tagging ◽

Sequence Labeling ◽

Part Of Speech ◽

Word Representation

Word representation is a key component in neural-network-based sequence labeling systems. However, representations of unseen or rare words trained on the end task are usually poor for appreciable performance. This is commonly referred to as the out-of-vocabulary (OOV) problem. In this work, we address the OOV problem in sequence labeling using only training data of the task. To this end, we propose a novel method to predict representations for OOV words from their surface-forms (e.g., character sequence) and contexts. The method is specifically designed to avoid the error propagation problem suffered by existing approaches in the same paradigm. To evaluate its effectiveness, we performed extensive empirical studies on four part-of-speech tagging (POS) tasks and four named entity recognition (NER) tasks. Experimental results show that the proposed method can achieve better or competitive performance on the OOV problem compared with existing state-of-the-art methods.

Download Full-text

BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition

10.21203/rs.3.rs-90025/v1 ◽

2020 ◽

Author(s):

Usman Naseem ◽

Matloob Khushi ◽

Vinay Reddy ◽

Sakthivel Rajendran ◽

Imran Razzak ◽

...

Keyword(s):

State Of The Art ◽

Language Model ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Future Research ◽

Named Entity ◽

Domain Specific ◽

Context Dependent ◽

Biomedical Named Entity Recognition

Abstract Background: In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. Results: We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted self-supervised loss function used in ALBERT that targets on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI-disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61% (BC5CDR-Chem) and 3.89 (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) Species type corpora by 6.19% (LINNAEUS) and 23.71% (Species-800) is observed which leads to a state-of-the-art results. Conclusions: The performance of proposed model on four different biomedical entity types shows that our model is robust and generalizable in recognizing biomedical entities in text. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.

Download Full-text

Exploiting named entity recognition for improving syntactic-based web service discovery

Journal of Information Science ◽

10.1177/0165551518793321 ◽

2018 ◽

Vol 45 (3) ◽

pp. 398-415 ◽

Cited By ~ 4

Author(s):

Ignacio Lizarralde ◽

Cristian Mateos ◽

Juan Manuel Rodriguez ◽

Alejandro Zunino

Keyword(s):

Information Retrieval ◽

Web Services ◽

Web Service ◽

Service Discovery ◽

Software Industry ◽

Named Entity Recognition ◽

Entity Recognition ◽

Web Service Discovery ◽

Named Entity ◽

Text Corpora

Web Services have become essential to the software industry as they represent reusable, remotely accessible functionality and data. Since Web Services must be discovered before being consumed, many discovery approaches applying classic Information Retrieval techniques, which store and process textual service descriptions, have arisen. These efforts are affected by term mismatch: a description relevant to a query can be retrieved only if they share many words. We present an approach to improve Web Service discoverability that automatically augments Web Service descriptions and can be used on top of such existing syntactic-based approaches. We exploit Named Entity Recognition to identify entities in descriptions and expand them with information from public text corpora, for example, Wikidata, mitigating term mismatch since it exploits both synonyms and hypernyms. We evaluated our approach together with classical syntactic-based service discovery approaches using a real 1274-service dataset, achieving up to 15.06% better Recall scores, and up to 17% Precision-at-1, 8% Precision-at-2 and 4% Precision-at-3.

Download Full-text

Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort

Lecture Notes in Computer Science - Computational Science – ICCS 2019 ◽

10.1007/978-3-030-22734-0_29 ◽

2019 ◽

pp. 398-411 ◽

Cited By ~ 2

Author(s):

Roselyne B. Tchoua ◽

Aswathy Ajith ◽

Zhi Hong ◽

Logan T. Ward ◽

Kyle Chard ◽

...

Keyword(s):

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Named Entity ◽

Human Effort

Download Full-text

Named Entity Recognition for a Low Resource Language

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2085.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 587-590

Keyword(s):

Machine Learning ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Linguistic Knowledge ◽

Rule Based ◽

Low Resource ◽

Named Entity ◽

The North ◽

Rule Based Approach

Kokborok named entity recognition using the rules based approach is being studied in this paper. Named entity recognition is one of the applications of natural language processing. It is considered a subtask for information extraction. Named entity recognition is the means of identifying the named entity for some specific task. We have studied the named entity recognition system for the Kokborok language. Kokborok is the official language of the state of Tripura situated in the north eastern part of India. It is also widely spoken in other part of the north eastern state of India and adjoining areas of Bangladesh. The named entities are like the name of person, organization, location etc. Named entity recognitions are studied using the machine learning approach, rule based approach or the hybrid approach combining the machine learning and rule based approaches. Rule based named entity recognitions are influence by the linguistic knowledge of the language. Machine learning approach requires a large number of training data. Kokborok being a low resource language has very limited number of training data. The rule based approach requires linguistic rules and the results are not depended on the size of data available. We have framed a heuristic rules for identifying the named entity based on linguistic knowledge of the language. An encouraging result is obtained after we test our data with the rule based approach. We also tried to study and frame the rules for the counting system in Kokborok in this paper. The rule based approach to named entity recognition is found suitable for low resource language with limited digital work and absence of named entity tagged data. We have framed a suitable algorithm using the rules for solving the named entity recognition task for obtaining a desirable result.

Download Full-text

Named Entity Recognition (NER) for Tibetan and Mongolian Newspapers

10.33774/coe-2021-xhw9l-v2 ◽

2021 ◽

Author(s):

Robert Barnett ◽

Christian Faggionato ◽

Marieke Meelen ◽

Sargai Yunshaab ◽

Tsering Samdrup ◽

...

Keyword(s):

Gold Standard ◽

Named Entity Recognition ◽

The State ◽

Training Data ◽

Entity Recognition ◽

People's Republic Of China ◽

Republic Of China ◽

Named Entity ◽

Policy Analysts ◽

Standard Training

Modern Tibetan and Vertical (Traditional) Mongolian are scripts used by c.11m people, mostly within the People’s Republic of China. In terms of publicly available tools for NLP, these languages and their scripts are extremely low-resourced and under-researched. We set out firstly to survey the state of NLP for these languages, and secondly to facilitate research by historians and policy analysts working on Tibetan newspapers. Their primary need is to be able to carry out Named Entity Recognition (NER) in Modern Tibetan, a script which has no word or sentence boundaries and for which no segmenters have been developed. Working on LightTag, an online tagger using character-based modelling, we were able to produce gold-standard training data for NER for use with Modern Tibetan.

Download Full-text