scholarly journals Information Extraction Tasks based on BERT and SpaCy on Tourism Domain

Author(s):  
Chantana Chantrapornchai ◽  
Aphisit Tunsakul

In this paper, we present two methodologies to extract particular information based on the full text returned from the search engine to facilitate the users. The approaches are based three tasks: name entity recognition (NER), text classification and text summarization. The first step is the building training data and data cleansing. We consider tourism domain such as restaurant, hotels, shopping and tourism data set crawling from the websites. First, the tourism data are gathered and the vocabularies are built. Several minor steps include sentence extraction, relation and name entity extraction for tagging purpose. These steps are needed for creating proper training data. Then, the recognition model of a given entity type can be built. From the experiments, given review texts, we demonstrate to build the model to extract the desired entity,i.e, name, location, facility as well as relation type, classify the reviews or summarize the reviews. Two tools, SpaCy and BERT, are used to compare the performance of these tasks.

Author(s):  
Nadhia Salsabila Azzahra ◽  
Muhammad Okky Ibrohim ◽  
Junaedi Fahmi ◽  
Bagus Fajar Apriyanto ◽  
Oskar Riandi

2021 ◽  
pp. 107558
Author(s):  
Zhao Fang ◽  
Qiang Zhang ◽  
Stanley Kok ◽  
Ling Li ◽  
Anning Wang ◽  
...  

2019 ◽  
Vol 76 (8) ◽  
pp. 6399-6420 ◽  
Author(s):  
Qing Zhao ◽  
Dan Wang ◽  
Jianqiang Li ◽  
Faheem Akhtar

2015 ◽  
Vol 7 (1) ◽  
Author(s):  
Carla Abreu ◽  
Jorge Teixeira ◽  
Eugénio Oliveira

This work aims at defining and evaluating different techniques to automatically build temporal news sequences. The approach proposed is composed by three steps: (i) near duplicate documents detention; (ii) keywords extraction; (iii) news sequences creation. This approach is based on: Natural Language Processing, Information Extraction, Name Entity Recognition and supervised learning algorithms. The proposed methodology got a precision of 93.1% for news chains sequences creation.


2021 ◽  
Author(s):  
Dao-Ling Huang ◽  
Quanlei Zeng ◽  
Yun Xiong ◽  
Shuixia Liu ◽  
Chaoqun Pang ◽  
...  

A combined high-quality manual annotation and deep-learning natural language processing study is reported to make accurate name entity recognition (NER) for biomedical literatures. A home-made version of entity annotation guidelines on biomedical literatures was constructed. Our manual annotations have an overall over 92% consistency for all the four entity types such as gene, variant, disease and species with the same publicly available annotated corpora from other experts previously. A total of 400 full biomedical articles from PubMed are annotated based on our home-made entity annotation guidelines. Both a BERT-based large model and a DistilBERT-based simplified model were constructed, trained and optimized for offline and online inference, respectively. The F1-scores of NER of gene, variant, disease and species for the BERT-based model are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those for the DistilBERT-based model are 95.14%, 86.26%, 91.37% and 89.92%, respectively. The F1 scores of the DistilBERT-based NER model retains 97.8%, 92.2%, 98.7% and 93.9% of those of BERT-based NER for gene, variant, disease and species, respectively. Moreover, the performance for both our BERT-based NER model and DistilBERT-based NER model outperforms that of the state-of-art model,BioBERT, indicating the significance to train an NER model on biomedical-domain literatures jointly with high-quality annotated datasets.


Sign in / Sign up

Export Citation Format

Share Document