corpus generation Latest Research Papers

Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu language on the Internet, there is a lack of benchmark corpus for the Cross-lingual Semantic Word Similarity task for the Urdu language. This article reports our efforts in developing such a corpus. The newly developed corpus is based on the SemEval-2017 task 2 English dataset, and it contains 1,945 cross-lingual English–Urdu word pairs. For each of these pairs of words, semantic similarity scores were assigned by 11 native Urdu speakers. In addition to corpus generation, this article also reports the evaluation results of a baseline approach, namely “Translation Plus Monolingual Analysis” for automated identification of semantic similarity between English–Urdu word pairs. The results showed that the path length similarity measure performs better for the Google and Bing translated words. The newly created corpus and evaluation results are freely available online for further research and development.

Download Full-text

Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

10.18653/v1/2021.naacl-main.278 ◽

2021 ◽

Author(s):

Oshin Agarwal ◽

Heming Ge ◽

Siamak Shakeri ◽

Rami Al-Rfou

Keyword(s):

Language Model ◽

Knowledge Graph ◽

Corpus Generation

Download Full-text

Exploiting News Article Structure for Automatic Corpus Generation of Entailment Datasets

10.1007/978-3-030-89363-7_7 ◽

2021 ◽

pp. 86-99

Author(s):

Jan Christian Blaise Cruz ◽

Jose Kristian Resabal ◽

James Lin ◽

Dan John Velasco ◽

Charibeth Cheng

Keyword(s):

News Article ◽

Corpus Generation

Download Full-text

AI-driven Approach for Automatic Synthetic Patient Status Corpus Generation

2020 4th International Conference on Artificial Intelligence and Virtual Reality ◽

10.1145/3439133.3439141 ◽

2020 ◽

Author(s):

Boris Velichkov ◽

Kristina Ivanova ◽

Valeri Hristov ◽

Ivan Borisov ◽

Alexander Peychev ◽

...

Keyword(s):

Patient Status ◽

Corpus Generation

Download Full-text

Machine Learning Approach for Multi-Layered Detection of Chemical Named Entities in Text

Cognitive Analytics ◽

10.4018/978-1-7998-2460-2.ch076 ◽

2020 ◽

pp. 1496-1512

Author(s):

Usha B. Biradar ◽

Harsha Gurulingappa ◽

Lokanath Khamari ◽

Shashikala Giriyan

Keyword(s):

Machine Learning ◽

System Performance ◽

Conditional Random Fields ◽

Learning Approach ◽

Named Entities ◽

Machine Learning Approach ◽

Feature Optimization ◽

Multi Level ◽

Corpus Generation ◽

Substantial Effort

Identification of chemical named entities in text and subsequent linkage of information to biological events is of immense value to fulfill the knowledge needs of pharmaceutical and chemical R&D. A significant amount of investigation has been carried out since a decade for identifying chemical named entities at morphological level. However, a barrier still remains in terms of value proposition to scientists at chemistry level. Therefore, the work described here aims to circumvent the information barrier by adaptation of a Conditional Random Fields-based approach for identifying chemical named entities at various levels namely generic chemical level, morphological level, and chemistry level. Substantial effort has been invested on generation of suitable multi-level annotated corpora. Recommended machine learning practices such as active learning-based training corpus generation and feature optimization have been systematically performed. Evaluation of system performance and benchmarking against the other state-of-the-approaches showed improved results.

Download Full-text