machine reading Latest Research Papers

This paper provides an empirical study of various techniques for information retrieval and machine reading comprehension in the context of an online education platform. More specifically, our application deals with answering conceptual students questions on technology courses. To that end we explore a pipeline consisting of a document retriever and a document reader. We find that using TF-IDF document representations for retrieving documents and RoBERTa deep learning model for reading documents and answering questions yields the best performance with respect to F-Score. In overall, without a fine-tuning step, deep learning models have a significant performance gap with comparison to previously reported F-scores on other datasets.

Download Full-text

Analyzing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

Wireless Communications and Mobile Computing ◽

10.1155/2021/5375334 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Changchang Zeng ◽

Shaobo Li

Keyword(s):

Reading Comprehension ◽

Language Processing ◽

Question Answering ◽

Multiple Choice ◽

Length Distribution ◽

Research Field ◽

Evaluation Framework ◽

Language Models ◽

Training Objective ◽

Machine Reading

Machine reading comprehension (MRC) is a challenging natural language processing (NLP) task. It has a wide application potential in the fields of question answering robots, human-computer interactions in mobile virtual reality systems, etc. Recently, the emergence of pretrained models (PTMs) has brought this research field into a new era, in which the training objective plays a key role. The masked language model (MLM) is a self-supervised training objective widely used in various PTMs. With the development of training objectives, many variants of MLM have been proposed, such as whole word masking, entity masking, phrase masking, and span masking. In different MLMs, the length of the masked tokens is different. Similarly, in different machine reading comprehension tasks, the length of the answer is also different, and the answer is often a word, phrase, or sentence. Thus, in MRC tasks with different answer lengths, whether the length of MLM is related to performance is a question worth studying. If this hypothesis is true, it can guide us on how to pretrain the MLM with a relatively suitable mask length distribution for MRC tasks. In this paper, we try to uncover how much of MLM’s success in the machine reading comprehension tasks comes from the correlation between masking length distribution and answer length in the MRC dataset. In order to address this issue, herein, (1) we propose four MRC tasks with different answer length distributions, namely, the short span extraction task, long span extraction task, short multiple-choice cloze task, and long multiple-choice cloze task; (2) four Chinese MRC datasets are created for these tasks; (3) we also have pretrained four masked language models according to the answer length distributions of these datasets; and (4) ablation experiments are conducted on the datasets to verify our hypothesis. The experimental results demonstrate that our hypothesis is true. On four different machine reading comprehension datasets, the performance of the model with correlation length distribution surpasses the model without correlation.

Download Full-text

UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

10.36227/techrxiv.16924255 ◽

2021 ◽

Author(s):

Samreen Ahmed ◽

shakeel khoja

Keyword(s):

Reading Comprehension ◽

Machine Translation ◽

Large Scale ◽

Question Answering ◽

Training Data ◽

Significant Progress ◽

Rule Based ◽

Low Resource ◽

Machine Reading ◽

Answer Format

In recent years, low-resource Machine Reading Comprehension (MRC) has made significant progress, with models getting remarkable performance on various language datasets. However, none of these models have been customized for the Urdu language. This work explores the semi-automated creation of the Urdu Question Answering Dataset (UQuAD1.0) by combining machine-translated SQuAD with human-generated samples derived from Wikipedia articles and Urdu RC worksheets from Cambridge O-level books. UQuAD1.0 is a large-scale Urdu dataset intended for extractive machine reading comprehension tasks consisting of 49k question Answers pairs in question, passage, and answer format. In UQuAD1.0, 45000 pairs of QA were generated by machine translation of the original SQuAD1.0 and approximately 4000 pairs via crowdsourcing. In this study, we used two types of MRC models: rule-based baseline and advanced Transformer-based models. However, we have discovered that the latter outperforms the others; thus, we have decided to concentrate solely on Transformer-based architectures. Using XLMRoBERTa and multi-lingual BERT, we acquire an F1 score of 0.66 and 0.63, respectively.

Download Full-text

UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

10.36227/techrxiv.16924255.v1 ◽

2021 ◽

Author(s):

Samreen Ahmed ◽

shakeel khoja

Keyword(s):

Reading Comprehension ◽

Machine Translation ◽

Large Scale ◽

Question Answering ◽

Training Data ◽

Significant Progress ◽

Rule Based ◽

Low Resource ◽

Machine Reading ◽

Answer Format

In recent years, low-resource Machine Reading Comprehension (MRC) has made significant progress, with models getting remarkable performance on various language datasets. However, none of these models have been customized for the Urdu language. This work explores the semi-automated creation of the Urdu Question Answering Dataset (UQuAD1.0) by combining machine-translated SQuAD with human-generated samples derived from Wikipedia articles and Urdu RC worksheets from Cambridge O-level books. UQuAD1.0 is a large-scale Urdu dataset intended for extractive machine reading comprehension tasks consisting of 49k question Answers pairs in question, passage, and answer format. In UQuAD1.0, 45000 pairs of QA were generated by machine translation of the original SQuAD1.0 and approximately 4000 pairs via crowdsourcing. In this study, we used two types of MRC models: rule-based baseline and advanced Transformer-based models. However, we have discovered that the latter outperforms the others; thus, we have decided to concentrate solely on Transformer-based architectures. Using XLMRoBERTa and multi-lingual BERT, we acquire an F1 score of 0.66 and 0.63, respectively.

Download Full-text

MRC4BioER: Joint extraction of biomedical entities and relations in the machine reading comprehension framework

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2021.103956 ◽

2021 ◽

pp. 103956

Author(s):

Cong Sun ◽

Zhihao Yang ◽

Lei Wang ◽

Yin Zhang ◽

Hongfei Lin ◽

...

Keyword(s):

Reading Comprehension ◽

Machine Reading

Download Full-text

Ensemble learning-based approach for improving generalization capability of machine reading comprehension systems

Neurocomputing ◽

10.1016/j.neucom.2021.08.095 ◽

2021 ◽

Vol 466 ◽

pp. 229-242

Author(s):

Razieh Baradaran ◽

Hossein Amirkhani

Keyword(s):

Reading Comprehension ◽

Ensemble Learning ◽

Generalization Capability ◽

Machine Reading

Download Full-text

machine reading
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Cross-Domain Slot Filling as Machine Reading Comprehension: A New Perspective

Data Augmentation Methods for Improving the Performance of Machine Reading Comprehension

Contrastive heterogeneous graphs learning for multi-hop machine reading comprehension

A Knowledge-aware Machine Reading Comprehension Framework for Dialogue Symptom Diagnosis

An Empirical Study of Information Retrieval and Machine Reading Comprehension Algorithms for an Online Education Platform

Analyzing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

MRC4BioER: Joint extraction of biomedical entities and relations in the machine reading comprehension framework

Ensemble learning-based approach for improving generalization capability of machine reading comprehension systems

Export Citation Format

machine readingRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Cross-Domain Slot Filling as Machine Reading Comprehension: A New Perspective

Data Augmentation Methods for Improving the Performance of Machine Reading Comprehension

Contrastive heterogeneous graphs learning for multi-hop machine reading comprehension

A Knowledge-aware Machine Reading Comprehension Framework for Dialogue Symptom Diagnosis

An Empirical Study of Information Retrieval and Machine Reading Comprehension Algorithms for an Online Education Platform

Analyzing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

MRC4BioER: Joint extraction of biomedical entities and relations in the machine reading comprehension framework

Ensemble learning-based approach for improving generalization capability of machine reading comprehension systems

machine reading
Recently Published Documents