Analyzing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

Machine reading comprehension (MRC) is a challenging natural language processing (NLP) task. It has a wide application potential in the fields of question answering robots, human-computer interactions in mobile virtual reality systems, etc. Recently, the emergence of pretrained models (PTMs) has brought this research field into a new era, in which the training objective plays a key role. The masked language model (MLM) is a self-supervised training objective widely used in various PTMs. With the development of training objectives, many variants of MLM have been proposed, such as whole word masking, entity masking, phrase masking, and span masking. In different MLMs, the length of the masked tokens is different. Similarly, in different machine reading comprehension tasks, the length of the answer is also different, and the answer is often a word, phrase, or sentence. Thus, in MRC tasks with different answer lengths, whether the length of MLM is related to performance is a question worth studying. If this hypothesis is true, it can guide us on how to pretrain the MLM with a relatively suitable mask length distribution for MRC tasks. In this paper, we try to uncover how much of MLM’s success in the machine reading comprehension tasks comes from the correlation between masking length distribution and answer length in the MRC dataset. In order to address this issue, herein, (1) we propose four MRC tasks with different answer length distributions, namely, the short span extraction task, long span extraction task, short multiple-choice cloze task, and long multiple-choice cloze task; (2) four Chinese MRC datasets are created for these tasks; (3) we also have pretrained four masked language models according to the answer length distributions of these datasets; and (4) ablation experiments are conducted on the datasets to verify our hypothesis. The experimental results demonstrate that our hypothesis is true. On four different machine reading comprehension datasets, the performance of the model with correlation length distribution surpasses the model without correlation.

Download Full-text

From ‘F’ to ‘A’ on the N.Y. Regents Science Exams: An Overview of the Aristo Project

AI Magazine ◽

10.1609/aimag.v41i4.5304 ◽

2020 ◽

Vol 41 (4) ◽

pp. 39-53

Author(s):

Peter Clark ◽

Oren Etzioni ◽

Tushar Khot ◽

Daniel Khashabi ◽

Bhavana Mishra ◽

...

Keyword(s):

New York ◽

Language Processing ◽

Question Answering ◽

Multiple Choice ◽

Language Models ◽

General Question ◽

8Th Grade ◽

The Rich ◽

Full Solution ◽

Standardized Exams

AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge. Even as recently as 2016, the best AI system could achieve merely 59.3 percent on an 8th grade science exam. This article reports success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more than 90 percent on the exam’s nondiagram, multiple choice (NDMC) questions. In addition, our Aristo system, building upon the success of recent language models, exceeded 83 percent on the corresponding Grade 12 Science Exam NDMC questions. The results, on unseen test questions, are robust across different test years and different variations of this kind of test. They demonstrate that modern natural language processing methods can result in mastery on this task. While not a full solution to general question-answering (the questions are limited to 8th grade multiple-choice science) it represents a significant milestone for the field.

Download Full-text

Evaluation of Single-Span Models on Extractive Multi-Span Question-Answering

International journal of Web & Semantic Technology ◽

10.5121/ijwest.2021.12102 ◽

2021 ◽

Vol 12 (1) ◽

pp. 19-29

Author(s):

Marie-Anne Xu ◽

Rahul Khanna

Keyword(s):

Reading Comprehension ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Future Development ◽

Question Answering ◽

Consistent Performance ◽

Machine Reading ◽

Entire Dataset

Machine Reading Comprehension (MRC), particularly extractive close-domain question-answering, is a prominent field in Natural Language Processing (NLP). Given a question and a passage or set of passages, a machine must be able to extract the appropriate answer from the passage(s). However, the majority of these existing questions have only one answer, and more substantial testing on questions with multiple answers, or multi-span questions, has not yet been applied. Thus, we introduce a newly compiled dataset consisting of questions with multiple answers that originate from previously existing datasets. In addition, we run BERT-based models pre-trained for question-answering on our constructed dataset to evaluate their reading comprehension abilities. Runtime of base models on the entire dataset is approximately one day while the runtime for all models on a third of the dataset is a little over two days. Among the three of BERT-based models we ran, RoBERTa exhibits the highest consistent performance, regardless of size. We find that all our models perform similarly on this new, multi-span dataset compared to the single-span source datasets. While the models tested on the source datasets were slightly fine-tuned in order to return multiple answers, performance is similar enough to judge that task formulation does not drastically affect question-answering abilities. Our evaluations indicate that these models are indeed capable of adjusting to answer questions that require multiple answers. We hope that our findings will assist future development in question-answering and improve existing question-answering products and methods.

Download Full-text

A Survey on Machine Reading Comprehension—Tasks, Evaluation Metrics and Benchmark Datasets

Applied Sciences ◽

10.3390/app10217640 ◽

2020 ◽

Vol 10 (21) ◽

pp. 7640

Author(s):

Changchang Zeng ◽

Shaobo Li ◽

Qin Li ◽

Jie Hu ◽

Jianjun Hu

Keyword(s):

Reading Comprehension ◽

Language Processing ◽

Large Scale ◽

Human Performance ◽

Research Field ◽

Evaluation Metrics ◽

Future Research ◽

Benchmark Datasets ◽

Comprehensive Survey ◽

Machine Reading

Machine Reading Comprehension (MRC) is a challenging Natural Language Processing (NLP) research field with wide real-world applications. The great progress of this field in recent years is mainly due to the emergence of large-scale datasets and deep learning. At present, a lot of MRC models have already surpassed human performance on various benchmark datasets despite the obvious giant gap between existing MRC models and genuine human-level reading comprehension. This shows the need for improving existing datasets, evaluation metrics, and models to move current MRC models toward “real” understanding. To address the current lack of comprehensive survey of existing MRC tasks, evaluation metrics, and datasets, herein, (1) we analyze 57 MRC tasks and datasets and propose a more precise classification method of MRC tasks with 4 different attributes; (2) we summarized 9 evaluation metrics of MRC tasks, 7 attributes and 10 characteristics of MRC datasets; (3) We also discuss key open issues in MRC research and highlighted future research directions. In addition, we have collected, organized, and published our data on the companion website where MRC researchers could directly access each MRC dataset, papers, baseline projects, and the leaderboard.

Download Full-text

Distill BERT to Traditional Models in Chinese Machine Reading Comprehension (Student Abstract)

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i10.7223 ◽

2020 ◽

Vol 34 (10) ◽

pp. 13901-13902

Author(s):

Xingkai Ren ◽

Ronghua Shi ◽

Fangfang Li

Keyword(s):

Reading Comprehension ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Representation Learning ◽

Original Model ◽

Language Models ◽

Machine Reading

Recently, unsupervised representation learning has been extremely successful in the field of natural language processing. More and more pre-trained language models are proposed and achieved the most advanced results especially in machine reading comprehension. However, these proposed pre-trained language models are huge with hundreds of millions of parameters that have to be trained. It is quite time consuming to use them in actual industry. Thus we propose a method that employ a distillation traditional reading comprehension model to simplify the pre-trained language model so that the distillation model has faster reasoning speed and higher inference accuracy in the field of machine reading comprehension. We evaluate our proposed method on the Chinese machine reading comprehension dataset CMRC2018 and greatly improve the accuracy of the original model. To the best of our knowledge, we are the first to propose a method that employ the distillation pre-trained language model in Chinese machine reading comprehension.

Download Full-text

Keyword extraction method for machine reading comprehension based on natural language processing

Journal of Physics Conference Series ◽

10.1088/1742-6596/1955/1/012072 ◽

2021 ◽

Vol 1955 (1) ◽

pp. 012072

Author(s):

Ruiheng Li ◽

Xuan Zhang ◽

Chengdong Li ◽

Zhongju Zheng ◽

Zihang Zhou ◽

...

Keyword(s):

Reading Comprehension ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Extraction Method ◽

Keyword Extraction ◽

Machine Reading

Download Full-text

Audio-aware Spoken Multiple-choice Question Answering with Pre-trained Language Models

IEEE/ACM Transactions on Audio Speech and Language Processing ◽

10.1109/taslp.2021.3120638 ◽

2021 ◽

pp. 1-1

Author(s):

Chia-Chih Kuo ◽

Kuan-Yu Chen ◽

Shang-Bao Luo

Keyword(s):

Question Answering ◽

Multiple Choice ◽

Multiple Choice Question ◽

Language Models

Download Full-text

Evaluating Commonsense in Pre-Trained Language Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6523 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9733-9740 ◽

Cited By ~ 1

Author(s):

Xuhui Zhou ◽

Yue Zhang ◽

Leyang Cui ◽

Dandan Huang

Keyword(s):

Reading Comprehension ◽

Question Answering ◽

Deep Level ◽

Language Models ◽

Future Research ◽

Correct Prediction ◽

Test Cases ◽

Word Sense ◽

Training Set ◽

Text Data

Contextualized representations trained over large raw text data have given remarkable improvements for NLP tasks including question answering and reading comprehension. There have been works showing that syntactic, semantic and word sense knowledge are contained in such representations, which explains why they benefit such tasks. However, relatively little work has been done investigating commonsense knowledge contained in contextualized representations, which is crucial for human question answering and reading comprehension. We study the commonsense ability of GPT, BERT, XLNet, and RoBERTa by testing them on seven challenging benchmarks, finding that language modeling and its variants are effective objectives for promoting models' commonsense ability while bi-directional context and larger training set are bonuses. We additionally find that current models do poorly on tasks require more necessary inference steps. Finally, we test the robustness of models by making dual test cases, which are correlated so that the correct prediction of one sample should lead to correct prediction of the other. Interestingly, the models show confusion on these test cases, which suggests that they learn commonsense at the surface rather than the deep level. We release a test set, named CATs publicly, for future research.

Download Full-text

Importance of the Single-Span Task Formulation to Extractive Question-answering

10.5121/csit.2020.101809 ◽

2020 ◽

Author(s):

Marie-Anne Xu ◽

Rahul Khanna

Keyword(s):

Reading Comprehension ◽

Future Development ◽

Recent Progress ◽

Question Answering ◽

Span Task ◽

Consistent Performance ◽

Machine Reading

Recent progress in machine reading comprehension and question-answering has allowed machines to reach and even surpass human question-answering. However, the majority of these questions have only one answer, and more substantial testing on questions with multiple answers, or multi-span questions, has not yet been applied. Thus, we introduce a newly compiled dataset consisting of questions with multiple answers that originate from previously existing datasets. In addition, we run BERT-based models pre-trained for question-answering on our constructed dataset to evaluate their reading comprehension abilities. Among the three of BERT-based models we ran, RoBERTa exhibits the highest consistent performance, regardless of size. We find that all our models perform similarly on this new, multi-span dataset (21.492% F1) compared to the single-span source datasets (~33.36% F1). While the models tested on the source datasets were slightly fine-tuned, performance is similar enough to judge that task formulation does not drastically affect question-answering abilities. Our evaluations indicate that these models are indeed capable of adjusting to answer questions that require multiple answers. We hope that our findings will assist future development in questionanswering and improve existing question-answering products and methods.

Download Full-text

Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00305 ◽

2020 ◽

Vol 8 ◽

pp. 141-155

Author(s):

Kai Sun ◽

Dian Yu ◽

Dong Yu ◽

Claire Cardie

Keyword(s):

Reading Comprehension ◽

Prior Knowledge ◽

Data Augmentation ◽

Multiple Choice ◽

Model Performance ◽

Free Form ◽

World Knowledge ◽

Domain Specific ◽

Significant Performance ◽

Machine Reading

Machine reading comprehension tasks require a machine reader to answer questions relevant to the given document. In this paper, we present the first free-form multiple-Choice Chinese machine reading Comprehension dataset (C3), containing 13,369 documents (dialogues or more formally written mixed-genre texts) and their associated 19,577 multiple-choice free-form questions collected from Chinese-as-a-second-language examinations. We present a comprehensive analysis of the prior knowledge (i.e., linguistic, domain-specific, and general world knowledge) needed for these real-world problems. We implement rule-based and popular neural methods and find that there is still a significant performance gap between the best performing model (68.5%) and human readers (96.0%), especiallyon problems that require prior knowledge. We further study the effects of distractor plausibility and data augmentation based on translated relevant datasets for English on model performance. We expect C3 to present great challenges to existing systems as answering 86.8% of questions requires both knowledge within and beyond the accompanying document, and we hope that C3 can serve as a platform to study how to leverage various kinds of prior knowledge to better understand a given written or orally oriented text. C3 is available at https://dataset.org/c3/ .

Download Full-text

Convolutional Spatial Attention Model for Reading Comprehension with Multiple-Choice Questions

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016276 ◽

2019 ◽

Vol 33 ◽

pp. 6276-6283 ◽

Cited By ~ 6

Author(s):

Zhipeng Chen ◽

Yiming Cui ◽

Wentao Ma ◽

Shijin Wang ◽

Guoping Hu

Keyword(s):

Reading Comprehension ◽

Mutual Information ◽

Spatial Attention ◽

State Of The Art ◽

Multiple Choice ◽

Multiple Choice Questions ◽

Attention Model ◽

Novel Approach ◽

Proposed Model ◽

Machine Reading

Machine Reading Comprehension (MRC) with multiplechoice questions requires the machine to read given passage and select the correct answer among several candidates. In this paper, we propose a novel approach called Convolutional Spatial Attention (CSA) model which can better handle the MRC with multiple-choice questions. The proposed model could fully extract the mutual information among the passage, question, and the candidates, to form the enriched representations. Furthermore, to merge various attention results, we propose to use convolutional operation to dynamically summarize the attention values within the different size of regions. Experimental results show that the proposed model could give substantial improvements over various state-of- the-art systems on both RACE and SemEval-2018 Task11 datasets.

Download Full-text