Estimation of Gap Between Current Language Models and Human Performance

Attending to Entities for Better Text Understanding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6254 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7554-7561

Author(s):

Pengxiang Cheng ◽

Katrin Erk

Keyword(s):

Large Scale ◽

Human Performance ◽

State Of The Art ◽

Syntactic Structure ◽

Semantic Knowledge ◽

Training Data ◽

Language Models ◽

Long Distance ◽

Future Directions ◽

Text Understanding

Recent progress in NLP witnessed the development of large-scale pre-trained language models (GPT, BERT, XLNet, etc.) based on Transformer (Vaswani et al. 2017), and in a range of end tasks, such models have achieved state-of-the-art results, approaching human performance. This clearly demonstrates the power of the stacked self-attention architecture when paired with a sufficient number of layers and a large amount of pre-training data. However, on tasks that require complex and long-distance reasoning where surface-level cues are not enough, there is still a large gap between the pre-trained models and human performance. Strubell et al. (2018) recently showed that it is possible to inject knowledge of syntactic structure into a model through supervised self-attention. We conjecture that a similar injection of semantic knowledge, in particular, coreference information, into an existing model would improve performance on such complex problems. On the LAMBADA (Paperno et al. 2016) task, we show that a model trained from scratch with coreference as auxiliary supervision for self-attention outperforms the largest GPT-2 model, setting the new state-of-the-art, while only containing a tiny fraction of parameters compared to GPT-2. We also conduct a thorough analysis of different variants of model architectures and supervision configurations, suggesting future directions on applying similar techniques to other problems.

Download Full-text

QASC: A Dataset for Question Answering via Sentence Composition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6319 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8082-8090

Author(s):

Tushar Khot ◽

Peter Clark ◽

Michal Guerquin ◽

Peter Jansen ◽

Ashish Sabharwal

Keyword(s):

Common Sense ◽

Human Performance ◽

Question Answering ◽

State Of The Art ◽

Multiple Choice ◽

Training Data ◽

Language Models ◽

Current State ◽

New Concepts ◽

Large Corpus

Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

Download Full-text

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6399 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8732-8740 ◽

Cited By ~ 1

Author(s):

Keisuke Sakaguchi ◽

Ronan Le Bras ◽

Chandra Bhagavatula ◽

Yejin Choi

Keyword(s):

Large Scale ◽

Human Performance ◽

State Of The Art ◽

Bias Reduction ◽

Training Data ◽

Language Models ◽

Systematic Bias ◽

Commonsense Reasoning ◽

Word Associations ◽

Key Steps

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4 – 79.1%, which are ∼15-35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% – 100% respectively).Furthermore, we establish new state-of-the-art results on five related benchmarks — WSC (→ 90.1%), DPR (→ 93.1%), COPA(→ 90.6%), KnowRef (→ 85.6%), and Winogender (→ 97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.

Download Full-text

The limits of human performance

Essays in Biochemistry ◽

10.1042/bse0440011 ◽

2008 ◽

Vol 44 ◽

pp. 11-26 ◽

Cited By ~ 14

Author(s):

Ralph Beneke ◽

Dieter Böning

Keyword(s):

Human Performance ◽

Safety Margin ◽

Afferent Input ◽

Metabolic Energy ◽

Human Organism ◽

Mechanical Resistance ◽

As Doping ◽

Gene And Protein Expression ◽

Underlying Mechanisms ◽

Biochemical Research

Human performance, defined by mechanical resistance and distance per time, includes human, task and environmental factors, all interrelated. It requires metabolic energy provided by anaerobic and aerobic metabolic energy sources. These sources have specific limitations in the capacity and rate to provide re-phosphorylation energy, which determines individual ratios of aerobic and anaerobic metabolic power and their sustainability. In healthy athletes, limits to provide and utilize metabolic energy are multifactorial, carefully matched and include a safety margin imposed in order to protect the integrity of the human organism under maximal effort. Perception of afferent input associated with effort leads to conscious or unconscious decisions to modulate or terminate performance; however, the underlying mechanisms of cerebral control are not fully understood. The idea to move borders of performance with the help of biochemicals is two millennia old. Biochemical findings resulted in highly effective substances widely used to increase performance in daily life, during preparation for sport events and during competition, but many of them must be considered as doping and therefore illegal. Supplements and food have ergogenic potential; however, numerous concepts are controversially discussed with respect to legality and particularly evidence in terms of usefulness and risks. The effect of evidence-based nutritional strategies on adaptations in terms of gene and protein expression that occur in skeletal muscle during and after exercise training sessions is widely unknown. Biochemical research is essential for better understanding of the basic mechanisms causing fatigue and the regulation of the dynamic adaptation to physical and mental training.

Download Full-text

1880: Assessment of Basic Human Performance Resources Predicts Performance of Ureterorenoscopy in Human Cadavers

The Journal of Urology ◽

10.1016/s0022-5347(18)39072-4 ◽

2004 ◽

Vol 171 (4S) ◽

pp. 496-497

Author(s):

Edward D. Matsumoto ◽

George V. Kondraske ◽

Lucas Jacomides ◽

Kenneth Ogan ◽

Margaret S. Pearle ◽

...

Keyword(s):

Human Performance ◽

Human Cadavers

Download Full-text

Short Stress State Questionnaire

European Journal of Psychological Assessment ◽

10.1027/1015-5759/a000200 ◽

2015 ◽

Vol 31 (1) ◽

pp. 20-30 ◽

Cited By ~ 28

Author(s):

William S. Helton ◽

Katharina Näswall

Keyword(s):

Stress State ◽

Factor Structure ◽

Human Performance ◽

Self Report ◽

Confirmatory Factor ◽

Stress States ◽

Related Stress ◽

Using Data ◽

Task Conditions ◽

Multiple Samples

Conscious appraisals of stress, or stress states, are an important aspect of human performance. This article presents evidence supporting the validity and measurement characteristics of a short multidimensional self-report measure of stress state, the Short Stress State Questionnaire (SSSQ; Helton, 2004 ). The SSSQ measures task engagement, distress, and worry. A confirmatory factor analysis of the SSSQ using data pooled from multiple samples suggests the SSSQ does have a three factor structure and post-task changes are not due to changes in factor structure, but to mean level changes (state changes). In addition, the SSSQ demonstrates sensitivity to task stressors in line with hypotheses. Different task conditions elicited unique patterns of stress state on the three factors of the SSSQ in line with prior predictions. The 24-item SSSQ is a valid measure of stress state which may be useful to researchers interested in conscious appraisals of task-related stress.

Download Full-text