Benchmarking Text Understanding Systems to Human Performance: An Exploration.

Recent progress in NLP witnessed the development of large-scale pre-trained language models (GPT, BERT, XLNet, etc.) based on Transformer (Vaswani et al. 2017), and in a range of end tasks, such models have achieved state-of-the-art results, approaching human performance. This clearly demonstrates the power of the stacked self-attention architecture when paired with a sufficient number of layers and a large amount of pre-training data. However, on tasks that require complex and long-distance reasoning where surface-level cues are not enough, there is still a large gap between the pre-trained models and human performance. Strubell et al. (2018) recently showed that it is possible to inject knowledge of syntactic structure into a model through supervised self-attention. We conjecture that a similar injection of semantic knowledge, in particular, coreference information, into an existing model would improve performance on such complex problems. On the LAMBADA (Paperno et al. 2016) task, we show that a model trained from scratch with coreference as auxiliary supervision for self-attention outperforms the largest GPT-2 model, setting the new state-of-the-art, while only containing a tiny fraction of parameters compared to GPT-2. We also conduct a thorough analysis of different variants of model architectures and supervision configurations, suggesting future directions on applying similar techniques to other problems.

Download Full-text

Constructing Datasets for Multi-hop Reading Comprehension Across Documents

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00021 ◽

2018 ◽

Vol 6 ◽

pp. 287-302 ◽

Cited By ~ 21

Author(s):

Johannes Welbl ◽

Pontus Stenetorp ◽

Sebastian Riedel

Keyword(s):

Reading Comprehension ◽

Human Performance ◽

Relevant Information ◽

Test Set ◽

Textual Evidence ◽

Single Sentence ◽

Text Understanding ◽

Query Answer ◽

Multiple Documents ◽

Combine Evidence

Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently no resources exist to train and test this capability. We propose a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In our task, a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, inference. We devise a methodology to produce datasets for this task, given a collection of query-answer pairs and thematically linked documents. Two datasets from different domains are induced, and we identify potential pitfalls and devise circumvention strategies. We evaluate two previously proposed competitive models and find that one can integrate information across documents. However, both models struggle to select relevant information; and providing documents guaranteed to be relevant greatly improves their performance. While the models outperform several strong baselines, their best accuracy reaches 54.5% on an annotated test set, compared to human performance at 85.0%, leaving ample room for improvement.

Download Full-text

The limits of human performance

Essays in Biochemistry ◽

10.1042/bse0440011 ◽

2008 ◽

Vol 44 ◽

pp. 11-26 ◽

Cited By ~ 14

Author(s):

Ralph Beneke ◽

Dieter Böning

Keyword(s):

Human Performance ◽

Safety Margin ◽

Afferent Input ◽

Metabolic Energy ◽

Human Organism ◽

Mechanical Resistance ◽

As Doping ◽

Gene And Protein Expression ◽

Underlying Mechanisms ◽

Biochemical Research

Human performance, defined by mechanical resistance and distance per time, includes human, task and environmental factors, all interrelated. It requires metabolic energy provided by anaerobic and aerobic metabolic energy sources. These sources have specific limitations in the capacity and rate to provide re-phosphorylation energy, which determines individual ratios of aerobic and anaerobic metabolic power and their sustainability. In healthy athletes, limits to provide and utilize metabolic energy are multifactorial, carefully matched and include a safety margin imposed in order to protect the integrity of the human organism under maximal effort. Perception of afferent input associated with effort leads to conscious or unconscious decisions to modulate or terminate performance; however, the underlying mechanisms of cerebral control are not fully understood. The idea to move borders of performance with the help of biochemicals is two millennia old. Biochemical findings resulted in highly effective substances widely used to increase performance in daily life, during preparation for sport events and during competition, but many of them must be considered as doping and therefore illegal. Supplements and food have ergogenic potential; however, numerous concepts are controversially discussed with respect to legality and particularly evidence in terms of usefulness and risks. The effect of evidence-based nutritional strategies on adaptations in terms of gene and protein expression that occur in skeletal muscle during and after exercise training sessions is widely unknown. Biochemical research is essential for better understanding of the basic mechanisms causing fatigue and the regulation of the dynamic adaptation to physical and mental training.

Download Full-text

1880: Assessment of Basic Human Performance Resources Predicts Performance of Ureterorenoscopy in Human Cadavers

The Journal of Urology ◽

10.1016/s0022-5347(18)39072-4 ◽

2004 ◽

Vol 171 (4S) ◽

pp. 496-497

Author(s):

Edward D. Matsumoto ◽

George V. Kondraske ◽

Lucas Jacomides ◽

Kenneth Ogan ◽

Margaret S. Pearle ◽

...

Keyword(s):

Human Performance ◽

Human Cadavers

Download Full-text

Short Stress State Questionnaire

European Journal of Psychological Assessment ◽

10.1027/1015-5759/a000200 ◽

2015 ◽

Vol 31 (1) ◽

pp. 20-30 ◽

Cited By ~ 28

Author(s):

William S. Helton ◽

Katharina Näswall

Keyword(s):

Stress State ◽

Factor Structure ◽

Human Performance ◽

Self Report ◽

Confirmatory Factor ◽

Stress States ◽

Related Stress ◽

Using Data ◽

Task Conditions ◽

Multiple Samples

Conscious appraisals of stress, or stress states, are an important aspect of human performance. This article presents evidence supporting the validity and measurement characteristics of a short multidimensional self-report measure of stress state, the Short Stress State Questionnaire (SSSQ; Helton, 2004 ). The SSSQ measures task engagement, distress, and worry. A confirmatory factor analysis of the SSSQ using data pooled from multiple samples suggests the SSSQ does have a three factor structure and post-task changes are not due to changes in factor structure, but to mean level changes (state changes). In addition, the SSSQ demonstrates sensitivity to task stressors in line with hypotheses. Different task conditions elicited unique patterns of stress state on the three factors of the SSSQ in line with prior predictions. The 24-item SSSQ is a valid measure of stress state which may be useful to researchers interested in conscious appraisals of task-related stress.

Download Full-text