scholarly journals Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals

2021 ◽  
Vol 9 ◽  
pp. 160-175
Author(s):  
Yanai Elazar ◽  
Shauli Ravfogel ◽  
Alon Jacovi ◽  
Yoav Goldberg

Abstract A growing body of work makes use of probing in order to investigate the working of neural models, often considered black boxes. Recently, an ongoing debate emerged surrounding the limitations of the probing paradigm. In this work, we point out the inability to infer behavioral conclusions from probing results, and offer an alternative method that focuses on how the information is being used, rather than on what information is encoded. Our method, Amnesic Probing, follows the intuition that the utility of a property for a given task can be assessed by measuring the influence of a causal intervention that removes it from the representation. Equipped with this new analysis tool, we can ask questions that were not possible before, for example, is part-of-speech information important for word prediction? We perform a series of analyses on BERT to answer these types of questions. Our findings demonstrate that conventional probing performance is not correlated to task importance, and we call for increased scrutiny of claims that draw behavioral or causal conclusions from probing results.1

2021 ◽  
Vol 72 ◽  
pp. 1385-1470
Author(s):  
Alexandra N. Uma ◽  
Tommaso Fornaciari ◽  
Dirk Hovy ◽  
Silviu Paun ◽  
Barbara Plank ◽  
...  

Many tasks in Natural Language Processing (NLP) and Computer Vision (CV) offer evidence that humans disagree, from objective tasks such as part-of-speech tagging to more subjective tasks such as classifying an image or deciding whether a proposition follows from certain premises. While most learning in artificial intelligence (AI) still relies on the assumption that a single (gold) interpretation exists for each item, a growing body of research aims to develop learning methods that do not rely on this assumption. In this survey, we review the evidence for disagreements on NLP and CV tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these different approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions. Our results suggest, first of all, that even if we abandon the assumption of a gold standard, it is still essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials.


2017 ◽  
pp. 35-46 ◽  
Author(s):  
Irene Doval

This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.


2016 ◽  
Vol 105 (1) ◽  
pp. 63-76
Author(s):  
Theresa Guinard

Abstract Morphological analysis (finding the component morphemes of a word and tagging morphemes with part-of-speech information) is a useful preprocessing step in many natural language processing applications, especially for synthetic languages. Compound words from the constructed language Esperanto are formed by straightforward agglutination, but for many words, there is more than one possible sequence of component morphemes. However, one segmentation is usually more semantically probable than the others. This paper presents a modified n-gram Markov model that finds the most probable segmentation of any Esperanto word, where the model’s states represent morpheme part-of-speech and semantic classes. The overall segmentation accuracy was over 98% for a set of presegmented dictionary words.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Xiaoqiang Chi ◽  
Yang Xiang

Paraphrase generation is an essential yet challenging task in natural language processing. Neural-network-based approaches towards paraphrase generation have achieved remarkable success in recent years. Previous neural paraphrase generation approaches ignore linguistic knowledge, such as part-of-speech information regardless of its availability. The underlying assumption is that neural nets could learn such information implicitly when given sufficient data. However, it would be difficult for neural nets to learn such information properly when data are scarce. In this work, we endeavor to probe into the efficacy of explicit part-of-speech information for the task of paraphrase generation in low-resource scenarios. To this end, we devise three mechanisms to fuse part-of-speech information under the framework of sequence-to-sequence learning. We demonstrate the utility of part-of-speech information in low-resource paraphrase generation through extensive experiments on multiple datasets of varying sizes and genres.


2002 ◽  
Vol 8 (2-3) ◽  
pp. 193-207 ◽  
Author(s):  
TOKUNAGA TAKENOBU ◽  
KIMURA KENJI ◽  
OGIBAYASHI HIRONORI ◽  
TANAKA HOZUMI

This paper explores the effectiveness of index terms more complex than the single words used in conventional information retrieval systems. Retrieval is done in two phases: in the first, a conventional retrieval method (the Okapi system) is used; in the second, complex index terms such as syntactic relations and single words with part-of-speech information are introduced to rerank the results of the first phase. We evaluated the effectiveness of the different types of index terms through experiments using the TREC-7 test collection and 50 queries. The retrieval effectiveness was improved for 32 out of 50 queries. Based on this investigation, we then introduce a method to select effective index terms by using a decision tree. Further experiments with the same test collection showed that retrieval effectiveness was improved in 25 of the 50 queries.


Author(s):  
Ting Lu ◽  
Yan Xiang ◽  
Junge Liang ◽  
Li Zhang ◽  
Mingfang Zhang

The grand challenge of cross-domain sentiment analysis is that classifiers trained in a specific domain are very sensitive to the discrepancy between domains. A sentiment classifier trained in the source domain usually have a poor performance in the target domain. One of the main strategies to solve this problem is the pivot-based strategy, which regards the feature representation as an important component. However, part-of-speech information was not considered to guide the learning of feature representation and feature mapping in previous pivot-based models. Therefore, we present a fused part-of-speech vectors and attention-based model (FAM). In our model, we fuse part-of-speech vectors and feature word embeddings as the representation of features, giving deep semantics to mapping features. And we adopt Multi-Head attention mechanism to train the cross-domain sentiment classifier to obtain the connection between different features. The results of 12 groups comparative experiments on the Amazon dataset demonstrate that our model outperforms all baseline models in this paper.


Sign in / Sign up

Export Citation Format

Share Document