scholarly journals Multi-Level Policy and Reward Reinforcement Learning for Image Captioning

Author(s):  
Anan Liu ◽  
Ning Xu ◽  
Hanwang Zhang ◽  
Weizhi Nie ◽  
Yuting Su ◽  
...  

Image captioning is one of the most challenging hallmark of AI, due to its complexity in visual and natural language understanding. As it is essentially a sequential prediction task, recent advances in image captioning use Reinforcement Learning (RL) to better explore the dynamics of word-by-word generation. However, existing RL-based image captioning methods mainly rely on a single policy network and reward function that does not well fit the multi-level (word and sentence) and multi-modal (vision and language) nature of the task. To this end, we propose a novel multi-level policy and reward RL framework for image captioning. It contains two modules: 1) Multi-Level Policy Network that can adaptively fuse the word-level policy and the sentence-level policy for the word generation; and 2) Multi-Level Reward Function that collaboratively leverages both vision-language reward and language-language reward to guide the policy. Further, we propose a guidance term to bridge the policy and the reward for RL optimization. Extensive experiments and analysis on MSCOCO and Flickr30k show that the proposed framework can achieve competing performances with respect to different evaluation metrics.

2020 ◽  
Vol 201 ◽  
pp. 103068
Author(s):  
Haiyang Wei ◽  
Zhixin Li ◽  
Canlong Zhang ◽  
Huifang Ma

2020 ◽  
Vol 22 (5) ◽  
pp. 1372-1383 ◽  
Author(s):  
Ning Xu ◽  
Hanwang Zhang ◽  
An-An Liu ◽  
Weizhi Nie ◽  
Yuting Su ◽  
...  

Author(s):  
Longteng Guo ◽  
Jing Liu ◽  
Xinxin Zhu ◽  
Xingjian He ◽  
Jie Jiang ◽  
...  

Most image captioning models are autoregressive, i.e. they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. Recently, non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel. Typically, these models use the word-level cross-entropy loss to optimize each word independently. However, such a learning process fails to consider the sentence-level consistency, thus resulting in inferior generation quality of these non-autoregressive models. In this paper, we propose a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL). CMAL formulates NAIC as a multi-agent reinforcement learning system where positions in the target sequence are viewed as agents that learn to cooperatively maximize a sentence-level reward. Besides, we propose to utilize massive unlabeled images to boost captioning performance. Extensive experiments on MSCOCO image captioning benchmark show that our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.


Author(s):  
Haiyang Wei ◽  
Zhixin Li ◽  
Canlong Zhang ◽  
Tao Zhou ◽  
Yu Quan

2021 ◽  
Author(s):  
Buddhika Bellana ◽  
Abhijit Mahabal ◽  
Christopher John Honey

What we think about at any moment is shaped by what preceded it. Why do some experiences, such as reading an immersive story, feel as if they linger in mind beyond their conclusion? In this study, we hypothesize that the stream of our thinking is especially affected by "deeper" forms of processing, emphasizing the meaning and implications of a stimulus rather than its immediate physical properties or low-level semantics (e.g., reading a story vs. reading disconnected words). To test this idea, we presented participants with short stories that preserved different levels of coherence (word-level, sentence-level, or intact narrative), and we measured participants’ self-reports of lingering and spontaneous word generation. Participants reported that stories lingered in their minds after reading, but this effect was greatly reduced when the same words were read with sentence or word-order randomly shuffled. Furthermore, the words that participants spontaneously generated after reading shared semantic meaning with the story’s central themes, particularly when the story was coherent (i.e., intact). Crucially, regardless of the objective coherence of what each participant read, lingering was strongest amongst participants who reported being ‘transported’ into the world of the story while reading. We further generalized this result to a non-narrative stimulus, finding that participants reported lingering after reading a list of words, especially when they had sought an underlying narrative or theme across words. We conclude that recent experiences are most likely to exert a lasting mental context when we seek to extract and represent their deep situation-level meaning.


2021 ◽  
Author(s):  
Stav Belogolovsky ◽  
Philip Korsunsky ◽  
Shie Mannor ◽  
Chen Tessler ◽  
Tom Zahavy

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.


2021 ◽  
Author(s):  
Amarildo Likmeta ◽  
Alberto Maria Metelli ◽  
Giorgia Ramponi ◽  
Andrea Tirinzoni ◽  
Matteo Giuliani ◽  
...  

AbstractIn real-world applications, inferring the intentions of expert agents (e.g., human operators) can be fundamental to understand how possibly conflicting objectives are managed, helping to interpret the demonstrated behavior. In this paper, we discuss how inverse reinforcement learning (IRL) can be employed to retrieve the reward function implicitly optimized by expert agents acting in real applications. Scaling IRL to real-world cases has proved challenging as typically only a fixed dataset of demonstrations is available and further interactions with the environment are not allowed. For this reason, we resort to a class of truly batch model-free IRL algorithms and we present three application scenarios: (1) the high-level decision-making problem in the highway driving scenario, and (2) inferring the user preferences in a social network (Twitter), and (3) the management of the water release in the Como Lake. For each of these scenarios, we provide formalization, experiments and a discussion to interpret the obtained results.


Minerals ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 587
Author(s):  
Joao Pedro de Carvalho ◽  
Roussos Dimitrakopoulos

This paper presents a new truck dispatching policy approach that is adaptive given different mining complex configurations in order to deliver supply material extracted by the shovels to the processors. The method aims to improve adherence to the operational plan and fleet utilization in a mining complex context. Several sources of operational uncertainty arising from the loading, hauling and dumping activities can influence the dispatching strategy. Given a fixed sequence of extraction of the mining blocks provided by the short-term plan, a discrete event simulator model emulates the interaction arising from these mining operations. The continuous repetition of this simulator and a reward function, associating a score value to each dispatching decision, generate sample experiences to train a deep Q-learning reinforcement learning model. The model learns from past dispatching experience, such that when a new task is required, a well-informed decision can be quickly taken. The approach is tested at a copper–gold mining complex, characterized by uncertainties in equipment performance and geological attributes, and the results show improvements in terms of production targets, metal production, and fleet management.


Sign in / Sign up

Export Citation Format

Share Document