accumulated reward
Recently Published Documents


TOTAL DOCUMENTS

14
(FIVE YEARS 5)

H-INDEX

5
(FIVE YEARS 1)

Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1363
Author(s):  
Hailuo Song ◽  
Ao Li ◽  
Tong Wang ◽  
Minghui Wang

It is an essential capability of indoor mobile robots to avoid various kinds of obstacles. Recently, multimodal deep reinforcement learning (DRL) methods have demonstrated great capability for learning control policies in robotics by using different sensors. However, due to the complexity of indoor environment and the heterogeneity of different sensor modalities, it remains an open challenge to obtain reliable and robust multimodal information for obstacle avoidance. In this work, we propose a novel multimodal DRL method with auxiliary task (MDRLAT) for obstacle avoidance of indoor mobile robot. In MDRLAT, a powerful bilinear fusion module is proposed to fully capture the complementary information from two-dimensional (2D) laser range findings and depth images, and the generated multimodal representation is subsequently fed into dueling double deep Q-network to output control commands for mobile robot. In addition, an auxiliary task of velocity estimation is introduced to further improve representation learning in DRL. Experimental results show that MDRLAT achieves remarkable performance in terms of average accumulated reward, convergence speed, and success rate. Moreover, experiments in both virtual and real-world testing environments further demonstrate the outstanding generalization capability of our method.


Author(s):  
Antonio Sánchez Herguedas ◽  
Adolfo Crespo Márquez ◽  
Francisco Rodrigo Muñoz

Abstract This paper describes the optimization of preventive maintenance (PM) over a finite planning horizon in a semi-Markov framework. In this framework, the asset may be operating, and providing income for the asset owner, or not operating and undergoing PM, or not operating and undergoing corrective maintenance following failure. PM is triggered when the asset has been operating for τ time units. A number m of transitions specifies the finite horizon. This system is described with a set of recurrence relations, and their z-transform is used to determine the value of τ that maximizes the average accumulated reward over the horizon. We study under what conditions a solution can be found, and for those specific cases the solution τ* is calculated. Despite the complexity of the mathematical solution, the result obtained allows the analyst to provide a quick and easy-to-use tool for practical application in many real-world cases. To demonstrate this, the method has been implemented for a case study, and its accuracy and practical implementation were tested using Monte Carlo simulation and direct calculation.


Electronics ◽  
2020 ◽  
Vol 9 (10) ◽  
pp. 1720
Author(s):  
Rashid Ali ◽  
Muhammad Sohail ◽  
Alaa Omran Almagrabi ◽  
Arslan Musaddiq ◽  
Byung-Seo Kim

We have seen a promising acceptance of wireless local area networks (WLANs) in our day-to-day communication devices, such as handheld smartphones, tablets, and laptops. Energy preservation plays a vital role in WLAN communication networks. The efficient use of energy remains one of the most substantial challenges to WLAN devices. Several approaches have been proposed by the industrial and institutional researchers to save energy and reduce the overall power consumption of WLAN devices focusing on static/adaptive energy saving methods. However, most of the approaches save energy at the cost of throughput degradation due to either increased sleep-time or reduced number of transmissions. In this paper, we recognize the potentials of reinforcement learning (RL) techniques, such as the Q-learning (QL) model, to enhance the WLAN’s channel reliability for energy saving. QL is one of the RL techniques, which utilizes the accumulated reward of the actions performed in the state-action model. We propose a QL-based energy-saving MAC protocol, named greenMAC protocol. The proposed greenMAC protocol reduces the energy consumption by utilizing accumulated reward value to optimize the channel reliability, which results in reduced channel collision probability of the network. We assess the degrees of channel congestion in collision probability as a reward function for our QL-based greenMAC protocol. The comparative results show that greenMAC protocol achieves enhanced system throughput performance with additional energy savings compared to existing energy-saving mechanisms in WLANs.


Mathematics ◽  
2020 ◽  
Vol 8 (8) ◽  
pp. 1254 ◽  
Author(s):  
Cheng-Hung Chen ◽  
Shiou-Yun Jeng ◽  
Cheng-Jian Lin

In this study, a fuzzy logic controller with the reinforcement improved differential search algorithm (FLC_R-IDS) is proposed for solving a mobile robot wall-following control problem. This study uses the reward and punishment mechanisms of reinforcement learning to train the mobile robot wall-following control. The proposed improved differential search algorithm uses parameter adaptation to adjust the control parameters. To improve the exploration of the algorithm, a change in the number of superorganisms is required as it involves a stopover site. This study uses reinforcement learning to guide the behavior of the robot. When the mobile robot satisfies three reward conditions, it gets reward +1. The accumulated reward value is used to evaluate the controller and to replace the next controller training. Experimental results show that, compared with the traditional differential search algorithm and the chaos differential search algorithm, the average error value of the proposed FLC_R-IDS in the three experimental environments is reduced by 12.44%, 22.54% and 25.98%, respectively. Final, the experimental results also show that the real mobile robot using the proposed method can effectively implement the wall-following control.


2020 ◽  
Vol 34 (04) ◽  
pp. 4328-4336
Author(s):  
Vishal Jain ◽  
William Fedus ◽  
Hugo Larochelle ◽  
Doina Precup ◽  
Marc G. Bellemare

Text-based games are a natural challenge domain for deep reinforcement learning algorithms. Their state and action spaces are combinatorially large, their reward function is sparse, and they are partially observable: the agent is informed of the consequences of its actions through textual feedback. In this paper we emphasize this latter point and consider the design of a deep reinforcement learning agent that can play from feedback alone. Our design recognizes and takes advantage of the structural characteristics of text-based games. We first propose a contextualisation mechanism, based on accumulated reward, which simplifies the learning problem and mitigates partial observability. We then study different methods that rely on the notion that most actions are ineffectual in any given situation, following Zahavy et al.'s idea of an admissible action. We evaluate these techniques in a series of text-based games of increasing difficulty based on the TextWorld framework, as well as the iconic game Zork. Empirically, we find that these techniques improve the performance of a baseline deep reinforcement learning agent applied to text-based games.


2012 ◽  
Vol 6 ◽  
Author(s):  
Karl Friston ◽  
Rick Adams ◽  
Read Montague
Keyword(s):  

2011 ◽  
pp. 136-152
Author(s):  
Charatdao Intratat

This investigation of popular computer games in comparison with language learning games was designed to offer an insight into the potential of games to the field of self-access. The study surveyed and analyzed common characteristics of popular computer games and then compared them with characteristics of several language learning games. It also investigated the participants’ recommended characteristics of computer games for learning English. The data were collected from undergraduate students at King Mongkut’s University of Technology Thonburi, Bangkok, Thailand. The results showed that the most conducive characteristics for attractive language learning games included animation, variety, planning strategy, virtual background, challenging action and accumulated reward.


Author(s):  
D Bruneo ◽  
A Puliafito ◽  
M Scarpa

Wireless sensor networks (WSN) are composed of a large number of tiny sensor nodes randomly distributed over a geographical region. In order to reduce power consumption, battery-operated sensors undergo cycles of sleeping–active periods that reduce their ability to send/receive data. Starting from the Markov reward model theory, this paper presents a dependability model to analyse the reliability of a sensor node. Also, a new dependability parameter is introduced, referred to as producibility, which is able to capture the capability of a sensor to accomplish its mission. Two different model solution techniques are proposed, one based on the evaluation of the accumulated reward distribution and the other based on an equivalent model based on non-Markovian stochastic Petri nets. The obtained results are used to investigate the dependability of a whole WSN taking into account the presence of redundant nodes. Topological aspects are taken into account, providing a quantitative comparison among three typical network topologies: star, tree, and mesh. Numerical results are provided in order to highlight the advantages of the proposed technique and to demonstrate the equivalence of the proposed approaches.


2009 ◽  
Vol 101 (1) ◽  
pp. 437-447 ◽  
Author(s):  
Takafumi Minamimoto ◽  
Giancarlo La Camera ◽  
Barry J. Richmond

Motivation is usually inferred from the likelihood or the intensity with which behavior is carried out. It is sensitive to external factors (e.g., the identity, amount, and timing of a rewarding outcome) and internal factors (e.g., hunger or thirst). We trained macaque monkeys to perform a nonchoice instrumental task (a sequential red-green color discrimination) while manipulating two external factors: reward size and delay-to-reward. We also inferred the state of one internal factor, level of satiation, by monitoring the accumulated reward. A visual cue indicated the forthcoming reward size and delay-to-reward in each trial. The fraction of trials completed correctly by the monkeys increased linearly with reward size and was hyperbolically discounted by delay-to-reward duration, relations that are similar to those found in free operant and choice tasks. The fraction of correct trials also decreased progressively as a function of the satiation level. Similar (albeit noiser) relations were obtained for reaction times. The combined effect of reward size, delay-to-reward, and satiation level on the proportion of correct trials is well described as a multiplication of the effects of the single factors when each factor is examined alone. These results provide a quantitative account of the interaction of external and internal factors on instrumental behavior, and allow us to extend the concept of subjective value of a rewarding outcome, usually confined to external factors, to account also for slow changes in the internal drive of the subject.


Sign in / Sign up

Export Citation Format

Share Document