scholarly journals Transfer of Temporal Logic Formulas in Reinforcement Learning

Author(s):  
Zhe Xu ◽  
Ufuk Topcu

Transferring high-level knowledge from a source task to a target task is an effective way to expedite reinforcement learning (RL). For example, propositional logic and first-order logic have been used as representations of such knowledge. We study the transfer of knowledge between tasks in which the timing of the events matters. We call such tasks temporal tasks. We concretize similarity between temporal tasks through a notion of logical transferability, and develop a transfer learning approach between different yet similar temporal tasks. We first propose an inference technique to extract metric interval temporal logic (MITL) formulas in sequential disjunctive normal form from labeled trajectories collected in RL of the two tasks. If logical transferability is identified through this inference, we construct a timed automaton for each sequential conjunctive subformula of the inferred MITL formulas from both tasks. We perform RL on the extended state which includes the locations and clock valuations of the timed automata for the source task. We then establish mappings between the corresponding components (clocks, locations, etc.) of the timed automata from the two tasks, and transfer the extended Q-functions based on the established mappings. Finally, we perform RL on the extended state for the target task, starting with the transferred extended Q-functions. Our implementation results show, depending on how similar the source task and the target task are, that the sampling efficiency for the target task can be improved by up to one order of magnitude by performing RL in the extended state space, and further improved by up to another order of magnitude using the transferred extended Q-functions.

2021 ◽  
Vol 11 (3) ◽  
pp. 1291
Author(s):  
Bonwoo Gu ◽  
Yunsick Sung

Gomoku is a two-player board game that originated in ancient China. There are various cases of developing Gomoku using artificial intelligence, such as a genetic algorithm and a tree search algorithm. Alpha-Gomoku, Gomoku AI built with Alpha-Go’s algorithm, defines all possible situations in the Gomoku board using Monte-Carlo tree search (MCTS), and minimizes the probability of learning other correct answers in the duplicated Gomoku board situation. However, in the tree search algorithm, the accuracy drops, because the classification criteria are manually set. In this paper, we propose an improved reinforcement learning-based high-level decision approach using convolutional neural networks (CNN). The proposed algorithm expresses each state as One-Hot Encoding based vectors and determines the state of the Gomoku board by combining the similar state of One-Hot Encoding based vectors. Thus, in a case where a stone that is determined by CNN has already been placed or cannot be placed, we suggest a method for selecting an alternative. We verify the proposed method of Gomoku AI in GuPyEngine, a Python-based 3D simulation platform.


2021 ◽  
Vol 31 (3) ◽  
pp. 1-26
Author(s):  
Aravind Balakrishnan ◽  
Jaeyoung Lee ◽  
Ashish Gaurav ◽  
Krzysztof Czarnecki ◽  
Sean Sedwards

Reinforcement learning (RL) is an attractive way to implement high-level decision-making policies for autonomous driving, but learning directly from a real vehicle or a high-fidelity simulator is variously infeasible. We therefore consider the problem of transfer reinforcement learning and study how a policy learned in a simple environment using WiseMove can be transferred to our high-fidelity simulator, W ise M ove . WiseMove is a framework to study safety and other aspects of RL for autonomous driving. W ise M ove accurately reproduces the dynamics and software stack of our real vehicle. We find that the accurately modelled perception errors in W ise M ove contribute the most to the transfer problem. These errors, when even naively modelled in WiseMove , provide an RL policy that performs better in W ise M ove than a hand-crafted rule-based policy. Applying domain randomization to the environment in WiseMove yields an even better policy. The final RL policy reduces the failures due to perception errors from 10% to 2.75%. We also observe that the RL policy has significantly less reliance on velocity compared to the rule-based policy, having learned that its measurement is unreliable.


Sensors ◽  
2021 ◽  
Vol 21 (7) ◽  
pp. 2534
Author(s):  
Oualid Doukhi ◽  
Deok-Jin Lee

Autonomous navigation and collision avoidance missions represent a significant challenge for robotics systems as they generally operate in dynamic environments that require a high level of autonomy and flexible decision-making capabilities. This challenge becomes more applicable in micro aerial vehicles (MAVs) due to their limited size and computational power. This paper presents a novel approach for enabling a micro aerial vehicle system equipped with a laser range finder to autonomously navigate among obstacles and achieve a user-specified goal location in a GPS-denied environment, without the need for mapping or path planning. The proposed system uses an actor–critic-based reinforcement learning technique to train the aerial robot in a Gazebo simulator to perform a point-goal navigation task by directly mapping the noisy MAV’s state and laser scan measurements to continuous motion control. The obtained policy can perform collision-free flight in the real world while being trained entirely on a 3D simulator. Intensive simulations and real-time experiments were conducted and compared with a nonlinear model predictive control technique to show the generalization capabilities to new unseen environments, and robustness against localization noise. The obtained results demonstrate our system’s effectiveness in flying safely and reaching the desired points by planning smooth forward linear velocity and heading rates.


1988 ◽  
Vol 127 ◽  
Author(s):  
Jan L. Marivoet ◽  
Geert Volckaert ◽  
Arnold A. Bonne

ABSTRACTPerformance assessment studies have been undertaken on the geological disposal of high-level waste in a clay layer in the framework of the CEC project PAGIS. The methodology applied consists of two consecutive steps : a scenario and a consequence analysis. The scenario analysis has indicated that scenarios of normal evolution, of human intrusion, of climatic change, of secondary glaciation effects and of faulting should be evaluated. For the consequence analysis as well deterministic “best estimate” as stochastic calculations, including uncertainty, risk and sensitivity analyses, have been elaborated.The calculations performed show that most radionuclides decay to negligible levels within the first fewjneters of the clay barrier. Just a few radionuclides, 99Tc, 135Cs and 237Np with its daughter nuclides 233U and 229Th can eventually reach the biosphere. The maximum dose rates arising from the geological disposal of HLW, as evaluated by the “best-estimate” approach are about 10−11 Sv/y for river pathways. If the sinking of a water well into the 150 m deep aquifer layer in the vicinity of the repository is considered together with a climatic change, the maximum calculated dose rate rises to a value of 3×10−7 Sv/y. The maximum dose rates evaluated by stochastic calculations are about one order of magnitude higher due to the considerable uncertainties in the model parameters. In the case of the Boom clay the estimated consequences of a fault scenario are of the same order of magnitude as the results obtained for the normal evolution scenario. The maximum risk is estimated from the results obtained through stochastic calculations to be about 5×10−8 per year. The sensitivity analysis has shown that the effective thickness of the clay layer, the retention factors of Tc, Cs and Np, and the Darcy velocity in the aquifer are parameters which strongly influence the calculated dose rates.


2012 ◽  
Vol 23 (04) ◽  
pp. 831-851 ◽  
Author(s):  
GUOQIANG LI ◽  
XIAOJUAN CAI ◽  
SHOJI YUEN

Timed automata are commonly recognized as a formal behavioral model for real-time systems. For compositional system design, parallel composition of timed automata as proposed by Larsen et al. [22] is useful. Although parallel composition provides a general method for system construction, in the low level behavior, components often behave sequentially by passing control via communication. This paper proposes a behavioral model, named controller automata, to combine timed automata by focusing on the control passing between components. In a controller automaton, to each state a timed automaton is assigned. A timed automaton at a state may be preempted by the control passing to another state by a global labeled transition. A controller automaton properly extends the expressive power because of the stack, but this can make the reachability problem undecidable. Given a strict partial order over states, we show that this problem can be avoided and a controller automaton can be faithfully translated into a timed automaton.


2019 ◽  
pp. 155-168
Author(s):  
Murukesan Loganathan ◽  
Thennarasan Sabapathy ◽  
Mohamed Elobaid Elshaikh ◽  
Mohamed Nasrun Osman ◽  
Rosemizi Abd Rahim ◽  
...  

Efficient collision arbitration protocol facilitates fast tag identification in radio frequency identification (RFID) systems. EPCGlobal-Class1-Generation2 (EPC-C1G2) protocol is the current standard for collision arbitration in commercial RFID systems. However, the main drawback of this protocol is that it requires excessive message exchanges between tags and the reader for its operation. This wastes energy of the already resource-constrained RFID readers. Hence, in this work, reinforcement learning based anti-collision protocol (RL-DFSA) is proposed to address the energy efficient collision arbitration problem in the RFID system. The proposed algorithm continuously learns and adapts to the changes in the environment by devising an optimal policy. The proposed RL-DFSA was evaluated through extensive simulations and compared with the variants of EPC-C1G2 algorithms that are currently being used in the commercial readers. Based on the results, it is concluded that RL-DFSA performs equal or better than EPC-C1G2 protocol in delay, throughput and time system efficiency when simulated for sparse and dense environments while requiring one order of magnitude lesser control message exchanges between the reader and the tags.


Author(s):  
Nicolas Bougie ◽  
Ryutaro Ichise

Deep reinforcement learning (DRL) methods traditionally struggle with tasks where environment rewards are sparse or delayed, which entails that exploration remains one of the key challenges of DRL. Instead of solely relying on extrinsic rewards, many state-of-the-art methods use intrinsic curiosity as exploration signal. While they hold promise of better local exploration, discovering global exploration strategies is beyond the reach of current methods. We propose a novel end-to-end intrinsic reward formulation that introduces high-level exploration in reinforcement learning. Our curiosity signal is driven by a fast reward that deals with local exploration and a slow reward that incentivizes long-time horizon exploration strategies. We formulate curiosity as the error in an agent’s ability to reconstruct the observations given their contexts. Experimental results show that this high-level exploration enables our agents to outperform prior work in several Atari games.


1997 ◽  
Vol 4 (29) ◽  
Author(s):  
Luca Aceto ◽  
Augusto Burgueno ◽  
Kim G. Larsen

In this paper we develop an approach to model-checking for timed automata via reachability testing. As our specification formalism, we consider a dense-time logic with clocks. This logic may be used to express safety and bounded liveness properties of real-time systems. We show how to automatically synthesize, for every logical formula phi, a so-called test automaton T_phi in such a way that checking whether a system S satisfies the property phi can be reduced to a reachability question over the system obtained by making T_phi interact with S. <br />The testable logic we consider is both of practical and theoretical interest. On the practical side, we have used the logic, and the associated approach to model-checking via reachability testing it supports, in the specification and verification in Uppaal of a collision avoidance protocol. On the theoretical side, we show that the logic is powerful enough to permit the definition of characteristic properties, with respect to a timed version of<br />the ready simulation preorder, for nodes of deterministic, tau-free timed automata. This allows one to compute behavioural relations via our model-checking technique, therefore effectively reducing the problem of checking the existence of a behavioural relation among states of a timed automaton to a reachability problem.


Sign in / Sign up

Export Citation Format

Share Document