Sensor Control for Multi-Object Bayesian Filtering Based on Minimum Predicted Miss-Distance

2013 ◽  
Vol 336-338 ◽  
pp. 361-366
Author(s):  
Chun Xiao Jian ◽  
Wei Yang ◽  
Pei Guo Liu

The sensor control is concerned in this paper to exploit the multi-object filtering capability of the sensor system. The proposed control algorithm is formulated in the framework of partially observed Markov decision processes as previous work, while it adopts a new reward function (RF). Multi-object miss-distance can jointly capture detection and estimation error in a mathematically consistent manner and is generally employed as the final performance measure for the multi-object filtering; therefore, the predicted multi-object miss-distance can be naturally selected as a RF. However, there is no analytical expression of the predicted multi-object miss-distance generally. The computation of this predicted miss-distance is discussed in detail. Future work will concentrate on providing a complete comparison of different sensor control schemes.

2021 ◽  
Author(s):  
Stav Belogolovsky ◽  
Philip Korsunsky ◽  
Shie Mannor ◽  
Chen Tessler ◽  
Tom Zahavy

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.


2014 ◽  
Vol 13 (6) ◽  
pp. 1261
Author(s):  
Francois Van Dyk ◽  
Gary Van Vuuren ◽  
Andre Heymans

The Sharpe ratio is widely used as a performance measure for traditional (i.e., long only) investment funds, but because it is based on mean-variance theory, it only considers the first two moments of a return distribution. It is, therefore, not suited for evaluating funds characterised by complex, asymmetric, highly-skewed return distributions such as hedge funds. It is also susceptible to manipulation and estimation error. These drawbacks have demonstrated the need for new and additional fund performance metrics. The monthly returns of 184 international long/short (equity) hedge funds from four geographical investment mandates were examined over an 11-year period.This study contributes to recent research on alternative performance measures to the Sharpe ratio and specifically assesses whether a scaled-version of the classic Sharpe ratio should augment the use of the Sharpe ratio when evaluating hedge fund risk and in the investment decision-making process. A scaled Treynor ratio is also compared to the traditional Treynor ratio. The classic and scaled versions of the Sharpe and Treynor ratios were estimated on a 36-month rolling basis to ascertain whether the scaled ratios do indeed provide useful additional information to investors to that provided solely by the classic, non-scaled ratios.


2001 ◽  
Vol 15 (4) ◽  
pp. 557-564 ◽  
Author(s):  
Rolando Cavazos-Cadena ◽  
Raúl Montes-de-Oca

This article concerns Markov decision chains with finite state and action spaces, and a control policy is graded via the expected total-reward criterion associated to a nonnegative reward function. Within this framework, a classical theorem guarantees the existence of an optimal stationary policy whenever the optimal value function is finite, a result that is obtained via a limit process using the discounted criterion. The objective of this article is to present an alternative approach, based entirely on the properties of the expected total-reward index, to establish such an existence result.


Entropy ◽  
2019 ◽  
Vol 21 (7) ◽  
pp. 674
Author(s):  
Boris Belousov ◽  
Jan Peters

An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback–Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of f-divergences, and more concretely α -divergences, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement is shown to correspond to the Pearson χ 2 -divergence penalty. Other actor-critic pairs arise for various choices of the penalty-generating function f. On a concrete instantiation of our framework with the α -divergence, we carry out asymptotic analysis of the solutions for different values of α and demonstrate the effects of the divergence function choice on common standard reinforcement learning problems.


2013 ◽  
Vol 45 (2) ◽  
pp. 490-519 ◽  
Author(s):  
Xianping Guo ◽  
Mantas Vykertas ◽  
Yi Zhang

In this paper we study absorbing continuous-time Markov decision processes in Polish state spaces with unbounded transition and cost rates, and history-dependent policies. The performance measure is the expected total undiscounted costs. For the unconstrained problem, we show the existence of a deterministic stationary optimal policy, whereas, for the constrained problems with N constraints, we show the existence of a mixed stationary optimal policy, where the mixture is over no more than N+1 deterministic stationary policies. Furthermore, the strong duality result is obtained for the associated linear programs.


2020 ◽  
Vol 69 ◽  
pp. 1203-1254
Author(s):  
Stefan Lüdtke ◽  
Thomas Kirste

We present a model for Bayesian filtering (BF) in discrete dynamic systems where multiple entities (inter)-act, i.e. where the system dynamics is naturally described by a Multiset rewriting system (MRS). Typically, BF in such situations is computationally expensive due to the high number of discrete states that need to be maintained explicitly. We devise a lifted state representation, based on a suitable decomposition of multiset states, such that some factors of the distribution are exchangeable and thus afford an efficient representation. Intuitively, this representation groups together similar entities whose properties follow an exchangeable joint distribution. Subsequently, we introduce a BF algorithm that works directly on lifted states, without resorting to the original, much larger ground representation. This algorithm directly lends itself to approximate versions by limiting the number of explicitly represented lifted states in the posterior. We show empirically that the lifted representation can lead to a factorial reduction in the representational complexity of the distribution, and in the approximate cases can lead to a lower variance of the estimate and a lower estimation error compared to the original, ground representation.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Luo Zhe ◽  
Li Xinsan ◽  
Wang Lixin ◽  
Shen Qiang

In order to improve the autonomy of gliding guidance for complex flight missions, this paper proposes a multiconstrained intelligent gliding guidance strategy based on optimal guidance and reinforcement learning (RL). Three-dimensional optimal guidance is introduced to meet the terminal latitude, longitude, altitude, and flight-path-angle constraints. A velocity control strategy through lateral sinusoidal maneuver is proposed, and an analytical terminal velocity prediction method considering maneuvering flight is studied. Aiming at the problem that the maneuvering amplitude in velocity control cannot be determined offline, an intelligent parameter adjustment method based on RL is studied. This method considers parameter determination as a Markov Decision Process (MDP) and designs a state space via terminal speed and an action space with maneuvering amplitude. In addition, it constructs a reward function that integrates terminal velocity error and gliding guidance tasks and uses Q-Learning to achieve the online intelligent adjustment of maneuvering amplitude. The simulation results show that the intelligent gliding guidance method can meet various terminal constraints with high accuracy and can improve the autonomous decision-making ability under complex tasks effectively.


Author(s):  
Huiqiao Fu ◽  
Kaiqiang Tang ◽  
Peng Li ◽  
Wenqi Zhang ◽  
Xinpeng Wang ◽  
...  

Legged locomotion in a complex environment requires careful planning of the footholds of legged robots. In this paper, a novel Deep Reinforcement Learning (DRL) method is proposed to implement multi-contact motion planning for hexapod robots moving on uneven plum-blossom piles. First, the motion of hexapod robots is formulated as a Markov Decision Process (MDP) with a specified reward function. Second, a transition feasibility model is proposed for hexapod robots, which describes the feasibility of the state transition under the condition of satisfying kinematics and dynamics, and in turn determines the rewards. Third, the footholds and Center-of-Mass (CoM) sequences are sampled from a diagonal Gaussian distribution and the sequences are optimized through learning the optimal policies using the designed DRL algorithm. Both of the simulation and experimental results on physical systems demonstrate the feasibility and efficiency of the proposed method. Videos are shown at https://videoviewpage.wixsite.com/mcrl.


2019 ◽  
Vol 59 (5) ◽  
pp. 518-526
Author(s):  
Michael Vetter

Finding potential security weaknesses in any complex IT system is an important and often challenging task best started in the early stages of the development process. We present a method that transforms this task for FPGA designs into a reinforcement learning (RL) problem. This paper introduces a method to generate a Markov Decision Process based RL model from a formal, high-level system description (formulated in the domain-specific language) of the system under review and different, quantified assumptions about the system’s security. Probabilistic transitions and the reward function can be used to model the varying resilience of different elements against attacks and the capabilities of an attacker. This information is then used to determine a plausible data exfiltration strategy. An example with multiple scenarios illustrates the workflow. A discussion of supplementary techniques like hierarchical learning and deep neural networks concludes this paper.


2020 ◽  
Vol 50 (4) ◽  
pp. 225-238
Author(s):  
Eunhye Song ◽  
Peiling Wu-Smith ◽  
Barry L. Nelson

A vehicle content portfolio refers to a complete set of combinations of vehicle features offered while satisfying certain restrictions for the vehicle model. Vehicle Content Optimization (VCO) is a simulation-based decision support system at General Motors (GM) that helps to optimize a vehicle content portfolio to improve GM’s business performance and customers’ satisfaction. VCO has been applied to most major vehicle models at GM. VCO consists of several steps that demand intensive computing power, thus requiring trade-offs between the estimation error of the simulated performance measures and the computation time. Given VCO’s substantial influence on GM’s content decisions, questions were raised regarding the business risk caused by uncertainty in the simulation results. This paper shows how we successfully established an uncertainty quantification procedure for VCO that can be applied to any vehicle model at GM. With this capability, GM can not only quantify the overall uncertainty in its performance measure estimates but also identify the largest source of uncertainty and reduce it by allocating more targeted simulation effort. Moreover, we identified several opportunities to improve the efficiency of VCO by reducing its computational overhead, some of which were adopted in the development of the next generation of VCO.


Sign in / Sign up

Export Citation Format

Share Document