optimal action
Recently Published Documents


TOTAL DOCUMENTS

81
(FIVE YEARS 39)

H-INDEX

10
(FIVE YEARS 2)

2021 ◽  
Vol 11 (23) ◽  
pp. 11162
Author(s):  
Bonwoo Gu ◽  
Yunsick Sung

A Deep-Q-Network (DQN) controls a virtual agent as the level of a player using only screenshots as inputs. Replay memory selects a limited number of experience replays according to an arbitrary batch size and updates them using the associated Q-function. Hence, relatively fewer experience replays of different states are utilized when the number of states is fixed and the state of the randomly selected transitions becomes identical or similar. The DQN may not be applicable in some environments where it is necessary to perform the learning process using more experience replays than is required by the limited batch size. In addition, because it is unknown whether each action can be executed, a problem of an increasing amount of repetitive learning occurs as more non-executable actions are selected. In this study, an enhanced DQN framework is proposed to resolve the batch size problem and reduce the learning time of a DQN in an environment with numerous non-executable actions. In the proposed framework, non-executable actions are filtered to reduce the number of selectable actions to identify the optimal action for the current state. The proposed method was validated in Gomoku, a strategy board game, in which the application of a traditional DQN would be difficult.


2021 ◽  
Vol 50 (3) ◽  
pp. 507-521
Author(s):  
Atif Mehmood ◽  
Inam ul Hasan Shaikh ◽  
Ahsan Ali

Deep reinforcement learning, the fastest growing technique, to solve real-world complex problems by creatinga simple mathematical framework. It includes an agent, action, environment, and a reward. An agent will interactwith the environment, takes an optimal action aiming to maximize the total reward. This paper proposesthe compelling technique of deep deterministic policy gradient for solving the complex continuous actionspace of 3-wheeled omnidirectional mobile robots. Three-wheeled Omnidirectional mobile robots tracking isa difficult task because of the orientation of the wheels which makes it rotate around its own axis rather tofollow the trajectory. A deep deterministic policy gradient (DDPG) algorithm has been designed to train in environmentswith continuous action space to follow the trajectory by training the neural networks defined forthe policy and value function to maximize the reward function defined for the tracking of the trajectory. DDPGagent environment is created in the Reinforcement learning toolbox in MATLAB 2019 while for Actor and criticnetwork design deep neural network designer is used. Results are shown to illustrate the effectiveness of thetechnique with a convergence of error approximately to zero.


Sensors ◽  
2021 ◽  
Vol 21 (18) ◽  
pp. 6187
Author(s):  
Yeonggul Jang ◽  
Byunghwan Jeon

Accurate identification of the coronary ostia from 3D coronary computed tomography angiography (CCTA) is a essential prerequisite step for automatically tracking and segmenting three main coronary arteries. In this paper, we propose a novel deep reinforcement learning (DRL) framework to localize the two coronary ostia from 3D CCTA. An optimal action policy is determined using a fully explicit spatial-sequential encoding policy network applying 2.5D Markovian states with three past histories. The proposed network is trained using a dueling DRL framework on the CAT08 dataset. The experiment results show that our method is more efficient and accurate than the other methods. blueFloating-point operations (FLOPs) are calculated to measure computational efficiency. The result shows that there are 2.5M FLOPs on the proposed method, which is about 10 times smaller value than 3D box-based methods. In terms of accuracy, the proposed method shows that 2.22 ± 1.12 mm and 1.94 ± 0.83 errors on the left and right coronary ostia, respectively. The proposed method can be applied to the tasks to identify other target objects by changing the target locations in the ground truth data. Further, the proposed method can be utilized as a pre-processing method for coronary artery tracking methods.


2021 ◽  
Vol 2 ◽  
Author(s):  
Ashlesha Akella ◽  
Chin-Teng Lin

In formation control, a robot (or an agent) learns to align itself in a particular spatial alignment. However, in a few scenarios, it is also vital to learn temporal alignment along with spatial alignment. An effective control system encompasses flexibility, precision, and timeliness. Existing reinforcement learning algorithms excel at learning to select an action given a state. However, executing an optimal action at an appropriate time remains challenging. Building a reinforcement learning agent which can learn an optimal time to act along with an optimal action can address this challenge. Neural networks in which timing relies on dynamic changes in the activity of population neurons have been shown to be a more effective representation of time. In this work, we trained a reinforcement learning agent to create its representation of time using a neural network with a population of recurrently connected nonlinear firing rate neurons. Trained using a reward-based recursive least square algorithm, the agent learned to produce a neural trajectory that peaks at the “time-to-act”; thus, it learns “when” to act. A few control system applications also require the agent to temporally scale its action. We trained the agent so that it could temporally scale its action for different speed inputs. Furthermore, given one state, the agent could learn to plan multiple future actions, that is, multiple times to act without needing to observe a new state.


Author(s):  
Zack Fitzsimmons ◽  
Edith Hemaspaandra

The computational study of election problems generally focuses on questions related to the winner or set of winners of an election. But social preference functions such as Kemeny rule output a full ranking of the candidates (a consensus). We study the complexity of consensus-related questions, with a particular focus on Kemeny and its qualitative version Slater. The simplest of these questions is the problem of determining whether a ranking is a consensus, and we show that this problem is coNP-complete. We also study the natural question of the complexity of manipulative actions that have a specific consensus as a goal. Though determining whether a ranking is a Kemeny consensus is hard, the optimal action for manipulators is to simply vote their desired consensus. We provide evidence that this simplicity is caused by the combination of election system (Kemeny), manipulative action (manipulation), and manipulative goal (consensus). In the process we provide the first completeness results at the second level of the polynomial hierarchy for electoral manipulation and for optimal solution recognition.


2021 ◽  
Vol 2021 ◽  
pp. 1-18
Author(s):  
Rongyao Yuan ◽  
Yang Yang ◽  
Chao Su ◽  
Shaopei Hu ◽  
Heng Zhang ◽  
...  

Magnetorheological (MR) dampers, as an intelligent vibration damping device, can quickly change the damping size of the material in milliseconds. The traditional semiactive control strategy cannot give full play to the ability of the MR dampers to consume energy and reduce vibration under different currents, and it is difficult to control the MR dampers accurately. In this paper, a semiactive control strategy based on reinforcement learning (RL) is proposed, which is based on “exploring” to learn the optimal value of the MR dampers at each step of the operation, the applied current value. During damping control, the learned optimal action value for each step is input into the MR dampers so that they provide the optimal damping force to the structure. Applying this strategy to a two-layer frame structure was found to provide more accurate control of the MR dampers, significantly improving the damping effect of the MR dampers.


Entropy ◽  
2021 ◽  
Vol 23 (6) ◽  
pp. 737
Author(s):  
Fengjie Sun ◽  
Xianchang Wang ◽  
Rui Zhang

An Unmanned Aerial Vehicle (UAV) can greatly reduce manpower in the agricultural plant protection such as watering, sowing, and pesticide spraying. It is essential to develop a Decision-making Support System (DSS) for UAVs to help them choose the correct action in states according to the policy. In an unknown environment, the method of formulating rules for UAVs to help them choose actions is not applicable, and it is a feasible solution to obtain the optimal policy through reinforcement learning. However, experiments show that the existing reinforcement learning algorithms cannot get the optimal policy for a UAV in the agricultural plant protection environment. In this work we propose an improved Q-learning algorithm based on similar state matching, and we prove theoretically that there has a greater probability for UAV choosing the optimal action according to the policy learned by the algorithm we proposed than the classic Q-learning algorithm in the agricultural plant protection environment. This proposed algorithm is implemented and tested on datasets that are evenly distributed based on real UAV parameters and real farm information. The performance evaluation of the algorithm is discussed in detail. Experimental results show that the algorithm we proposed can efficiently learn the optimal policy for UAVs in the agricultural plant protection environment.


2021 ◽  
Vol 13 (2) ◽  
pp. 57-80
Author(s):  
Arunita Kundaliya ◽  
D.K. Lobiyal

In resource constraint Wireless Sensor Networks (WSNs), enhancement of network lifetime has been one of the significantly challenging issues for the researchers. Researchers have been exploiting machine learning techniques, in particular reinforcement learning, to achieve efficient solutions in the domain of WSN. The objective of this paper is to apply Q-learning, a reinforcement learning technique, to enhance the lifetime of the network, by developing distributed routing protocols. Q-learning is an attractive choice for routing due to its low computational requirements and additional memory demands. To facilitate an agent running at each node to take an optimal action, the approach considers node’s residual energy, hop length to sink and transmission power. The parameters, residual energy and hop length, are used to calculate the Q-value, which in turn is used to decide the optimal next-hop for routing. The proposed protocols’ performance is evaluated through NS3 simulations, and compared with AODV protocol in terms of network lifetime, throughput and end-to-end delay.


Sign in / Sign up

Export Citation Format

Share Document