policy optimization
Recently Published Documents


TOTAL DOCUMENTS

387
(FIVE YEARS 254)

H-INDEX

17
(FIVE YEARS 4)

Symmetry ◽  
2022 ◽  
Vol 14 (1) ◽  
pp. 161
Author(s):  
Hyojoon Han ◽  
Hyukho Kim ◽  
Yangwoo Kim

The complexity of network intrusion detection systems (IDSs) is increasing due to the continuous increases in network traffic, various attacks and the ever-changing network environment. In addition, network traffic is asymmetric with few attack data, but the attack data are so complex that it is difficult to detect one. Many studies on improving intrusion detection performance using feature engineering have been conducted. These studies work well in the dataset environment; however, it is challenging to cope with a changing network environment. This paper proposes an intrusion detection hyperparameter control system (IDHCS) that controls and trains a deep neural network (DNN) feature extractor and k-means clustering module as a reinforcement learning model based on proximal policy optimization (PPO). An IDHCS controls the DNN feature extractor to extract the most valuable features in the network environment, and identifies intrusion through k-means clustering. Through iterative learning using the PPO-based reinforcement learning model, the system is optimized to improve performance automatically according to the network environment, where the IDHCS is used. Experiments were conducted to evaluate the system performance using the CICIDS2017 and UNSW-NB15 datasets. In CICIDS2017, an F1-score of 0.96552 was achieved and UNSW-NB15 achieved an F1-score of 0.94268. An experiment was conducted by merging the two datasets to build a more extensive and complex test environment. By merging datasets, the attack types in the experiment became more diverse and their patterns became more complex. An F1-score of 0.93567 was achieved in the merged dataset, indicating 97% to 99% performance compared with CICIDS2017 and UNSW-NB15. The results reveal that the proposed IDHCS improved the performance of the IDS by automating learning new types of attacks by managing intrusion detection features regardless of the network environment changes through continuous learning.


2022 ◽  
Author(s):  
Shuhuan Wen ◽  
Zhixin Ji ◽  
Ahmad B. Rad ◽  
Zhengzheng Guo

Abstract The problem of exploration in unknown environments is still a great challenge for autonomous mobile robots due to the lack of a priori knowledge. Active Simultaneous Localization and Mapping (SLAM) is an effective method to realize obstacle avoidance and autonomous navigation. Traditional Active SLAM is usually complex to model and difficult to adapt automatically to new operating areas. This paper presents a novel Active SLAM algorithm based on Deep Reinforcement Learning (DRL). The Relational Proximal Policy Optimization (RPPO) model with deep separable convolution and data batch processing is used to predict the action strategy and generate the action plan through the acquired environment RGB images, so as to realize the autonomous collision free exploration of the environment. Meanwhile, Gmapping is applied to locate and map the environment. Then, based on Transfer Learning, Active SLAM algorithm is applied to complex unknown environments with various dynamic and static obstacles. Finally, we present several experiments to demonstrate the advantages and feasibility of the proposed Active SLAM algorithm.


2022 ◽  
pp. 1-12
Author(s):  
Shuailong Li ◽  
Wei Zhang ◽  
Huiwen Zhang ◽  
Xin Zhang ◽  
Yuquan Leng

Model-free reinforcement learning methods have successfully been applied to practical applications such as decision-making problems in Atari games. However, these methods have inherent shortcomings, such as a high variance and low sample efficiency. To improve the policy performance and sample efficiency of model-free reinforcement learning, we propose proximal policy optimization with model-based methods (PPOMM), a fusion method of both model-based and model-free reinforcement learning. PPOMM not only considers the information of past experience but also the prediction information of the future state. PPOMM adds the information of the next state to the objective function of the proximal policy optimization (PPO) algorithm through a model-based method. This method uses two components to optimize the policy: the error of PPO and the error of model-based reinforcement learning. We use the latter to optimize a latent transition model and predict the information of the next state. For most games, this method outperforms the state-of-the-art PPO algorithm when we evaluate across 49 Atari games in the Arcade Learning Environment (ALE). The experimental results show that PPOMM performs better or the same as the original algorithm in 33 games.


2021 ◽  
Vol 1 (2) ◽  
pp. 33-39
Author(s):  
Mónika Farsang ◽  
Luca Szegletes

Learning the optimal behavior is the ultimate goal in reinforcement learning. This can be achieved by many different approaches, the most successful of them are policy gradient methods. However, they can suffer from undesirably large updates of policies, leading to poor performance. In recent years there has been a clear trend toward designing more reliable algorithms. This paper addresses to examine different restriction strategies applied to the widely used Proximal Policy Optimization (PPO-Clip) technique. We also question whether the analyzed methods are able to adapt not only to low-dimensional tasks but also to complex, high-dimensional problems in control and robotic domains. The analysis of the learned behavior shows that these methods can lead to better performance compared to the original PPO-Clip algorithm, moreover, they are also able to achieve complex behavior and policies in high-dimensional environments.


Author(s):  
Lingyun Mi ◽  
Tianwen Jia ◽  
Yang Yang ◽  
Lulu Jiang ◽  
Bangjun Wang ◽  
...  

Evaluating the effectiveness of ecological civilization policies is the basis from which policymakers can optimize policies. From the perspective of the overall effectiveness of regional policies, and taking Jiangsu Province as an example, this study constructed a quantitative evaluation model of eco-civilization policy text and an eco-civilization evaluation index system. Using these tools, this paper evaluates the effectiveness of 53 ecological civilization policies issued by Jiangsu Province during 2004–2019 to promote the construction of ecological civilization in the four fields of resource utilization, environmental protection, economic development, and social life. There are three key findings. (1) During the period of 2004–2019, the effectiveness of the textual content of ecological civilization policies in Jiangsu Province generally showed a fluctuating upward trend. (2) The construction effectiveness indexes of the four fields of eco-civilization all showed a growth trend, but the construction effect varied greatly. The index of economic development had grown rapidly, while environmental protection had grown slowly. (3) Ecological civilization policies in Jiangsu Province were effective in promoting the construction of ecological civilization. However, the effects of different policy dimensions on ecological civilization development in the four fields were significantly different. Finally, based on these results, powerful recommendations are provided for the optimization of eco-civilization policies in Jiangsu Province. Moreover, Jiangsu is the first province in China to launch a provincial-level ecological civilization construction plan. Its policy optimization to promote ecological civilization construction can also provide an example and realistic basis for reference for the construction of eco-civilization in other provinces in China.


2021 ◽  
Author(s):  
Zikai Feng ◽  
Yuanyuan Wu ◽  
Mengxing Huang ◽  
Di Wu

Abstract In order to avoid the malicious jamming of the intelligent unmanned aerial vehicle (UAV) to ground users in the downlink communications, a new anti-UAV jamming strategy based on multi-agent deep reinforcement learning is studied in this paper. In this method, ground users aim to learn the best mobile strategies to avoid the jamming of UAV. The problem is modeled as a Stackelberg game to describe the competitive interaction between the UAV jammer (leader) and ground users (followers). To reduce the computational cost of equilibrium solution for the complex game with large state space, a hierarchical multi-agent proximal policy optimization (HMAPPO) algorithm is proposed to decouple the hybrid game into several sub-Markov games, which updates the actor and critic network of the UAV jammer and ground users at different time scales. Simulation results suggest that the hierarchical multi-agent proximal policy optimization -based anti-jamming strategy achieves comparable performance with lower time complexity than the benchmark strategies. The well-trained HMAPPO has the ability to obtain the optimal jamming strategy and the optimal anti-jamming strategies, which can approximate the Stackelberg equilibrium (SE).


2021 ◽  
pp. 1-10
Author(s):  
Wei Zhou ◽  
Xing Jiang ◽  
Bingli Guo (Member, IEEE) ◽  
Lingyu Meng

Currently, Quality-of-Service (QoS)-aware routing is one of the crucial challenges in Software Defined Network (SDN). The QoS performances, e.g. latency, packet loss ratio and throughput, must be optimized to improve the performance of network. Traditional static routing algorithms based on Open Shortest Path First (OSPF) could not adapt to traffic fluctuation, which may cause severe network congestion and service degradation. Central intelligence of SDN controller and recent breakthroughs of Deep Reinforcement Learning (DRL) pose a promising solution to tackle this challenge. Thus, we propose an on-policy DRL mechanism, namely the PPO-based (Proximal Policy Optimization) QoS-aware Routing Optimization Mechanism (PQROM), to achieve a general and re-customizable routing optimization. PQROM can dynamically update the routing calculation by adjusting the reward function according to different optimization objectives, and it is independent of any specific network pattern. Additionally, as a black-box one-step optimization, PQROM is qualified for both continuous and discrete action space with high-dimensional input and output. The OMNeT ++ simulation experiment results show that PQROM not only has good convergence, but also has better stability compared with OSPF, less training time and simpler hyper-parameters adjustment than Deep Deterministic Policy Gradient (DDPG) and less hardware consumption than Asynchronous Advantage Actor-Critic (A3C).


Sign in / Sign up

Export Citation Format

Share Document