scholarly journals Reinforcement Learning in Optimizing Forest Management

Author(s):  
Pekka Malo ◽  
Olli Tahvonen ◽  
Antti Suominen ◽  
Philipp Back ◽  
Lauri Viitasaari

We solve a stochastic high-dimensional optimal harvesting problem by reinforcement learning algorithms developed for agents who learn an optimal policy in a sequential decision process through repeated experience. This approach produces optimal solutions without discretization of state and control variables. Our stand-level model includes mixed species, tree size structure, optimal harvest timing, choice between rotation and continuous cover forestry, stochasticity in stand growth, and stochasticity in the occurrence of natural disasters. The optimal solution or policy maps the system state to the set of actions, i.e. clear-cut/thinning/no harvest decisions and the intensity of thinning over tree species and size classes. The algorithm repeats the solutions for deterministic problems computed earlier with time-consuming methods. Optimal policy describes harvesting choices from any initial state and reveals how the initial thinning vs. clear-cut choice depends on the economic and ecological factors. Stochasticity in stand growth increases the diversity of species composition. Despite the high variability in natural regeneration, the optimal policy closely satisfies the certainty equivalence principle. The effect of natural disasters is similar to an increase in the interest rate, but in contrast to earlier results, this tends to change the management regime from rotation forestry to continuous cover management.

Author(s):  
Swetasudha Panda ◽  
Yevgeniy Vorobeychik

We propose a novel Stackelberg game model of MDP interdiction in which the defender modifies the initial state of the planner, who then responds by computing an optimal policy starting with that state. We first develop a novel approach for MDP interdiction in factored state space that allows the defender to modify the initial state. The resulting approach can be computationally expensive for large factored MDPs. To address this, we develop several interdiction algorithms that leverage variations of reinforcement learning using both linear and non-linear function approximation. Finally, we extend the interdiction framework to consider a Bayesian interdiction problem in which the interdictor is uncertain about some of the planner's initial state features. Extensive experiments demonstrate the effectiveness of our approaches.


2021 ◽  
Vol 6 (1) ◽  
Author(s):  
Peter Morales ◽  
Rajmonda Sulo Caceres ◽  
Tina Eliassi-Rad

AbstractComplex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the task-specific network discovery problem as a sequential decision-making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called network actor critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. The NAC paradigm utilizes a task-specific network embedding to reduce the state space complexity. A detailed comparative analysis of popular network embeddings is presented with respect to their role in supporting offline planning. Furthermore, a quantitative study is presented on various synthetic and real benchmarks using NAC and several baselines. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms. Finally, we outline learning regimes where planning is critical in addressing sparse and changing reward signals.


Author(s):  
Ming-Sheng Ying ◽  
Yuan Feng ◽  
Sheng-Gang Ying

AbstractMarkov decision process (MDP) offers a general framework for modelling sequential decision making where outcomes are random. In particular, it serves as a mathematical framework for reinforcement learning. This paper introduces an extension of MDP, namely quantum MDP (qMDP), that can serve as a mathematical model of decision making about quantum systems. We develop dynamic programming algorithms for policy evaluation and finding optimal policies for qMDPs in the case of finite-horizon. The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.


Entropy ◽  
2021 ◽  
Vol 23 (6) ◽  
pp. 737
Author(s):  
Fengjie Sun ◽  
Xianchang Wang ◽  
Rui Zhang

An Unmanned Aerial Vehicle (UAV) can greatly reduce manpower in the agricultural plant protection such as watering, sowing, and pesticide spraying. It is essential to develop a Decision-making Support System (DSS) for UAVs to help them choose the correct action in states according to the policy. In an unknown environment, the method of formulating rules for UAVs to help them choose actions is not applicable, and it is a feasible solution to obtain the optimal policy through reinforcement learning. However, experiments show that the existing reinforcement learning algorithms cannot get the optimal policy for a UAV in the agricultural plant protection environment. In this work we propose an improved Q-learning algorithm based on similar state matching, and we prove theoretically that there has a greater probability for UAV choosing the optimal action according to the policy learned by the algorithm we proposed than the classic Q-learning algorithm in the agricultural plant protection environment. This proposed algorithm is implemented and tested on datasets that are evenly distributed based on real UAV parameters and real farm information. The performance evaluation of the algorithm is discussed in detail. Experimental results show that the algorithm we proposed can efficiently learn the optimal policy for UAVs in the agricultural plant protection environment.


1983 ◽  
Vol 40 (7) ◽  
pp. 987-1024 ◽  
Author(s):  
Lionel Johnson

The results of investigations on the fish stocks of seven Arctic lakes covering a period of 23 yr are described. These lakes have remained largely undisturbed since their formation in late glacial times; all but one are completely autonomous and of comparatively small size. Such lakes provide a unique opportunity for the development and testing of conceptual models. In all cases the only fish species present is Arctic charr, Salvelinus alpinus. Length frequency distributions derived from gillnet catch curves are shown to be, within reasonable limits, representative of the actual populations in the lake, and not artifacts of the sampling procedure. Length frequency curves show a unimodal or bimodal distribution and this structure, in the absence of perturbation, appears to remain constant indefinitely. Individuals are of great age but age-at-length is highly variable. Age and size structure are shown to be comparable with the age and size structure of the dominant tree species in a climax forest; it is concluded that forces of great generality fashion these configurations. It is hypothesized that all species tend to move towards a state of least energy dissipation; this can be most readily seen in the dominant species at the climax in an autonomous system. The dominant species is characterized by large individual size, a high degree of uniformity, high total biomass, great mean age, indeterminate age-at-death, and a low incidence of replacement stock. After severe perturbation it is shown that the charr stock returns to a state of least dissipation without oscillation. Absence of oscillation during the return to the initial state, combined with the long-term stability shown in control lakes, indicates the presence of an effective damping mechanism; this in turn indicates the existence of organization within the stock as a whole. Organization develops through an interactive mechanism described under the doctrine of homeokinesis, which is responsible for energy equipartitioning and the maintenance of uniformity. These concepts help to explain phenomena observed in more complex systems and help our understanding of ecosystem functioning.


Author(s):  
Александр Александрович Воевода ◽  
Дмитрий Олегович Романников

Синтез регуляторов для многоканальных систем - актуальная и сложная задача. Одним из возможных способов синтеза является применение нейронных сетей. Нейронный регулятор либо обучают на предварительно рассчитанных данных, либо используют для настройки параметров ПИД-регулятора из начального устойчивого положения замкнутой системы. Предложено использовать нейронные сети для регулирования двухканального объекта, при этом обучение будет выполняться из неустойчивого (произвольного) начального положения с применением методов обучения нейронных сетей с подкреплением. Предложена структура нейронной сети и замкнутой системы, в которой уставка задается при помощи входного параметра нейронной сети регулятора The problem for synthesis of automatic control systems is hard, especially for multichannel objects. One of the approaches is the use of neural networks. For the approaches that are based on the use of reinforcement learning, there is an additional issue - supporting of range of values for the set points. The method of synthesis of automatic control systems using neural networks and the process of its learning with reinforcement learning that allows neural networks learning for supporting regulation is proposed in the predefined range of set points. The main steps of the method are 1) to form a neural net input as a state of the object and system set point; 2) to perform modelling of the system with a set of randomly generated set points from the desired range; 3) to perform a one-step of the learning using the Deterministic Policy Gradient method. The originality of the proposed method is that, in contrast to existing methods of using a neural network to synthesize a controller, the proposed method allows training a controller from an unstable initial state in a closed system and set of a range of set points. The method was applied to the problem of stabilizing the outputs of a two-channel object, for which stabilization both outputs and the first near the input set point is required


2021 ◽  
Author(s):  
Yunfan Su

Vehicular ad hoc network (VANET) is a promising technique that improves traffic safety and transportation efficiency and provides a comfortable driving experience. However, due to the rapid growth of applications that demand channel resources, efficient channel allocation schemes are required to utilize the performance of the vehicular networks. In this thesis, two Reinforcement learning (RL)-based channel allocation methods are proposed for a cognitive enabled VANET environment to maximize a long-term average system reward. First, we present a model-based dynamic programming method, which requires the calculations of the transition probabilities and time intervals between decision epochs. After obtaining the transition probabilities and time intervals, a relative value iteration (RVI) algorithm is used to find the asymptotically optimal policy. Then, we propose a model-free reinforcement learning method, in which we employ an agent to interact with the environment iteratively and learn from the feedback to approximate the optimal policy. Simulation results show that our reinforcement learning method can acquire a similar performance to that of the dynamic programming while both outperform the greedy method.


2021 ◽  
Author(s):  
Yunfan Su

Vehicular ad hoc network (VANET) is a promising technique that improves traffic safety and transportation efficiency and provides a comfortable driving experience. However, due to the rapid growth of applications that demand channel resources, efficient channel allocation schemes are required to utilize the performance of the vehicular networks. In this thesis, two Reinforcement learning (RL)-based channel allocation methods are proposed for a cognitive enabled VANET environment to maximize a long-term average system reward. First, we present a model-based dynamic programming method, which requires the calculations of the transition probabilities and time intervals between decision epochs. After obtaining the transition probabilities and time intervals, a relative value iteration (RVI) algorithm is used to find the asymptotically optimal policy. Then, we propose a model-free reinforcement learning method, in which we employ an agent to interact with the environment iteratively and learn from the feedback to approximate the optimal policy. Simulation results show that our reinforcement learning method can acquire a similar performance to that of the dynamic programming while both outperform the greedy method.


Sign in / Sign up

Export Citation Format

Share Document