Alpha-T: Learning to Traverse over Graphs with An AlphaZero-inspired Self-Play Framework
Abstract The combinatorial optimization problems on the graph are the core and classic problems in artificial intelligence and operations research. For example, the Vehicle Routing Problem (VRP) and Traveling Salesman Problem (TSP) are not only very interesting NP-hard problems but also have important significance for the actual transportation system. Traditional methods such as heuristics methods, precise algorithms, and solution solvers can already find approximate solutions on small-scale graphs. However, they are helpless for large-scale graphs and other problems with similar structures. Moreover, traditional methods often require artificially designed heuristic functions to assist decision-making. In recent years, more and more work has focused on the application of deep learning and reinforcement learning (RL) to learn heuristics, which allows us to learn the internal structure of the graph end-to-end and find the optimal path under the guidance of heuristic rules, but most of these still need manual assistance, and the RL method used has the problems of low sampling efficiency and small searchable space. In this paper, we propose a novel framework (called Alpha-T) based on AlphaZero, which does not require expert experience or label data but is trained through self-play. We divide the learning into two stages: in the first stage we employ graph attention network (GAT) and GRU to learn node representations and memory history trajectories, and in the second stage we employ Monte Carlo tree search (MCTS) and deep RL to search the solution space and train the model.