The development of object detection in infrared images has attracted more attention in recent years. However, there are few studies on multi-scale object detection in infrared street scene images. Additionally, the lack of high-quality infrared datasets hinders research into such algorithms. In order to solve these issues, we firstly make a series of modifications based on Faster Region-Convolutional Neural Network (R-CNN). In this paper, a double-layer region proposal network (RPN) is proposed to predict proposals of different scales on both fine and coarse feature maps. Secondly, a multi-scale pooling module is introduced into the backbone of the network to explore the response of objects on different scales. Furthermore, the inception4 module and the position sensitive region of interest (ROI) align (PSalign) pooling layer are utilized to explore richer features of the objects. Thirdly, this paper proposes instance level data augmentation, which takes into account the imbalance between categories while enlarging dataset. In the training stage, the online hard example mining method is utilized to further improve the robustness of the algorithm in complex environments. The experimental results show that, compared with baseline, our detection method has state-of-the-art performance.