Recently, people’s demand for action recognition has extended from the initial high classification accuracy to the high accuracy of the temporal action detection. It is challenging to meet the two requirements simultaneously. The key to behavior recognition lies in the quantity and quality of the extracted features. In this paper, a two-stream convolutional network is used. A three-dimensional convolutional neural network (3D-CNN) is used to extract spatiotemporal features from the consecutive frames. A two-dimensional convolutional neural network (2D-CNN) is used to extract spatial features from the key-frames. The integration of the two networks is excellent for improving the model’s accuracy and can complete the task of distinguishing the start–stop frame. In this paper, a multi-scale feature extraction method is presented to extract more abundant feature information. At the same time, a multi-task learning model is introduced. It can further improve the accuracy of classification via sharing the data between multiple tasks. The experimental result shows that the accuracy of the modified model is improved by 10%. Meanwhile, we propose the confidence gradient, which can optimize the distinguishing method of the start–stop frame to improve the temporal action detection accuracy. The experimental result shows that the accuracy has been enhanced by 11%.