AbstractVisual sensor data of manual assembly operations offers rich information that can be extracted in order to analyze and digitalize the assembly. The worker’s interaction with tools and objects, as well as the spatial–temporal nature of assembly operations, makes the recognition and classification of assembly operations a complex task. Therefore, classical methods of computer vision do not provide a sufficient solution. This paper presents a recurrent neural network for the classification of manual assembly operations using visual sensor data and addresses the question as to what extent such a solution is feasible in terms of robustness and reliability. Since complex assembly operations are a combination of basic movements, four main assembly operations of the Methods Time-Measurement base operations are classified using a machine learning approach. A dataset of these four assembly operations, reach, grasp, move and release, containing RGB-, infrared-, and depth-data is used. A Convolutional Neural Network—Long Short Term Memory architecture is investigated regarding its applicability due to the spatial–temporal nature of the data.