Category level object discovery is important for a number of applications such as remote sensing image classification, and data mining in images and video sequences. This paper presents a novel unsupervised learning algorithm for discovering object category and their locations in video sequences. Both appearance consistency and motion consistency of local patches across frames are exploited. Video patches are first extracted and represented by spatial-temporal context words. A dynamic topic model is then introduced to learn object categories in video sequences. The proposed dynamic model can categorize and localize multiple objects in a single video. Experimental results on the CamVid dataset and the VISATTM dataset demonstrate the effectiveness of our method.