How environmental features (e.g., people, enrichment, or other animals) affect movement is an important element for the study of animal behavior, biomechanics, and welfare. Here we present a stationary overhead camera-based persistent monitoring framework for the investigation of bottlenose dolphin (Tursiops truncatus) response to environmental stimuli. Mask R-CNN, a convolutional neural network architecture, was trained to automatically detect 3 object types in the environment: dolphins, people, and enrichment floats that were introduced to stimulate and engage the animals. Detected objects within each video frame were linked together to create track segments across frames. The animals’ tracks were used to parameterize their response to the presence of environmental stimuli. We collected and analyzed data from 24 sessions from bottlenose dolphins in a managed lagoon environment. The seasons had an average duration of 1 h and around half of them had enrichment (42%) while the rest (58%) did not. People were visible in the environment for 18.8% of the total time (∼4.5 h), more often when enrichment was present (∼3 h) than without (∼1.5 h). When neither enrichment nor people were present, the animals swam at an average speed of 1.2 m/s. When enrichment was added to the lagoon, average swimming speed decreased to 1.0 m/s and the animals spent more time moving at slow speeds around the enrichment. Animals’ engagement with the enrichment also decreased over time. These results indicate that the presence of enrichment and people in, or around, the environment attracts the animals, influencing habitat use and movement patterns as a result. This work demonstrates the ability of the proposed framework for the quantification and persistent monitoring of bottlenose dolphins’ movement, and will enable new studies to investigate individual and group animal locomotion and behavior.