Modeling movie-evoked human brain activity using motion-energy and space-time vision transformer features
SummaryIn this paper, the process of building a model for predicting human brain activity under video viewing conditions was described as a part of an entry into the Algonauts Project 2021 Challenge. The model was designed to predict brain activity measured using functional MRI (fMRI) by weighted linear summations of the spatiotemporal visual features that appear in the video stimuli (video features). Two types of video features were used: (1) motion-energy features designed based on neurophysiological findings, and (2) features derived from a space-time vision transformer (TimeSformer). To utilize the features of various video domains, the features of the TimeSformer models pre-trained using several different movie sets were combined. Through these model building and validation processes, results showed that there is a certain correspondence between the hierarchical representation of the TimeSformer model and the hierarchical representation of the visual system in the brain. The motion-energy features are effective in predicting brain activity in the early visual areas, while TimeSformer-derived features are effective in higher-order visual areas, and a hybrid model that uses motion energy and TimeSformer features is effective for predicting whole brain activity.