Not so fast: Limited validity of deep convolutional neural networks as in silico models for human naturalistic face processing
Deep convolutional neural networks (DCNNs) trained for face identification can rival and even exceed human-level performance. The relationships between internal representations learned by DCNNs and those of the primate face processing system are not well understood, especially in naturalistic settings. We developed the largest naturalistic dynamic face stimulus set in human neuroimaging research (700+ naturalistic video clips of unfamiliar faces) and used representational similarity analysis to investigate how well the representations learned by high-performing DCNNs match human brain representations across the entire distributed face processing system. DCNN representational geometries were strikingly consistent across diverse architectures and captured meaningful variance among faces. Similarly, representational geometries throughout the human face network were highly consistent across subjects. Nonetheless, correlations between DCNN and neural representations were very weak overall—DCNNs captured 3% of variance in the neural representational geometries at best. Intermediate DCNN layers better matched visual and face-selective cortices than the final fully-connected layers. Behavioral ratings of face similarity were highly correlated with intermediate layers of DCNNs, but also failed to capture representational geometry in the human brain. Our results suggest that the correspondence between intermediate DCNN layers and neural representations of naturalistic human face processing is weak at best, and diverges even further in the later fully-connected layers. This poor correspondence can be attributed, at least in part, to the dynamic and cognitive information that plays an essential role in human face processing but is not modeled by DCNNs. These mismatches indicate that current DCNNs have limited validity as in silico models of dynamic, naturalistic face processing in humans.