Scene description refers to the automatic generation of natural language descriptions from videos. In general, deep learning-based scene description networks utilize multimodalities, such as image, motion, audio, and label information, to improve the description quality. In particular, image information plays an important role in scene description. However, scene description has a potential issue, because it may handle images with severe compression artifacts. Hence, this paper analyzes the impact of video compression on scene description, and then proposes a simple network that is robust to compression artifacts. In addition, a network cascading more encoding layers for efficient multimodal embedding is also proposed. Experimental results show that the proposed network is more efficient than conventional networks.