Attention-Based Multimodal Neural Network for Automatic Evaluation of Press Conferences
In the study, a multimodal neural network is proposed to automatically predict the evaluation of a professional consultant team for press conferences using text and audio data. Seven publicly available press conference videos were collected, and all the Q&A pairs between speakers and journalists were annotated by the consultant team. The proposed multimodal neural network consists of a language model, an audio model, and a feature fusion network. The word representation is made up by a token embedding using ELMo and a type embedding. The language model is an LSTM with an attention layer. The audio model is based on a six-layer CNN to extract segmental feature as well as an attention network to measure the importance of each segment. Two approaches of feature fusion are proposed: a shared attention network and the production of text features and audio features. The former can explain the importance between speech content and speaking style. The latter achieved the best performance with the average accuracy of 60.1% for all evaluation criteria.