Cross-Modal Semantic-Associative Labelling, Indexing and Retrieval of Multimodal Data
Digitalised multimedia information today is typically represented in different modalities and distributed through various channels. The use of such a huge amount of data is highly dependent on effective and efficient cross-modal labelling, indexing and retrieval of multimodal information. In this Chapter, we mainly focus on the combining of the primary and collateral modalities of the information resource in an intelligent and effective way in order to provide better multimodal information understanding, classification, labelling and retrieval. Image and text are the two modalities we mainly talk about here. A novel framework for semantic-based collaterally cued image labelling had been proposed and implemented, aiming to automatically assign linguistic keywords to regions of interest in an image. A visual vocabulary was constructed based on manually labelled image segments. We use Euclidean distance and Gaussian distribution to map the low-level region-based image features to the high-level visual concepts defined in the visual vocabulary. Both the collateral content and context knowledge were extracted from the collateral textual modality to bias the mapping process. A semantic-based high-level image feature vector model was constructed based on the labelling results, and the performance of image retrieval using this feature vector model appears to outperform both content-based and text-based approaches in terms of its capability for combining both perceptual and conceptual similarity of the image content.