Gabriel Barbosa Fonseca
◽
Zenilton K. G. Patrocínio Jr
◽
Guillaume Gravier
◽
Silvio Jamil F. Guimarães
The indexing of large datasets is a task of great importance, since it directly impacts on the quality of information that can be retrieved from these sets. Unfortunately, some datasets are growing in size so fast that manually indexing becomes unfeasible. Automatic indexing techniques can be applied to overcome this issue, and in this study, a unsupervised technique for multimodal person discovery is proposed, which consists in detecting persons that are appearing and speaking simultaneously on a video and associating names to them. To achieve that, the data is modeled as a graph of speaking-faces, and names are extracted via OCR and propagated through the graph based on audiovisual relations between speaking faces. To propagate labels, two graph based methods are proposed, one based on random walks and the other based on a hierarchical approach. In order to assess the proposed approach, we use two graph clustering baselines, and different modality fusion approaches. On the MediaEval MPD 2017 dataset, the proposed label propagation methods outperform all literature methods except one, which uses a different approach on the pre-processing step. Even though the Kappa coefficient indicates that the random walk and the hierarchical label propagation produce highly equivalent results, the hierarchical propagation is more than 6 times faster than the random walk under same configurations.