Computer Vision for Multimedia Applications
Latest Publications


TOTAL DOCUMENTS

16
(FIVE YEARS 0)

H-INDEX

2
(FIVE YEARS 0)

Published By IGI Global

9781609600242, 9781609600266

Author(s):  
Guoliang Fan ◽  
Yi Ding

Semantic event detection is an active and interesting research topic in the field of video mining. The major challenge is the semantic gap between low-level features and high-level semantics. In this chapter, we will advance a new sports video mining framework where a hybrid generative-discriminative approach is used for event detection. Specifically, we propose a three-layer semantic space by which event detection is converted into two inter-related statistical inference procedures that involve semantic analysis at different levels. The first is to infer the mid-level semantic structures from the low-level visual features via generative models, which can serve as building blocks of high-level semantic analysis. The second is to detect high-level semantics from mid-level semantic structures using discriminative models, which are of direct interests to users. In this framework we can explicitly represent and detect semantics at different levels. The use of generative and discriminative approaches in two different stages is proved to be effective and appropriate for event detection in sports video. The experimental results from a set of American football video data demonstrate that the proposed framework offers promising results compared with traditional approaches.


Author(s):  
Noureddine Abbadeni

This chapter describes an approach based on human perception to content-based image representation and retrieval. We consider textured images and propose to model the textural content of images by a set of features having a perceptual meaning and their application to content-based image retrieval. We present a new method to estimate a set of perceptual textural features, namely coarseness, directionality, contrast and busyness. The proposed computational measures are based on two representations: the original images representation and the autocovariance function (associated with images) representation. The correspondence of the proposed computational measures to human judgments is shown using a psychometric method based on the Spearman rank-correlation coefficient. The set of computational measures is applied to content-based image retrieval on a large image data set, the well-known Brodatz database. Experimental results show a strong correlation between the proposed computational textural measures and human perceptual judgments. The benchmarking of retrieval performance, done using the recall measure, shows interesting results. Furthermore, results merging/fusion returned by each of the two representations is shown to allow significant improvement in retrieval effectiveness.


Author(s):  
Yongmian Zhang ◽  
Jixu Chen ◽  
Yan Tong ◽  
Qiang Ji

This chapter describes a probabilistic framework for faithful reproduction of spontaneous facial expressions on a synthetic face model in a real time interactive application. The framework consists of a coupled Bayesian network (BN) to unify the facial expression analysis and synthesis into one coherent structure. At the analysis end, we cast the facial action coding system (FACS) into a dynamic Bayesian network (DBN) to capture relationships between facial expressions and the facial motions as well as their uncertainties and dynamics. The observations fed into the DBN facial expression model are measurements of facial action units (AUs) generated by an AU model. Also implemented by a DBN, the AU model captures the rigid head movements and nonrigid facial muscular movements of a spontaneous facial expression. At the synthesizer, a static BN reconstructs the Facial Animation Parameters (FAPs) and their intensity through the top-down inference according to the current state of facial expression and pose information output by the analysis end. The two BNs are connected statically through a data stream link. The novelty of using the coupled BN brings about several benefits. First, a facial expression is inferred through both spatial and temporal inference so that the perceptual quality of animation is less affected by the misdetection of facial features. Second, more realistic looking facial expressions can be reproduced by modeling the dynamics of human expressions in facial expression analysis. Third, very low bitrate (9 bytes per frame) in data transmission can be achieved.


Author(s):  
Ma Bin ◽  
Li Chun-lei ◽  
Wang Yun-hong ◽  
Bai Xiao

Visual saliency, namely the perceptual significance to human vision system (HVS), is a quality that differentiates an object from its neighbors. Detection of salient regions which contain prominent features and represent main contents of the visual scene, has obtained wide utilization among computer vision based applications, such as object tracking and classification, region-of-interest (ROI) based image compression, etc. Specially, as for biometric authentication system, whose objective is to distinguish the identification of people through biometric data (e.g. fingerprint, iris, face etc.), the most important metric is distinguishability. Consequently, in biometric watermarking fields, there has been a great need of good metrics for feature prominency. In this chapter, we present two salient-region-detection based biometric watermarking scenarios, in which robust annotation and fragile authentication watermark are respectively applied to biometric systems. Saliency map plays an important role of perceptual mask that adaptively select watermarking strength and position, therefore controls the distortion introduced by watermark and preserves the identification accuracy of biometric images.


Author(s):  
Hong Lu ◽  
Xiangyang Xue

With the amount of video data increasing rapidly, automatic methods are needed to deal with large-scale video data sets in various applications. In content-based video analysis, a common and fundamental preprocess for these applications is video segmentation. Based on the segmentation results, video has a hierarchical representation structure of frames, shots, and scenes from the low level to high level. Due to the huge amount of video frames, it is not appropriate to represent video contents using frames. In the levels of video structure, shot is defined as an unbroken sequence of frames from one camera; however, the contents in shots are trivial and can hardly convey valuable semantic information. On the other hand, scene is a group of consecutive shots that focuses on an object or objects of interest. And a scene can represent a semantic unit for further processing such as story extraction, video summarization, etc. In this chapter, we will survey the methods on video scene segmentation. Specifically, there are two kinds of scenes. One kind of scene is to just consider the visual similarity of video shots and clustering methods are used for scene clustering. Another kind of scene is to consider both the visual similarity and temporal constraints of video shots, i.e., shots with similar contents and not lying too far in temporal order. Also, we will present our proposed methods on scene clustering and scene segmentation by using Gaussian mixture model, graph theory, sequential change detection, and spectral methods.


Author(s):  
Lin Wu ◽  
Yang Wang

This chapter presents a framework for detecting fake regions by using various methods including watermarking technique and blind approaches. In particular, we describe current categories on blind approaches which can be divided into five: pixel-based techniques, format-based techniques, camera-based techniques, physically-based techniques and geometric-based techniques. Then we take a second look on the geometric-based techniques and further categorize them in detail. In the following section, the state-of-the-art methods involved in the geometric technique are elaborated.


Author(s):  
Kongqiao Wang ◽  
Yikai Fang ◽  
Xiujuan Chai

Vision based gesture recognition is a hot research topic in recent years. Many researchers focus on how to differentiate various hand shapes, e.g. the static hand gesture recognition or hand posture recognition. It is one of the fundamental problems in vision based gesture analysis. In general, most frequently used visual cues human uses to describe hand are appearance and structure information, while the recognition with such information is difficult due to variant hand shapes and subject differences. To have a good representation of hand area, methods based on local features and texture histograms are attempted to represent the hand. And a learning based classification strategy is designed with different descriptors or features. In this chapter, we mainly focus on 2D geometric and appearance models, the design of local texture descriptor and semi-supervised learning strategy with different features for hand posture recognition.


Author(s):  
Wen Wu ◽  
Jie Yang ◽  
Xilin Chen

Human drivers often use landmarks for navigation. For example, we tell people to turn left after the second traffic light and to make a right at Starbucks. In our daily life, a landmark can be anything that is easily recognizable and used for giving navigation directions, such as a sign or a building. It has been proposed that current navigation systems can be made more effective and safer by incorporating landmarks as key navigation cues. Especially, landmarks support navigation in unfamiliar environments. In this chapter, we aim to describe technologies for two intelligent vision systems for landmark-based car navigation: (1) labeling street landmarks in images with minimal human effort; we have proposed a semi-supervised learning framework for the task; (2) automatically detecting text on road signs from video; the proposed framework takes advantage of spatio-temporal information in video and fuses partial information for detecting text from frame to frame.


Author(s):  
Zahid Riaz ◽  
Suat Gedikli ◽  
Michael Beetz ◽  
Bernd Radig

In this chapter, we focus on the human robot joint interaction application where robots can extract the useful multiple features from human faces. The idea follows daily life scenarios where humans rely mostly on face to face interaction and interpret gender, identity, facial behavior and age of the other persons at a very first glance. We term this problem as face-at-a-glance problem. The proposed solution to this problem is the development of a 3D photorealistic face model in real time for human facial analysis. We also discuss briefly some outstanding challenges like head poses, facial expressions and illuminations for image synthesis. Due to the diversity of the application domain and optimization of relevant information extraction for computer vision applications, we propose to solve this problem using an interdisciplinary 3D face model. The model is built using computer vision and computer graphics tools with image processing techniques. In order to trade off between accuracy and efficiency, we choose wireframe model which provides automatic face generation in real time. The goal of this chapter is to provide a standalone and comprehensive framework to extract useful multi-feature from a 3D model. Such features due to their wide range of information and less computational power, finds their applications in several advanced camera mounted technical systems. Although this chapter focuses on multi-feature extraction approach for human faces in interactive applications with intelligent systems, however the scope of this chapter is equally useful for researchers and industrial practitioner working in the modeling of 3D deformable objects. The chapter mainly specified to human faces but can also be applied to other applications like medical imaging, industrial robot manipulation and action recognition.


Author(s):  
Giancarlo Iannizzotto ◽  
Francesco La Rosa

This chapter introduces the VirtualBoard framework for building vision-based Perceptual User Interfaces (PUI). While most vision-based Human Computer Interaction applications developed over the last decade focus on the technological aspects related to image processing and computer vision, our main effort is towards ease and naturalness of use, integrability and compatibility with the existing systems and software, portability and efficiency. VirtualBoard is based on a modular architecture which allows the implementation of several classes of gestural and vision-based human-computer interaction approaches: it is extensible and portable and requires relatively few computational resources, thus also helping in reducing energy consumption and hardware costs. Particular attention is also devoted to robustness to environment conditions (such as illumination and noise level). We believe that current technologies can easily support vision-based PUIs and that PUIs are strongly needed by modern applications. With the exception of gaming industry, where vision-based PUIs are already being intensively studied and in some cases exploited, more effort is needed to merge the knowledge from HCI and computer vision communities to develop realistic and industrially appealing products. This work is intended as a stimulus in this direction.


Sign in / Sign up

Export Citation Format

Share Document