Computer Vision for Multimedia Applications

Semantic event detection is an active and interesting research topic in the field of video mining. The major challenge is the semantic gap between low-level features and high-level semantics. In this chapter, we will advance a new sports video mining framework where a hybrid generative-discriminative approach is used for event detection. Specifically, we propose a three-layer semantic space by which event detection is converted into two inter-related statistical inference procedures that involve semantic analysis at different levels. The first is to infer the mid-level semantic structures from the low-level visual features via generative models, which can serve as building blocks of high-level semantic analysis. The second is to detect high-level semantics from mid-level semantic structures using discriminative models, which are of direct interests to users. In this framework we can explicitly represent and detect semantics at different levels. The use of generative and discriminative approaches in two different stages is proved to be effective and appropriate for event detection in sports video. The experimental results from a set of American football video data demonstrate that the proposed framework offers promising results compared with traditional approaches.

Download Full-text

A Perceptual Approach for Image Representation and Retrieval

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch008 ◽

2011 ◽

pp. 121-141

Author(s):

Noureddine Abbadeni

Keyword(s):

Image Retrieval ◽

Image Representation ◽

Rank Correlation ◽

Human Perception ◽

Image Data ◽

Content Based Image Retrieval ◽

Data Set ◽

Psychometric Method ◽

Results Merging ◽

Spearman Rank Correlation Coefficient

This chapter describes an approach based on human perception to content-based image representation and retrieval. We consider textured images and propose to model the textural content of images by a set of features having a perceptual meaning and their application to content-based image retrieval. We present a new method to estimate a set of perceptual textural features, namely coarseness, directionality, contrast and busyness. The proposed computational measures are based on two representations: the original images representation and the autocovariance function (associated with images) representation. The correspondence of the proposed computational measures to human judgments is shown using a psychometric method based on the Spearman rank-correlation coefficient. The set of computational measures is applied to content-based image retrieval on a large image data set, the well-known Brodatz database. Experimental results show a strong correlation between the proposed computational textural measures and human perceptual judgments. The benchmarking of retrieval performance, done using the recall measure, shows interesting results. Furthermore, results merging/fusion returned by each of the two representations is shown to allow significant improvement in retrieval effectiveness.

Download Full-text

Spontaneous Facial Expression Analysis and Synthesis for Interactive Facial Animation

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch002 ◽

2011 ◽

pp. 20-37

Author(s):

Yongmian Zhang ◽

Jixu Chen ◽

Yan Tong ◽

Qiang Ji

Keyword(s):

Facial Expression ◽

Bayesian Network ◽

Facial Expressions ◽

Expression Analysis ◽

Facial Animation ◽

Head Movements ◽

Perceptual Quality ◽

Facial Action ◽

Facial Expression Analysis ◽

Analysis And Synthesis

This chapter describes a probabilistic framework for faithful reproduction of spontaneous facial expressions on a synthetic face model in a real time interactive application. The framework consists of a coupled Bayesian network (BN) to unify the facial expression analysis and synthesis into one coherent structure. At the analysis end, we cast the facial action coding system (FACS) into a dynamic Bayesian network (DBN) to capture relationships between facial expressions and the facial motions as well as their uncertainties and dynamics. The observations fed into the DBN facial expression model are measurements of facial action units (AUs) generated by an AU model. Also implemented by a DBN, the AU model captures the rigid head movements and nonrigid facial muscular movements of a spontaneous facial expression. At the synthesizer, a static BN reconstructs the Facial Animation Parameters (FAPs) and their intensity through the top-down inference according to the current state of facial expression and pose information output by the analysis end. The two BNs are connected statically through a data stream link. The novelty of using the coupled BN brings about several benefits. First, a facial expression is inferred through both spatial and temporal inference so that the perceptual quality of animation is less affected by the misdetection of facial features. Second, more realistic looking facial expressions can be reproduced by modeling the dynamics of human expressions in facial expression analysis. Third, very low bitrate (9 bytes per frame) in data transmission can be achieved.

Download Full-text

Salient Region Detection for Biometric Watermarking

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch013 ◽

2011 ◽

pp. 218-236

Author(s):

Ma Bin ◽

Li Chun-lei ◽

Wang Yun-hong ◽

Bai Xiao

Keyword(s):

Vision System ◽

Region Of Interest ◽

Visual Saliency ◽

Saliency Map ◽

Human Vision ◽

Identification Accuracy ◽

Biometric Data ◽

Salient Region Detection ◽

Salient Region ◽

Region Detection

Visual saliency, namely the perceptual significance to human vision system (HVS), is a quality that differentiates an object from its neighbors. Detection of salient regions which contain prominent features and represent main contents of the visual scene, has obtained wide utilization among computer vision based applications, such as object tracking and classification, region-of-interest (ROI) based image compression, etc. Specially, as for biometric authentication system, whose objective is to distinguish the identification of people through biometric data (e.g. fingerprint, iris, face etc.), the most important metric is distinguishability. Consequently, in biometric watermarking fields, there has been a great need of good metrics for feature prominency. In this chapter, we present two salient-region-detection based biometric watermarking scenarios, in which robust annotation and fragile authentication watermark are respectively applied to biometric systems. Saliency map plays an important role of perceptual mask that adaptively select watermarking strength and position, therefore controls the distortion introduced by watermark and preserves the identification accuracy of biometric images.

Download Full-text

Content-Based Video Scene Clustering and Segmentation

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch010 ◽

2011 ◽

pp. 166-179

Author(s):

Hong Lu ◽

Xiangyang Xue

Keyword(s):

Large Scale ◽

Video Summarization ◽

Gaussian Mixture ◽

Visual Similarity ◽

Video Data ◽

Data Sets ◽

Scene Segmentation ◽

Clustering Methods ◽

Video Frames ◽

Video Scene

With the amount of video data increasing rapidly, automatic methods are needed to deal with large-scale video data sets in various applications. In content-based video analysis, a common and fundamental preprocess for these applications is video segmentation. Based on the segmentation results, video has a hierarchical representation structure of frames, shots, and scenes from the low level to high level. Due to the huge amount of video frames, it is not appropriate to represent video contents using frames. In the levels of video structure, shot is defined as an unbroken sequence of frames from one camera; however, the contents in shots are trivial and can hardly convey valuable semantic information. On the other hand, scene is a group of consecutive shots that focuses on an object or objects of interest. And a scene can represent a semantic unit for further processing such as story extraction, video summarization, etc. In this chapter, we will survey the methods on video scene segmentation. Specifically, there are two kinds of scenes. One kind of scene is to just consider the visual similarity of video shots and clustering methods are used for scene clustering. Another kind of scene is to consider both the visual similarity and temporal constraints of video shots, i.e., shots with similar contents and not lying too far in temporal order. Also, we will present our proposed methods on scene clustering and scene segmentation by using Gaussian mixture model, graph theory, sequential change detection, and spectral methods.

Download Full-text

Detecting Image Forgeries using Geometric Cues

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch012 ◽

2011 ◽

pp. 197-217 ◽

Cited By ~ 3

Author(s):

Lin Wu ◽

Yang Wang

Keyword(s):

State Of The Art ◽

The State ◽

Second Look ◽

Art Methods ◽

Physically Based ◽

Geometric Technique ◽

Geometric Cues

This chapter presents a framework for detecting fake regions by using various methods including watermarking technique and blind approaches. In particular, we describe current categories on blind approaches which can be divided into five: pixel-based techniques, format-based techniques, camera-based techniques, physically-based techniques and geometric-based techniques. Then we take a second look on the geometric-based techniques and further categorize them in detail. In the following section, the state-of-the-art methods involved in the geometric technique are elaborated.

Download Full-text

Vision Based Hand Posture Recognition

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch011 ◽

2011 ◽

pp. 180-195 ◽

Cited By ~ 1

Author(s):

Kongqiao Wang ◽

Yikai Fang ◽

Xiujuan Chai

Keyword(s):

Gesture Recognition ◽

Visual Cues ◽

Learning Strategy ◽

Good Representation ◽

Hand Posture ◽

Structure Information ◽

Hand Posture Recognition ◽

Gesture Analysis ◽

Appearance Models ◽

Posture Recognition

Vision based gesture recognition is a hot research topic in recent years. Many researchers focus on how to differentiate various hand shapes, e.g. the static hand gesture recognition or hand posture recognition. It is one of the fundamental problems in vision based gesture analysis. In general, most frequently used visual cues human uses to describe hand are appearance and structure information, while the recognition with such information is difficult due to variant hand shapes and subject differences. To have a good representation of hand area, methods based on local features and texture histograms are attempted to represent the hand. And a learning based classification strategy is designed with different descriptors or features. In this chapter, we mainly focus on 2D geometric and appearance models, the design of local texture descriptor and semi-supervised learning strategy with different features for hand posture recognition.

Download Full-text

Intelligent Vision Systems for Landmark-Based Vehicle Navigation

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch001 ◽

2011 ◽

pp. 1-19

Author(s):

Wen Wu ◽

Jie Yang ◽

Xilin Chen

Keyword(s):

Partial Information ◽

Daily Life ◽

Vehicle Navigation ◽

Navigation Systems ◽

Traffic Light ◽

Vision Systems ◽

Learning Framework ◽

Road Signs ◽

Human Effort ◽

Spatio Temporal

Human drivers often use landmarks for navigation. For example, we tell people to turn left after the second traffic light and to make a right at Starbucks. In our daily life, a landmark can be anything that is easily recognizable and used for giving navigation directions, such as a sign or a building. It has been proposed that current navigation systems can be made more effective and safer by incorporating landmarks as key navigation cues. Especially, landmarks support navigation in unfamiliar environments. In this chapter, we aim to describe technologies for two intelligent vision systems for landmark-based car navigation: (1) labeling street landmarks in images with minimal human effort; we have proposed a semi-supervised learning framework for the task; (2) automatically detecting text on road signs from video; the proposed framework takes advantage of spatio-temporal information in video and fuses partial information for detecting text from frame to frame.

Download Full-text

3D Face Modeling for Multi-Feature Extraction for Intelligent Systems

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch005 ◽

2011 ◽

pp. 73-89

Author(s):

Zahid Riaz ◽

Suat Gedikli ◽

Michael Beetz ◽

Bernd Radig

Keyword(s):

Computer Vision ◽

Feature Extraction ◽

Real Time ◽

Intelligent Systems ◽

Industrial Robot ◽

Image Synthesis ◽

Face Model ◽

3D Face ◽

3D Face Modeling ◽

Human Faces

In this chapter, we focus on the human robot joint interaction application where robots can extract the useful multiple features from human faces. The idea follows daily life scenarios where humans rely mostly on face to face interaction and interpret gender, identity, facial behavior and age of the other persons at a very first glance. We term this problem as face-at-a-glance problem. The proposed solution to this problem is the development of a 3D photorealistic face model in real time for human facial analysis. We also discuss briefly some outstanding challenges like head poses, facial expressions and illuminations for image synthesis. Due to the diversity of the application domain and optimization of relevant information extraction for computer vision applications, we propose to solve this problem using an interdisciplinary 3D face model. The model is built using computer vision and computer graphics tools with image processing techniques. In order to trade off between accuracy and efficiency, we choose wireframe model which provides automatic face generation in real time. The goal of this chapter is to provide a standalone and comprehensive framework to extract useful multi-feature from a 3D model. Such features due to their wide range of information and less computational power, finds their applications in several advanced camera mounted technical systems. Although this chapter focuses on multi-feature extraction approach for human faces in interactive applications with intelligent systems, however the scope of this chapter is equally useful for researchers and industrial practitioner working in the modeling of 3D deformable objects. The chapter mainly specified to human faces but can also be applied to other applications like medical imaging, industrial robot manipulation and action recognition.

Download Full-text

A Modular Framework for Vision-Based Human Computer Interaction

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch003 ◽

2011 ◽

pp. 38-59

Author(s):

Giancarlo Iannizzotto ◽

Francesco La Rosa

Keyword(s):

Computer Vision ◽

Human Computer Interaction ◽

User Interfaces ◽

Modular Architecture ◽

Gaming Industry ◽

Perceptual User Interfaces ◽

Hardware Costs ◽

Computational Resources ◽

Modular Framework ◽

Computer Interaction

This chapter introduces the VirtualBoard framework for building vision-based Perceptual User Interfaces (PUI). While most vision-based Human Computer Interaction applications developed over the last decade focus on the technological aspects related to image processing and computer vision, our main effort is towards ease and naturalness of use, integrability and compatibility with the existing systems and software, portability and efficiency. VirtualBoard is based on a modular architecture which allows the implementation of several classes of gestural and vision-based human-computer interaction approaches: it is extensible and portable and requires relatively few computational resources, thus also helping in reducing energy consumption and hardware costs. Particular attention is also devoted to robustness to environment conditions (such as illumination and noise level). We believe that current technologies can easily support vision-based PUIs and that PUIs are strongly needed by modern applications. With the exception of gaming industry, where vision-based PUIs are already being intensively studied and in some cases exploited, more effort is needed to merge the knowledge from HCI and computer vision communities to develop realistic and industrially appealing products. This work is intended as a stimulus in this direction.

Download Full-text

Computer Vision for Multimedia Applications
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Event Detection in Sports Video Based on Generative-Discriminative Models

A Perceptual Approach for Image Representation and Retrieval

Spontaneous Facial Expression Analysis and Synthesis for Interactive Facial Animation

Salient Region Detection for Biometric Watermarking

Content-Based Video Scene Clustering and Segmentation

Detecting Image Forgeries using Geometric Cues

Vision Based Hand Posture Recognition

Intelligent Vision Systems for Landmark-Based Vehicle Navigation

3D Face Modeling for Multi-Feature Extraction for Intelligent Systems

A Modular Framework for Vision-Based Human Computer Interaction

Export Citation Format

Computer Vision for Multimedia ApplicationsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Event Detection in Sports Video Based on Generative-Discriminative Models

A Perceptual Approach for Image Representation and Retrieval

Spontaneous Facial Expression Analysis and Synthesis for Interactive Facial Animation

Salient Region Detection for Biometric Watermarking

Content-Based Video Scene Clustering and Segmentation

Detecting Image Forgeries using Geometric Cues

Vision Based Hand Posture Recognition

Intelligent Vision Systems for Landmark-Based Vehicle Navigation

3D Face Modeling for Multi-Feature Extraction for Intelligent Systems

A Modular Framework for Vision-Based Human Computer Interaction

Computer Vision for Multimedia Applications
Latest Publications