Learned versus Handcrafted Features for Person Re-identification

Person re-identification is one of the indispensable elements for visual surveillance. It assigns consistent labeling for the same person within the field of view of the same camera or even across multiple cameras. While handcrafted feature extraction is certainly one way of approaching this problem, in many cases, these features are becoming more and more complex. Besides, training a deep convolutional neural network (CNN) from scratch is difficult because it requires a large amount of labeled training data and a great deal of expertise to ensure proper convergence. This paper explores the following three main strategies for solving the person re-identification problem: (i) using handcrafted features, (ii) using transfer learning based on a pre-trained deep CNN (trained for object categorization) and (iii) training a deep CNN from scratch. Our experiments consistently demonstrated that: (1) The handcrafted features may still have favorable characteristics and benefits especially in cases where the learning database is not sufficient to train a deep network. (2) A fully trained Siamese CNN outperforms handcrafted approaches and the combination of pre-trained CNN with different re-identification processes. (3) Moreover, our experiments demonstrated that pre-trained features and handcrafted features perform equally well. These experiments have also revealed the most discriminative parts in the human body.

Download Full-text

Facial Expression Recognition Based on Weighted-Cluster Loss and Deep Transfer Learning Using a Highly Imbalanced Dataset

Sensors ◽

10.3390/s20092639 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2639

Author(s):

Quan T. Ngo ◽

Seokhoon Yoon

Keyword(s):

Facial Expression ◽

Transfer Learning ◽

Loss Function ◽

Real World ◽

Facial Expression Recognition ◽

Training Data ◽

Fine Tuning ◽

Expression Recognition ◽

Recent Success ◽

Deep Cnn

Facial expression recognition (FER) is a challenging problem in the fields of pattern recognition and computer vision. The recent success of convolutional neural networks (CNNs) in object detection and object segmentation tasks has shown promise in building an automatic deep CNN-based FER model. However, in real-world scenarios, performance degrades dramatically owing to the great diversity of factors unrelated to facial expressions, and due to a lack of training data and an intrinsic imbalance in the existing facial emotion datasets. To tackle these problems, this paper not only applies deep transfer learning techniques, but also proposes a novel loss function called weighted-cluster loss, which is used during the fine-tuning phase. Specifically, the weighted-cluster loss function simultaneously improves the intra-class compactness and the inter-class separability by learning a class center for each emotion class. It also takes the imbalance in a facial expression dataset into account by giving each emotion class a weight based on its proportion of the total number of images. In addition, a recent, successful deep CNN architecture, pre-trained in the task of face identification with the VGGFace2 database from the Visual Geometry Group at Oxford University, is employed and fine-tuned using the proposed loss function to recognize eight basic facial emotions from the AffectNet database of facial expression, valence, and arousal computing in the wild. Experiments on an AffectNet real-world facial dataset demonstrate that our method outperforms the baseline CNN models that use either weighted-softmax loss or center loss.

Download Full-text

Matching People Across Surveillance Cameras

10.5753/sibgrapi.est.2019.8306 ◽

2019 ◽

Author(s):

Raphael Prates ◽

William Robson Schwartz

Keyword(s):

Nonlinear Regression ◽

Large Scale ◽

Identification Problem ◽

Subspace Learning ◽

Nonlinear Regression Model ◽

Learning Approaches ◽

Matching Function ◽

Multiple Cameras ◽

Feature Descriptors ◽

Surveillance Cameras

This work addresses the person re-identification problem, which consists on matching images of individuals captured by multiple and non-overlapping surveillance cameras. Works from literature tackle this problem proposing robust feature descriptors and matching functions, where the latter is responsible to assign the correct identity for individuals and is the focus of this work. Specifically, we propose two matching methods: the Kernel MBPLS and the Kernel X-CRC. The Kernel MBPLS is a nonlinear regression model that is scalable with respect to the number of cameras and allows the inclusion of additional labelled information (e.g., attributes). Differently, the Kernel X-CRC is a nonlinear and multitask matching function that can be used jointly with subspace learning approaches to boost the matching rates. We present an extensive experimental evaluation of both approaches in four datasets (VIPeR, PRID450S, WARD and Market-1501). Experimental results demonstrate that the Kernel MBPLS and the Kernel X-CRC outperforms approaches from literature. Furthermore, we show that the Kernel X-CRC can be successfuly applied in large-scale and multiple cameras datasets.

Download Full-text

Hand gesture recognition via image processing techniques and deep CNN

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200385 ◽

2020 ◽

Vol 39 (3) ◽

pp. 4405-4418

Author(s):

Yao-Liang Chung ◽

Hung-Yuan Chung ◽

Wei-Feng Tsai

Keyword(s):

Skin Color ◽

Recognition Rate ◽

Region Of Interest ◽

Training Data ◽

Background Information ◽

Hand Gestures ◽

Data Set ◽

Practical Applications ◽

Deep Cnn ◽

Hand Region

In the present study, we sought to enable instant tracking of the hand region as a region of interest (ROI) within the image range of a webcam, while also identifying specific hand gestures to facilitate the control of home appliances in smart homes or issuing of commands to human-computer interaction fields. To accomplish this objective, we first applied skin color detection and noise processing to remove unnecessary background information from the captured image, before applying background subtraction for detection of the ROI. Then, to prevent background objects or noise from influencing the ROI, we utilized the kernelized correlation filters (KCF) algorithm to implement tracking of the detected ROI. Next, the size of the ROI image was resized to 100×120 and input into a deep convolutional neural network (CNN) to enable the identification of various hand gestures. In the present study, two deep CNN architectures modified from the AlexNet CNN and VGGNet CNN, respectively, were developed by substantially reducing the number of network parameters used and appropriately adjusting internal network configuration settings. Then, the tracking and recognition process described above was continuously repeated to achieve immediate effect, with the execution of the system continuing until the hand is removed from the camera range. The results indicated excellent performance by both of the proposed deep CNN architectures. In particular, the modified version of the VGGNet CNN achieved better performance with a recognition rate of 99.90% for the utilized training data set and a recognition rate of 95.61% for the utilized test data set, which indicate the good feasibility of the system for practical applications.

Download Full-text

Automatic calibration of multiple cameras on a nonplanar ground for visual surveillance systems

10.1117/12.526761 ◽

2004 ◽

Author(s):

Junghoon Jung ◽

Hyunjong Ki ◽

Jeongho Shin ◽

Joonki Paik

Keyword(s):

Visual Surveillance ◽

Automatic Calibration ◽

Surveillance Systems ◽

Multiple Cameras

Download Full-text

Visual Surveillance System with Multiple Cameras in Wide Area Environment.

TRANSACTIONS OF THE JAPAN SOCIETY OF MECHANICAL ENGINEERS Series C ◽

10.1299/kikaic.69.1011 ◽

2003 ◽

Vol 69 (680) ◽

pp. 1011-1018 ◽

Cited By ~ 2

Author(s):

Toshio FUKUDA ◽

Tatsuya SUZUKI ◽

Yasuhisa HASEGAWA ◽

Fumihito ARAI ◽

Masaru NEGI

Keyword(s):

Surveillance System ◽

Visual Surveillance ◽

Wide Area ◽

Multiple Cameras

Download Full-text

A SEQUENTIAL BAYESIAN ALGORITHM FOR SURVEILLANCE WITH NONOVERLAPPING CAMERAS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001405004423 ◽

2005 ◽

Vol 19 (08) ◽

pp. 977-996 ◽

Cited By ~ 20

Author(s):

WOJCIECH ZAJDEL ◽

BEN J. A. KRÖSE

Keyword(s):

Visual Surveillance ◽

Field Of View ◽

Office Building ◽

Model Parameters ◽

Bayesian Algorithm ◽

Bayes Network ◽

Smooth Motion ◽

Similar Association ◽

Multiobject Tracking ◽

Asynchronous Observations

Visual surveillance in wide areas (e.g. airports) relies on sparsely distributed cameras, that is, cameras that observe nonoverlapping scenes. In this setup, multiobject tracking requires reidentification of an object when it leaves one field of view, and later appears at some other. Although similar association problems are common for multiobject tracking scenarios, in the distributed case one has to cope with asynchronous observations and cannot assume smooth motion of the objects. In this paper, we propose a method for human indoor tracking. The method is based on a Dynamic Bayes Network (DBN) as a probabilistic model for the observations. The edges of the network define the correspondences between observations of the same object. Accordingly, we derive an approximate EM-like method for selecting the most likely structure of DBN and learning model parameters. The presented algorithm is tested on a collection of real-world observations gathered by a system of cameras in an office building.

Download Full-text

Comparison of the Performance of the Calibration Algorithm of Wide-Angle Lens

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.467.323 ◽

2013 ◽

Vol 467 ◽

pp. 323-326

Author(s):

Jong Eun Ha

Keyword(s):

Visual Surveillance ◽

Field Of View ◽

Experimental Result ◽

Eye Lens ◽

Comparison Results ◽

Calibration Algorithm ◽

Wide Angle Lens ◽

Present Experimental Result ◽

Fish Eye ◽

Small Field Of View

Fish-eye lens is used in various applications due to wide coverage of scene. In particular, it can be effectively used in visual surveillance and surround monitoring in automotive. It has large radial distortion compared to the conventional lens with small field of view. In this paper, we present comparison results for the calibration of fish-eye lens. We compare two algorithms where one is available in OpenCV [ and the other is Devernay and Faugeras [. Also, we present experimental result according to the number of calibration points, initialization value. We evaluate the accuracy of calibration thorough 3D reconstruction by stereo system. It can give more reliable evaluation than using reprojection eror by single camera.

Download Full-text

Consistent labeling of tracked objects in multiple cameras with overlapping fields of view

IEEE Transactions on Pattern Analysis and Machine Intelligence ◽

10.1109/tpami.2003.1233912 ◽

2003 ◽

Vol 25 (10) ◽

pp. 1355-1360 ◽

Cited By ~ 257

Author(s):

S. Khan ◽

M. Shah

Keyword(s):

Multiple Cameras ◽

Consistent Labeling

Download Full-text

Non-cooperative Circle Characteristic Pose Measurement Using Multiple Cameras without Public Field of View

Infrared Technoiogy ◽

10.3724/sp.j.7100840892 ◽

2020 ◽

Vol 42 (1) ◽

pp. 93-98

Author(s):

璐陆 ◽

代平宋

Keyword(s):

Field Of View ◽

Multiple Cameras

Download Full-text

The development of transformation tolerant visual representations differs between the human brain and convolutional neural networks

10.1101/2020.08.11.246934 ◽

2020 ◽

Author(s):

Yaoda Xu ◽

Maryam Vaziri-Pashkam

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Visual Processing ◽

Training Data ◽

Object Categorization ◽

Object Identity ◽

Primate Brain ◽

Recurrent Processing ◽

Image Transformations ◽

Relative Similarity

AbstractExisting single cell neural recording findings predict that, as information ascends the visual processing hierarchy in the primate brain, the relative similarity among the objects would be increasingly preserved across identity-preserving image transformations. Here we confirm this prediction and show that object category representational structure becomes increasingly invariant across position and size changes as information ascends the human ventral visual processing pathway. Such a representation, however, is not found in 14 different convolutional neural networks (CNNs) trained for object categorization that varied in architecture, depth and the presence/absence of recurrent processing. CNNs thus do not appear to form or maintain brain-like transformation-tolerant object identity representations at higher levels of visual processing despite the fact that CNNs may classify objects under various transformations. This limitation could potentially contribute to the large number of training data required to train CNNs and their limited ability to generalize to objects not included in training.

Download Full-text