ACM Transactions on Multimedia Computing Communications and Applications

A Fast View Synthesis Implementation Method for Light Field Applications

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3459098 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-20

Author(s):

Wei Gao ◽

Linjie Zhou ◽

Lvfang Tao

Keyword(s):

Neural Networks ◽

Light Field ◽

Three Dimensional ◽

Super Resolution ◽

View Synthesis ◽

Compact Representation ◽

Light Weight ◽

Computationally Efficient ◽

Full Resolution ◽

Time Systems

View synthesis (VS) for light field images is a very time-consuming task due to the great quantity of involved pixels and intensive computations, which may prevent it from the practical three-dimensional real-time systems. In this article, we propose an acceleration approach for deep learning-based light field view synthesis, which can significantly reduce calculations by using compact-resolution (CR) representation and super-resolution (SR) techniques, as well as light-weight neural networks. The proposed architecture has three cascaded neural networks, including a CR network to generate the compact representation for original input views, a VS network to synthesize new views from down-scaled compact views, and a SR network to reconstruct high-quality views with full resolution. All these networks are jointly trained with the integrated losses of CR, VS, and SR networks. Moreover, due to the redundancy of deep neural networks, we use the efficient light-weight strategy to prune filters for simplification and inference acceleration. Experimental results demonstrate that the proposed method can greatly reduce the processing time and become much more computationally efficient with competitive image quality.

Download Full-text

Detecting Non-Aligned Double JPEG Compression Based on Amplitude-Angle Feature

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3464388 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-18

Author(s):

Jinwei Wang ◽

Wei Huang ◽

Xiangyang Luo ◽

Yun-Qing Shi ◽

Sunil Kr. Jha

Keyword(s):

Quality Factor ◽

Color Image ◽

Binary Classification ◽

Image Editing ◽

Blocking Artifacts ◽

Jpeg Images ◽

Compressed Images ◽

Editing Operation ◽

Double Jpeg Compression ◽

Extraction Scheme

Due to the popularity of JPEG format images in recent years, JPEG images will inevitably involve image editing operation. Thus, some tramped images will leave tracks of Non-aligned double JPEG ( NA-DJPEG ) compression. By detecting the presence of NA-DJPEG compression, one can verify whether a given JPEG image has been tampered with. However, only few methods can identify NA-DJPEG compressed images in the case that the primary quality factor is greater than the secondary quality factor. To address this challenging task, this article proposes a novel feature extraction scheme based optimized pixel difference ( OPD ), which is a new measure for blocking artifacts. Firstly, three color channels (RGB) of a reconstructed image generated by decompressing a given JPEG color image are mapped into spherical coordinates to calculate amplitude and two angles (azimuth and zenith). Then, 16 histograms of OPD along the horizontal and vertical directions are calculated in the amplitude and two angles, respectively. Finally, a set of features formed by arranging the bin values of these histograms is used for binary classification. Experiments demonstrate the effectiveness of the proposed method, and the results show that it significantly outperforms the existing typical methods in the mentioned task.

Download Full-text

Bayesian Covariance Representation with Global Informative Prior for 3D Action Recognition

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3460235 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-22

Author(s):

Jianhai Zhang ◽

Zhiyong Feng ◽

Yong Su ◽

Meng Xing

Keyword(s):

Action Recognition ◽

Temporal Order ◽

State Of The Art ◽

Feature Representation ◽

Independent Action ◽

Sufficient Statistics ◽

Informative Prior ◽

Global Action ◽

High Order Statistics ◽

Order Structures

For the merits of high-order statistics and Riemannian geometry, covariance matrix has become a generic feature representation for action recognition. An independent action can be represented by an empirical statistics over all of its pose samples. Two major problems of covariance include the following: (1) it is prone to be singular so that actions fail to be represented properly, and (2) it is short of global action/pose-aware information so that expressive and discriminative power is limited. In this article, we propose a novel Bayesian covariance representation by a prior regularization method to solve the preceding problems. Specifically, covariance is viewed as a parametric maximum likelihood estimate of Gaussian distribution over local poses from an independent action. Then, a Global Informative Prior (GIP) is generated over global poses with sufficient statistics to regularize covariance. In this way, (1) singularity is greatly relieved due to sufficient statistics, (2) global pose information of GIP makes Bayesian covariance theoretically equivalent to a saliency weighting covariance over global action poses so that discriminative characteristics of actions can be represented more clearly. Experimental results show that our Bayesian covariance with GIP efficiently improves the performance of action recognition. In some databases, it outperforms the state-of-the-art variant methods that are based on kernels, temporal-order structures, and saliency weighting attentions, among others.

Download Full-text

Dual-Stream Guided-Learning via a Priori Optimization for Person Re-identification

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3447715 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-22

Author(s):

Junyi Wu ◽

Yan Huang ◽

Qiang Wu ◽

Zhipeng Gao ◽

Jianqiang Zhao ◽

...

Keyword(s):

Learning Strategy ◽

State Of The Art ◽

A Priori ◽

Background Information ◽

Stream Network ◽

Related Information ◽

Guided Learning ◽

Segmentation Algorithms ◽

Art Methods ◽

Background Clutter

The task of person re-identification (re-ID) is to find the same pedestrian across non-overlapping camera views. Generally, the performance of person re-ID can be affected by background clutter. However, existing segmentation algorithms cannot obtain perfect foreground masks to cover the background information clearly. In addition, if the background is completely removed, some discriminative ID-related cues (i.e., backpack or companion) may be lost. In this article, we design a dual-stream network consisting of a Provider Stream (P-Stream) and a Receiver Stream (R-Stream). The R-Stream performs an a priori optimization operation on foreground information. The P-Stream acts as a pusher to guide the R-Stream to concentrate on foreground information and some useful ID-related cues in the background. The proposed dual-stream network can make full use of the a priori optimization and guided-learning strategy to learn encouraging foreground information and some useful ID-related information in the background. Our method achieves Rank-1 accuracy of 95.4% on Market-1501, 89.0% on DukeMTMC-reID, 78.9% on CUHK03 (labeled), and 75.4% on CUHK03 (detected), outperforming state-of-the-art methods.

Download Full-text

An Adaptive Bitrate Switching Algorithm for Speech Applications in Context of WebRTC

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3458751 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-21

Author(s):

Mohannad Alahmadi ◽

Peter Pocta ◽

Hugh Melvin

Keyword(s):

Real Time ◽

Data Exchange ◽

Quality Of Experience ◽

End User ◽

High Quality ◽

Content Type ◽

Switching Algorithm ◽

Wide Range ◽

The Impact

Web Real-Time Communication (WebRTC) combines a set of standards and technologies to enable high-quality audio, video, and auxiliary data exchange in web browsers and mobile applications. It enables peer-to-peer multimedia sessions over IP networks without the need for additional plugins. The Opus codec, which is deployed as the default audio codec for speech and music streaming in WebRTC, supports a wide range of bitrates. This range of bitrates covers narrowband, wideband, and super-wideband up to fullband bandwidths. Users of IP-based telephony always demand high-quality audio. In addition to users’ expectation, their emotional state, content type, and many other psychological factors; network quality of service; and distortions introduced at the end terminals could determine their quality of experience. To measure the quality experienced by the end user for voice transmission service, the E-model standardized in the ITU-T Rec. G.107 (a narrowband version), ITU-T Rec. G.107.1 (a wideband version), and the most recent ITU-T Rec. G.107.2 extension for the super-wideband E-model can be used. In this work, we present a quality of experience model built on the E-model to measure the impact of coding and packet loss to assess the quality perceived by the end user in WebRTC speech applications. Based on the computed Mean Opinion Score, a real-time adaptive codec parameter switching mechanism is used to switch to the most optimum codec bitrate under the present network conditions. We present the evaluation results to show the effectiveness of the proposed approach when compared with the default codec configuration in WebRTC.

Download Full-text

Adaptive Compression for Online Computer Vision: An Edge Reinforcement Learning Approach

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3447878 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-23

Author(s):

Zhaoliang He ◽

Hongshan Li ◽

Zhi Wang ◽

Shutao Xia ◽

Wenwu Zhu

Keyword(s):

Computer Vision ◽

Deep Learning ◽

Reinforcement Learning ◽

Quality Level ◽

Image Size ◽

Standard Quality ◽

Encapsulation Method ◽

Cloud Servers ◽

Model Structures ◽

Deep Learning Model

With the growth of computer vision-based applications, an explosive amount of images have been uploaded to cloud servers that host such online computer vision algorithms, usually in the form of deep learning models. JPEG has been used as the de facto compression and encapsulation method for images. However, standard JPEG configuration does not always perform well for compressing images that are to be processed by a deep learning model—for example, the standard quality level of JPEG leads to 50% of size overhead (compared with the best quality level selection) on ImageNet under the same inference accuracy in popular computer vision models (e.g., InceptionNet and ResNet). Knowing this, designing a better JPEG configuration for online computer vision-based services is still extremely challenging. First, cloud-based computer vision models are usually a black box to end-users; thus, it is challenging to design JPEG configuration without knowing their model structures. Second, the “optimal” JPEG configuration is not fixed; instead, it is determined by confounding factors, including the characteristics of the input images and the model, the expected accuracy and image size, and so forth. In this article, we propose a reinforcement learning (RL)-based adaptive JPEG configuration framework, AdaCompress. In particular, we design an edge (i.e., user-side) RL agent that learns the optimal compression quality level to achieve an expected inference accuracy and upload image size, only from the online inference results, without knowing details of the model structures. Furthermore, we design an explore-exploit mechanism to let the framework fast switch an agent when it detects a performance degradation, mainly due to the input change (e.g., images captured across daytime and night). Our evaluation experiments using real-world online computer vision-based APIs from Amazon Rekognition, Face++, and Baidu Vision show that our approach outperforms existing baselines by reducing the size of images by one-half to one-third while the overall classification accuracy only decreases slightly. Meanwhile, AdaCompress adaptively re-trains or re-loads the RL agent promptly to maintain the performance.

Download Full-text

Y-Net: Dual-branch Joint Network for Semantic Segmentation

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3460940 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-22

Author(s):

Yizhen Chen ◽

Haifeng Hu

Keyword(s):

Feature Vector ◽

State Of The Art ◽

Computational Cost ◽

Receptive Fields ◽

Semantic Segmentation ◽

Global Context ◽

Multi Level ◽

The One ◽

Public Datasets ◽

High Level

Most existing segmentation networks are built upon a “ U -shaped” encoder–decoder structure, where the multi-level features extracted by the encoder are gradually aggregated by the decoder. Although this structure has been proven to be effective in improving segmentation performance, there are two main drawbacks. On the one hand, the introduction of low-level features brings a significant increase in calculations without an obvious performance gain. On the other hand, general strategies of feature aggregation such as addition and concatenation fuse features without considering the usefulness of each feature vector, which mixes the useful information with massive noises. In this article, we abandon the traditional “ U -shaped” architecture and propose Y-Net, a dual-branch joint network for accurate semantic segmentation. Specifically, it only aggregates the high-level features with low-resolution and utilizes the global context guidance generated by the first branch to refine the second branch. The dual branches are effectively connected through a Semantic Enhancing Module, which can be regarded as the combination of spatial attention and channel attention. We also design a novel Channel-Selective Decoder (CSD) to adaptively integrate features from different receptive fields by assigning specific channelwise weights, where the weights are input-dependent. Our Y-Net is capable of breaking through the limit of singe-branch network and attaining higher performance with less computational cost than “ U -shaped” structure. The proposed CSD can better integrate useful information and suppress interference noises. Comprehensive experiments are carried out on three public datasets to evaluate the effectiveness of our method. Eventually, our Y-Net achieves state-of-the-art performance on PASCAL VOC 2012, PASCAL Person-Part, and ADE20K dataset without pre-training on extra datasets.

Download Full-text

Perceptual Quality Assessment of Low-light Image Enhancement

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3457905 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-24

Author(s):

Guangtao Zhai ◽

Wei Sun ◽

Xiongkuo Min ◽

Jiantao Zhou

Keyword(s):

Image Quality ◽

Quality Assessment ◽

Image Enhancement ◽

Dynamic Range ◽

State Of The Art ◽

Perceptual Quality ◽

Low Light ◽

Light Image ◽

Light Enhancement

Low-light image enhancement algorithms (LIEA) can light up images captured in dark or back-lighting conditions. However, LIEA may introduce various distortions such as structure damage, color shift, and noise into the enhanced images. Despite various LIEAs proposed in the literature, few efforts have been made to study the quality evaluation of low-light enhancement. In this article, we make one of the first attempts to investigate the quality assessment problem of low-light image enhancement. To facilitate the study of objective image quality assessment (IQA), we first build a large-scale low-light image enhancement quality (LIEQ) database. The LIEQ database includes 1,000 light-enhanced images, which are generated from 100 low-light images using 10 LIEAs. Rather than evaluating the quality of light-enhanced images directly, which is more difficult, we propose to use the multi-exposure fused (MEF) image and stack-based high dynamic range (HDR) image as a reference and evaluate the quality of low-light enhancement following a full-reference (FR) quality assessment routine. We observe that distortions introduced in low-light enhancement are significantly different from distortions considered in traditional image IQA databases that are well-studied, and the current state-of-the-art FR IQA models are also not suitable for evaluating their quality. Therefore, we propose a new FR low-light image enhancement quality assessment (LIEQA) index by evaluating the image quality from four aspects: luminance enhancement, color rendition, noise evaluation, and structure preserving, which have captured the most key aspects of low-light enhancement. Experimental results on the LIEQ database show that the proposed LIEQA index outperforms the state-of-the-art FR IQA models. LIEQA can act as an evaluator for various low-light enhancement algorithms and systems. To the best of our knowledge, this article is the first of its kind comprehensive low-light image enhancement quality assessment study.

Download Full-text

Where Are They Going? Predicting Human Behaviors in Crowded Scenes

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3449359 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-19

Author(s):

Bo Zhang ◽

Rui Zhang ◽

Niccolo Bisagno ◽

Nicola Conci ◽

Francesco G. B. De Natale ◽

...

Keyword(s):

Short Term Memory ◽

Behavior Prediction ◽

Short Term ◽

Memory Module ◽

Term Memory ◽

The Social ◽

Spatial Configurations ◽

Long Short Term Memory ◽

Crowd Behaviors ◽

Crowded Scenes

In this article, we propose a framework for crowd behavior prediction in complicated scenarios. The fundamental framework is designed using the standard encoder-decoder scheme, which is built upon the long short-term memory module to capture the temporal evolution of crowd behaviors. To model interactions among humans and environments, we embed both the social and the physical attention mechanisms into the long short-term memory. The social attention component can model the interactions among different pedestrians, whereas the physical attention component helps to understand the spatial configurations of the scene. Since pedestrians’ behaviors demonstrate multi-modal properties, we use the generative model to produce multiple acceptable future paths. The proposed framework not only predicts an individual’s trajectory accurately but also forecasts the ongoing group behaviors by leveraging on the coherent filtering approach. Experiments are carried out on the standard crowd benchmarks (namely, the ETH, the UCY, the CUHK crowd, and the CrowdFlow datasets), which demonstrate that the proposed framework is effective in forecasting crowd behaviors in complex scenarios.

Download Full-text

Dissimilarity-Based Regularized Learning of Charts

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3458884 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-23

Author(s):

Prerna Mishra ◽

Santosh Kumar ◽

Mithilesh Kumar Chaube

Keyword(s):

Deep Learning ◽

Loss Function ◽

Structural Variation ◽

Learning Model ◽

Discriminative Power ◽

Learning Models ◽

Dissimilarity Index ◽

Regularization Parameters ◽

Learned Features

Chart images exhibit significant variabilities that make each image different from others even though they belong to the same class or categories. Classification of charts is a major challenge because each chart class has variations in features, structure, and noises. However, due to the lack of affiliation between the dissimilar features and the structure of the chart, it is a challenging task to model these variations for automatic chart recognition. In this article, we present a novel dissimilarity-based learning model for similar structured but diverse chart classification. Our approach jointly learns the features of both dissimilar and similar regions. The model is trained by an improved loss function, which is fused by a structural variation-aware dissimilarity index and incorporated with regularization parameters, making the model more prone toward dissimilar regions. The dissimilarity index enhances the discriminative power of the learned features not only from dissimilar regions but also from similar regions. Extensive comparative evaluations demonstrate that our approach significantly outperforms other benchmark methods, including both traditional and deep learning models, over publicly available datasets.

Download Full-text

ACM Transactions on Multimedia Computing Communications and Applications
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

A Fast View Synthesis Implementation Method for Light Field Applications

Detecting Non-Aligned Double JPEG Compression Based on Amplitude-Angle Feature

Bayesian Covariance Representation with Global Informative Prior for 3D Action Recognition

Dual-Stream Guided-Learning via a Priori Optimization for Person Re-identification

An Adaptive Bitrate Switching Algorithm for Speech Applications in Context of WebRTC

Adaptive Compression for Online Computer Vision: An Edge Reinforcement Learning Approach

Y-Net: Dual-branch Joint Network for Semantic Segmentation

Perceptual Quality Assessment of Low-light Image Enhancement

Where Are They Going? Predicting Human Behaviors in Crowded Scenes

Dissimilarity-Based Regularized Learning of Charts

Export Citation Format

ACM Transactions on Multimedia Computing Communications and ApplicationsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

A Fast View Synthesis Implementation Method for Light Field Applications

Detecting Non-Aligned Double JPEG Compression Based on Amplitude-Angle Feature

Bayesian Covariance Representation with Global Informative Prior for 3D Action Recognition

Dual-Stream Guided-Learning via a Priori Optimization for Person Re-identification

An Adaptive Bitrate Switching Algorithm for Speech Applications in Context of WebRTC

Adaptive Compression for Online Computer Vision: An Edge Reinforcement Learning Approach

Y-Net: Dual-branch Joint Network for Semantic Segmentation

Perceptual Quality Assessment of Low-light Image Enhancement

Where Are They Going? Predicting Human Behaviors in Crowded Scenes

Dissimilarity-Based Regularized Learning of Charts

ACM Transactions on Multimedia Computing Communications and Applications
Latest Publications