ACM Transactions on Multimedia Computing Communications and Applications
Latest Publications


TOTAL DOCUMENTS

1027
(FIVE YEARS 386)

H-INDEX

39
(FIVE YEARS 9)

Published By Association For Computing Machinery

1551-6857

Author(s):  
Zhaoliang He ◽  
Hongshan Li ◽  
Zhi Wang ◽  
Shutao Xia ◽  
Wenwu Zhu

With the growth of computer vision-based applications, an explosive amount of images have been uploaded to cloud servers that host such online computer vision algorithms, usually in the form of deep learning models. JPEG has been used as the de facto compression and encapsulation method for images. However, standard JPEG configuration does not always perform well for compressing images that are to be processed by a deep learning model—for example, the standard quality level of JPEG leads to 50% of size overhead (compared with the best quality level selection) on ImageNet under the same inference accuracy in popular computer vision models (e.g., InceptionNet and ResNet). Knowing this, designing a better JPEG configuration for online computer vision-based services is still extremely challenging. First, cloud-based computer vision models are usually a black box to end-users; thus, it is challenging to design JPEG configuration without knowing their model structures. Second, the “optimal” JPEG configuration is not fixed; instead, it is determined by confounding factors, including the characteristics of the input images and the model, the expected accuracy and image size, and so forth. In this article, we propose a reinforcement learning (RL)-based adaptive JPEG configuration framework, AdaCompress. In particular, we design an edge (i.e., user-side) RL agent that learns the optimal compression quality level to achieve an expected inference accuracy and upload image size, only from the online inference results, without knowing details of the model structures. Furthermore, we design an explore-exploit mechanism to let the framework fast switch an agent when it detects a performance degradation, mainly due to the input change (e.g., images captured across daytime and night). Our evaluation experiments using real-world online computer vision-based APIs from Amazon Rekognition, Face++, and Baidu Vision show that our approach outperforms existing baselines by reducing the size of images by one-half to one-third while the overall classification accuracy only decreases slightly. Meanwhile, AdaCompress adaptively re-trains or re-loads the RL agent promptly to maintain the performance.


Author(s):  
Yizhen Chen ◽  
Haifeng Hu

Most existing segmentation networks are built upon a “ U -shaped” encoder–decoder structure, where the multi-level features extracted by the encoder are gradually aggregated by the decoder. Although this structure has been proven to be effective in improving segmentation performance, there are two main drawbacks. On the one hand, the introduction of low-level features brings a significant increase in calculations without an obvious performance gain. On the other hand, general strategies of feature aggregation such as addition and concatenation fuse features without considering the usefulness of each feature vector, which mixes the useful information with massive noises. In this article, we abandon the traditional “ U -shaped” architecture and propose Y-Net, a dual-branch joint network for accurate semantic segmentation. Specifically, it only aggregates the high-level features with low-resolution and utilizes the global context guidance generated by the first branch to refine the second branch. The dual branches are effectively connected through a Semantic Enhancing Module, which can be regarded as the combination of spatial attention and channel attention. We also design a novel Channel-Selective Decoder (CSD) to adaptively integrate features from different receptive fields by assigning specific channelwise weights, where the weights are input-dependent. Our Y-Net is capable of breaking through the limit of singe-branch network and attaining higher performance with less computational cost than “ U -shaped” structure. The proposed CSD can better integrate useful information and suppress interference noises. Comprehensive experiments are carried out on three public datasets to evaluate the effectiveness of our method. Eventually, our Y-Net achieves state-of-the-art performance on PASCAL VOC 2012, PASCAL Person-Part, and ADE20K dataset without pre-training on extra datasets.


Author(s):  
Guangtao Zhai ◽  
Wei Sun ◽  
Xiongkuo Min ◽  
Jiantao Zhou

Low-light image enhancement algorithms (LIEA) can light up images captured in dark or back-lighting conditions. However, LIEA may introduce various distortions such as structure damage, color shift, and noise into the enhanced images. Despite various LIEAs proposed in the literature, few efforts have been made to study the quality evaluation of low-light enhancement. In this article, we make one of the first attempts to investigate the quality assessment problem of low-light image enhancement. To facilitate the study of objective image quality assessment (IQA), we first build a large-scale low-light image enhancement quality (LIEQ) database. The LIEQ database includes 1,000 light-enhanced images, which are generated from 100 low-light images using 10 LIEAs. Rather than evaluating the quality of light-enhanced images directly, which is more difficult, we propose to use the multi-exposure fused (MEF) image and stack-based high dynamic range (HDR) image as a reference and evaluate the quality of low-light enhancement following a full-reference (FR) quality assessment routine. We observe that distortions introduced in low-light enhancement are significantly different from distortions considered in traditional image IQA databases that are well-studied, and the current state-of-the-art FR IQA models are also not suitable for evaluating their quality. Therefore, we propose a new FR low-light image enhancement quality assessment (LIEQA) index by evaluating the image quality from four aspects: luminance enhancement, color rendition, noise evaluation, and structure preserving, which have captured the most key aspects of low-light enhancement. Experimental results on the LIEQ database show that the proposed LIEQA index outperforms the state-of-the-art FR IQA models. LIEQA can act as an evaluator for various low-light enhancement algorithms and systems. To the best of our knowledge, this article is the first of its kind comprehensive low-light image enhancement quality assessment study.


Author(s):  
Wei Gao ◽  
Linjie Zhou ◽  
Lvfang Tao

View synthesis (VS) for light field images is a very time-consuming task due to the great quantity of involved pixels and intensive computations, which may prevent it from the practical three-dimensional real-time systems. In this article, we propose an acceleration approach for deep learning-based light field view synthesis, which can significantly reduce calculations by using compact-resolution (CR) representation and super-resolution (SR) techniques, as well as light-weight neural networks. The proposed architecture has three cascaded neural networks, including a CR network to generate the compact representation for original input views, a VS network to synthesize new views from down-scaled compact views, and a SR network to reconstruct high-quality views with full resolution. All these networks are jointly trained with the integrated losses of CR, VS, and SR networks. Moreover, due to the redundancy of deep neural networks, we use the efficient light-weight strategy to prune filters for simplification and inference acceleration. Experimental results demonstrate that the proposed method can greatly reduce the processing time and become much more computationally efficient with competitive image quality.


Author(s):  
Jinwei Wang ◽  
Wei Huang ◽  
Xiangyang Luo ◽  
Yun-Qing Shi ◽  
Sunil Kr. Jha

Due to the popularity of JPEG format images in recent years, JPEG images will inevitably involve image editing operation. Thus, some tramped images will leave tracks of Non-aligned double JPEG ( NA-DJPEG ) compression. By detecting the presence of NA-DJPEG compression, one can verify whether a given JPEG image has been tampered with. However, only few methods can identify NA-DJPEG compressed images in the case that the primary quality factor is greater than the secondary quality factor. To address this challenging task, this article proposes a novel feature extraction scheme based optimized pixel difference ( OPD ), which is a new measure for blocking artifacts. Firstly, three color channels (RGB) of a reconstructed image generated by decompressing a given JPEG color image are mapped into spherical coordinates to calculate amplitude and two angles (azimuth and zenith). Then, 16 histograms of OPD along the horizontal and vertical directions are calculated in the amplitude and two angles, respectively. Finally, a set of features formed by arranging the bin values of these histograms is used for binary classification. Experiments demonstrate the effectiveness of the proposed method, and the results show that it significantly outperforms the existing typical methods in the mentioned task.


Author(s):  
Jianhai Zhang ◽  
Zhiyong Feng ◽  
Yong Su ◽  
Meng Xing

For the merits of high-order statistics and Riemannian geometry, covariance matrix has become a generic feature representation for action recognition. An independent action can be represented by an empirical statistics over all of its pose samples. Two major problems of covariance include the following: (1) it is prone to be singular so that actions fail to be represented properly, and (2) it is short of global action/pose-aware information so that expressive and discriminative power is limited. In this article, we propose a novel Bayesian covariance representation by a prior regularization method to solve the preceding problems. Specifically, covariance is viewed as a parametric maximum likelihood estimate of Gaussian distribution over local poses from an independent action. Then, a Global Informative Prior (GIP) is generated over global poses with sufficient statistics to regularize covariance. In this way, (1) singularity is greatly relieved due to sufficient statistics, (2) global pose information of GIP makes Bayesian covariance theoretically equivalent to a saliency weighting covariance over global action poses so that discriminative characteristics of actions can be represented more clearly. Experimental results show that our Bayesian covariance with GIP efficiently improves the performance of action recognition. In some databases, it outperforms the state-of-the-art variant methods that are based on kernels, temporal-order structures, and saliency weighting attentions, among others.


Author(s):  
Junyi Wu ◽  
Yan Huang ◽  
Qiang Wu ◽  
Zhipeng Gao ◽  
Jianqiang Zhao ◽  
...  

The task of person re-identification (re-ID) is to find the same pedestrian across non-overlapping camera views. Generally, the performance of person re-ID can be affected by background clutter. However, existing segmentation algorithms cannot obtain perfect foreground masks to cover the background information clearly. In addition, if the background is completely removed, some discriminative ID-related cues (i.e., backpack or companion) may be lost. In this article, we design a dual-stream network consisting of a Provider Stream (P-Stream) and a Receiver Stream (R-Stream). The R-Stream performs an a priori optimization operation on foreground information. The P-Stream acts as a pusher to guide the R-Stream to concentrate on foreground information and some useful ID-related cues in the background. The proposed dual-stream network can make full use of the a priori optimization and guided-learning strategy to learn encouraging foreground information and some useful ID-related information in the background. Our method achieves Rank-1 accuracy of 95.4% on Market-1501, 89.0% on DukeMTMC-reID, 78.9% on CUHK03 (labeled), and 75.4% on CUHK03 (detected), outperforming state-of-the-art methods.


Author(s):  
Mohannad Alahmadi ◽  
Peter Pocta ◽  
Hugh Melvin

Web Real-Time Communication (WebRTC) combines a set of standards and technologies to enable high-quality audio, video, and auxiliary data exchange in web browsers and mobile applications. It enables peer-to-peer multimedia sessions over IP networks without the need for additional plugins. The Opus codec, which is deployed as the default audio codec for speech and music streaming in WebRTC, supports a wide range of bitrates. This range of bitrates covers narrowband, wideband, and super-wideband up to fullband bandwidths. Users of IP-based telephony always demand high-quality audio. In addition to users’ expectation, their emotional state, content type, and many other psychological factors; network quality of service; and distortions introduced at the end terminals could determine their quality of experience. To measure the quality experienced by the end user for voice transmission service, the E-model standardized in the ITU-T Rec. G.107 (a narrowband version), ITU-T Rec. G.107.1 (a wideband version), and the most recent ITU-T Rec. G.107.2 extension for the super-wideband E-model can be used. In this work, we present a quality of experience model built on the E-model to measure the impact of coding and packet loss to assess the quality perceived by the end user in WebRTC speech applications. Based on the computed Mean Opinion Score, a real-time adaptive codec parameter switching mechanism is used to switch to the most optimum codec bitrate under the present network conditions. We present the evaluation results to show the effectiveness of the proposed approach when compared with the default codec configuration in WebRTC.


Author(s):  
Bo Zhang ◽  
Rui Zhang ◽  
Niccolo Bisagno ◽  
Nicola Conci ◽  
Francesco G. B. De Natale ◽  
...  

In this article, we propose a framework for crowd behavior prediction in complicated scenarios. The fundamental framework is designed using the standard encoder-decoder scheme, which is built upon the long short-term memory module to capture the temporal evolution of crowd behaviors. To model interactions among humans and environments, we embed both the social and the physical attention mechanisms into the long short-term memory. The social attention component can model the interactions among different pedestrians, whereas the physical attention component helps to understand the spatial configurations of the scene. Since pedestrians’ behaviors demonstrate multi-modal properties, we use the generative model to produce multiple acceptable future paths. The proposed framework not only predicts an individual’s trajectory accurately but also forecasts the ongoing group behaviors by leveraging on the coherent filtering approach. Experiments are carried out on the standard crowd benchmarks (namely, the ETH, the UCY, the CUHK crowd, and the CrowdFlow datasets), which demonstrate that the proposed framework is effective in forecasting crowd behaviors in complex scenarios.


Author(s):  
Prerna Mishra ◽  
Santosh Kumar ◽  
Mithilesh Kumar Chaube

Chart images exhibit significant variabilities that make each image different from others even though they belong to the same class or categories. Classification of charts is a major challenge because each chart class has variations in features, structure, and noises. However, due to the lack of affiliation between the dissimilar features and the structure of the chart, it is a challenging task to model these variations for automatic chart recognition. In this article, we present a novel dissimilarity-based learning model for similar structured but diverse chart classification. Our approach jointly learns the features of both dissimilar and similar regions. The model is trained by an improved loss function, which is fused by a structural variation-aware dissimilarity index and incorporated with regularization parameters, making the model more prone toward dissimilar regions. The dissimilarity index enhances the discriminative power of the learned features not only from dissimilar regions but also from similar regions. Extensive comparative evaluations demonstrate that our approach significantly outperforms other benchmark methods, including both traditional and deep learning models, over publicly available datasets.


Sign in / Sign up

Export Citation Format

Share Document