scholarly journals Intelligent Video Highlights Generation with Front-Camera Emotion Sensing

Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1035
Author(s):  
Hugo Meyer ◽  
Peter Wei ◽  
Xiaofan Jiang

In this paper, we present HOMER, a cloud-based system for video highlight generation which enables the automated, relevant, and flexible segmentation of videos. Our system outperforms state-of-the-art solutions by fusing internal video content-based features with the user’s emotion data. While current research mainly focuses on creating video summaries without the use of affective data, our solution achieves the subjective task of detecting highlights by leveraging human emotions. In two separate experiments, including videos filmed with a dual camera setup, and home videos randomly picked from Microsoft’s Video Titles in the Wild (VTW) dataset, HOMER demonstrates an improvement of up to 38% in F1-score from baseline, while not requiring any external hardware. We demonstrated both the portability and scalability of HOMER through the implementation of two smartphone applications.

2020 ◽  
Vol 2020 (1) ◽  
pp. 78-81
Author(s):  
Simone Zini ◽  
Simone Bianco ◽  
Raimondo Schettini

Rain removal from pictures taken under bad weather conditions is a challenging task that aims to improve the overall quality and visibility of a scene. The enhanced images usually constitute the input for subsequent Computer Vision tasks such as detection and classification. In this paper, we present a Convolutional Neural Network, based on the Pix2Pix model, for rain streaks removal from images, with specific interest in evaluating the results of the processing operation with respect to the Optical Character Recognition (OCR) task. In particular, we present a way to generate a rainy version of the Street View Text Dataset (R-SVTD) for "text detection and recognition" evaluation in bad weather conditions. Experimental results on this dataset show that our model is able to outperform the state of the art in terms of two commonly used image quality metrics, and that it is capable to improve the performances of an OCR model to detect and recognise text in the wild.


2019 ◽  
Vol 9 (12) ◽  
pp. 2535
Author(s):  
Di Fan ◽  
Hyunwoo Kim ◽  
Jummo Kim ◽  
Yunhui Liu ◽  
Qiang Huang

Face attributes prediction has an increasing amount of applications in human–computer interaction, face verification and video surveillance. Various studies show that dependencies exist in face attributes. Multi-task learning architecture can build a synergy among the correlated tasks by parameter sharing in the shared layers. However, the dependencies between the tasks have been ignored in the task-specific layers of most multi-task learning architectures. Thus, how to further boost the performance of individual tasks by using task dependencies among face attributes is quite challenging. In this paper, we propose a multi-task learning using task dependencies architecture for face attributes prediction and evaluate the performance with the tasks of smile and gender prediction. The designed attention modules in task-specific layers of our proposed architecture are used for learning task-dependent disentangled representations. The experimental results demonstrate the effectiveness of our proposed network by comparing with the traditional multi-task learning architecture and the state-of-the-art methods on Faces of the world (FotW) and Labeled faces in the wild-a (LFWA) datasets.


Author(s):  
Bingqian Lu ◽  
Jianyi Yang ◽  
Weiwen Jiang ◽  
Yiyu Shi ◽  
Shaolei Ren

Convolutional neural networks (CNNs) are used in numerous real-world applications such as vision-based autonomous driving and video content analysis. To run CNN inference on various target devices, hardware-aware neural architecture search (NAS) is crucial. A key requirement of efficient hardware-aware NAS is the fast evaluation of inference latencies in order to rank different architectures. While building a latency predictor for each target device has been commonly used in state of the art, this is a very time-consuming process, lacking scalability in the presence of extremely diverse devices. In this work, we address the scalability challenge by exploiting latency monotonicity --- the architecture latency rankings on different devices are often correlated. When strong latency monotonicity exists, we can re-use architectures searched for one proxy device on new target devices, without losing optimality. In the absence of strong latency monotonicity, we propose an efficient proxy adaptation technique to significantly boost the latency monotonicity. Finally, we validate our approach and conduct experiments with devices of different platforms on multiple mainstream search spaces, including MobileNet-V2, MobileNet-V3, NAS-Bench-201, ProxylessNAS and FBNet. Our results highlight that, by using just one proxy device, we can find almost the same Pareto-optimal architectures as the existing per-device NAS, while avoiding the prohibitive cost of building a latency predictor for each device.


Author(s):  
Yongyi Tang ◽  
Lin Ma ◽  
Lianqiang Zhou

Appearance and motion are two key components to depict and characterize the video content. Currently, the two-stream models have achieved state-of-the-art performances on video classification. However, extracting motion information, specifically in the form of optical flow features, is extremely computationally expensive, especially for large-scale video classification. In this paper, we propose a motion hallucination network, namely MoNet, to imagine the optical flow features from the appearance features, with no reliance on the optical flow computation. Specifically, MoNet models the temporal relationships of the appearance features and exploits the contextual relationships of the optical flow features with concurrent connections. Extensive experimental results demonstrate that the proposed MoNet can effectively and efficiently hallucinate the optical flow features, which together with the appearance features consistently improve the video classification performances. Moreover, MoNet can help cutting down almost a half of computational and data-storage burdens for the two-stream video classification. Our code is available at: https://github.com/YongyiTang92/MoNet-Features


2021 ◽  
Vol 11 (16) ◽  
pp. 7472
Author(s):  
Mario Montagud ◽  
Cristian Hurtado ◽  
Juan Antonio De Rus ◽  
Sergi Fernández

All multimedia services must be accessible. Accessibility for multimedia content is typically provided by means of access services, of which subtitling is likely the most widespread approach. To date, numerous recommendations and solutions for subtitling classical 2D audiovisual services have been proposed. Similarly, recent efforts have been devoted to devising adequate subtitling solutions for VR360 video content. This paper, for the first time, extends the existing approaches to address the challenges remaining for efficiently subtitling 3D Virtual Reality (VR) content by exploring two key requirements: presentation modes and guiding methods. By leveraging insights from earlier work on VR360 content, this paper proposes novel presentation modes and guiding methods, to not only provide the freedom to explore omnidirectional scenes, but also to address the additional specificities of 3D VR compared to VR360 content: depth, 6 Degrees of Freedom (6DoF), and viewing perspectives. The obtained results prove that always-visible subtitles and a novel proposed comic-style presentation mode are significantly more appropriate than state-of-the-art fixed-positioned subtitles, particularly in terms of immersion, ease and comfort of reading, and identification of speakers, when applied to professional pieces of content with limited displacement of speakers and limited 6DoF (i.e., users are not expected to navigate around the virtual environment). Similarly, even in such limited movement scenarios, the results show that the use of indicators (arrows), as a guiding method, is well received. Overall, the paper provides relevant insights and paves the way for efficiently subtitling 3D VR content.


Author(s):  
Tapiwanashe Miranda Sanyanga ◽  
Munyaradzi Sydney Chinzvende ◽  
Tatenda Duncan Kavu ◽  
John Batani

Due to the increase in video content being generated from surveillance cameras and filming, videos analysis becomes imperative. Sometimes it becomes tedious to watch a video captured by a surveillance camera for hours, just to find out the desired footage. Current state of-the-art video analysis methods do not address the problem of searching and localizing a particular object in a video using the name of the object as a query and to return only a segment of the video clip showing the instances of that object. In this research the authors make use of combined implementations from existing work and also applied the dropping frames algorithm to produce a shorter, trimmed video clip showing the target object specified by the search tag. The resulting video is short and specific to the object of interest.


Author(s):  
Hichem Karray ◽  
Monji Kherallah ◽  
Mohamed Ben Halima ◽  
Adel M. Alimi

The authors propose a framework for multimodal analysis of Arabic news broadcast which helps users of pervasive devices to browsing quickly into news archive; their solution integrating many aspects such as summarizing, indexing textual content and on on-line recognition of the handwriting. Firstly, the summarizing process is to accelerate the video content browsing based on genetic algorithm. Secondly, the indexing process, which operates on video summaries based on text recognition. Finally users communicate by writing keywords on PDA screen and keep only summaries speaking about this topic. This PDA contains an on line recognition system of Arabic of handwritten based on visual coding and genetic algorithm.


2016 ◽  
Vol 27 (8) ◽  
pp. 1275-1288 ◽  
Author(s):  
Wolfgang Fuhl ◽  
Marc Tonsen ◽  
Andreas Bulling ◽  
Enkelejda Kasneci

2021 ◽  
Vol 6 (1) ◽  
pp. 1-5
Author(s):  
Zobeir Raisi ◽  
Mohamed A. Naiel ◽  
Paul Fieguth ◽  
Steven Wardell ◽  
John Zelek

The reported accuracy of recent state-of-the-art text detection methods, mostly deep learning approaches, is in the order of 80% to 90% on standard benchmark datasets. These methods have relaxed some of the restrictions of structured text and environment (i.e., "in the wild") which are usually required for classical OCR to properly function. Even with this relaxation, there are still circumstances where these state-of-the-art methods fail.  Several remaining challenges in wild images, like in-plane-rotation, illumination reflection, partial occlusion, complex font styles, and perspective distortion, cause exciting methods to perform poorly. In order to evaluate current approaches in a formal way, we standardize the datasets and metrics for comparison which had made comparison between these methods difficult in the past. We use three benchmark datasets for our evaluations: ICDAR13, ICDAR15, and COCO-Text V2.0. The objective of the paper is to quantify the current shortcomings and to identify the challenges for future text detection research.


Sign in / Sign up

Export Citation Format

Share Document