video content analysis Latest Research Papers

One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search

Proceedings of the ACM on Measurement and Analysis of Computing Systems ◽

10.1145/3491046 ◽

2021 ◽

Vol 5 (3) ◽

pp. 1-34

Author(s):

Bingqian Lu ◽

Jianyi Yang ◽

Weiwen Jiang ◽

Yiyu Shi ◽

Shaolei Ren

Keyword(s):

State Of The Art ◽

Autonomous Driving ◽

Pareto Optimal ◽

Video Content ◽

Fast Evaluation ◽

Video Content Analysis ◽

Search Spaces ◽

Neural Architecture ◽

Real World Applications ◽

Prohibitive Cost

Convolutional neural networks (CNNs) are used in numerous real-world applications such as vision-based autonomous driving and video content analysis. To run CNN inference on various target devices, hardware-aware neural architecture search (NAS) is crucial. A key requirement of efficient hardware-aware NAS is the fast evaluation of inference latencies in order to rank different architectures. While building a latency predictor for each target device has been commonly used in state of the art, this is a very time-consuming process, lacking scalability in the presence of extremely diverse devices. In this work, we address the scalability challenge by exploiting latency monotonicity --- the architecture latency rankings on different devices are often correlated. When strong latency monotonicity exists, we can re-use architectures searched for one proxy device on new target devices, without losing optimality. In the absence of strong latency monotonicity, we propose an efficient proxy adaptation technique to significantly boost the latency monotonicity. Finally, we validate our approach and conduct experiments with devices of different platforms on multiple mainstream search spaces, including MobileNet-V2, MobileNet-V3, NAS-Bench-201, ProxylessNAS and FBNet. Our results highlight that, by using just one proxy device, we can find almost the same Pareto-optimal architectures as the existing per-device NAS, while avoiding the prohibitive cost of building a latency predictor for each device.

Attitudes towards the ‘Shisha No Thanks’ campaign video: Content Analysis of Facebook Comments (Preprint)

10.2196/preprints.33902 ◽

2021 ◽

Author(s):

Lilian Chan ◽

Ben Harris-Roxas ◽

Becky Freeman ◽

Ross MacKenzie ◽

Dalya Karezi ◽

...

Keyword(s):

Social Media ◽

Health Workers ◽

Research Question ◽

Application Programming Interface ◽

Target Audience ◽

Health Campaigns ◽

Video Content Analysis ◽

Stop Smoking ◽

Campaign Message ◽

Insight Into

BACKGROUND While social media is commonly used in public health campaigns, there is a gap in our understanding of what happens after the campaign is seen by the target audience. Frequently reported social media metrics, such as reach and engagement, do not reflect whether the audience accepts the campaign, or more importantly, whether they take up the campaign’s message. This study investigates whether analysing social media comments can provide insight into how a campaign is received, by examining Facebook comments about the Shisha No Thanks campaign. Shisha No Thanks aims to raise awareness of the harms of shisha (also known as waterpipe) smoking among young people from Arabic-speaking background in Sydney, Australia. A campaign video was produced and shared widely on social media, where it received over 10,000 Facebook comments. OBJECTIVE This study aims to understand how the Shisha No Thanks video was received by the target audience by analysing Facebook comments posted to it. Specifically, this study aims to determine whether the audience accepted or rejected the campaign’s message. METHODS A sample of the Facebook comments was extracted using Facebook’s Graph API (application programming interface). The study team developed content categories consistent with the research question. The categories were: ‘Accept’ the campaign message, ‘Reject’, and ‘Unclear’. Subcategories were developed based on common themes in each category. The categories were reviewed by Cultural Support Workers (health workers who support multicultural communities) to ensure the cultural meanings of the comments were captured. Each comment was then coded by three team members, and only assigned a category if there was agreement by at least two members. RESULTS The Shisha No Thanks video reached 435,811 people and received over 11,000 comments. Of the n=4,990 comments that were sampled, 9.1% (n=456) accepted the campaign message, 22.9% (n=1,144) rejected the message, 21.8% (n=1,089) were unclear, and 46.1% (n=2,301) contained only tagged names. Of our sample, 2.8% (n=138) indicated the commenter took on board the campaign message by expressing an intention to stop smoking shisha, or asking a friend to stop smoking shisha. Of the comments that showed rejection of the campaign, the majority were people dismissing the campaign by laughing at it or expressing pro-shisha sentiments. CONCLUSIONS This study demonstrates that conducting content analyses of social media comments can provide important insight into how a campaign message is received by the target audience. Analysing Facebook comments on the Shisha No Thanks video showed that almost one in 10 people who commented accepted the campaign message, and almost 3% took up the campaign message. On the other hand, analysing the Facebook comments also provides important insight into perspectives of people who did not accept the campaign message, which can be useful in developing future interventions on this issue.

A Deep Multimodal Model for Predicting Affective Responses Evoked by Movies Based on Shot Segmentation

Security and Communication Networks ◽

10.1155/2021/7650483 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Chunxiao Wang ◽

Jingjing Zhang ◽

Wei Jiang ◽

Shuang Wang

Keyword(s):

Short Term Memory ◽

Feature Fusion ◽

Pearson Correlation ◽

Feature Representation ◽

Visual Features ◽

Visual Feature ◽

Temporal Attention ◽

Video Content Analysis ◽

Wide Range ◽

Experienced Emotion

Predicting the emotions evoked in a viewer watching movies is an important research element in affective video content analysis over a wide range of applications. Generally, the emotion of the audience is evoked by the combined effect of the audio-visual messages of the movies. Current research has mainly used rough middle- and high-level audio and visual features to predict experienced emotions, but combining semantic information to refine features to improve emotion prediction results is still not well studied. Therefore, on the premise of considering the time structure and semantic units of a movie, this paper proposes a shot-based audio-visual feature representation method and a long short-term memory (LSTM) model incorporating a temporal attention mechanism for experienced emotion prediction. First, the shot-based audio-visual feature representation defines a method for extracting and combining audio and visual features of each shot clip, and the advanced pretraining models in the related audio-visual tasks are used to extract the audio and visual features with different semantic levels. Then, four components are included in the prediction model: a nonlinear multimodal feature fusion layer, a temporal feature capture layer, a temporal attention layer, and a sentiment prediction layer. This paper focuses on experienced emotion prediction and evaluates the proposed method on the extended COGNIMUSE dataset. The method performs significantly better than the state-of-the-art while significantly reducing the number of calculations, with increases in the Pearson correlation coefficient (PCC) from 0.46 to 0.62 for arousal and from 0.18 to 0.34 for valence in experienced emotion.

Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/104 ◽

2021 ◽

Author(s):

Yuqi Huo ◽

Mingyu Ding ◽

Haoyu Lu ◽

Ziyuan Huang ◽

Mingqian Tang ◽

...

Keyword(s):

Content Analysis ◽

State Of The Art ◽

Representation Learning ◽

Video Content ◽

Video Content Analysis ◽

Video Representation ◽

Natural Choice ◽

Jigsaw Puzzles

This paper proposes a novel pretext task for self-supervised video representation learning by exploiting spatiotemporal continuity in videos. It is motivated by the fact that videos are spatiotemporal by nature and a representation learned by detecting spatiotemporal continuity/discontinuity is thus beneficial for downstream video content analysis tasks. A natural choice of such a pretext task is to construct spatiotemporal (3D) jigsaw puzzles and learn to solve them. However, as we demonstrate in the experiments, this task turns out to be intractable. We thus propose Constrained Spatiotemporal Jigsaw (CSJ) whereby the 3D jigsaws are formed in a constrained manner to ensure that large continuous spatiotemporal cuboids exist. This provides sufficient cues for the model to reason about the continuity. Instead of solving them directly, which could still be extremely hard, we carefully design four surrogate tasks that are more solvable. The four tasks aim to learn representations sensitive to spatiotemporal continuity at both the local and global levels. Extensive experiments show that our CSJ achieves state-of-the-art on various benchmarks.

Group Behavior Pattern Recognition Algorithm Based on Spatio-Temporal Graph Convolutional Networks

Scientific Programming ◽

10.1155/2021/2934943 ◽

2021 ◽

Vol 2021 ◽

pp. 1-8

Author(s):

Xinfang Chen ◽

Venkata Dinavahi

Keyword(s):

Pattern Recognition ◽

Behavior Pattern ◽

Group Behavior ◽

Recognition Algorithm ◽

Pattern Recognition Algorithm ◽

Complex Scene ◽

Video Content Analysis ◽

Density Analysis ◽

Temporal Graph ◽

Spatio Temporal

With the rapid growth of population, more diverse crowd activities, and the rapid development of socialization process, group scenes are becoming more common, so the demand for modeling, analyzing, and understanding group behavior data in video is increasing. Compared with the previous work on video content analysis, factors such as the increasing number of people in the group video and the more complex scene make the analysis of group behavior in video face great challenges. Therefore, a group behavior pattern recognition algorithm based on spatio-temporal graph convolutional network is proposed in this paper, aiming at group density analysis and group behavior recognition in the video. A crowd detection and location method based on density map regression-guided classification was designed. Finally, a crowd behavior analysis method based on density grade division was designed to complete crowd density analysis and video group behavior detection. In addition, this paper also proposes to extract spatio-temporal features of crowd posture and density by using the double-flow spatio-temporal map network model, so as to effectively capture the differentiated movement information among different groups. Experimental results on public datasets show that the proposed method has high accuracy and can effectively predict group behavior.

Towards Optimized IoT-based Context-aware Video Content Analysis Framework

10.1109/wf-iot51360.2021.9595891 ◽

2021 ◽

Author(s):

Gad Gad ◽

Eyad Gad ◽

Bassem Mokhtar

Keyword(s):

Content Analysis ◽

Context Aware ◽

Video Content ◽

Analysis Framework ◽

Video Content Analysis

Conditional random field based image and video content analysis

10.32920/ryerson.14660733 ◽

2021 ◽

Author(s):

Xiaofeng Wang

Keyword(s):

Content Analysis ◽

Random Field ◽

Graphical Model ◽

Research Effort ◽

Conditional Random Field ◽

Semantic Gap ◽

Video Content ◽

Video Content Analysis ◽

Image Labeling ◽

New Feature

Image and video content analysis is an interesting, meaningful and challenging topic. In recent years much of the research effort in the multimedia field focuses on indexing and retrieval. Semantic gap between low-level features and high-level content is a bottleneck in most systems. To bridge the semantic gap, new content analysis models need to be developed. In this thesis, algorithms based on a relatively new graphical model, called the conditional random field (CRF) model, are developed for two closely-related problems in content analysis: image labeling and video content analysis. The CRF model can represent spatial interactions in image labeling and temporal interactions in video content analysis. New feature functions are designed to better represent the feature distributions. The mixture feature functions are used in image labeling for databases with nature images, and the independent component analysis (ICA) mixture function is applied in sports video content analysis. The spatial dependence of image parts and the temporal dependence of video frames can be explored by the CRF model more effectively using new feature functions. For image labeling with large databases, the content-based image retrieval method is combined with the CRF image labeling model successfully.

Conditional random field based image and video content analysis

10.32920/ryerson.14660733.v1 ◽

2021 ◽

Author(s):

Xiaofeng Wang

Keyword(s):

Content Analysis ◽

Random Field ◽

Graphical Model ◽

Research Effort ◽

Conditional Random Field ◽

Semantic Gap ◽

Video Content ◽

Video Content Analysis ◽

Image Labeling ◽

New Feature

Image and video content analysis is an interesting, meaningful and challenging topic. In recent years much of the research effort in the multimedia field focuses on indexing and retrieval. Semantic gap between low-level features and high-level content is a bottleneck in most systems. To bridge the semantic gap, new content analysis models need to be developed. In this thesis, algorithms based on a relatively new graphical model, called the conditional random field (CRF) model, are developed for two closely-related problems in content analysis: image labeling and video content analysis. The CRF model can represent spatial interactions in image labeling and temporal interactions in video content analysis. New feature functions are designed to better represent the feature distributions. The mixture feature functions are used in image labeling for databases with nature images, and the independent component analysis (ICA) mixture function is applied in sports video content analysis. The spatial dependence of image parts and the temporal dependence of video frames can be explored by the CRF model more effectively using new feature functions. For image labeling with large databases, the content-based image retrieval method is combined with the CRF image labeling model successfully.

Video content analysis based on statistical modeling

10.32920/ryerson.14646498 ◽

2021 ◽

Author(s):

Jian Zhou

Keyword(s):

Content Analysis ◽

Statistical Modeling ◽

Sequential Data ◽

Video Content ◽

Video Content Analysis ◽

Statistical Measures ◽

Video Parsing ◽

Parsing Algorithm ◽

Semantic Event

This thesis is aimed at finding solutions and statistical modeling techniques to analyze the video content in a way such that intelligent and efficient interaction with video is possible. In our work, we investigate several fundamental tasks for content analysis of video. Specifically, we propose an outline video parsing algorithm using basic statistical measures and an off-line solution using Independent Component Analysis (ICA). A spatiotemporal video similarity model based on dynamic programming is developed. For video object segmentation and tracking, we develop a new method based on probabilistic fuzzy c-means and Gibbs random fields. Theoretically, we develop a generic framework for sequential data analysis. The new framework integrates both Hidden Markov Model and ICA mixture model. The re-estimation formulas for model parameter learning are also derived. As a case study, the new model is applied to golf video for semantic event detection and recognition.

Video content analysis based on statistical modeling

10.32920/ryerson.14646498.v1 ◽

2021 ◽

Author(s):

Jian Zhou

Keyword(s):

Content Analysis ◽

Statistical Modeling ◽

Sequential Data ◽

Video Content ◽

Video Content Analysis ◽

Statistical Measures ◽

Video Parsing ◽

Parsing Algorithm ◽

Semantic Event

This thesis is aimed at finding solutions and statistical modeling techniques to analyze the video content in a way such that intelligent and efficient interaction with video is possible. In our work, we investigate several fundamental tasks for content analysis of video. Specifically, we propose an outline video parsing algorithm using basic statistical measures and an off-line solution using Independent Component Analysis (ICA). A spatiotemporal video similarity model based on dynamic programming is developed. For video object segmentation and tracking, we develop a new method based on probabilistic fuzzy c-means and Gibbs random fields. Theoretically, we develop a generic framework for sequential data analysis. The new framework integrates both Hidden Markov Model and ICA mixture model. The re-estimation formulas for model parameter learning are also derived. As a case study, the new model is applied to golf video for semantic event detection and recognition.

video content analysis
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search

Attitudes towards the ‘Shisha No Thanks’ campaign video: Content Analysis of Facebook Comments (Preprint)

A Deep Multimodal Model for Predicting Affective Responses Evoked by Movies Based on Shot Segmentation

Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw

Group Behavior Pattern Recognition Algorithm Based on Spatio-Temporal Graph Convolutional Networks

Towards Optimized IoT-based Context-aware Video Content Analysis Framework

Conditional random field based image and video content analysis

Conditional random field based image and video content analysis

Video content analysis based on statistical modeling

Video content analysis based on statistical modeling

Export Citation Format

video content analysisRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search

Attitudes towards the ‘Shisha No Thanks’ campaign video: Content Analysis of Facebook Comments (Preprint)

A Deep Multimodal Model for Predicting Affective Responses Evoked by Movies Based on Shot Segmentation

Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw

Group Behavior Pattern Recognition Algorithm Based on Spatio-Temporal Graph Convolutional Networks

Towards Optimized IoT-based Context-aware Video Content Analysis Framework

Conditional random field based image and video content analysis

Conditional random field based image and video content analysis

Video content analysis based on statistical modeling

Video content analysis based on statistical modeling

video content analysis
Recently Published Documents