video captioning Latest Research Papers

With the advancement of the technological field, day by day, people from around the world are having easier access to internet abled devices, and as a result, video data is growing rapidly. The increase of portable devices such as various action cameras, mobile cameras, motion cameras, etc., can also be considered for the faster growth of video data. Data from these multiple sources need more maintenance to process for various usages according to the needs. By considering these enormous amounts of video data, it cannot be navigated fully by the end-users. Throughout recent times, many research works have been done to generate descriptions from the images or visual scene recordings to address the mentioned issue. This description generation, also known as video captioning, is more complex than single image captioning. Various advanced neural networks have been used in various studies to perform video captioning. In this paper, we propose an attention-based Bi-LSTM and sequential LSTM (Att-BiL-SL) encoder-decoder model for describing the video in textual format. The model consists of two-layer attention-based bi-LSTM and one-layer sequential LSTM for video captioning. The model also extracts the universal and native temporal features from the video frames for smooth sentence generation from optical frames. This paper includes the word embedding with a soft attention mechanism and a beam search optimization algorithm to generate qualitative results. It is found that the architecture proposed in this paper performs better than various existing state of the art models.

Download Full-text

Hierarchical Attention-Based Video Captioning Using Key Frames

Artificial Intelligence and Technologies - Lecture Notes in Electrical Engineering ◽

10.1007/978-981-16-6448-9_30 ◽

2021 ◽

pp. 295-302

Author(s):

Munusamy Hemalatha ◽

P. Karthik

Keyword(s):

Video Captioning ◽

Key Frames

Download Full-text

An efficient deep learning‐based video captioning framework using multi‐modal features

Expert Systems ◽

10.1111/exsy.12920 ◽

2021 ◽

Author(s):

Soumya Varma ◽

Dinesh Peter James

Keyword(s):

Deep Learning ◽

Video Captioning

Download Full-text

An attention based dual learning approach for video captioning

Applied Soft Computing ◽

10.1016/j.asoc.2021.108332 ◽

2021 ◽

pp. 108332

Author(s):

Wanting Ji ◽

Ruili Wang ◽

Yan Tian ◽

Xun Wang

Keyword(s):

Learning Approach ◽

Video Captioning

Download Full-text

Video Captioning Based on Both Egocentric and Exocentric Views of Robot Vision for Human-Robot Interaction

International Journal of Social Robotics ◽

10.1007/s12369-021-00842-1 ◽

2021 ◽

Author(s):

Soo-Han Kang ◽

Ji-Hyeong Han

Keyword(s):

Deep Learning ◽

Natural Language ◽

Vision System ◽

Robot Vision ◽

Learning Model ◽

Human Robot Interaction ◽

Robot Interaction ◽

Video Captioning ◽

Global Action ◽

Deep Learning Model

AbstractRobot vision provides the most important information to robots so that they can read the context and interact with human partners successfully. Moreover, to allow humans recognize the robot’s visual understanding during human-robot interaction (HRI), the best way is for the robot to provide an explanation of its understanding in natural language. In this paper, we propose a new approach by which to interpret robot vision from an egocentric standpoint and generate descriptions to explain egocentric videos particularly for HRI. Because robot vision equals to egocentric video on the robot’s side, it contains as much egocentric view information as exocentric view information. Thus, we propose a new dataset, referred to as the global, action, and interaction (GAI) dataset, which consists of egocentric video clips and GAI descriptions in natural language to represent both egocentric and exocentric information. The encoder-decoder based deep learning model is trained based on the GAI dataset and its performance on description generation assessments is evaluated. We also conduct experiments in actual environments to verify whether the GAI dataset and the trained deep learning model can improve a robot vision system

Download Full-text