video description Latest Research Papers

With the development of computer technology, video description, which combines the key technologies in the field of natural language processing and computer vision, has attracted more and more researchers’ attention. Among them, how to objectively and efficiently describe high-speed and detailed sports videos is the key to the development of the video description field. In view of the problems of sentence errors and loss of visual information in the generation of the video description text due to the lack of language learning information in the existing video description methods, a multihead model combining the long-term and short-term memory network and attention mechanism is proposed for the intelligent description of the volleyball video. Through the introduction of the attention mechanism, the model pays much attention to the significant areas in the video when generating sentences. Through the comparative experiment with different models, the results show that the model with the attention mechanism can effectively solve the loss of visual information. Compared with the LSTM and base model, the multihead model proposed in this paper, which combines the long-term and short-term memory network and attention mechanism, has higher scores in all evaluation indexes and significantly improved the quality of the intelligent text description of the volleyball video.

Download Full-text

Enhancing semantics with multi‐objective reinforcement learning for video description

Electronics Letters ◽

10.1049/ell2.12334 ◽

2021 ◽

Author(s):

Qinyu Li ◽

Longyu Yang ◽

Pengjie Tang ◽

Hanli Wang

Keyword(s):

Reinforcement Learning ◽

Multi Objective ◽

Video Description

Download Full-text

Exploiting better motion cues for better action recognition

10.31219/osf.io/jfueq ◽

2021 ◽

Author(s):

Jawad Khan

Keyword(s):

Action Recognition ◽

Space Time ◽

Motion Cues ◽

Large Margin ◽

Additional Information ◽

Recognition Algorithms ◽

Video Description ◽

Shear Characteristics

Several recent studies on action recognition have emphasised the significance of including motioncharacteristics clearly in the video description. This work shows that properly partitioning visualmotion into dominant and residual motions enhances action recognition algorithms greatly, both interms of extracting space-time trajectories and computing descriptors. Then, using differentialmotion scalar variables, divergence, curl, and shear characteristics, we create a new motiondescriptor, the DCS descriptor. It adds to the results by capturing additional information on localmotion patterns. Finally, adopting the recently proposed VLAD coding technique in image retrievalimproves action recognition significantly. On three difficult datasets, namely Hollywood 2,HMDB51, and Olympic Sports, our three additions are complementary and lead to beat all reportedresults by a large margin.

Download Full-text

Automatic textual description of interactions between two objects in surveillance videos

SN Applied Sciences ◽

10.1007/s42452-021-04534-3 ◽

2021 ◽

Vol 3 (7) ◽

Author(s):

Wael F. Youssef ◽

Siba Haidar ◽

Philippe Joly

Keyword(s):

Deep Learning ◽

Physical Contact ◽

Surveillance Systems ◽

Surveillance Video ◽

Surveillance Videos ◽

Textual Description ◽

Video Description ◽

General Schema ◽

Remote Interaction ◽

Context Free

AbstractThe purpose of our work is to automatically generate textual video description schemas from surveillance video scenes compatible with police incidents reports. Our proposed approach is based on a generic and flexible context-free ontology. The general schema is of the form [actuator] [action] [over/with] [actuated object] [+ descriptors: distance, speed, etc.]. We focus on scenes containing exactly two objects. Through elaborated steps, we generate a formatted textual description. We try to identify the existence of an interaction between the two objects, including remote interaction which does not involve physical contact and we point out when aggressivity took place in these cases. We use supervised deep learning to classify scenes into interaction or no-interaction classes and then into subclasses. The chosen descriptors used to represent subclasses are keys in surveillance systems that help generate live alerts and facilitate offline investigation.

Download Full-text

LiveDescribe : can amatures [i.e. amateurs] create quality video description

10.32920/ryerson.14645208 ◽

2021 ◽

Author(s):

Carmen Branje

Keyword(s):

Low Vision ◽

First Language ◽

Common Factors ◽

Work Flow ◽

Flow Process ◽

Moderate Amount ◽

Video Description ◽

Quality Video ◽

Regional Dialect ◽

The Common

This thesis explores amateur video description facilitated through the video description software program called LiveDescribe. Twelve amateur describers created video description which was reviewed by 76 sighted, low vision, and blind reviewers. It was found that describers were able to not only produce description but that their descriptions seem to be perceived as having an acceptable level of quality. Three describers were found to be rated as "good", three were rated as "weak" and the remaining six were in a "medium" category. The common factors that appeared to characterize the good describers were a soft non-obtrusive voice, a moderate amount of well placed descriptions, moderate description lengths and English as a first language spoken without an accent or regional dialect. It was found that LiveDescribe was a useful and easy to use tool and that it facilitated a video description work flow process for amateur describers.

Download Full-text

LiveDescribe : can amatures [i.e. amateurs] create quality video description

10.32920/ryerson.14645208.v1 ◽

2021 ◽

Author(s):

Carmen Branje

Keyword(s):

Low Vision ◽

First Language ◽

Common Factors ◽

Work Flow ◽

Flow Process ◽

Moderate Amount ◽

Video Description ◽

Quality Video ◽

Regional Dialect ◽

The Common

This thesis explores amateur video description facilitated through the video description software program called LiveDescribe. Twelve amateur describers created video description which was reviewed by 76 sighted, low vision, and blind reviewers. It was found that describers were able to not only produce description but that their descriptions seem to be perceived as having an acceptable level of quality. Three describers were found to be rated as "good", three were rated as "weak" and the remaining six were in a "medium" category. The common factors that appeared to characterize the good describers were a soft non-obtrusive voice, a moderate amount of well placed descriptions, moderate description lengths and English as a first language spoken without an accent or regional dialect. It was found that LiveDescribe was a useful and easy to use tool and that it facilitated a video description work flow process for amateur describers.

Download Full-text