scholarly journals Action recognition based on element-level fine-grained multi-modal fusion

2021 ◽  
Vol 2010 (1) ◽  
pp. 012114
Author(s):  
Guozheng Peng ◽  
Lixin Han ◽  
Jiaxue Yang
2021 ◽  
pp. 620-631
Author(s):  
Xiang Li ◽  
Shenglan Liu ◽  
Yunheng Li ◽  
Hao Liu ◽  
Jinjing Zhao ◽  
...  

Author(s):  
Yang Zhou ◽  
Bingbing Ni ◽  
Shuicheng Yan ◽  
Pierre Moulin ◽  
Qi Tian

Author(s):  
Dima Damen ◽  
Hazel Doughty ◽  
Giovanni Maria Farinella ◽  
Antonino Furnari ◽  
Evangelos Kazakos ◽  
...  

AbstractThis paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version (Damen in Scaling egocentric vision: ECCV, 2018), EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection enables new challenges such as action detection and evaluating the “test of time”—i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics.


IEEE Access ◽  
2019 ◽  
Vol 7 ◽  
pp. 103629-103638 ◽  
Author(s):  
Jian Xiong ◽  
Liguo Lu ◽  
Hengbing Wang ◽  
Jie Yang ◽  
Guan Gui

Author(s):  
Yaparla Ganesh ◽  
Allaparthi Sri Teja ◽  
Sai Krishna Munnangi ◽  
Garimella Rama Murthy

2020 ◽  
Vol 10 (4) ◽  
pp. 1531 ◽  
Author(s):  
Bhishan Bhandari ◽  
Geonu Lee ◽  
Jungchan Cho

Action recognition is an application that, ideally, requires real-time results. We focus on single-image-based action recognition instead of video-based because of improved speed and lower cost of computation. However, a single image contains limited information, which makes single-image-based action recognition a difficult problem. To get an accurate representation of action classes, we propose three feature-stream-based shallow sub-networks (image-based, attention-image-based, and part-image-based feature networks) on the deep pose estimation network in a multitasking manner. Moreover, we design the multitask-aware loss function, so that the proposed method can be adaptively trained with heterogeneous datasets where only human pose annotations or action labels are included (instead of both pose and action information), which makes it easier to apply the proposed approach to new data on behavioral analysis on intelligent systems. In our extensive experiments, we showed that these streams represent complementary information and, hence, the fused representation is robust in distinguishing diverse fine-grained action classes. Unlike other methods, the human pose information was trained using heterogeneous datasets in a multitasking manner; nevertheless, it achieved 91.91% mean average precision on the Stanford 40 Actions Dataset. Moreover, we demonstrated the proposed method can be flexibly applied to multi-labels action recognition problem on the V-COCO Dataset.


2021 ◽  
pp. 108282
Author(s):  
Sravani Yenduri ◽  
Nazil Perveen ◽  
Vishnu Chalavadi ◽  
C. Krishna Mohan

2021 ◽  
Author(s):  
Mei Chee Leong ◽  
Hui Li Tan ◽  
Haosong Zhang ◽  
Liyuan Li ◽  
Feng Lin ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document