Learning 3-D Human Pose Estimation from Catadioptric Videos

3-D human pose estimation is a crucial step for understanding human actions. However, reliably capturing precise 3-D position of human joints is non-trivial and tedious. Current models often suffer from the scarcity of high-quality 3-D annotated training data. In this work, we explore a novel way of obtaining gigantic 3-D human pose data without manual annotations. In catedioptric videos (\emph{e.g.}, people dance before a mirror), the camera records both the original and mirrored human poses, which provides cues for estimating 3-D positions of human joints. Following this idea, we crawl a large-scale Dance-before-Mirror (DBM) video dataset, which is about 24 times larger than existing Human3.6M benchmark. Our technical insight is that, by jointly harnessing the epipolar geometry and human skeleton priors, 3-D joint estimation can boil down to an optimization problem over two sets of 2-D estimations. To our best knowledge, this represents the first work that collects high-quality 3-D human data via catadioptric systems. We have conducted comprehensive experiments on cross-scenario pose estimation and visualization analysis. The results strongly demonstrate the usefulness of our proposed DBM human poses.

Download Full-text

3D Human Pose Estimation Using Spatio-Temporal Networks with Explicit Occlusion Training

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6689 ◽

2020 ◽

Vol 34 (07) ◽

pp. 10631-10638

Author(s):

Yu Cheng ◽

Bo Yang ◽

Bo Wang ◽

Robby T. Tan

Keyword(s):

Pose Estimation ◽

Ground Truth ◽

Video Data ◽

Training Data ◽

Human Pose Estimation ◽

Ground Truth Data ◽

Public Data ◽

Spatio Temporal ◽

Human Pose ◽

3D Human Pose Estimation

Estimating 3D poses from a monocular video is still a challenging task, despite the significant progress that has been made in the recent years. Generally, the performance of existing methods drops when the target person is too small/large, or the motion is too fast/slow relative to the scale and speed of the training data. Moreover, to our knowledge, many of these methods are not designed or trained under severe occlusion explicitly, making their performance on handling occlusion compromised. Addressing these problems, we introduce a spatio-temporal network for robust 3D human pose estimation. As humans in videos may appear in different scales and have various motion speeds, we apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multi-stride temporal convolutional networks (TCNs) to estimate 3D joints or keypoints. Furthermore, we design a spatio-temporal discriminator based on body structures as well as limb motions to assess whether the predicted pose forms a valid pose and a valid movement. During training, we explicitly mask out some keypoints to simulate various occlusion cases, from minor to severe occlusion, so that our network can learn better and becomes robust to various degrees of occlusion. As there are limited 3D ground truth data, we further utilize 2D video data to inject a semi-supervised learning capability to our network. Experiments on public data sets validate the effectiveness of our method, and our ablation studies show the strengths of our network's individual submodules.

Download Full-text

Multi-Scale Collaborative Network for Human Pose Estimation

International Journal of Humanoid Robotics ◽

10.1142/s0219843619410032 ◽

2019 ◽

Vol 16 (04) ◽

pp. 1941003

Author(s):

Chunsheng Guo ◽

Jialuo Zhou ◽

Wenlong Du ◽

Xuguang Zhang

Keyword(s):

Loss Function ◽

Pose Estimation ◽

Large Scale ◽

Human Pose Estimation ◽

Small Scale ◽

Feature Maps ◽

Multi Scale ◽

Human Pose ◽

Processing Network ◽

Weight Coefficients

Human pose estimation is a fundamental but challenging task in computer vision. The estimation of human pose mainly depends on the global information of the keypoint type and the local information of the keypoint location. However, the consistency of the cascading process makes it difficult for each stacking network to form a differentiation and collaboration mechanism. In order to solve these problems, this paper introduces a new human pose estimation framework called Multi-Scale Collaborative (MSC) network. The pre-processing network forms feature maps of different sizes, and dispatches them to various locations of the stack network, with small-scale features reaching the front-end stacking network and large-scale features reaching the back-end stacking network. A new loss function is proposed for MSC network. Different keypoints have different weight coefficients of loss function at different scales, and the keypoint weight coefficients are dynamically adjusted from the top hourglass network to the bottom hourglass network. Experimental results show that the proposed method is competitive in MPII and LSP challenge leaderboard among the state-of-the-art methods.

Download Full-text

Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ◽

10.1109/cvpr42600.2020.00621 ◽

2020 ◽

Cited By ~ 1

Author(s):

Shichao Li ◽

Lei Ke ◽

Kevin Pratama ◽

Yu-Wing Tai ◽

Chi-Keung Tang ◽

...

Keyword(s):

Pose Estimation ◽

Training Data ◽

Human Pose Estimation ◽

Human Pose ◽

3D Human Pose Estimation

Download Full-text

DRPose3D: Depth Ranking in 3D Human Pose Estimation

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/136 ◽

2018 ◽

Cited By ~ 11

Author(s):

Min Wang ◽

Xipeng Chen ◽

Wentao Liu ◽

Chen Qian ◽

Liang Lin ◽

...

Keyword(s):

Neural Network ◽

Pose Estimation ◽

Human Pose Estimation ◽

Geometric Feature ◽

Classification Problems ◽

Two Stage ◽

Human Pose ◽

Human Joints ◽

3D Information ◽

3D Human Pose Estimation

In this paper, we propose a two-stage depth ranking based method (DRPose3D) to tackle the problem of 3D human pose estimation. Instead of accurate 3D positions, the depth ranking can be identified by human intuitively and learned using the deep neural network more easily by solving classification problems. Moreover, depth ranking contains rich 3D information. It prevents the 2D-to-3D pose regression in two-stage methods from being ill-posed. In our method, firstly, we design a Pairwise Ranking Convolutional Neural Network (PRCNN) to extract depth rankings of human joints from images. Secondly, a coarse-to-fine 3D Pose Network(DPNet) is proposed to estimate 3D poses from both depth rankings and 2D human joint locations. Additionally, to improve the generality of our model, we introduce a statistical method to augment depth rankings. Our approach outperforms the state-of-the-art methods in the Human3.6M benchmark for all three testing protocols, indicating that depth ranking is an essential geometric feature which can be learned to improve the 3D pose estimation.

Download Full-text

Self-Attention Network for Human Pose Estimation

Applied Sciences ◽

10.3390/app11041826 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1826

Author(s):

Hailun Xia ◽

Tianyang Zhang

Keyword(s):

Pose Estimation ◽

Human Pose Estimation ◽

Attention Network ◽

Learning Framework ◽

Benchmark Datasets ◽

Rgb Images ◽

Human Pose ◽

Human Joints ◽

Symmetric Relations ◽

2D And 3D

Estimating the positions of human joints from monocular single RGB images has been a challenging task in recent years. Despite great progress in human pose estimation with convolutional neural networks (CNNs), a central problem still exists: the relationships and constraints, such as symmetric relations of human structures, are not well exploited in previous CNN-based methods. Considering the effectiveness of combining local and nonlocal consistencies, we propose an end-to-end self-attention network (SAN) to alleviate this issue. In SANs, attention-driven and long-range dependency modeling are adopted between joints to compensate for local content and mine details from all feature locations. To enable an SAN for both 2D and 3D pose estimations, we also design a compatible, effective and general joint learning framework to mix up the usage of different dimension data. We evaluate the proposed network on challenging benchmark datasets. The experimental results show that our method has significantly achieved competitive results on Human3.6M, MPII and COCO datasets.

Download Full-text

Progressive Bi-C3D Pose Grammar for Human Pose Estimation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.7004 ◽

2020 ◽

Vol 34 (07) ◽

pp. 13033-13040 ◽

Cited By ~ 2

Author(s):

Lu Zhou ◽

Yingying Chen ◽

Jinqiao Wang ◽

Hanqing Lu

Keyword(s):

Pose Estimation ◽

Human Body ◽

Message Passing ◽

Contextual Information ◽

Human Pose Estimation ◽

Body Parts ◽

Multi Scale ◽

Human Pose ◽

Human Joints ◽

Body Joints

In this paper, we propose a progressive pose grammar network learned with Bi-C3D (Bidirectional Convolutional 3D) for human pose estimation. Exploiting the dependencies among the human body parts proves effective in solving the problems such as complex articulation, occlusion and so on. Therefore, we propose two articulated grammars learned with Bi-C3D to build the relationships of the human joints and exploit the contextual information of human body structure. Firstly, a local multi-scale Bi-C3D kinematics grammar is proposed to promote the message passing process among the locally related joints. The multi-scale kinematics grammar excavates different levels human context learned by the network. Moreover, a global sequential grammar is put forward to capture the long-range dependencies among the human body joints. The whole procedure can be regarded as a local-global progressive refinement process. Without bells and whistles, our method achieves competitive performance on both MPII and LSP benchmarks compared with previous methods, which confirms the feasibility and effectiveness of C3D in information interactions.

Download Full-text

MH Pose: 3D Human Pose Estimation based on High-quality Heatmap

10.1109/bigdata52589.2021.9671770 ◽

2021 ◽

Author(s):

Huifen Zhou ◽

Chaoqun Hong ◽

Yong Han ◽

Pengcheng Huang ◽

Yanhui Zhuang

Keyword(s):

Pose Estimation ◽

Human Pose Estimation ◽

High Quality ◽

Human Pose ◽

3D Human Pose Estimation

Download Full-text

Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6792 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11312-11319 ◽

Cited By ~ 1

Author(s):

Jogendra Nath Kundu ◽

Siddharth Seth ◽

Rahul M V ◽

Mugalodi Rakesh ◽

Venkatesh Babu Radhakrishnan ◽

...

Keyword(s):

Pose Estimation ◽

Large Scale ◽

3D Structure ◽

Human Pose Estimation ◽

Kinematic Structure ◽

In The Wild ◽

Monocular Image ◽

Human Pose ◽

Weakly Supervised ◽

Image Pairs

Estimation of 3D human pose from monocular image has gained considerable attention, as a key step to several human-centric applications. However, generalizability of human pose estimation models developed using supervision on large-scale in-studio datasets remains questionable, as these models often perform unsatisfactorily on unseen in-the-wild environments. Though weakly-supervised models have been proposed to address this shortcoming, performance of such models relies on availability of paired supervision on some related task, such as 2D pose or multi-view image pairs. In contrast, we propose a novel kinematic-structure-preserved unsupervised 3D pose estimation framework, which is not restrained by any paired or unpaired weak supervisions. Our pose estimation framework relies on a minimal set of prior knowledge that defines the underlying kinematic 3D structure, such as skeletal joint connectivity information with bone-length ratios in a fixed canonical scale. The proposed model employs three consecutive differentiable transformations namely forward-kinematics, camera-projection and spatial-map transformation. This design not only acts as a suitable bottleneck stimulating effective pose disentanglement, but also yields interpretable latent pose representations avoiding training of an explicit latent embedding to pose mapper. Furthermore, devoid of unstable adversarial setup, we re-utilize the decoder to formalize an energy-based loss, which enables us to learn from in-the-wild videos, beyond laboratory settings. Comprehensive experiments demonstrate our state-of-the-art unsupervised and weakly-supervised pose estimation performance on both Human3.6M and MPI-INF-3DHP datasets. Qualitative results on unseen environments further establish our superior generalization ability.

Download Full-text