Making Machine Learning Simple to use
Human Pose estimation is an important problem and has enjoyed the attention of the Computer Vision community for the past few decades. It is an important step towards understanding people in images and videos. In this post, I write about the basics of 3D Human Pose Estimation and review the literature on this topic. I've covered 2D Pose Estimation here.

## What is 3D Human Pose Estimation?

Given an image of a person, 3d pose estimation is the task of producing a 3D pose that matches the spatial position of the depicted person.

It is a significantly more difficult problem that 2D Pose estimation and there has been a lot of exciting development in the field in the past few years and in this post I will be covering the basics and the major milestones in 3D Human Pose Estimation literature.

## Why is it hard? And harder than 2D Pose Estimation?

In general, recovering 3D pose from 2D RGB images is considered more difficult than 2D pose estimation, due to the larger 3D pose space and more ambiguities. An algorithm has to be invariant to a number of factors, including background scenes, lighting, clothing shape and texture, skin color and image imperfections, among others.

Also, it is quite difficult to build an in the wild dataset. A 3D pose dataset is built using MOCAP systems which are suitable for an indoor environment. MOCAP systems require an elaborate setup with multiple sensors and bodysuits, which is impractical to use outside. So the lack of 3D in the wild Ground truth data is a major bottleneck.

## Applications

3D HPE has immediate applications in various tasks such as action understanding, surveillance, human-robot interaction, motion capture and CGI. Motion capture and CGI are very lucrative since the default method is to use expensive MOCAP setups to capture motion and movement. Imagine getting all this just from a monocular image/video.

## Different approaches to 3D Human Pose Estimation

Human pose estimation approaches can be classified into two types  - model-based generative methods and discriminative methods.

• The pictorial structure model (PSM) is one of the most popular generative models for 2D human pose estimation. The conventional PSM treats the human body as an articulated structure. The model usually consists of two terms, which model the appearance of each body part and the spatial relationship between adjacent parts. Since the length of a limb in 2D can vary, a mixture of models was proposed for modeling each body part. The spatial relationships between articulated parts are simpler for 3D pose, since the limb length in 3D is a constant for one specific subject. Burenius et al. propose to apply PSM to 3D pose estimation by discretizing the space. However, the pose space grows cubicly with the resolution of the discretization, which makes it complex.
• Discriminative methods view pose estimation as a regression problem. After extracting features from the image, a mapping is learned from the feature space to the pose space. Because of the articulated structure of the human skeleton, the joint locations are highly correlated. To consider the dependencies between output variables, Ionescu et al. proposes to use a structured SVM to learn the mapping from segmentation features to joint locations.
• Deep Learning approaches - Instead of dealing with the structural dependencies manually, a more direct way is to “embed” the structure into the mapping function and learn a representation that disentangles the dependencies between output variables. In this case models need to discover the patterns of human pose from data, which usually requires a large dataset for learning.

In the next section, I’ll summarize a few papers in chronological order that represents the evolution of 3D Human Pose Estimation starting (This is not an exhaustive list, but a list of papers that I feel show the best progression/most significant ones per conference).

Papers

## 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network (ACCV 14) [arXiv]

The simplest way to predict 3D joint locations is to train a network to directly regress them from an image. This paper exactly does this. They are the first to show that deep neural networks can be applied to 3D human pose estimation from single images.
The framework consists of two types of tasks: 1) A joint point regression task; and 2) Joint point detection tasks. The input for both tasks are the bounding box images containing human subjects. The goal of the regression task is to estimate the positions of joint points relative to the root joint position. The aim of each detection task is to classify whether one local window contains the specific joint or not.

### Network

The whole network consists of 9 trainable layers – 3 convolutional layers that are shared by both regression and detection networks, 3 fully connected layers for the regression network, and 3 fully connected layers for the detection network.

The network is trained within a multi-task learning framework. As in, the features in the lower layers are to be shared between the regression and detection tasks during joint training. During the training, the gradients from both networks will be back-propagated to the same shared feature network, i.e., the network with layers from conv1 to pool3. In this case, the shared network tends to learn features that will benefit both tasks.

### Results

Have a look at the appendix to find the definition of MPJPE.

## 3D Human Pose Estimation = 2D Pose Estimation + Matching (CVPR'17) [arXiv][code]

Instead of directly predicting 3D Pose from image, the paper explores a simple architecture that reasons through intermediate 2D pose predictions. It was based on two key observations

• Deep neural nets have revolutionized 2D pose estimation producing very good results
• ”Big-data”sets of 3D mocap data were readily available, making it easier to “lift” predicted 2D poses to 3D through simple memorization.

Given a 3D pose library (essentially a collection of 3D poses), they generate a large number of 2D projections. Given this training set of paired (2D,3D) data and predictions from a 2D pose estimation algorithm, the depths from the 3D pose associated with the closest matching 2D example from the library are returned.

Due to the difficulty of annotation in 3D, training datasets with 3D labels are typically collected in a lab environment, while 2D datasets tend to be more diverse. The two-stage pipeline makes use of different training sets for different stages, resulting in a system that can predict 3D poses from “in-the-wild” images and hence generalize better.

### Approach

The paper make use of a probabilistic formulation over variables including the image I, the 3D pose $X \in R^{N \times 3}$, and the 2D pose $x \in R^{N \times 2}$ , where $N$ is the number of articulated joints. The joint probability is;

$$p(X, x, I)=p(X | x, I) \cdot p(x | I) \cdot p(I)$$

The first term $p(X / x, I)$ is modeled as a non-parametric Nearest-neighbor model $(NN)$, and the second term $p(x / I)$ as a CNN i.e

$$p(x | I)=C N N(I)$$

Where CNN returns $N$ 2D heatmaps where $N$ is the number of joints. The CNN used is a Convolutional Pose Machine.
$p(X / x)$ is modeled using a Nearest Neighbour, which returns 1-Nearest neighbor (1NN) 3D depth.

## Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach (ICCV 2017) [arXiv](code)

This paper argues that sequential pipelines (like the previous paper) are sub-optimal because the original in-the-wild 2D image information, which contains rich cues for 3D pose recovery, is discarded in the second step.  This approach leverages both 2D joint locations as well as intermediate feature representations from the original image.  They propose a weakly-supervised and end-to-end method that uses mixed 2D and 3D labels in a deep neural network that presents a two-stage cascaded structure. That’s a mouthful, let’s break it down.

2D annotations act as weak labels for 3D pose estimation. 2D data does not have 3D ground truths but have diverse in the wild images. To predict 3D pose on the 2D images, a geometric loss is introduced.

Now the transfer learning part. Because of the shared representations, deep features are better learned. In doing so, the 3D pose labels in controlled lab environments are transferred to in the wild images.

### Network

2D Pose Estimation module - A stacked hourglass module is used and the output is a set of low-res heat maps.(Each map represents a 2D probability distribution of one joint. You can read more about it here. This module is trained with an L2 loss function, between target and predicted heatmaps.

### Depth Regression Module

• The depth regression module contains 4 sequential residual & pooling modules
• The strategy of estimating 3D points from 2D key points is inherently ambiguous, as there typically exist multiple 3D interpretations of a single 2D skeleton. They propose to combine the 2D joint heatmaps and the intermediate feature representations in the 2D module as input to the depth regression module. These features, which extract semantic information at multiple levels for 2D pose estimation, provide additional cues for 3D pose recovery.
• 3D geometric constraint induced loss  For the 3D dataset, Euclidean loss is used. For the 2D dataset, however, 3D annotations do not exist. A novel loss induced from a geometric constraint is used.  In the absence of ground truth depth label, this geometric constraint serves as effective regularization for depth prediction. You can read more about it in the original paper.

## A Simple Yet Effective Baseline for 3D Human Pose Estimation (ICCV’17) [arXiv][code]

This is an interesting paper and explores the sources of error in 3D pose estimation. 3D Pose estimation can be broken down into 2 stages;  Image → 2D pose → 3D pose. Through a simple and intuitive architecture, the paper understands whether error stems from a limited 2D pose (visual) understanding, or from a failure to map 2D poses into 3D positions.

They conclude that “lifting” ground truth 2D joint locations to 3D space is a task that can be solved with a low error rate. Their results indicate that a large portion of the error of modern deep 3D pose estimation systems stems from their visual analysis.

### Model

These are the basic building blocks of the architecture. It is based on a simple, deep, multilayer neural network with batchnorm, dropout and RELUs, as well as residual connections. Most of the experiments use 2 residual blocks.

## Integral Human Pose Regression (ECCV'18) [arXiv] [code]

Recent years have seen significant progress on the problem, using deep convolutional neural networks (CNNs). Best performing methods on 2D pose estimation are all detection based and generate a likelihood heat map for each joint and locate the joint as the point with the maximum likelihood in the map. The heat maps are also extended for 3D pose estimation and shown to be promising.

Despite the performance, a heat map representation bears a few drawbacks in nature.

• The “taking-maximum” operation is not differentiable and prevents training from being end-to-end.
• A heat map has a lower resolution than that of input image due to the downsampling steps in a deep neural network. This causes inevitable quantization errors.
• Using image and heat map with higher resolution helps to increase accuracy but is computational and storage demanding, especially for 3D heat maps.

Existing works are either heat-map based or regression based. This paper shows that a simple operation would relate and unify the heat map representation and joint regression. It modifies the “taking-maximum” operation to “taking-expectation”. The joint is estimated as the integration of all locations in the heat map, weighted by their probabilities (normalized from likelihoods). This approach is called integral regression (soft-argmax). It shares the merits of both heat map representation and regression approaches while avoiding their drawbacks. The integral function is differentiable and allows end-to-end training. It is simple and brings little overhead in computation and storage. Moreover, it can be easily combined with any heat map based methods.

Because the integral regression is parameter free and only transforms the pose representation from a heat map to a joint, it does not affect other algorithm design choices and can be combined with any of them, including different tasks, a heat map, and joint losses, network architectures, image and heat map resolutions.

### Joint 2D and 3D training

Network Architecture - A simple network architecture is used which consists of a deep convolutional backbone network to extract convolutional features from the input image, and a shallow head network to estimate the target output (heat maps or joints) from the features.

The head network for heat-map is fully convolutional. It firstly use deconvolution layers to upsample the feature map to the required resolution (64 × 64 by default). The number of output channels is fixed to 256 as in. Then, a 1 × 1 conv layer is used to produce K heat maps. For regression, first an average pooling layer reduces the spatial dimensionality of the convolutional features. Then, a fully connected layer outputs 3K joint coordinates.

They use a simple multi-stage implementation based on ResNet-50, the features from conv3 block are shared as input to all stages. Each stage then concatenates this feature with the heat maps from the previous stage, and passes through the conv4 and conv5 blocks to generate its own deep feature. The heat map head is then appended to output heat maps, supervised with the ground truth and losses.

## Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation (ECCV'18) [arXiv][code]

This is an interesting unsupervised learning approach. Modern 3D human pose estimation techniques rely on deep networks, which require large amounts of training data. Weakly-supervised methods that reduce the amount of annotation required to achieve a desired level of performance are therefore valuable. Instead of trying to improve upon the SOTA, it shows very good performance but with less than 10% (even 1%) the amount of data.

### Approach

Images of the same person taken from multiple views are used to learn a latent representation that captures the 3D geometry of the human body. Learning this representation does not require any 2D or 3D pose annotation (left side of Fig. (a)). Instead, an encoder-decoder is trained to predict an image seen from one viewpoint from an image captured from a different one. Then it is possible to learn to predict a 3D pose from this latent representation in a supervised manner. (Right side of Fig. (b)).  Because the latent representation already captures 3D geometry, the mapping to 3D pose is much simpler and can be learned using much fewer examples than existing methods that rely on multi-view supervision. Hence a shallow network can be used.

As can be seen in Fig. (a), the latent representation resembles a volumetric 3D shape and encodes both 3D pose and appearance.

At run time,  the test image is fed to the trained encoder that outputs 3D latent variables, and these are passed to the shallow trained NN that can predict 3D pose, as seen in Fig (b). By contrast, most state-of-the-art approaches train a network to regress directly from the images to the 3D poses, which requires a much deeper network and therefore more training data. (Fig. c)

### Results

Other papers I feel are important:

• Structured Prediction of 3D Human Pose with Deep Neural Networks (BMVC'16)[arXiv]
• Recurrent 3D Pose Sequence Machines (CVPR'17) [paper] [code]
• Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose (CVPR'17) [arXiv] [code]
• VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera (SIGGRAPH'17) [paper] [code]
• DensePose : Dense Human Pose Estimation In The Wild (CVPR'18) [arXiv] [code]

## Appendix

MPJPE or Mean Per Joint Position Error is the most common evaluation metric in 3D Human Pose Estimation.

• Per joint position error is the Euclidean distance between ground truth and prediction for a joint.
• Mean per joint position error is the Mean of per joint position error for all N joints (Typically, N = 16)
• Calculated after aligning the root joints (typically the pelvis) of the estimated and groundtruth 3D pose.
• The joints are also normalized wrt root joint.
• The formula is as below, where T is the number of samples,

Lazy to code, don't want to spend on GPUs? Head over to Nanonets and build computer vision models for free!