PoseGPT - Chatting about 3D Human Pose

Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black

Published

07 March 2024

Motivated by human’s ability to intuitively understand postures from either images or textual descriptions, this work aims to develop a multi-modal LLM for pose, PostGPT. Traditional approaches on pose estimations are proven to lack holistic scene comprehension and nuanced reasoning, leading to a disconncect between visual data and its real world implications. PostGPT addresses issue by embedding SMPL poses as a distinct signal token in a multi-model LLM, empowering LLM to apply their world knowledge in reasoning about human poses. The results show that PoseGPT outperforms existing multi-modal LLMs, opening new directions on pose analysis.

Introduction

Traditional 3D human pose estimation can be categorized into two methods based on the input modality:

Image: this method usually detects individuals from the picture and segments them from the image, then uses a NN to predict the 3D pose and shape. However, it often lacks contextual awareness, missing out on information on a holistic understanding of the scene and the interactions between human with themselves and the environment.
text: the instructions for this method are typically very explicit, with the text describing desired actions. However, this method is limited as the training data is scarce.

Therefore, the existing specialized systems are very narrow tasked, contrary to what the authors perception on the ability of the LLMs. The authors hypothesize that, due to LLMs ability to perceive and interpret information based on their world knowledge, if LLMs can relate their knowledge to 3D human pose, then they would be very powerful, performing beyond existing solutions.

To evaluate what LLMs already understand about 3D pose generation and how we can teach them, the authors propose PoseGPT, fine-tuning multi-modal LLMs for predicting human pose represented as SMPL. A special <POSE> token is generated in LLMs’ output, representing the SMPL poses. The language embedding is extracted from the token, later fed into a MLP to predict SMPL pose parameters.

The contributions of the paper can be summarized as follows:

Propose PoseGPT, a multi-modal LLM directly generating SMLP poses
Introduce two innovative tasks
- Speculative Pose Generation (SPG): in this task, LLMs are asked to speculate, which requires it to have the understanding of
  - what being adjective means
  - how does this translate into 3D pose
- Reasoning-based Pose Estimation (RPE): contrary to conventional approach where LLMs are provided with a bounding box, PoseGPT is exposed to the entire scene, gaining information regrading the context.
Demonstrates superior performance compared to other LLMs based results

Method

Architecture

The model consists of an multi-modal LLM, $f_\phi$ , an embedding projection layer, $g_\Theta$ , and a parametric human body modal SMPL, represented by $\beta ~{} \& ~{}\theta$ .

Given a text string $X_q$ and an image $X_v$ as an input, the model will produces a textual respone $Y_t = f_\phi (X_q, X_v)$ or $Y_t = f_\phi (X_q)$ . If <POSE> is present in $Y_t$ , then the corresponding embedding $H_{pose}$ will be retrieved from $Y_t$ by utilizing the projection layer $g_\Theta$ .

Training

The authors optimze the model using the following objective function

$\mathcal{L} = \beta_t \text{CE} (\hat{Y}_t, Y_t) + \lambda_\theta |\hat{\theta} - \theta|$

with the first term is the cross-entropy loss while the second term is the L1 loss computed between pose parameters.

Text to Post Generation

The data pairs for this task is SMPL post parameters and textual descriptions, $\{X_q, \hat{\theta}\}$ , fit into a question-answer format.

Human Pose Estimation

Since the focus is estimatnig human poses from 2D images, the data pairs for this task is $\{X_v, \hat{\theta}\}$ . Similar to text pose generation, a question-answer template is employed for training.

To maintain LLM’s inherent ability for multi-turn conversations, the authors also incorporate a multi-modal instruction following dataset during traning.

Reasoning about Human Pose

Speculative Pose Generation (SPG)

This task requires the model to understand the global concepts. This is because instead of explicit pose descriptions, the model is given implicit information regarding the pose, requiring deduction to generate appropriate pose.

Reasoning-based Pose Estimation (RPE)

The standard approach for pose estimation typically by first deploying detections method to isolate the person of interest and then apply pose estimation methods on the cropped image. This approach ignores the scene context, which could be helpful in reasoning poses. In contrast, this paper structured the prompt in a way such that the model is required to interpret the scene context and generate the SMPL parameters for the fitting individual.

Results

Pose Generation

PoseGPT is evaluated on both the classical and the novel SPG tasks. The results show that PoseGPT performs comparably to PoseScript for classical tasks and outperforms on the SPG.

Pose Estimation

Both the classical and the novel reasoning-based pose estimation are evaluated for PoseGPT. It is found that for classical estimation tasks, PostGPT outperforms other multi-modal LLMs while not matching the performance of the task-specific models. For reasoning-based pose estimation, PostGPT outperforms both task-specific and multi-modal LLMs, despite having trouble estimating global orientation of the person. In addition, the authors find that PostGPT generalizes well to strong occulsions, even withtout any data augmentation in training.

Conclusion & Future Work

PoseGPT makes a first step in integrating human pose estimation with general reasoning abilities of LLMs. It demonstrates that multi-model LLMs can not only be fine-tuned to infer 3D poses but also connect 3D human pose with language. However, the model does not match the performance of the special-purpose 3D human regressors.

There is much to do in the near term. For example, this study only focused on the body pose. It should be straightforward to extend the work to 3D body shape. Extension to generating representaitons for a sequence of human motions is also a possible direction, enabling both generation and analysis of human motion. Moreover, integrating SMPL poses or videos as an input modality could open up new applications and opportunities for training.