 自动化学报  2018, Vol. 44 Issue (4): 646-655 PDF

1. 北京理工大学计算机学院 北京 100081;
2. 河北大学网络空间安全与计算机学院 保定 071000;
3. 智能信息技术北京市重点实验室 北京 100081

A Medium Granularity Model for Human Pose Estimation in Video
SHI Qing-Xuan1,2,3, DI Hui-Jun1,3, LU Yao1,3, TIAN Xue-Dong2
1. School of Computer Science, Beijing Institute of Technology, Beijing 100081;
2. School of Cyber Security and Computer, Hebei University, Baoding 071000;
3. Beijing Laboratory of Intelligent Information Technology, Beijing 100081
Manuscript received : December 27, 2016, accepted: July 12, 2017.
Foundation Item: Supported by National Natural Science Foundation of China (61375075, 9142020013, 61273273) and the Key Project of the Science and Technology Research Program in University of Hebei Province of China (ZD2017208)
Corresponding author. LU Yao  Professor at the School of Computer Science, Beijing Institute of Technology. His research interest covers neural network, image and signal processing, and pattern recognition. Corresponding author of this paper
Recommended by Associate Editor WANG Liang
Abstract: Human pose estimation has attracted much attention in the computer vision community due to its potential applications in action recognition, human-computer interaction, etc. To focus on pose estimation in videos, a medium granularity spatio-temporal probabilistic graphical model using body part tracklets as entities is presented in this paper. The optimal tracklet for each body part is acquired by spatiotemporal approximate reasoning through iterative spatial and temporal parsing, and the final human pose estimation is achieved by merging these optimal tracklets. To generate reliable tracklet proposals, global motion cue is adopted to propagate pose detections from individual frames to the whole video, and the trajectories from this propagation are segmented into fixed-length overlapping tracklets. To deal with the double counting problem, symmetric parts are coupled to one virtual node, so that the loops in spatial model are removed and the constaints between symmetric parts are maintained. The experiment on three datasets shows the proposed method achieves a higher accuracy than other pose estimation methods.
Key words: Human pose estimation     medium granularity model     Markov random field     hidden Markov model

 图 1 现有视频人体姿态估计方法采用的模型 Figure 1 The models used in video pose estimation

 图 2 中粒度时空模型 Figure 2 The medium granularity model

 图 4 不同方法的长时运动估计对比 Figure 4 Long-term performances of different motion estimation approaches

 图 3 不同方法的短时运动估计对比 Figure 3 Short-term performances of different motion estimation approaches

1 问题描述

 图 5 基于中粒度模型的视频人体姿态估计方法示意图 Figure 5 Overview of the video pose estimation method based on medium granularity model

1.1 单帧姿态检测

 $$$\label{equ_fmp} S(I, X)=\sum\limits_{i\in \mathcal{V}}\phi (x_i, I)+\sum\limits_{(i, j)\in \mathcal{E}}\psi({x_i, x_j})$$$ (1)

 $$$\label{equ_Phis} \Phi(T_i^t, V^t)=\Phi_\mathit{s}(S_i, Q) = \sum\limits_{f=1}^F \phi_d(s_i^f, q^f) + \lambda_1\phi_g(S_i)$$$ (3)

 $$$\label{equ_phig} \phi_g(S_i) = -\frac{var(\Lambda(s_i^1), \Lambda(s_i^2), \cdots, \Lambda(s_i^F))}{\max\limits_{f_1, f_2}\|s_i^{f_1}-s_i^{f_2}\|_2^2}$$$ (4)

 $$$\label{equ_Phic}\\ \begin{split} \Phi(T_i^t, &V^t)=\Phi_\mathit{c}(C_i, Q) = \Phi_\mathit{s}(C_i.l, Q) +\\& \Phi_\mathit{s}(C_i.r, Q)+ \lambda_2\sum\limits_{f=1}^F(-\psi_{\text{color}}(c_i^f\!\!.l, c_i^f\!\!.r)) +\\& \lambda_3\sum\limits_{f=1}^F\psi_{\text{dist}}(c_i^f\!\!.l, c_i^f\!\!.r) \end{split}$$$ (5)

 $$$\label{equ_PsiSS} \Psi({T_i^t, T_j^t})=\Psi(S_i, S_j) = \sum\limits_{f=1}^F \psi_p(s_i^f, s_j^f)$$$ (6)

 \begin{align} \label{equ_PsiSC} \Psi({T_i^t, T_j^t})= &\Psi(S_i, C_j)= \\ &\sum\limits_{f=1}^F( \psi_p(s_i^f, c_j^f\!.l)+\psi_p(s_i^f, c_j^f\!.r)) \end{align} (7)

 \begin{align} \label{equ_PsiCC} \Psi({T_i^t, T_j^t})= &\Psi(C_i, C_j)= \\ &\sum\limits_{f=1}^F( \psi_p(c_i^f\!.l, c_j^f\!.l)+\psi_p(c_i^f\!.r, c_j^f\!.r)) \end{align} (8)

1.3.2 隐马尔科夫模型

 $$$\label{equ_hmm} {S}'_T(T_i, V)=\sum\limits_{t=1}^N \Phi'(T_i^t, V^t)+\sum\limits_{t=1}^{N\!-\!1}\Psi'(T_i^t, T_i^{t+1})$$$ (9)

 $$$\label{equ_PHI_HMM} \Phi'(T_i^t, V^t) = \Phi(T_i^t, V^t)+ \Psi({T_i^t, T_{pa(i)}^t})$$$ (10)

 $$$\label{equ_PsiHMM} \Psi'(A, B) = -\lambda_4\|A - B\|_2^2$$$ (11)

 $\Psi '(A,B) = - {\lambda _5}{\left( {\frac{{\parallel A.l - B.l{\parallel _2} + \parallel A.r - B.r{\parallel _2}}}{2}} \right)^2}$ (12)

2 模型推理

 $$$\mathcal{M}_{\mathcal{T}}(T_i^t, a) = \max\limits_{T^t\in\mathcal{T}:T_i^t=a}b{S}_T(T^t, V^t)$$$ (13)

 $$$\label{equ_msg_space} m_{i\rightarrow j}( T_j^t) \propto \max\limits_ {T_i^t}(m_i(T_i^t)+ \Psi({T_i^t, T_j^t}))$$$ (14)
 $$$\label{equ_belief_space} m_i(T_i^t) \propto \Phi(T_i^t, V^t) +\sum\limits_{k \in N\!b\!d(i)\backslash j} m_{k\rightarrow i}( T_i^t)$$$ (15)

 $$$b(T_i^t) = \Phi(T_i^t, V^t) + \sum\limits_{k \in N\!b\!d(i)} m_{k\rightarrow i}( T_i^t)$$$ (16)

 $$$\label{equ_msg_time} m_{t\rightarrow {t\!+\!1}}( T_i^{t+1}) \propto \max\limits_ {T_i^t}(m_i(T_i^t)+ \Psi'(T_i^t, T_i^{t+1}))$$$ (17)
 $$$m_i(T_i^t) \propto \Phi'(T_i^t, V^t) + m_{{t\!-\!1}\rightarrow t}( T_i^t)$$$ (18)

 $$$\label{equ_belief_time} b(T_i^t) = \Phi'(T_i^t, V^t\!)\!+m_{{t\!-\!1}\rightarrow t}( T_i^t)+ m_{{t\!+\!1}\rightarrow t}( T_i^t)$$$ (19)

WHILE迭代次数$<$最大迭代次数

$//$空域解析

FOR $t = 1$ TO $N$ DO

FOR $i$ =叶子 TO根 DO

依据式(14)计算消息$m_{i\rightarrow j}( T_j^t)$;

END FOR

FOR $i$ =根 TO叶子 DO

依据式(14)计算消息$m_{i\rightarrow j}( T_j^t)$;

END FOR

FOR $i$ =根 TO叶子 DO

依据式(15)计算轨迹片段的评分$b(T_i^t)$;

依据$b(T_i^t)$从大到小排序, 按比例$P$筛选轨迹片段候选;

END FOR

END FOR

$//$时域解析

FOR $i$ =根 TO叶子 DO

FOR $t = 1$ TO $N-1$ DO

依据式(17)计算消息$m_{t\rightarrow {t\!+\!1}}( T_i^{t+1})$;

END FOR

FOR $t = N$ TO 2 DO

依据式(17)计算消息$m_{ t\rightarrow{t\!-\!1} }( T_i^{t-1})$;

END FOR

FOR $t = 1$ TO $N$ DO

依据式(19)计算轨迹片段的评分$b(T_i^t)$;

依据$b(T_i^t)$从大到小排序, 按比例$P$筛选轨迹片段候选;

END FOR

END FOR

END WHILE

$\hat{T}_i^t = \arg\max\limits_{{T}_i^t}(b(T_i^t))$.

3 实验 3.1 实验数据

UnusualPose视频数据集[12]:该视频集包含4段视频, 存在大量的非常规人体姿态以及快速运动.

FYDP视频数据集[29]:由20个舞蹈视频构成, 除个别视频外, 大部分运动比较平滑.

Sub_Nbest视频数据集[22]:为方便与其他方法对比, 本文按照对比算法中的挑选方法, 只选用了文献[22]中给出的Walkstraight和Baseball两个视频.

3.2 评价机制及实验设置

PCK (Percentage of correct keypoints)[7]: PCK给出正确估计关键点(关节点部件的坐标位置)的百分比, 这里的关键点, 通常指的是人体的关节点(如头、颈、肩、肘、腕、胯、膝、踝, 当一个关键点的估计位置落在真值$\alpha \cdot \max(h, w)$像素范围内时, 其估计被认为是准确的, 这里的$h$, $w$分别是人体目标边界框的高和宽, $\alpha$用于控制正确性判断的阈值.边界框由人体关节点真值的最紧外包矩形框界定, 根据姿态估计对象为整个人体或上半身人体, $\alpha$值设为0.1或0.2.

PCP (Percentage of correct limb parts)[11]: PCP是目前应用非常广泛的姿态估计的评价机制, 它计算的是人体部件的正确评估百分比, 与关节点不同, 这里的人体部件是指两相邻关节点连接所对应的人体部位(比如上臂、前臂、大腿、小腿、躯干、头部).当一个人体部件两端对应的关节点均落在端点连线长度的50 %范围内时, 该部件的估计被认为是正确的.

3.3 算法有效性分析

 图 7 算法关键策略有效性测试结果 Figure 7 Examination of key modules
3.4 与其他算法对比

 图 8 UnusualPose数据集上的实验结果对比 Figure 8 Qualitative comparison on UnusualPose dataset
 图 9 FYDP数据集上的实验结果 Figure 9 Sample results on FYDP dataset
 图 10 Sub_Nbest数据集上的实验结果 Figure 10 Sample results on Sub_Nbest dataset
4 结论

