首页 关于本刊 编 委 会 期刊动态 作者中心 审者中心 读者中心 下载中心 联系我们 English
 自动化学报  2018, Vol. 44 Issue (4): 646-655 PDF

1. 北京理工大学计算机学院 北京 100081;
2. 河北大学网络空间安全与计算机学院 保定 071000;
3. 智能信息技术北京市重点实验室 北京 100081

A Medium Granularity Model for Human Pose Estimation in Video
SHI Qing-Xuan1,2,3, DI Hui-Jun1,3, LU Yao1,3, TIAN Xue-Dong2
1. School of Computer Science, Beijing Institute of Technology, Beijing 100081;
2. School of Cyber Security and Computer, Hebei University, Baoding 071000;
3. Beijing Laboratory of Intelligent Information Technology, Beijing 100081
Manuscript received : December 27, 2016, accepted: July 12, 2017.
Foundation Item: Supported by National Natural Science Foundation of China (61375075, 9142020013, 61273273) and the Key Project of the Science and Technology Research Program in University of Hebei Province of China (ZD2017208)
Corresponding author. LU Yao  Professor at the School of Computer Science, Beijing Institute of Technology. His research interest covers neural network, image and signal processing, and pattern recognition. Corresponding author of this paper
Recommended by Associate Editor WANG Liang
Abstract: Human pose estimation has attracted much attention in the computer vision community due to its potential applications in action recognition, human-computer interaction, etc. To focus on pose estimation in videos, a medium granularity spatio-temporal probabilistic graphical model using body part tracklets as entities is presented in this paper. The optimal tracklet for each body part is acquired by spatiotemporal approximate reasoning through iterative spatial and temporal parsing, and the final human pose estimation is achieved by merging these optimal tracklets. To generate reliable tracklet proposals, global motion cue is adopted to propagate pose detections from individual frames to the whole video, and the trajectories from this propagation are segmented into fixed-length overlapping tracklets. To deal with the double counting problem, symmetric parts are coupled to one virtual node, so that the loops in spatial model are removed and the constaints between symmetric parts are maintained. The experiment on three datasets shows the proposed method achieves a higher accuracy than other pose estimation methods.
Key words: Human pose estimation     medium granularity model     Markov random field     hidden Markov model

 图 1 现有视频人体姿态估计方法采用的模型 Figure 1 The models used in video pose estimation

 图 2 中粒度时空模型 Figure 2 The medium granularity model

 图 4 不同方法的长时运动估计对比 Figure 4 Long-term performances of different motion estimation approaches

 图 3 不同方法的短时运动估计对比 Figure 3 Short-term performances of different motion estimation approaches

1 问题描述

 图 5 基于中粒度模型的视频人体姿态估计方法示意图 Figure 5 Overview of the video pose estimation method based on medium granularity model

1.1 单帧姿态检测

 $$$\label{equ_fmp} S(I, X)=\sum\limits_{i\in \mathcal{V}}\phi (x_i, I)+\sum\limits_{(i, j)\in \mathcal{E}}\psi({x_i, x_j})$$$ (1)

 $$$\label{equ_Phis} \Phi(T_i^t, V^t)=\Phi_\mathit{s}(S_i, Q) = \sum\limits_{f=1}^F \phi_d(s_i^f, q^f) + \lambda_1\phi_g(S_i)$$$ (3)

 $$$\label{equ_phig} \phi_g(S_i) = -\frac{var(\Lambda(s_i^1), \Lambda(s_i^2), \cdots, \Lambda(s_i^F))}{\max\limits_{f_1, f_2}\|s_i^{f_1}-s_i^{f_2}\|_2^2}$$$ (4)

 $$$\label{equ_Phic}\\ \begin{split} \Phi(T_i^t, &V^t)=\Phi_\mathit{c}(C_i, Q) = \Phi_\mathit{s}(C_i.l, Q) +\\& \Phi_\mathit{s}(C_i.r, Q)+ \lambda_2\sum\limits_{f=1}^F(-\psi_{\text{color}}(c_i^f\!\!.l, c_i^f\!\!.r)) +\\& \lambda_3\sum\limits_{f=1}^F\psi_{\text{dist}}(c_i^f\!\!.l, c_i^f\!\!.r) \end{split}$$$ (5)

 $$$\label{equ_PsiSS} \Psi({T_i^t, T_j^t})=\Psi(S_i, S_j) = \sum\limits_{f=1}^F \psi_p(s_i^f, s_j^f)$$$ (6)

 \begin{align} \label{equ_PsiSC} \Psi({T_i^t, T_j^t})= &\Psi(S_i, C_j)= \\ &\sum\limits_{f=1}^F( \psi_p(s_i^f, c_j^f\!.l)+\psi_p(s_i^f, c_j^f\!.r)) \end{align} (7)

 \begin{align} \label{equ_PsiCC} \Psi({T_i^t, T_j^t})= &\Psi(C_i, C_j)= \\ &\sum\limits_{f=1}^F( \psi_p(c_i^f\!.l, c_j^f\!.l)+\psi_p(c_i^f\!.r, c_j^f\!.r)) \end{align} (8)

1.3.2 隐马尔科夫模型

 $$$\label{equ_hmm} {S}'_T(T_i, V)=\sum\limits_{t=1}^N \Phi'(T_i^t, V^t)+\sum\limits_{t=1}^{N\!-\!1}\Psi'(T_i^t, T_i^{t+1})$$$ (9)

 $$$\label{equ_PHI_HMM} \Phi'(T_i^t, V^t) = \Phi(T_i^t, V^t)+ \Psi({T_i^t, T_{pa(i)}^t})$$$ (10)

 $$$\label{equ_PsiHMM} \Psi'(A, B) = -\lambda_4\|A - B\|_2^2$$$ (11)

 $\Psi '(A,B) = - {\lambda _5}{\left( {\frac{{\parallel A.l - B.l{\parallel _2} + \parallel A.r - B.r{\parallel _2}}}{2}} \right)^2}$ (12)

2 模型推理

 $$$\mathcal{M}_{\mathcal{T}}(T_i^t, a) = \max\limits_{T^t\in\mathcal{T}:T_i^t=a}b{S}_T(T^t, V^t)$$$ (13)

 $$$\label{equ_msg_space} m_{i\rightarrow j}( T_j^t) \propto \max\limits_ {T_i^t}(m_i(T_i^t)+ \Psi({T_i^t, T_j^t}))$$$ (14)
 $$$\label{equ_belief_space} m_i(T_i^t) \propto \Phi(T_i^t, V^t) +\sum\limits_{k \in N\!b\!d(i)\backslash j} m_{k\rightarrow i}( T_i^t)$$$ (15)

 $$$b(T_i^t) = \Phi(T_i^t, V^t) + \sum\limits_{k \in N\!b\!d(i)} m_{k\rightarrow i}( T_i^t)$$$ (16)

 $$$\label{equ_msg_time} m_{t\rightarrow {t\!+\!1}}( T_i^{t+1}) \propto \max\limits_ {T_i^t}(m_i(T_i^t)+ \Psi'(T_i^t, T_i^{t+1}))$$$ (17)
 $$$m_i(T_i^t) \propto \Phi'(T_i^t, V^t) + m_{{t\!-\!1}\rightarrow t}( T_i^t)$$$ (18)

 $$$\label{equ_belief_time} b(T_i^t) = \Phi'(T_i^t, V^t\!)\!+m_{{t\!-\!1}\rightarrow t}( T_i^t)+ m_{{t\!+\!1}\rightarrow t}( T_i^t)$$$ (19)

WHILE迭代次数$<$最大迭代次数

$//$空域解析

FOR $t = 1$ TO $N$ DO

FOR $i$ =叶子 TO根 DO

依据式(14)计算消息$m_{i\rightarrow j}( T_j^t)$;

END FOR

FOR $i$ =根 TO叶子 DO

依据式(14)计算消息$m_{i\rightarrow j}( T_j^t)$;

END FOR

FOR $i$ =根 TO叶子 DO

依据式(15)计算轨迹片段的评分$b(T_i^t)$;

依据$b(T_i^t)$从大到小排序, 按比例$P$筛选轨迹片段候选;

END FOR

END FOR

$//$时域解析

FOR $i$ =根 TO叶子 DO

FOR $t = 1$ TO $N-1$ DO

依据式(17)计算消息$m_{t\rightarrow {t\!+\!1}}( T_i^{t+1})$;

END FOR

FOR $t = N$ TO 2 DO

依据式(17)计算消息$m_{ t\rightarrow{t\!-\!1} }( T_i^{t-1})$;

END FOR

FOR $t = 1$ TO $N$ DO

依据式(19)计算轨迹片段的评分$b(T_i^t)$;

依据$b(T_i^t)$从大到小排序, 按比例$P$筛选轨迹片段候选;

END FOR

END FOR

END WHILE

$\hat{T}_i^t = \arg\max\limits_{{T}_i^t}(b(T_i^t))$.

3 实验 3.1 实验数据

UnusualPose视频数据集[12]:该视频集包含4段视频, 存在大量的非常规人体姿态以及快速运动.

FYDP视频数据集[29]:由20个舞蹈视频构成, 除个别视频外, 大部分运动比较平滑.

Sub_Nbest视频数据集[22]:为方便与其他方法对比, 本文按照对比算法中的挑选方法, 只选用了文献[22]中给出的Walkstraight和Baseball两个视频.

3.2 评价机制及实验设置

PCK (Percentage of correct keypoints)[7]: PCK给出正确估计关键点(关节点部件的坐标位置)的百分比, 这里的关键点, 通常指的是人体的关节点(如头、颈、肩、肘、腕、胯、膝、踝, 当一个关键点的估计位置落在真值$\alpha \cdot \max(h, w)$像素范围内时, 其估计被认为是准确的, 这里的$h$, $w$分别是人体目标边界框的高和宽, $\alpha$用于控制正确性判断的阈值.边界框由人体关节点真值的最紧外包矩形框界定, 根据姿态估计对象为整个人体或上半身人体, $\alpha$值设为0.1或0.2.

PCP (Percentage of correct limb parts)[11]: PCP是目前应用非常广泛的姿态估计的评价机制, 它计算的是人体部件的正确评估百分比, 与关节点不同, 这里的人体部件是指两相邻关节点连接所对应的人体部位(比如上臂、前臂、大腿、小腿、躯干、头部).当一个人体部件两端对应的关节点均落在端点连线长度的50 %范围内时, 该部件的估计被认为是正确的.

3.3 算法有效性分析

 图 7 算法关键策略有效性测试结果 Figure 7 Examination of key modules
3.4 与其他算法对比

 图 8 UnusualPose数据集上的实验结果对比 Figure 8 Qualitative comparison on UnusualPose dataset
 图 9 FYDP数据集上的实验结果 Figure 9 Sample results on FYDP dataset
 图 10 Sub_Nbest数据集上的实验结果 Figure 10 Sample results on Sub_Nbest dataset
4 结论

 1 Li Yi, Sun Zheng-Xing, Chen Song-Le, Li Qian. 3D Human pose analysis from monocular video by simulated annealed particle swarm optimization. Acta Automatica Sinica, 2012, 38(5): 732-741.( 李毅, 孙正兴, 陈松乐, 李骞. 基于退火粒子群优化的单目视频人体姿态分析方法. 自动化学报, 2012, 38(5): 732-741.) 2 Zhu Yu, Zhao Jiang-Kun, Wang Yi-Ning, Zheng Bing-Bing. A review of human action recognition based on deep learning. Acta Automatica Sinica, 2016, 42(6): 848-857.( 朱煜, 赵江坤, 王逸宁, 郑兵兵. 基于深度学习的人体行为识别算法综述. 自动化学报, 2016, 42(6): 848-857.) 3 Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Kohli P, Criminisi A, Kipman A, Blake A. E-cient human pose estimation from single depth images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2821-2840. DOI:10.1109/TPAMI.2012.241 4 Cristani M, Raghavendra R, del Bue A, Murino V. Human behavior analysis in video surveillance:a social signal processing perspective. Neurocomputing, 2013, 100: 86-97. DOI:10.1016/j.neucom.2011.12.038 5 Wang L M, Qiao Y, Tang X O. Video action detection with relational dynamic-poselets. In: Proceedings of the European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 565-580 6 Felzenszwalb P F, Huttenlocher D P. Pictorial structures for object recognition. International Journal of Computer Vision, 2005, 61(1): 55-79. DOI:10.1023/B:VISI.0000042934.15159.49 7 Yang Y, Ramanan D. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2878-2890. DOI:10.1109/TPAMI.2012.261 8 Sapp B, Jordan C, Taskar B. Adaptive pose priors for pictorial structures. In: Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE, 2010. 422-429 9 Andriluka M, Roth S, Schiele B. Pictorial structures revisited: people detection and articulated pose estimation. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009. 1014-1021 10 Eichner M, Marin-Jimenez M, Zisserman A, Ferrari V. 2D articulated human pose estimation and retrieval in (almost) unconstrained still images. International Journal of Computer Vision, 2012, 99(2): 190-214. DOI:10.1007/s11263-012-0524-9 11 Ferrari V, Marin-Jimenez M, Zisserman A. Progressive search space reduction for human pose estimation. In: Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA: IEEE, 2008. 1-8 12 Shi Q X, Di H J, Lu Y, Lü F. Human pose estimation with global motion cues. In: Proceedings of the 2015 IEEE International Conference on Image Processing. Quebec, Canada: IEEE, 2015. 442-446 13 Sapp B, Toshev A, Taskar B. Cascaded models for articulated pose estimation. In: Proceedings of the Eeuropean Conference on Computer Vision. Heraklion, Greece: Springer, 2010. 406-420 14 Zhao L, Gao X B, Tao D C, Li X L. Tracking human pose using max-margin Markov models. IEEE Transactions on Image Processing, 2015, 24(12): 5274-5287. DOI:10.1109/TIP.2015.2473662 15 Ramakrishna V, Kanade T, Sheikh Y. Tracking human pose by tracking symmetric parts. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013. 3728-3735 16 Cherian A, Mairal J, Alahari K, Schmid C. Mixing bodypart sequences for human pose estimation. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014. 2361-2368 17 Tokola R, Choi W, Savarese S. Breaking the chain: liberation from the temporal Markov assumption for tracking human poses. In: Proceedings of the 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013. 2424-2431 18 Zhang D, Shah M. Human pose estimation in videos. In: Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015. 2012-2020 19 Sigal L, Bhatia S, Roth S, Black M J, Isard M. Tracking loose-limbed people. In: Proceedings of the 2004 IEEE Conference on Computer Vision and Pattern Recognition. Washington, D. C., USA: IEEE, 2004. 421-428 20 Sminchisescu C, Triggs B. Estimating articulated human motion with covariance scaled sampling. The International Journal of Robotics Research, 2003, 22(6): 371-391. DOI:10.1177/0278364903022006003 21 Weiss D, Sapp B, Taskar B. Sidestepping intractable inference with structured ensemble cascades. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press, 2010. 2415-2423 22 Park D, Ramanan D. N-best maximal decoders for part models. In: Proceedings of the 2011 IEEE International Conference on Computer Vision. Barcelona, Spain: IEEE, 2011. 2627-2634 23 Wang C Y, Wang Y Z, Yuille A L. An approach to posebased action recognition. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013. 915-922 24 Zu-S, Romero J, Schmid C, Black M J. Estimating human pose with flowing puppets. In: Proceedings of the 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013. 3312-3319 25 Sapp B, Weiss D, Taskar B. Parsing human motion with stretchable models. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO, USA: IEEE, 2011. 1281-1288 26 Fragkiadaki K, Hu H, Shi J B. Pose from flow and flow from pose. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013. 2059-2066 27 Brox T, Malik J. Large displacement optical flow:descriptor matching in variational motion estimation. IEEE Transactions on Pattern Recognition and Machine Intelligence, 2011, 33(3): 500-513. DOI:10.1109/TPAMI.2010.143 28 Wang H, Klaser A, Schmid C, Liu C L. Action recognition by dense trajectories. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Washington, D. C., USA: IEEE, 2011. 3169-3176 29 Shen H Q, Yu S I, Yang Y, Meng D Y, Hauptmann A. Unsupervised video adaptation for parsing human motion. In: Proceedings of the European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 347-360 30 Di H J, Tao L M, Xu G Y. A mixture of transformed hidden Markov models for elastic motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10): 1817-1830. DOI:10.1109/TPAMI.2009.111 31 Lü Feng, Di Hui-Jun, Lu Yao, Xu Guang-You. Non-rigid tracking method based on layered elastic motion analysis. Acta Automatica Sinica, 2015, 41(2): 295-303.( 吕峰, 邸慧军, 陆耀, 徐光祐. 基于分层弹性运动分析的非刚体跟踪方法. 自动化学报, 2015, 41(2): 295-303.)