«上一篇
 文章快速检索 高级检索

 智能系统学报  2020, Vol. 15 Issue (5): 900-909  DOI: 10.11992/tis.201906054 0

### 引用本文

LIU Dongjingdian, MENG Xuechun, ZHANG Zixin, et al. A behavioral recognition algorithm based on 2D spatiotemporal information extraction[J]. CAAI Transactions on Intelligent Systems, 2020, 15(5): 900-909. DOI: 10.11992/tis.201906054.

### 文章历史

A behavioral recognition algorithm based on 2D spatiotemporal information extraction
LIU Dongjingdian , MENG Xuechun , ZHANG Zixin , YANG Xu , NIU Qiang
College of Computer Science & Technology, China University of Mining and Technology , Xuzhou 221008, China
Abstract: Human behavior recognition technology based on computer vision is a research hotspot currently. It is widely applied in various fields of social life, such as behavioral detection, video surveillance, etc. Traditional behavior recognition methods are computationally cumbersome and time-sensitive. Therefore, the development of deep learning has greatly improved the accuracy of behavior recognition algorithms. However, compared with the field of image processing, there is a certain gap in the effect of such methods. We introduce a novel behavior recognition algorithm based on DenseNet, which uses DenseNet as the network architecture, learns spatio-temporal information through 2D convolution, selects frames for characterizing behavior in video, organizes these frames into RGB space in time-space order and inputs them into our network to train the network. We have carried out a large number experiments on the UCF101 dataset, and our method can reach an accuracy rate of 94.46%.
Key words: behavior recognition    video analysis    neural networks    deep learning    convolutional neural networks    classification    spatiotemporal feature    densenet

1 相关工作 1.1 卷积网络

ResNet引入了残差块，即增加了把当前输出直接传输给后面层网络而绕过了非线性变换的直接连接，梯度可以直接流向前面层，有助于解决梯度消失和梯度爆炸问题。然而该网络的缺点是，前一层的输出与其卷积变换后的输出之间通过值相加操作结合在一起可能会阻碍网络中的信息流[5-6]

DenseNet在ResNet的基础上提出了一种不同的连接方式。它建立了一个密集块内前面层和后面所有层的密集连接，即每层的输入是其前面所有层的特征图，与ResNet在值上的累加不同，DenseNet是维度上的累加，因此在信息流方面克服了ResNet的缺点，改进了信息流。DenseNet的网络结构由密集块组成，其中，两个密集块之间有过渡层。密集块内的结构参照了ResNet的瓶颈结构(Bottleneck)，而过渡层中包括了一个 $1\times 1$ 的卷积层和一个 $2\times 2$ 的平均池化层。DenseNet减少了参数，使网络更窄，缓解了梯度消失问题，加强了特征的传播，鼓励特征重用[6]

1.2 行为识别算法

2 2D时空卷积设计以及时空特征组织形式

2.1 2D卷积理解与时空特征提取可行性分析

 $d=\frac{2\times w}{k}$ (1)

 $d=\frac{2\times w}{k\times {f}^{n}}$ (2)

 ${A_{n,i,j}} = R\left[ {\begin{array}{*{20}{c}} {{A_{n - 1,i - \frac{{k - 1}}{2},j - \frac{{k - 1}}{2}}}}& \cdots &{{A_{n - 1,i - \frac{{k - 1}}{2},j + \frac{{k - 1}}{2}}}}\\ \vdots & & \vdots \\ {{A_{n - 1,i + \frac{{k - 1}}{2},j - \frac{{k - 1}}{2}}}}& \cdots &{{A_{n - 1,i + \frac{{k - 1}}{2},j + \frac{{k - 1}}{2}}}} \end{array}} \right]$ (3)

 ${r}_{n}=\left({r}_{n-1}+k-1\right)\times {f}_{n-1}$ (4)

2.2 选取和拼接的组织

2.3 翻转操作及原因

2.4 DenseNet的选择

DenseNet是CVPR2017的最佳论文，不同于之前的神经网络在宽度(inception结构)和深度(resblock结构)上的改进，在模型的特征维度进行了改进，将不同卷积阶段所提取的特征进行维度上的密集连接，可以保留更丰富的信息。DenseNet建立了一个denseblock内前面层和后面所有层的密集连接，即每层的输入是其前面所有层的特征图，第 $l$ 层的输出 ${x}_{l}$ 可以表示为如下恒等函数：

 ${x}_{l}={H}_{l}\left(\left\{{x}_{0},{x}_{1},\cdots, {x}_{l-1}\right\}\right)$ (5)

2.5 引入时空卷积层提取时空信息

3 实验

3.1 翻转操作的验证

 Download: 图 10 无翻转操作与带翻转操作准确率对比 Fig. 10 Accuracy comparison between no flipping operation and flipping operation

3.2 时空卷积层效果提升与特征可视化

 Download: 图 11 2DSDCN_R 和 2DSDCN_D的准确率对比 Fig. 11 Accuracy comparison between 2DSDCN_R and 2DSDCN_D
 Download: 图 12 denseblock和resblock设计的特征可视化 Fig. 12 Feature of visualization denseblock and resblock
3.3 不同的帧选取方式下模型鲁棒性的验证

 Download: 图 13 每5帧采样下2DSDCN_R和2DSDCN_D的准确率对比 Fig. 13 Accuracy comparison between 2DSDCN_R and 2DSDCN_D with sampling every 5 frames
3.4 实验分析

4 结束语

 [1] WANG H, SCHMID C. Action recognition with improved trajectories[C]//2013 IEEE International Conference on Computer Vision. Sydney, AUS, 2013: 3551−3558. (0) [2] KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA, 2012: 1097−1105. (0) [3] RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International journal of computer vision, 2014, 115(3): 211-252. (0) [4] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA, 2015: 1−9. (0) [5] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA, 2016: 770−778. (0) [6] HUANG G, LIN Z, LAURENS V D M, et al. Densely connected convolutional networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA, 2017: 2261−2269. (0) [7] SOOMRO K, ZAMIR A R, SHAH M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv: 1212.0402, 2012. (0) [8] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. DOI:10.1109/5.726791 (0) [9] CHEN P H, LIN C J, Schölkopf B. A tutorial on v-support vector machines[J]. Applied stochastic models in business and industry, 2005, 21(2): 111-136. DOI:10.1002/asmb.537 (0) [10] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA, 2005, 1: 886−893. (0) [11] CHAUDHRY R, RAVICHANDRAN A, HAGER G, et al. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA, 2009: 1932−1939. (0) [12] DALAL N, TRIGGS B, SCHMID C. Human detection using oriented histograms of flow and appearance[C]//European Conference on Computer Vision. Graz, Austria, 2006: 428−441. (0) [13] WANG H, Kläser A, SCHMID C, et al. Action recognition by dense trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision. Colorado Springs, USA, 2011: 3169−3176. (0) [14] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Venice, Italy, 2017: 6299−6308. (0) [15] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA, 2016: 1933−1941. (0) [16] NG Y H, HAUSKENCHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: Deep networks for video classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA, 2015: 4694−4702. (0) [17] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]//European Conference on Computer Vision. Amsterdam, The Netherlands, 2016: 20−36. (0) [18] LAN Z, ZHU Y, HAUPTMANN A G, et al. Deep local video feature for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Venice, Italy, 2017: 1−7. (0) [19] 张培浩. 基于姿态估计的行为识别方法研究[D]. 南京: 南京航空航天大学, 2015. ZHANG Peihao. Research on action recognition based on pose estimation[D]. Nanjing: Nanjing University of Aeronautics and Astronautics, 2015. (0) [20] 马淼. 视频中人体姿态估计、跟踪与行为识别研究[D]. 山东: 山东大学, 2017. MA Miao. Study on human pose estimation, tracking and human action recognition in videos[D]. Shandong: Shandong University, 2017. (0) [21] TRAN D, BOURDEY L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional Networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 4489−4497. (0) [22] HARA K, KATAOKA H, SATOH Y. Learning spatio-temporal features with 3D residual networks for action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops. Venice, Italy, 2017: 3154−3160. (0) [23] QIU Z, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3d residual networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy, 2017: 5533−5541. (0) [24] DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3d convnets: new architecture and transfer learning for video classification[J]. arXiv preprint arXiv: 1711.08200, 2017. (0) [25] SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 4597−4605. (0) [26] TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA, 2018: 6450−6459. (0) [27] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems. Montreal, Canada, 2014: 568−576. (0)