基于灰狼算法SVM的NIR杉木密度预测

DOI: 10.11707/j.1001-7488.20181215

文章信息

谭念, 王学顺, 黄安民, 王晨.

Tan Nian, Wang Xueshun, Huang Anmin, Wang Chen.

基于灰狼算法SVM的NIR杉木密度预测

Wood Density Prediction of Cunninghamia lanceolata Based on Gray Wolf Algorithm SVM and NIR

林业科学, 2018, 54(12): 137-141.

Scientia Silvae Sinicae, 2018, 54(12): 137-141.

DOI: 10.11707/j.1001-7488.20181215

文章历史

收稿日期：2017-07-10

修回日期：2018-01-08

作者相关文章

谭念

王学顺

黄安民

王晨

引用本文

谭念, 王学顺, 黄安民, 王晨. 2018. 基于灰狼算法SVM的NIR杉木密度预测. 林业科学, 54(12): 137-141.

Tan Nian, Wang Xueshun, Huang Anmin, Wang Chen. 2018. Wood Density Prediction of Cunninghamia lanceolata Based on Gray Wolf Algorithm SVM and NIR. Scientia Silvae Sinicae, 54(12): 137-141. DOI: 10.11707/j.1001-7488.20181215

基于灰狼算法SVM的NIR杉木密度预测

谭念¹, 王学顺¹, 黄安民², 王晨²

1. 北京林业大学理学院北京 100083;
2. 中国林业科学研究院木材工业研究所北京 100091

收稿日期：2017-07-10; 修回日期：2018-01-08

基金项目：中央高校基本科研业务费专项资金资助项目（2015ZCQ-LY-01）；国家自然科学基金项目（31670564）

通讯作者：王学顺

摘要：【目的】提出一种基于灰狼算法支持向量机（GWO-SVM）的木材密度预测模型，利用近红外光谱（NIR）对杉木密度进行预测，为杉木性质定量分析提供理论依据。【方法】将109个杉木样品光谱数据和样品密度数据进行归一化，选取88个样品作为训练集、21个样品作为测试集，对2 151维光谱数据提取主成分，以主成分作为输入变量，以杉木样本密度作为输出变量，建立杉木密度多元线性回归（MLR）模型、SVM模型和GWO-SVM模型，采用决定系数（R²）、均方误差（MSE）和平均绝对百分误差（MAPE）对3种模型的预测结果进行比较分析。【结果】对光谱数据进行主成分分析并选择5个主成分，其累积贡献率达98.7%。MLR模型的R²为0.771 4，MSE为0.000 282 1，MAPE为3.009 23%；SVM模型的R²为0.923 8，MSE为0.000 233 1，MAPE为2.794 50%；灰狼算法对SVM进行参数寻优，获得的最优参数分别为C=18.366 6、σ=0.043 3，GWO-SVM模型的R²为0.919 2，MSE为0.000 183 4，MAPE为2.496 37%。3种模型的平均绝对百分误差均在可接受范围内，且GWO-SVM模型的平均绝对百分误差最小，预测效果最好。【结论】从预测精度分析，GWO-SVM模型明显优于MLR模型和SVM模型；从模型决定系数分析，GWO-SVM模型和SVM模型均优于MLR模型。灰狼算法优化支持向量机结合近红外光谱对杉木密度进行预测分析合理、高效。

关键词：近红外光谱灰狼算法支持向量机杉木密度

Wood Density Prediction of Cunninghamia lanceolata Based on Gray Wolf Algorithm SVM and NIR

Tan Nian¹, Wang Xueshun¹, Huang Anmin², Wang Chen²

1. School of Science, Beijing Forestry University Beijing 100083;
2. Research Institute of Wood Industry, CAF Beijing 100091

Abstract: 【Objective】In order to explore the more efficient method of predicting the wood density of Cunninghamia lanceolata, the near infrared spectroscopy was used. It could provide the theoretical basis for quantitative analysis of wood properties.【Method】Firstly, wood density and the near infrared spectroscopy data of 109 C. lanceolata samples were normalized. Of which, 88 C. lanceolata samples constituted training set and 21 samples constituted test set. Secondly, the principal component analysis was used to extract the principal components of 2 151 dimensions of the C. lanceolata near infrared spectrum. Thirdly, the principal components set as the independent variable and the C. lanceolata density set as the dependent variable were used to establish the multiple linear regression(MLR)model and the support vector machine(SVM)model. In order to improve the prediction accuracy of C. lanceolata density model, the grey wolf optimizer(GWO)algorithm was applied to the SVM model for parameter optimization. Therefore the prediction of the C. lanceolata density based on the GWO-SVM was proposed. The determination coefficient(R²), the mean square error(MSE)and the mean absolute percentage error(MAPE)were adopted to measure the prediction result of the three models.【Result】Five principal components were obtained from the near infrared spectrum, and their cumulative contribution rate was 98.7%. The significant value P of the MLR model and the partial regression coefficient was both less than 0.05, which indicated that the MLR model was effective and the model can be used for the C. lanceolata density prediction. The R² of MLR model was 0.771 4, the MSE was 0.000 282 1, and the MAPE was 3.009 23%. At the same time, the R² of SVM model was 0.923 8, the MSE was 0.000 233 1, and the MAPE was 2.794 50%. The parameters of the SVM were optimized by the wolf group algorithm, and the optimal parameters were C=18.366 6, σ=0.043 3. In the GWO-SVM model, the R² was 0.919 2, the MSE was 0.000 183 4, and the MAPE was 2.496 37%.The MAPE of the three models were all within the acceptable range, and the prediction of the GWO-SVM model was the best.【Conclusion】The GWO-SVM model is superior to the MLR model and the SVM model on the prediction accuracy. The GWO-SVM and the SVM model are superior to the MLR model on determination coefficient analysis. So the approach of GWO-SVM combined with the near infrared spectroscopy to predict the C. lanceolata density is reasonable and efficient.

Key words: near infrared spectrum grey wolf optimizer(GWO) support vector machines(SVM) wood density of Cunninghamia lanceolata

木材密度是表示木材材质的重要指标，是木材内部因子综合性指标的外在反应，根据木材密度可以估计木材质量，判断木材的工艺性质和硬度、强度、干缩、湿胀等物理力学性质(徐明锋等，2016)。近红外光谱(near infrared spectrum，NIR)分析技术是近年来分析化学领域迅速发展起来的一门高新技术，具有操作简易、预测快捷、结果准确及对试样无损等优点(卢万鸿等，2015)，在国内外已广泛应用于检测木材密度、强度、含水率等物理性质，预测木材中的木质素、抽提物、糖类等化学性质(褚小立等，2014；Hein et al., 2010)。目前，木材近红外光谱预测主要有多元线性回归法(劳万里等，2015)、主成分回归法(李耀翔等，2010)、偏最小二乘法(李耀翔等，2012)等，但这些方法都是经典建立线性模型的方法，很难精确模拟出高度非线性的光谱数据。支持向量机(support vector machines，SVM)是现代智能算法的代表之一，相对于传统方法，其能根据数据建立更加准确的非线性模型，使模型具有较好的泛化能力，提高预测准确性(Djemai et al., 2016)。于仕兴等(2013)提出一种应用于木材近红外光谱分析的粒子群(PSO)-SVM回归模型，结果发现PSO-SVM回归模型在桉木(Eucalyptus)近红外光谱的木质素含量预测中具有较高的准确性和很好的稳定性。梁龙等(2016)将基于SVM的近红外特征变量选择(SVM-SCARS)算法用于树种快速识别，结果表明SVM-SCARS算法能够有效优化光谱特征变量，提高近红外在线分析模型在木材材性分析中的稳健性和适用性。但在实践中发现，经典SVM算法在参数选择过程中的随机性和主观性对预测结果影响很大，虽然目前也有一些优化算法对SVM参数进行优化选择，如粒子群算法、遗传算法等，但依然存在收敛速度慢、易陷入局部极值等缺点。灰狼优化算法(grey wolf optimizer，GWO)已被证明是相比于以上算法更为合理的全局最优解搜索机制，算法运行稳定性更强、收敛速度更快(Mirjalili et al.，2014)，因此，可以利用灰狼算法优化支持向量机(GWO-SVM)对木材密度进行预测。

杉木(Cunninghamia lanceolata)是我国特有的速生商品材树种，具有生长快、材质好、木材纹理通直、结构均匀等优点，广泛应用于建筑、造船、家具等领域(齐建文等，2014)。本研究利用主成分分析法对光谱数据进行降维处理，分别建立杉木密度多元线性回归模型、SVM模型和GWO-SVM模型，并比较分析3种模型的预测结果，以期为杉木性质定量分析提供理论依据。

1 材料与方法 1.1 样品来源与制备

杉木光谱数据由中国林业科学研究院木材工业研究所提供。近红外光谱仪采用美国ASD公司生产的LabSpec光谱仪，波长范围在350~2 500 nm之间。使用两分叉光纤探头采集杉木样品表面的近红外光谱，实验室温度为(22±1.5)℃，湿度50%±3%，对109个样品分别扫描10次全光谱(350~2 500 nm)，计算机显示每个样品的平均光谱，将得到的近红外光谱转换成Unscrambler R文件后保存。样本光谱图如图 1所示。

图 1 109个杉木样本的近红外光谱 Fig. 1 The near-infrared spectra of 109 Cunninghamia lanceolata samples

1.2 数据预处理

主成分分析(principal component analysis，PCA)的本质是对高维变量进行降维处理，其基本思路是将多维数据用少数几个相互独立的主成分表示，且这些主成分能够反映原始变量的绝大部分信息。给定一组相关变量X(x₁, x₂, x₃, …, x_n)，通过线性变换转成另一组不相关变量Z(z₁, z₂, …, z_n)，这些新的变量称为主成分，第i个主成分Z_i=l_i1x₁+l_i2x₂+l_i3x₃…+l_inx_n。将主成分按照方差递减顺序排列，方差越大，表示主成分所含原始变量信息越多；如果方差贡献率足够大，则可以用来反映原始变量的信息。

利用Matlab中的pca函数对109个样品数据进行主成分降维处理，将2 151维数据降至5维，累计贡献率达98%以上，可以用来解释原始变量，并将处理后的数据保存在Excel中。

1.3 多元线性回归法

利用主成分分析法处理杉木红外光谱数据，以得到的主成分Z₁、Z₂、Z₃、Z₄、Z₅作为自变量，以杉木密度y作为因变量，建立多元回归模型：

$\mathit{y = }{\mathit{b}_{\rm{0}}} + {b_{\rm{1}}}{Z_1} + {b_2}{Z_2} + {b_3}{Z_3} + {b_4}{Z_4} + {b_5}{Z_5} + e。$

(1)

式中：b₀为常数项，b₁，…，b₅为回归系数；e~N(0, σ²)为残差。

对求得的回归模型的可信度进行检验，判断自变量对y有无影响，一般P < 0.05，即说明自变量对因变量有显著影响。

1.4 灰狼算法优化支持向量机

在灰狼算法中，狼群被分为4等，如图 2所示，其中等级最高的头狼标记为α狼，负责狩猎(寻优)过程中的决策制定，前3组依次是适应度最好的3组。

图 2 GWO算法狼群等级分类 Fig. 2 The wolf classification of GWO

在优化过程中，各等级狼群通过不断更新自己的位置来寻找猎物。当狼群判断出猎物位置时，头狼α带领β、δ狼群对猎物进行包围，ω狼群根据前3个狼群的位置信息更新自己的位置，逐渐逼近猎物，灰狼算法位置更新过程如图 3所示。

图 3 GWO算法位置更新过程 Fig. 3 The location update process of GWO

灰狼算法优化支持向量机的基本思想是，用人工狼的位置代表支持向量机的惩罚参数C和径向基核函数参数σ。随机初始化人工狼位置，以均方误差(MSE)为适应度，适应度越小，越接近目标，人工狼位置更新，对支持向量机进行训练，并利用训练后的支持向量机模型进行预测。主要步骤如下：

1) 初始化参数，包括狼群数量、个体狼位置、最大迭代次数、参数C和σ取值的上下界。

2) 利用训练集计算每头狼相应的适应度，并选出前3个最好的狼分别作为α、β、δ狼。

3) 迭代更新ω狼的位置，直到达到最大迭代次数。

4) 输出α狼的位置，即最优参数C、σ。

5) 采用最优参数C和σ建模，对测试集进行预测，分析预测结果。

1.5 模型比较

使用决定性系数(R²)、均方误差(MSE)和平均绝对百分误差(MAPE) 3个指标对模型进行比较：

${R^2} = \frac{{{{\sum\limits_{i = 1}^n {\left({{{\mathit{\hat y}}_\mathit{i}} - \mathit{\bar y}} \right)} }^2}}}{{{{\sum\limits_{i = 1}^n {\left({{\mathit{y}_\mathit{i}} - \mathit{\bar y}} \right)} }^2}}}; $

(2)

${\rm{MSE = }}\frac{1}{n}{\sum\limits_{i = 1}^n {\left({{{\mathit{\hat y}}_\mathit{i}} - {y_\mathit{i}}} \right)} ^2}; $

(3)

${\rm{MAPE = }}\frac{1}{n}\sum\limits_{i = 1}^n {\left| {\frac{{{y_\mathit{i}} - {{\hat y}_\mathit{i}}}}{{{y_i}}}} \right|} 。$

(4)

式中：y为密度实测平均值；y_i为第i个样本密度实测值；${\mathit{\hat y}_\mathit{i}}$为第i个样本密度模拟值；n为样本个数。

决定系数(R²)越接近1，说明模型越好；均方误差(MSE)和平均绝对百分误差(MAPE)越小，说明模型预测效果越好。

2 结果与分析

利用Matlab对109个杉木样品的近红外光谱数据进行主成分分析，选择5个主成分，每个主成分的贡献率如表 1所示，累积贡献率达98.7%。

表 1 5个主成分贡献率 Tab.1 The contribution rate of 5 principal components

选取88个样品作为训练集，利用R软件对杉木密度数据和近红外光谱主成分建立多元线性回归模型，采用向后剔除法，按照AIC最小原则选出最优模型，当AIC=-574.21时，最优模型为：

$\begin{array}{l} y = 0.398{\rm{\;59 - 0}}{\rm{.021\;47}}{\mathit{Z}_{\rm{1}}} - 0.015{\rm{\;77}}{\mathit{Z}_{\rm{2}}} - \\ {\rm{\;\;\;\;\;\;\;\;\;\;\;\;\;\;}}0.039{\rm{\;6}}{\mathit{Z}_{\rm{3}}} - 0.057{\rm{\;5}}{\mathit{Z}_{\rm{4}}}. \end{array} $

模型的偏回归系数Z₁、Z₂、Z₃和Z₄的P分别为2.3×10^-10、0.006 85、0.028 34和0.002 97，均满足P < 0.05，整个模型的P也小于0.05，说明该回归模型是有意义的。进一步，利用所建模型对测试集21个样品的密度进行预测。

SVM模型与GWO-SVM模型的运行环境是Matlab 2014b。SVM模型参数设置为默认值，核函数采用径向基函数。GWO-SVM模型的搜索参数：狼的数量为100，最大迭代次数为30，需要优化的参数为C和σ，故维数取2，参数取值下界为0.001、上界为1 000。对数据归一化后进行训练建模，其适应度变化曲线如图 4所示。

图 4 GWO适应度变化曲线 Fig. 4 The fitness curve of GWO

由图 4可知，当迭代到第3代时，均方误差大幅度减小，说明灰狼算法收敛速度快、收敛趋于稳定。经过寻优，获得的最优参数为C=18.366 6，σ=0.043 3。将训练后的模型用于测试集密度的预测，并对预测结果进行反归一化。

3种模型预测值与实际值的比较见表 2，GWO-SVM模型的预测值更接近实际值，预测效果更好。

表 2 杉木密度实际值与预测值的比较 Tab.2 Comparison of the actual and predicted values of the Cunninghamia lanceolata density

表 2 杉木密度实际值与预测值的比较

Tab.2 Comparison of the actual and predicted values of the Cunninghamia lanceolata density

g·cm^-3
序号 Serial No.	密度实测值 Density actual value	MLR模型预测值 Predicted value of MLR model	SVM模型预测值 Predicted value of SVM model	GWO-SVM模型预测值 Predicted value of GWO-SVM model
1	0.364 854 8	0.358 942 7	0.362 753 2	0.361 706 2
2	0.360 716 3	0.355 624 6	0.358 003 8	0.355 622 2
3	0.373 117 2	0.372 865 7	0.372 076 9	0.371 589 3
4	0.377 552 4	0.377 445 0	0.374 479 0	0.374 678 2
5	0.386 746 3	0.391 334 3	0.387 707 1	0.385 833 8
6	0.498 155 1	0.481 869 7	0.478 325 1	0.497 003 1
7	0.511 636 9	0.504 292 9	0.487 652 3	0.510 579 1
8	0.379 574 6	0.373 024 0	0.378 316 5	0.378 932 2
9	0.401 943 7	0.402 607 4	0.392 098 6	0.392 107 7
10	0.386 988 2	0.432 333 3	0.402 146 1	0.408 055 5
11	0.365 938 6	0.359 579 8	0.363 625 9	0.361 902 1
12	0.434 392 1	0.406 537 8	0.403 277 9	0.404 186 5
13	0.414 781 9	0.400 609 9	0.400 937 1	0.400 739 8
14	0.378 959 1	0.379 757 9	0.374 834 2	0.374 475 8
15	0.373 779 2	0.367 608 3	0.367 542 8	0.366 722 6
16	0.344 980 6	0.376 017 3	0.370 346 5	0.372 989 5
17	0.393 130 4	0.385 516 7	0.390 332 7	0.389 665 0
18	0.379 306 3	0.404 050 7	0.403 991 9	0.404 613 6
19	0.502 691 8	0.480 560 2	0.475 941 0	0.487 248 0
20	0.380 978 6	0.395 709 2	0.398 512 9	0.398 674 9
21	0.372 273 1	0.377 530 8	0.380 697 3	0.380 987 9

3种密度预测模型的比较如表 3所示。从决定系数(R²)可以看出，MLR模型、SVM模型和GWO-SVM模型都能实现有效预测；但从模型具体参数分析，SVM模型的R²为0.923 8，稍优于GWO-SVM模型(R²=0.919 2)，且SVM模型和GWO-SVM模型的R²明显优于MLR模型(R²=0.771 4)。从均方误差(MSE)和平均绝对百分误差(MAPE)比较可知，GWO-SVM模型的MSE和MAPE最小，其次是SVM模型，且二者的MAPE均小于3%。这表明，在3种预测模型中，GWO-SVM模型结合红外光谱预测密度的效果最好，SVM模型次之，均优于现在应用最广的MLR模型，且GWO-SVM模型结合红外光谱预测杉木密度更精确、更有效。

表 3 3种密度预测模型的比较 Tab.3 The comparison of three density models

3 结论

本研究提出基于灰狼算法支持向量机结合近红外光谱的杉木密度预测模型。结果表明，灰狼算法支持向量机模型在杉木密度预测中获得了良好效果，且预测精度优于传统多元线性回归和支持向量机方法。灰狼算法支持向量机预测模型结合了支持向量机的结构风险最小化和狼群全局优化算法的优点，预测模型准确度更高，在杉木密度近红外光谱的定量分析中有很好的应用和研究价值。探索近红外光谱数据去噪降维方法，并对灰狼算法进行改进，进一步提高其全局收敛速度和精度，将是今后的主要研究方向。

参考文献(References)

褚小立, 陆婉珍. 2014. 近五年我国近红外光谱分析技术研究与应用进展. 光谱学与光谱分析, 34(10): 2595-2605.
(Chu X L, Lu W Z. 2014. Research and application progress of near infrared spectroscopy analytical technology in China in the past five years. Spectroscopy and Spectral Analysis, 34(10): 2595-2605. DOI:10.3964/j.issn.1000-0593(2014)10-2595-11 [in Chinese])

劳万里, 李改云, 秦特夫, 等. 2015. 红外光谱结合多元线性回归法快速测定木塑复合材料中木粉含量. 林产化学与工业, 35(3): 20-26.
(Lao W L, Li G Y, Qin T F, et al. 2015. Rapid determination of wood flour content in wood plastic composites by FT-IR combined with multiple linear regression. Chemistry & Industry of Forest Products, 35(3): 20-26. DOI:10.3969/j.issn.0253-2417.2015.03.004 [in Chinese])

李耀翔, 张鸿富. 2010. 应用NIR及主成分回归预测落叶松密度的研究. 林业科技, 35(2): 46-48.
(Li Y X, Zhang H F. 2010. Study on modeling larch density by NIR and principle component. Forestry Science & Technology, 35(2): 46-48. [in Chinese])

李耀翔, 张鸿富. 2012. 非线性算法在近红外预测木材密度中的应用研究. 森林工程, 28(5): 38-41.
(Li Y X, Zhang H F. 2012. Application of nonlinear algorithm in predicting wood density using near infrared spectroscopy. Forest Engineering, 28(5): 38-41. DOI:10.3969/j.issn.1001-005X.2012.05.011 [in Chinese])

梁龙, 房桂干, 吴珽, 等. 2016. 基于支持向量机的近红外特征变量选择算法用于树种快速识别. 分析测试学报, 35(1): 101-106.
(Liang L, Fang G G, Wu T, et al. 2016. Fast identification of wood species using near infrared spectroscopy coupled with variables selection methods based on support vector machine. Journal of Instrumental Analysis, 35(1): 101-106. DOI:10.3969/j.issn.1004-4957.2016.01.017 [in Chinese])

卢万鸿, 王楚彪, 林彦, 等. 2015. 桉树材性性状近红外预测模型的建立. 桉树科技, 32(2): 10-16.
(Lu W H, Wang C B, Lin Y, et al. 2015. NIRS calibration for predicting wood properties of Eucalyptus. Eucalypt Science & Technology, 32(2): 10-16. DOI:10.3969/j.issn.1674-3172.2015.02.002 [in Chinese])

齐建文, 张蓓, 刘金山. 2014. 湖南杉木林生物量密度的模拟与预测. 中南林业科技大学学报:自然科学版, 34(8): 15-18.
(Qi J W, Zhang B, Liu J S. 2014. Simulation and prediction on biomass density of Chinese fir in Hunan Province. Journal of Central South University of Forestry & Technology, 34(8): 15-18. [in Chinese])

徐明锋, 柯娴氡, 张毅, 等. 2016. 粤东6种阔叶树木材密度及其影响因子研究. 华南农业大学学报, 37(3): 100-106.
(Xu M F, Ke X D, Zhang Y, et al. 2016. Wood densities of six hardwood tree species in eastern Guangdong and influencing factors. Journal of South China Agricultural University, 37(3): 100-106. [in Chinese])

于仕兴, 李学春, 黄安民, 等. 2013. 粒子群支持向量机结合NIR测定桉木木质素. 东北林业大学学报, 41(2): 123-126.
(Yu S X, Li X C, Huang A M, et al. 2013. PSO support vector machine combined with NIR determination for the lignin content of Eucalyptus. Journal of Northeast Forestry University, 41(2): 123-126. DOI:10.3969/j.issn.1000-5382.2013.02.029 [in Chinese])

Djemai S, Brahmi B, Bibi M O. 2016. A primal-dual method for SVM training. Neurocomputing, 211: 34-40. DOI:10.1016/j.neucom.2016.01.103

Hein P R G, Lima J T, Chaix G. 2010. Effects of sample preparation on NIR spectroscopic estimation of chemical properties of Eucalyptus urophylla S.T. Blake wood. Holzforschung, 64(1): 45-54.

Mirjalili S, Mirjalili S M, Lewis A. 2014. Grey wolf optimizer. Advances in Engineering Software, 69: 46-61. DOI:10.1016/j.advengsoft.2013.12.007