 哈尔滨工程大学学报  2019, Vol. 40 Issue (9): 1662-1666  DOI: 10.11990/jheu.201812073 0

XU Feng, ZHANG Xuefen, XIN Zhanhong. A Chinese word segmentation scheme based on a deep neural network model[J]. Journal of Harbin Engineering University, 2019, 40(9), 1662-1666. DOI: 10.11990/jheu.201812073.

### 文章历史

1. 北京邮电大学 经济管理学院, 北京 100876;
2. 北京联合大学 智慧城市学院, 北京 100101

A Chinese word segmentation scheme based on a deep neural network model
XU Feng 1, ZHANG Xuefen 2, XIN Zhanhong 1
1. School of Economics and Management, Beijing University of Posts and Telecommunications, Beijing 100876, China;
2. Smart City College, Beijing Union University, Beijing 100101, China
Abstract: In order to solve the problem of reduced performance of existing segmentation algorithms and programs when processing massive network text segmentation, a Chinese word segmentation scheme based on a deep neural network model is proposed in this paper. The encoder-decoder model (EDM), based on the long short-term memory (LSTM) network, was employed to train the data model, which derived a model to perform the word segmentation. In order to improve the word segmentation performance, a modification method based on word vectors is further provided. Experimental results on the typical Weibo dataset suggested that the performance of the proposed scheme is significantly improved compared with traditional word segmentation software. In addition, word segmentation precision and F-values, after modification by the presented method based on the word vectors, were slightly improved compared with those without modification. All these facts indicated the effectiveness of the proposed segmentation scheme.
Keywords: Chinese word segmentation    long short-term memory network    encoder-decoder model    word vector    accuracy rate    F value

1 基于LSTM的EDM模型 1.1 LSTM模型

LSTM是为了解决传统的RNN模型的梯度消失问题而提出的一种深度神经网络模型[13]。相对于传统的RNN结构，LSTM网络引入记忆单元C和3个逻辑门：输入门(i)、遗忘门(f)和输出门(o)，用来控制信息的传递和遗弃，从而更好地处理历史信息。

1.2 基于LSTM的EDM

 $\mathit{\boldsymbol{C}} = f\left( {{X_1}, {X_2}, \cdots , {X_m}} \right)$ (1)

 ${Y_t} = g\left( {{Y_1}, {Y_2}, \cdots , {Y_{t - 1}}} \right)$ (2)

 ${e_{ij}} = {\mathop{\rm score}\nolimits} \left( {{s_{i - 1}}, {h_j}} \right)$ (3)
 ${\alpha _{ij}} = \frac{{\exp \left( {{e_{ij}}} \right)}}{{\sum\limits_k {\exp } \left( {{e_{ik}}} \right)}}$ (4)
 ${C_i} = \sum\limits_j {{\alpha _{ij}}} {h_j}$ (5)
 ${s_i} = f\left( {{s_{i - 1}}, {y_1}, \cdots , {y_{i - 1}}, {C_i}} \right)$ (6)
 ${y_i} = g\left( {{y_1}, \cdots , {y_{i - 1}}, {s_i}, {C_i}} \right)$ (7)

2 词向量模型

 ${\mathit{\boldsymbol{X}}_i} = \sum\limits_j {{X_{ij}}}$ (8)

 $J(\boldsymbol{W})=\sum\limits_{i, j} f\left(X_{i j}\right)\left(\boldsymbol{W}_{i j}^{\mathrm{T}}-\operatorname{lb}_{2} X_{i j}\right)^{2}$ (9)

 $f\left( {{X_{ij}}} \right) = \left\{ {\begin{array}{*{20}{l}} {{{\left( {{X_{ij}}/{X_{\max }}} \right)}^\alpha }, }&{{X_{ij}} < {X_{\max }}}\\ {1, }&{{X_{ij}} \ge {X_{\max }}} \end{array}} \right.$ (10)

3 分词方案设计与实现 3.1 中文分词方案

 Download: 图 3 提出的中文分词方案流程图 Fig. 3 The diagram of the proposed scheme for Chinese word segmentation

 $\cos \theta=\frac{<\boldsymbol{W}_{i}, \boldsymbol{W}_{j}>}{\left|\boldsymbol{W}_{i}\right|\left|\boldsymbol{W}_{j}\right|}$ (11)

 ${\mathit{\boldsymbol{W}}_{{\rm{out}}}} = \frac{{{\mathit{\boldsymbol{W}}_i} + {\mathit{\boldsymbol{W}}_{i + 1}}}}{{\left| {{\mathit{\boldsymbol{W}}_i} + {\mathit{\boldsymbol{W}}_{i + 1}}} \right|}}$ (12)

3.2 测试及分析

 $P = \frac{{{D_1}}}{{{D_0}}} \times 100\%$ (13)
 $P = \frac{{{D_1}}}{D} \times 100\%$ (14)
 $F = \frac{{P \times R}}{{(1 - a) \times P + a \times R}}$ (15)

 Download: 图 4 不同阈值λ下的修正性能 Fig. 4 Correct performances of different threshold λ

4 结论

1) 提出基于LSTM的EDM分词模型的性能相对于传统的jieba分词软件的分词性能有了较大提升。

2) 采用提出的词向量修正方法修正后的分词准确率和F值略优于未修正的分词准确率。验证了提出的基于LSTM模型的中文分词方案和基于词向量的修正方法的有效性。

