 智能系统学报  2017, Vol. 12 Issue (5): 661-667  DOI: 10.11992/tis.201706012 0

CHEN Pei, JING Liping. Word representation learning model using matrix factorization to incorporate semantic information[J]. CAAI Transactions on Intelligent Systems, 2017, 12(5): 661-667. DOI: 10.11992/tis.201706012.

Word representation learning model using matrix factorization to incorporate semantic information
CHEN Pei, JING Liping
Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China
Abstract: Word representation plays an important role in natural language processing and has attracted a great deal of attention from many researchers due to its simplicity and effectiveness. However, traditional methods for learning word representations generally rely on a large amount of unlabeled training data, while neglecting the semantic information of words, such as the semantic relationship between words. To sufficiently utilize knowledge bases that contain rich semantic word information in existing fields, in this paper, we propose a word representation learning method that incorporates semantic information (KbEMF). In this method, we use matrix factorization to incorporate field knowledge constraint items into a learning word representation model, which identifies words with strong semantic relationships as being relatively approximate to the obtained word representations. The results of word analogy reasoning tasks and word similarity measurement tasks obtained using actual data show the performance of KbEMF to be superior to that of existing models.
Key words: natural language processing    word representation    matrix factorization    semantic information    knowledge base

1 矩阵分解词向量学习模型相关背景

KbEMF模型是通过扩展矩阵分解词向量学习模型构建的，本节介绍有关矩阵分解学习词向量涉及的背景知识。

EMF模型   skip-gram模型学得的词向量在多项自然语言处理任务中都取得了良好的表现，却没有清晰的理论原理解释。由此，EMF从表示学习的角度出发，重新定义了skip-gram模型的目标函数，将其精确地解释为矩阵分解模型，把词向量解释为softmax损失下显示词向量dw关于表示字典C的一个隐表示，并直接显式地证明了skip-gram就是分解词共现矩阵学习词向量的模型。这一证明为进一步推广及拓展skip-gram提供了坚实理论基础。EMF目标函数用(1)式表示：

 ${\min _{\mathit{\boldsymbol{W, C}}}}\zeta \left( {\mathit{\boldsymbol{X, WC}}} \right) = - {\rm{tr}}\left( {{\mathit{\boldsymbol{X}}^{\rm{T}}}{\mathit{\boldsymbol{C}}^{\rm{T}}}\mathit{\boldsymbol{W}}} \right) + \sum\nolimits_{w \in V} {\ln \left( {\sum\limits_{X_{w \in {S_w}}^\prime } {{{\rm{e}}^{X_w^{\prime {\rm{T}}}{C^{\rm{T}}}\mathit{\boldsymbol{w}}}}} } \right)}$ (1)

2 融合语义信息的矩阵分解词向量学习模型 2.1 提取语义信息并构建语义矩阵

2.2 构建语义约束模型

 $\begin{matrix} R=\sum\limits_{{{w}_{i}},{{w}_{j}}\in V}{{}}{{S}_{ij}}{{\left\| {{\mathit{\boldsymbol{w}}}_{i}}-{{\mathit{\boldsymbol{w}}}_{j}} \right\|}^{2}}= \\ \sum\limits_{i, j=1}^{\left| V \right|}{{}}{{S}_{ij}}({{\mathit{\boldsymbol{w}}}_{i}}^{\rm{T}}{{\mathit{\boldsymbol{w}}}_{i}}+{{\mathit{\boldsymbol{w}}}_{j}}^{\rm{T}}{{\mathit{\boldsymbol{w}}}_{j}}-2{{\mathit{\boldsymbol{w}}}_{i}}^{\rm{T}}{{\mathit{\boldsymbol{w}}}_{j}})= \\ \sum\limits_{i=1}^{\left| V \right|}{{}}(\sum\limits_{j=1}^{\left| V \right|}{{}}{{S}_{ij}}){{\mathit{\boldsymbol{w}}}_{i}}^{\rm{T}}{{\mathit{\boldsymbol{w}}}_{i}}+\sum\limits_{j=1}^{\left| V \right|}{{}}(\sum\limits_{i=1}^{\left| V \right|}{{}}{{S}_{ij}}){{\mathit{\boldsymbol{w}}}_{j}}^{\rm{T}}{{\mathit{\boldsymbol{w}}}_{j}}-2\sum\limits_{i, j=1}^{\left| V \right|}{{}}{{S}_{ij}}{{\mathit{\boldsymbol{w}}}_{i}}^{\rm{T}}{{\mathit{\boldsymbol{w}}}_{j}}= \\ \sum\limits_{i=1}^{\left| V \right|}{{}}{{S}_{i}}{{\mathit{\boldsymbol{w}}}_{i}}^{\rm{T}}{{\mathit{\boldsymbol{w}}}_{i}}+\sum\limits_{j=1}^{\left| V \right|}{{}}{{S}_{j}}{{\mathit{\boldsymbol{w}}}_{j}}^{\rm{T}}{{\mathit{\boldsymbol{w}}}_{j}}-2\sum\limits_{i, j=1}^{\left| V \right|}{{}}{{S}_{ij}}{{\mathit{\boldsymbol{w}}}_{i}}^{\rm{T}}{{\mathit{\boldsymbol{w}}}_{j}}= \\ \rm{tr}({{\mathit{\boldsymbol{W}}}^{T}}{{\mathit{\boldsymbol{S}}}_{\rm{row}}}\mathit{\boldsymbol{W}})+tr({{\mathit{\boldsymbol{W}}}^{\rm{T}}}{{\mathit{\boldsymbol{S}}}_{\rm{col}}}\mathit{\boldsymbol{W}})-2\rm{tr}({{\mathit{\boldsymbol{W}}}^{\rm{T}}}\mathit{\boldsymbol{SW}})= \\ \rm{tr}({{\mathit{\boldsymbol{W}}}^{\rm{T}}}({{\mathit{\boldsymbol{S}}}_{\rm{row}}}+{{\mathit{\boldsymbol{S}}}_{\rm{col}}}-2\mathit{\boldsymbol{S}})\mathit{\boldsymbol{W}}) \\ \end{matrix}$

 $R=\rm{tr}({{\mathit{\boldsymbol{W}}}^{\rm{T}}}({{\mathit{\boldsymbol{S}}}_{\rm{row}}}+{{\mathit{\boldsymbol{S}}}_{\rm{col}}}-2\mathit{\boldsymbol{S}})\mathit{\boldsymbol{W}})$ (2)

2.3 模型融合

 $\begin{array}{c} \mathit{\boldsymbol{O}} = - {\rm{tr}}\left( {{\mathit{\boldsymbol{X}}^{\rm{T}}}{\mathit{\boldsymbol{C}}^{\rm{T}}}\mathit{\boldsymbol{W}}} \right) + \sum\limits_{w \in V} {} \ln (\sum\limits_{\mathit{\boldsymbol{X}}{\prime _w} \in \mathit{\boldsymbol{S}}{_w}} {} {\rm{ }}{{\rm{e}}^{\mathit{\boldsymbol{X}}\prime _\mathit{\boldsymbol{w}}^{\bf{T}}}}{\mathit{\boldsymbol{C}}^{\rm{T}}}\mathit{\boldsymbol{w}}) + \\ {\rm{ }}\gamma {\rm{tr}}\left( {{\mathit{\boldsymbol{W}}^{\rm{T}}}\left( {{\rm{ }}{\mathit{\boldsymbol{S}}_{{\rm{row}}}} + {\mathit{\boldsymbol{S}}_{{\rm{col}}}} - 2\mathit{\boldsymbol{S}}} \right)\mathit{\boldsymbol{W}}} \right) \end{array}$ (3)

2.4 模型求解

 $\frac{{\partial \mathit{\boldsymbol{O}}}}{{\partial \mathit{\boldsymbol{C}}}} = \left( {{E_{\mathit{\boldsymbol{X}}|{\mathit{\boldsymbol{C}}^{\rm{T}}}\mathit{\boldsymbol{W}}}}\mathit{\boldsymbol{X - X}}} \right){\mathit{\boldsymbol{W}}^{\rm{T}}}$ (4)
 $\frac{{\partial \mathit{\boldsymbol{O}}}}{{\partial \mathit{\boldsymbol{W}}}} = \mathit{\boldsymbol{C}}\left( {{E_{\mathit{\boldsymbol{X}}|{\mathit{\boldsymbol{C}}^{\rm{T}}}\mathit{\boldsymbol{W}}}}\mathit{\boldsymbol{X - X}}} \right) + \gamma (\mathit{\boldsymbol{L}} + {\mathit{\boldsymbol{L}}^{\rm{T}}})\mathit{\boldsymbol{W}}$ (5)

 $\mathit{\boldsymbol{W}} \leftarrow \mathit{\boldsymbol{W}} + \eta \left[ {\mathit{\boldsymbol{C}}\left( {{E_{\mathit{\boldsymbol{X}}|{\mathit{\boldsymbol{C}}^{\rm{T}}}\mathit{\boldsymbol{W}}}}\mathit{\boldsymbol{X - X}}} \right) + \gamma (\mathit{\boldsymbol{L}} + {\mathit{\boldsymbol{L}}^{\rm{T}}})\mathit{\boldsymbol{W}}} \right]$ (6)
 $\mathit{\boldsymbol{C}} \leftarrow \mathit{\boldsymbol{C}} + \eta [\left( {{E_{\mathit{\boldsymbol{X}}|{\mathit{\boldsymbol{C}}^{\rm{T}}}\mathit{\boldsymbol{W}}}}\mathit{\boldsymbol{X - X}}} \right){\mathit{\boldsymbol{W}}^{\rm{T}}}]$ (7)

1) 随机初始化：W0, C0

2) for i = 1 to K，执行

3) Wi=Wi-1

4) for j = 1 to k, 执行

5) ${\mathit{\boldsymbol{W}}_i} = {W_i} + \eta [{\mathit{\boldsymbol{W}}_{i - 1}}\left( {{E_{X|C_{i - 1}^{\rm{T}}{\mathit{\boldsymbol{W}}_i}}}\mathit{\boldsymbol{W}} - \mathit{\boldsymbol{W}}} \right) + \gamma (L + {L^{\rm{T}}}){\mathit{\boldsymbol{W}}_i}]$

6) j=j+1

7) Ci=Ci-1

8) for j=1 to k, 执行

9) Ci=Ci+η$\left( {{E_{\mathit{\boldsymbol{X}}|\mathit{\boldsymbol{C}}_i^{\rm{T}}{\mathit{\boldsymbol{W}}_i}}}\mathit{\boldsymbol{X}} - \mathit{\boldsymbol{X}}} \right)$WiT

10) j=j+1

11) i=i+1

3 实验与结果

3.1 数据集

3.2 实验设置

 图 1 KbEMF在不同向量维度和语义组合权重的正确率 Fig.1 Performance when incorporating semantic knowledge related to word analogical reasoning for different vector sizes and semantic combination weights
3.3 单词类比推理

 图 2 不同过滤词频下EMF与KbEMF的正确率对比 Fig.2 Performance of KbEMF compared to EMF for different word frequencies

3.4 单词相似度量

4 结束语