首页 关于本刊 编 委 会 期刊动态 作者中心 审者中心 读者中心 下载中心 联系我们 English
 自动化学报  2018, Vol. 44 Issue (5): 891-900 PDF

1. 中国科学院声学研究所语言声学与内容理解重点实验室 北京 100190;
2. 中国科学院大学 北京 100049;
3. 中国科学院新疆理化技术研究所新疆民族语音语言信息处理实验室 乌鲁木齐 830011

Data Augmentation for Language Models via Adversarial Training
ZHANG Yi-Ke1,2, ZHANG Peng-Yuan1,2, YAN Yong-Hong1,2,3
1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190;
2. University of Chinese Academy of Sciences, Beijing 100049;
3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumchi 830011
Manuscript received : August 28, 2017, accepted: January 14, 2018.
Foundation Item: Supported by National Natural Science Foundation of China (11590770-4, U1536117, 11504406, 11461141004), National Key Research and Development Plan (2016YFB0801203, 2016YFB0801200), and Key Science and Technology Project of Xinjiang Uygur Autonomous Region (2016A03007-1)
Corresponding author. ZHANG Peng-Yuan  Professor at the Key Laboratory of Speech Acoustics and Content Understanding, the Institute of Acoustics, Chinese Academy of Sciences. He received his Ph. D. degree from the Institute of Acoustics, Chinese Academy of Sciences in 2007. His research interest covers spontaneous speech recognition. Corresponding author of this paper
Recommended by Associate Editor ZUO Wang-Meng
Abstract: The conventional approach to data augmentation for language models based on maximum likelihood estimation (MLE) causes the exposure bias problem, which leads to generated text lacking of long-term semantics. We propose a novel data augmentation approach via adversarial training, which uses a convolutional neural network as a discriminator to guide the training of a recurrent neural network based generative model. The matter of augmentation for language models can be regarded as discrete sequential data generation. When outputs of the generative model are discrete, backforward propagation algorithm fails to update the generative model via the gradient of discriminator errors. To deal with this problem, we treat the generative model as a stochastic policy in reinforcement learning and optimize it by rewards from the discriminator. Since the discriminator can only judge completed sequences, we evaluate intermediate states by Monte Carlo search. Experiments on rescoring the n-best lists of speech recognition outputs show that with the increase of training corpus, the proposed approach achieves a lower character error rate (CER) and always outperforms the MLE-based approach. When training corpus reaches 6 million tokens, the proposed approach provides a relative 5.0% CER reduction on THCHS30 dataset and a relative 7.1% CER reduction on AISHELL dataset compared with the baseline.
Key words: Data augmentation     language modeling     generative adversarial nets (GAN)     reinforcement learning     speech recognition

$N$元文法语言模型($N$-gram LM)是一种常用的统计语言模型[1].由于实际自然语言中词汇组合的多样性, 利用有限数据训练得到的$N$-gram LM不可避免地存在数据稀疏问题[2].数据增强是一种有效缓解数据稀疏问题的方法[3-5].就语言模型建模任务而言, 常见的数据增强方法包括基于外部数据的方法[4-5]和基于递归神经网络语言模型(Recurrent neural network LM, RNN LM)随机采样的方法[6-7].前者按照一定的规则从其他来源(例如互联网)的数据中挑选部分数据扩充训练集, 后者是利用训练好的RNN LM随机生成采样数据以丰富训练集中包含的语言现象.

1 基于RNN LM的数据增强算法

RNN LM的目标是预测给定词序列中每个词出现的条件概率.给定一条训练语句$w_1, w_2, \cdots, w_T$ $(w_t$ $\in V$, $t=1, 2, \cdots, T)$, $V$表示词典空间. RNN LM按照下式将输入词序列编码为隐含层状态序列$\pmb{s}_1$, $\pmb{s}_2$, $\cdots$, $\pmb{s}_T$ $(\pmb{s}_t\in {\bf{R}}^h$, $t=1, 2, \cdots, T)$

 \begin{align} \pmb{s}_t = \sigma(W_r \pmb{s}_{t-1} + W_x \pmb{x}_t + \pmb{b}_h) \end{align} (1)

2.3.2 判别模型

 \begin{align} \varepsilon_{1:T} = \pmb{x}_1 \circ \pmb{x}_2 \circ \cdots \circ \pmb{x}_T \end{align} (18)

 \begin{align} c_i = \rho(r \ast \varepsilon_{i:i+l-1} + b) \end{align} (19)

 \begin{align} c = \max \{ c_1, \cdots, c_{T-l+1} \} \end{align} (20)

 \begin{align} h = H(c, W_H) \times T(c, W_T) + c \times C(c, W_c) \end{align} (21)

3 实验 3.1 实验设置

RNN生成模型包含2层隐含层, 每层由150个LSTM单元组成.输出层节点数等于词典大小, 词典共包含55 590个中文词.为了防止模型对训练数据过拟合, 训练时采用了丢弃正则化技术, 在预训练与对抗训练过程中初始丢弃率均为0.3.

CNN判别模型分别采用窗长为1, 2, 3, 4, 5, 10的卷积核进行卷积操作, 每个窗长分别使用50个不同的卷积核.此外, 判别模型包含2层通道层, 每层150节点.输出层包含1个节点, 表示输入序列与真实数据相似程度.在训练过程中, 同样采取丢弃正则化技术防止模型过拟合, 丢弃率为0.3.同时在输出层采用L2范数正则化技术, 正则项系数为0.1.

3.2 对抗训练中超参数的选取

 图 2 不同超参数条件下序列对抗生成网络训练误差 Figure 2 Training errors of sequential generative adversarial networks with different hyper-parameters

3.3 基于相对交叉熵的数据扩增

 图 3 序列生成对抗网络在不同数据集上的性能 Figure 3 Performance of sequential generative adversarial networks on different datasets

3.4 识别多候选重估实验

 \begin{align} s_{\rm lm} = \omega\times s_{\rm new} + (1 - \omega) \times s_{\rm base} \end{align} (22)

 \begin{align} s = s_{\rm am} + \gamma \times s_{\rm lm} \end{align} (23)

 图 4 训练数据规模对两种数据增强技术性能的影响 Figure 4 The effect of the size of training data on two augmentation approaches

3.5 生成模型性能分析

 图 5 不同采样数据的分布图 Figure 5 Distribution of sentences sampled from different sources

4 结论