首页 关于本刊 编 委 会 期刊动态 作者中心 审者中心 读者中心 下载中心 联系我们 English
 自动化学报  2017, Vol. 43 Issue (2): 248-258 PDF

1. 解放军理工大学 南京 210007;
2. 西安通信学院 西安 710106;
3. 中国人民解放军96637部队 北京 102101

A Single-channel Speech Enhancement Approach Based on Perceptual Masking Deep Neural Network
HAN Wei1, ZHANG Xiong-Wei1, MIN Gang1,2, ZHANG Qi-Ye3
1. PLA University of Science and Technology, Nanjing 210007;
2. Xi'an Communications Institute, Xi'an 710106;
3. Unit 96637 of PLA, Beijing 102101
Foundation Item: Supported by National Natural Science Foundation of China (61471394, 61402519), Natural Science Foundation of Jiangsu Province (BK20140071, BK20140074)
Corresponding author. ZHANG Xiong-Wei Professor at the College of Command Information System, PLA University of Science and Technology. He received his Ph. D. degree from Nanjing Institute of Communication Engineering in 1992. His research interest covers intelligence information processing, speech and image signal processing, and telecommunication systems. Corresponding author of this paper
Recommended by Associate Editor KE Deng-Feng
Abstract: A new deep neural network (DNN) is proposed for single-channel speech enhancement, which incorporates the perceptual masking properties of psychoacoustic models. Firstly, the proposed DNN is trained to learn both the clean speech magnitude spectrum and the noise magnitude spectrum from the noisy magnitude spectrum. Secondly, the estimated clean speech magnitude spectrum is used to calculate the noise masking threshold. Then, the noise masking threshold and the estimated noise magnitude spectrum are combined to calculate a perceptual gain function. Finally, the enhanced speech magnitude spectrum are obtained by jointly training the perceptual gain function and the noisy speech magnitude spectrum. Experimental results on TIMIT with 20 noise types at various SNR (signal-noise ratio) levels demonstrate that the proposed perceptual masking DNN can effectively remove the noise while maintaining small speech distortion, so as to obtain better performance than the common DNN methods and the NMF (nonnegative matrix factorization) method, no matter noise conditions are included in the training set or not.
Key words: Speech enhancement     deep neural network     perceptual gain function     masking threshold

1 基于DNN的语音增强方法及噪声掩蔽阈值

1.1 基于DNN的语音增强方法 1.1.1 DNN网络结构

 图 1 基于DNN的语音增强 Figure 1 Speech enhancement based on DNN

DNN的结构通常由3部分组成:输入层、隐藏层和输出层.输入层用来输入带噪语音的特征参数.隐藏层一般由多层堆叠而成, 相邻层节点之间有连接, 同一层及跨层节点之间无连接.输入层以及隐藏层的各个层之间利用激活函数传递数据, 上一层计算得到的输出作为下一层的输入变量, 如式(1) 所示:

 $\begin{eqnarray} \begin{array}{*{20}{c}} {\pmb{h}^l} = \sigma ({W^l}{\pmb{h}^{l - 1}} + {\pmb{b}^l}) \end{array} \end{eqnarray}$ (1)

1.1.2 DNN网络的训练

 $\begin{eqnarray} \begin{array}{*{20}{c}} {J_{{\rm{MSE}}}}(W, \pmb{b}) = \dfrac{1}{N}\sum\limits_{n = 1}^N {\dfrac{1}{2}{{\left\| {{\hat{\pmb{S}}_n}(W, \pmb{b}) - {\pmb{S}_n}} \right\|}^2}} \end{array} \end{eqnarray}$ (2)

 $\begin{eqnarray} \begin{array}{*{20}{c}} {W^l} = {W^l} - \varepsilon \dfrac{{\partial J(W, \pmb{b})}}{{\partial {W^l}}}, {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 1 \le l \le L + 1 \end{array} \end{eqnarray}$ (3)
 $\begin{eqnarray} \begin{array}{*{20}{c}} {\pmb{b}^l} = {\pmb{b}^l} - \varepsilon \dfrac{{\partial J(W, \pmb{b})}}{{\partial {\pmb{b}^l}}}, {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 1 \le l \le L + 1 \end{array} \end{eqnarray}$ (4)

1.2 心理声学模型及噪声掩蔽阈值的计算

Johnston提出了一种在各语音帧中, 估计背景噪声掩蔽阈值的一般方法[19], 该方法建立在临界带分析的基础上, 具体可以表述为以下4个步骤:

 $\begin{eqnarray} \begin{array}{*{20}{c}} \pmb{P}(\omega) = {{\mathop{\rm Re}\nolimits} ^2}(\pmb{S}(\omega )) + {{\mathop{\rm Im}\nolimits} ^2}(\pmb{S}(\omega )) \end{array} \end{eqnarray}$ (5)

 $\begin{eqnarray} \begin{array}{*{20}{c}} {B_i} = \sum\limits_{\omega = b{l_i}}^{b{h_i}} {\pmb{P}(\omega )} \end{array} \end{eqnarray}$ (6)

 $\begin{eqnarray} \begin{array}{*{20}{c}} {C_i} = {S_{ij}} * {B_i} \end{array} \end{eqnarray}$ (8)

 $\begin{eqnarray} \begin{array}{*{20}{c}} {T_{\rm{N}}} = {C_i} - 14.5 - i \end{array} \end{eqnarray}$ (9)

 $\begin{eqnarray} \begin{array}{*{20}{c}} {T_{\rm{T}}} = {C_i} - 5.5 \end{array} \end{eqnarray}$ (10)

 $\begin{eqnarray} {\rm{SFM}}_{\rm{dB}}= 10{\rm{lg}}\left( \dfrac {G_{\rm m}} {A_{\rm m} } \right) \end{eqnarray}$ (11)

 $\begin{eqnarray} \begin{array}{*{20}{c}} \pmb{Y}(\omega ) = {F^{\rm H}}\pmb{y} = {F^{\rm H}}\pmb{s} + {F^{\rm H}}\pmb{n} = \pmb{S}(\omega ) + \pmb{N}(\omega ) \end{array} \end{eqnarray}$ (16)

2.3 PM-DNN语音增强流程

 图 3 基于PM-DNN的语音增强框图 Figure 3 The framework of speech enhancement based on PM-DNN

3 PM-DNN语音增强方法性能评估

3.1 实验数据及设置

3.2 对比方法及评价指标

 $\begin{eqnarray} \begin{array}{*{20}{c}} \rm{IRM} = \dfrac{{|\pmb{S}(\omega ){|^2}}}{{|\pmb{S}(\omega ){|^2} + |\pmb{N}(\omega ){|^2}}} \end{array} \end{eqnarray}$ (26)

3.3 实验结果及分析

 图 4 PM-DNN目标函数中的权重$\alpha$和$\beta$对20种噪声的PESQ均值影响 Figure 4 The PESQ scores of PM-DNN objective function with different $\alpha$ and $\beta$ (For each condition, the numbers are the mean values over all the 20 noise types.)

 图 5 4种增强方法在20种不同噪声情况下的PESQ值(每种噪声的PESQ值是在-5 dB, 0 dB, 5 dB和10 dB 4种信噪比下的平均值.) Figure 5 The PESQ scores of the 4 enhancement methods for the 20 noise types (For each noise type, the numbers are the mean values over four input SNR conditions, i.e. from -5 dB to 10 dB spaced by 5 dB.)

 图 6 4种增强方法在20种不同噪声情况下的LSD值(每种噪声的LSD值是在-5 dB, 0 dB, 5 dB和10 dB 4种信噪比下的平均值.) Figure 6 The LSD values of the 4 enhancement methods for the 20 noise types (For each noise type, the numbers are the mean values over four input SNR conditions, i.e. from -5 dB to 10 dB spaced by 5 dB.)
 图 7 4种增强方法在20种不同噪声情况下的fwSNRseg值(每种噪声的fwSNRseg值是在-5 dB, 0 dB, 5 dB和10 dB 4种信噪比下的平均值.) Figure 7 The fwSNRseg values of the 4 enhancement methods for the 20 noise types (For each noise type, the numbers are the mean values over four input SNR conditions, i.e. from -5 dB to 10 dB spaced by 5 dB.)

 图 8 语谱图 Figure 8 Spectrograms

4 结论及展望

 1 Boll S F. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979, 27 (2): 113–120. DOI:10.1109/TASSP.1979.1163209 2 Chen J D, Benesty J, Huang Y T, Doclo S. New insights into the noise reduction Wiener filter. IEEE Transactions on Audio, Speech and Language Processing, 2006, 14 (4): 1218–1234. DOI:10.1109/TSA.2005.860851 3 Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, 32 (6): 1109–1121. DOI:10.1109/TASSP.1984.1164453 4 Gerkmann T, Hendriks R C. Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20 (4): 1383–1393. DOI:10.1109/TASL.2011.2180896 5 Jensen J R, Benesty J, Christensen M G, Jensen S H. Enhancement of single-channel periodic signals in the time-domain. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20 (7): 1948–1963. DOI:10.1109/TASL.2012.2191957 6 Wilson K W, Raj B, Smaragdis P, Divakaran A. Speech denoising using nonnegative matrix factorization with priors. In:Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. Las Vegas, USA:IEEE, 2008. 4029-4032 7 Sun C L, Zhu Q, Wan M H. A novel speech enhancement method based on constrained low-rank and sparse matrix decomposition. Speech Communication, 2014, 60 : 44–55. DOI:10.1016/j.specom.2014.03.002 8 Sun M, Li Y N, Gemmeke J, Zhang X W. Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23 (7): 1233–1242. DOI:10.1109/TASLP.2015.2427520 9 Xu Y, Du J, Dai L R, Lee C H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23 (1): 7–19. DOI:10.1109/TASLP.2014.2364452 10 Huang P S, Kim M, Hasegawa-Johnson M, Smaragdis P. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23 (12): 2136–2147. DOI:10.1109/TASLP.2015.2468583 11 Wang Y X, Narayanan A, Wang D L. On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22 (12): 1849–1858. DOI:10.1109/TASLP.2014.2352935 12 Sun M, Zhang X W, Van hamme H, Zheng T F. Unseen noise estimation using separable deep auto encoder for speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24 (1): 93–104. DOI:10.1109/TASLP.2015.2498101 13 Williamson D S, Wang Y X, Wang D L. Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24 (3): 483–492. DOI:10.1109/TASLP.2015.2512042 14 Narayanan A, Wang D L. Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23 (1): 92–101. 15 Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets. Neural Computation, 2006, 18 (7): 1527–1554. DOI:10.1162/neco.2006.18.7.1527 16 Bengio Y. Learning deep architectures for AI. Foundations and Trends® in Machine Learning, 2009, 2 (1): 1–127. DOI:10.1561/2200000006 17 Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In:Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale, USA:JMLR, 2011. 315-323 18 Zhang Yong, Liu Yi, Liu Hong. A two-stage speech enhancement algorithm combined with human auditory perception. Journal of Signal Processing, 2014, 30 (4): 363–373. ( 张勇, 刘轶, 刘宏. 结合人耳听觉感知的两级语音增强算法. 信号处理, 2014, 30 (4): 363–373. ) 19 Johnston J D. Transform coding of audio signals using perceptual noise criteria. IEEE Journal on Selected Areas in Communications, 1988, 6 (2): 314–323. DOI:10.1109/49.608 20 Udrea R M, Vizireanu N D, Ciochina S. An improved spectral subtraction method for speech enhancement using a perceptual weighting filter. Digital Signal Processing, 2008, 18 (4): 581–587. DOI:10.1016/j.dsp.2007.08.002 21 Hu Y, Loizou P C. Incorporating a psychoacoustical model in frequency domain speech enhancement. IEEE Signal Processing Letters, 2004, 11 (2): 270–273. DOI:10.1109/LSP.2003.821714 22 Rix A W, Beerends J G, Hollier M P, Hekstra A P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In:Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Salt Lake City, USA:IEEE, 2001. 749-752 23 Zou Xia, Chen Liang, Zhang Xiong-Wei. Speech enhancement with Gamma speech modeling. Journal on Communications, 2006, 27 (10): 118–123. ( 邹霞, 陈亮, 张雄伟. 基于Gamma语音模型的语音增强算法. 通信学报, 2006, 27 (10): 118–123. ) 24 Hu Y, Loizou P C. Evaluation of objective quality measures for speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16 (1): 229–238. DOI:10.1109/TASL.2007.911054 25 Huang P S, Kim M, Hasegawa-Johnson M, Smaragdis P. Deep learning for monaural speech separation. In:Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014. 1562-1566