 自动化学报  2018, Vol. 44 Issue (10): 1876-1887

1. 内蒙古大学计算机学院 呼和浩特 010021;
2. 中国科学院自动化研究所模式识别国家重点实验室 北京 100190

Supervised Speech Separation Using Optimal Ratio Mask
XIA Sha-Sha1, ZHANG Xue-Liang1, LIANG Shan2
1. College of Computer Science, Inner Mongolia University, Hohhot 010021;
2. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190
Manuscript received : November 1, 2016, accepted: March 21, 2017.
Foundation Item: Supported by National Natural Science Foundation of China(61365006)
Corresponding author. ZHANG Xue-Liang  Associate professor at the College of Computer Science, Inner Mongolia University. He received his bachelor degree from Inner Mongolia University in 2003, master degree from Harbin Institute of Technology in 2005, and Ph. D. degree from the Institute of Automation, Chinese Academy of Sciences in 2010. His research interest covers speech separation, computational auditory scene analysis, and speech signal processing. Corresponding author of this paper.
Abstract: Supervised speech separation uses a supervised learning algorithm to learn a mapping from an input noisy signal to an output target signal. In recent years, due to the development of deep learning, supervised separation algorithm has become the most important research direction in speech separation area and the training target has a significant impact on the performance of the speech separation algorithm. Ideal ratio mask is a commonly used training target, which can improve speech intelligibility and quality of the separated speech. However, it does not take into account the correlation between noise and clean speech. In this paper, we use an optimal ratio mask as the training target, and use the deep neural network (DNN) as the separation model. Experiments are carried out under various noise environments and signal to noise ratio conditions, and the results show that the optimal ratio mask outperforms other training targets in general.
Key words: Deep neural network (DNN)     speech separation     supervised learning     training targets

1 基于深度神经网络的语音分离

 $\hat{C}(t)=\frac{\hat{C}(t-m)+\cdots+C(t)+\cdots+C(t+m)}{2m+1}$ (1)

 图 1 基于ARMA模型的深度神经网络 Figure 1 ARMA based DNN architecture
2 优化浮值掩蔽

3.2 傅里叶变换域的理想浮值掩蔽(FFT Ideal Ratio Mask, IRM_FFT)

IRM_FFT是傅里叶域的IRM. IRM_FFT的定义为

 \begin{align} &IRM_{\rm FFT}(t, f)=\\[2mm] &\qquad\left(\frac{S^{2}(t, f)}{S^{2} (t, f)+N^{2}(t, f)}\right)=\\[2mm] &\qquad \left(\frac{P_{s}(t, f)}{P_{s}(t, f)+P_{n}(t, f)}\right) \end{align} (10)

3.3 复数域的理想浮值掩蔽(Complex Ideal Ratio Mask, cIRM)

cIRM的定义:混合信号的STFT系数在经cIRM作用后可得到纯净语音信号的STFT系数, 即给出混合信号的复数频谱$Y$, 可得到纯净语音信号的复数频谱$S$, 于是有

 $S_{t, f}=M_{t, f}\ast Y_{t, f}$ (11)

 $M=\frac{Y_{r}S_{r}+Y_{i}S_{i}}{Y_{r}^{2}+Y_{i}^{2}}+{\rm i}\frac{Y_{r}S_{i}-Y_{i}S_{r}}{Y_{r}^{2}+Y_{i}^{2}}$ (12)

 $PSM(t, f)=\frac{|S(t, f)|}{|Y(t, f)|}\cos(\theta)$ (13)

 图 5 Babble噪声信噪比3 dB条件下由各个计算目标分离出目标语音的频谱图 Figure 5 STFT magnitudes of a separated speech using different training targets. The mixture here is an IEEE male utterance mixed with the Babble noise at 3 dB
4.2 不同人声的分离 4.2.1 实验设置

4.2.2 实验结果与分析

 图 6 0 dB条件下男女声分离频谱图 Figure 6 STFT magnitudes of a separated speech using different training targets. The mixture here is an IEEE male utterance mixed with an IEEE female utterance at 0 dB
5 结束语