﻿ 基于时域全卷积网络的语音增强
 舰船科学技术  2022, Vol. 44 Issue (15): 139-144    DOI: 10.3404/j.issn.1672-7649.2022.15.029 PDF

Speech enhancement based on time domain fully convolutional network
LI Wen-zhi, QU Xiao-xu
Naval University of Engineering, College of Electronic Engineering, Wuhan 430000, China
Abstract: At present, speech enhancement methods based on deep learning generally process the amplitude spectrum of speech signal in the frequency domain, and the phase information is lost to some extent. To solve this problem, a speech enhancement method based on time-domain full convolutional network is proposed. The method processes speech signal in time domain by designing full convolutional neural network, and preserves the original phase information of the signal. The noisy speech and clean speech are used as the input and output of the network, and the nonlinear relationship in the time domain is established to realize the end-to-end speech enhancement. The simulation results show that the proposed speech enhancement method based on time-domain full convolution can effectively improve speech quality under the condition of low signal to noise ratio.
Key words: speech enhancement     time-domain signal     deep learning     convolutional neural network     fully convolutional network
0 引　言

1 基本原理 1.1 卷积神经网络

 $x_j^m = f\left( {\sum\limits_{i \in {M_j}} {x_i^{m - 1}} *k_{i,j}^m + b_j^m} \right) 。$ (1)

1.2 激活函数

 $f(x) = 1/(1 + {e^{ - x}}) ，$ (2)
 $f(x) = ({e^x} - {e^{ - x}})/({e^x} + {e^{ - x}})，$ (3)
 $f(x) = \max (0,x)。$ (4)
1.3 批标准化

BN操作的过程如式(5)～式(8)[22]，考虑一个batch的训练含有 $N$ 个训练样本，第 $n$ 训练样本为 $B = [{x_1},{x_2}, \cdots ,{x_m}]$ ，则计算训练样本的均值为：

 ${\mu }_{B}=\frac{1}{m}\underset{i=1}{\overset{m}{{{\displaystyle \sum }}^{\text{}}}}{x}_{i}。$ (5)

 ${\sigma }_{B}^{2}=\frac{1}{m}\underset{m}{\overset{i+1}{{{\displaystyle \sum }}^{\text{}}}}{\left({x}_{i}-{\mu }_{B}\right)}^{2}，$ (6)

 $\widehat{B}=\frac{B-{\mu }_{B}}{\sqrt{{\sigma }_{B}^{2}+\epsilon}} 。$ (7)

 $y = \gamma * \hat B + \beta 。$ (8)
2 基于时域全卷积网络的语音增强 2.1 网络结构设计

 图 1 时域全卷积网络结构 Fig. 1 Structure of the full convolution network in time domain

2.2 算法流程

 $m(n) = s(n) + d(n) 。$ (9)

 图 2 基于时域全卷积网络的语音增强流程图 Fig. 2 Flowchart of speech enhancement based on full convolutional network in time domain
3 仿真与结果分析

3.1 数据处理

3.2 结果分析

SNR和LSD可以分别在时域和频域上评估语音的失真程度，其公式分别为：

 ${\text{SNR}} = 10{\rm{lg}}\frac{{\displaystyle\sum\limits_{i = 1}^N {{s^2}(n)} }}{{\displaystyle\sum\limits_{n = 1}^N {{{\left| {m(n) - s(n)} \right|}^2}} }}，$ (10)
 $\text{LSD}=\frac{1}{T}\underset{t=0}{\overset{T-1}{{{\displaystyle \sum }}^{\text{}}}}\sqrt{\left\{\frac{1}{L/2+1}\underset{k=0}{\overset{L/2}{{{\displaystyle \sum }}^{\text{}}}}{\left[10\mathrm{lg}\frac{P(k)}{\widehat{P}(k)}\right]}^{2}\right\}}。$ (11)

 图 3 white噪声下的时域图（5 dB） Fig. 3 Time domain diagram under white noise (5 dB)

 图 4 white噪声下的增强语谱图(5 dB) Fig. 4 Enhanced language spectra under white noise (5 dB)

4 结　语

 [1] PALIWAL K, SCHWERIN B, WÓJCICKI K. Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator[J]. Speech Communication, 2012, 54(2): 282-305. DOI:10.1016/j.specom.2011.09.003 [2] KUMAR M A, CHARI K M. Noise reduction using modified wiener filter in digital hearing aid for speech signal enhancement[J]. Journal of Intelligent Systems, 2020, 29(1): 1360-1378. [3] TACHIOKA Y. DNN-based voice activity detection using auxiliary speech models in noisy environments [C] // IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: 5529–5533. [4] FAYEK H M, LECH M, CAVEDON L. Evaluating deep learning architectures for speech emotion recognition[J]. Neural Networks:The Official Journal of the International Neural Network Society, 2017, 92: 60-68. DOI:10.1016/j.neunet.2017.02.013 [5] ZHANG S, CHEN A, GUO W, et al. Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition[J]. IEEE Access, 2020, 8: 23496-23505. DOI:10.1109/ACCESS.2020.2969032 [6] ZHANG S, ZHANG S, HUANG T, et al. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J]. IEEE Transactions on Multimedia, 2018, 20: 1576-1590. DOI:10.1109/TMM.2017.2766843 [7] FAYEK H M, LECH M, CAVEDON L. Evaluating deep learning architectures for speech emotion recognition[J]. Neural Networks, 2017, 92: 60-68. DOI:10.1016/j.neunet.2017.02.013 [8] WANG D L. Deep learning reinvents the hearing aid[J]. IEEE Spectrum, 2017, 54(3): 32-37. DOI:10.1109/MSPEC.2017.7864754 [9] XU Y, DU J, DAI L R, et al. A regression approach to speech enhancement based on deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23: 7-19. DOI:10.1109/TASLP.2014.2364452 [10] XU Y, DU J, DAI L R, et al. An experimental study on speech enhancement based on deep neural networks[J]. IEEE Signal Processing Letters, 2013, 21(1): 65-68. [11] 张明亮, 陈雨. 基于全卷积神经网络的语音增强算法[J]. 计算机应用研究, 2020, 37(S1): 135-137. [12] KOUNOVSKY T, MALEK J. Single channel speech enhancement using convolutional neural network[C] // IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), 2017: 1–5. [13] PARK S R, LEE J. A fully convolutional neural network for speech enhancement[J]. Interspeech 2017: 1993–1997 [14] ZHAO H, ZARAR S, TASHEV I, et al. Convolutional-recurrent neural networks for speech enhancement[C]// ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: 2401–2405. [15] ALA B, MY C, CZA B, et al. Speech enhancement using progressive learning-based convolutional recurrent neural network[J]. Applied Acoustics, 2020, 166: 107347. DOI:10.1016/j.apacoust.2020.107347 [16] FU, S, TSAO, Y, LU, X. SNR-Aware Convolutional neural network modeling for speech enhancement[C] // Interspeech 2016: 3768–3772. [17] PALIWAL K K, WÓJCICKI KK, SHANNON B J. The importance of phase in speech enhancement[J]. Speech Communication, 2011, 53(4): 465-494. DOI:10.1016/j.specom.2010.12.003 [18] YIN D, LUO C, XIONG Z, et al. PHASEN: A phase-and-harmonics-aware speech enhancement network[J]. arXiv: 1911.04679, 2019. [19] WILLIAMSON D S, Wang Y, Wang D. Wang. Complex ratio masking for monaural speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(3): 483-492. DOI:10.1109/TASLP.2015.2512042 [20] OORD A, DIELEMAN S, ZEN H, et al. WaveNet: A generative model for raw audio[J]. arXiv: 1609.03499, 2016. [21] 缪裕青, 邹巍, 刘同来, 等. 基于参数迁移和卷积循环神经网络的语音情感识别[J]. 计算机工程与应用, 2019, 55(10): 135-140. DOI:10.3778/j.issn.1002-8331.1802-0089 [22] 罗仁泽, 王瑞杰, 张可, 等. 残差卷积自编码网络图像去噪方法[J]. 计算机仿真, 2021, 38(5): 455-461. DOI:10.3969/j.issn.1006-9348.2021.05.093 [23] FU S W, YU T, LU X, et al. Raw waveform-based speech enhancement by fully convolutional networks [C] 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2017: 006–012.