﻿ 基于时域全卷积网络的语音增强
Speech enhancement based on time domain fully convolutional network
LI Wen-zhi, QU Xiao-xu
Naval University of Engineering, College of Electronic Engineering, Wuhan 430000, China
Abstract: At present, speech enhancement methods based on deep learning generally process the amplitude spectrum of speech signal in the frequency domain, and the phase information is lost to some extent. To solve this problem, a speech enhancement method based on time-domain full convolutional network is proposed. The method processes speech signal in time domain by designing full convolutional neural network, and preserves the original phase information of the signal. The noisy speech and clean speech are used as the input and output of the network, and the nonlinear relationship in the time domain is established to realize the end-to-end speech enhancement. The simulation results show that the proposed speech enhancement method based on time-domain full convolution can effectively improve speech quality under the condition of low signal to noise ratio.
Key words: speech enhancement     time-domain signal     deep learning     convolutional neural network     fully convolutional network
0 引　言

1 基本原理 1.1 卷积神经网络

 $x_j^m = f\left( {\sum\limits_{i \in {M_j}} {x_i^{m - 1}} *k_{i,j}^m + b_j^m} \right) 。$ (1)

1.2 激活函数

 $f(x) = 1/(1 + {e^{ - x}}) ，$ (2)
 $f(x) = ({e^x} - {e^{ - x}})/({e^x} + {e^{ - x}})，$ (3)
 $f(x) = \max (0,x)。$ (4)
1.3 批标准化

BN操作的过程如式(5)～式(8)[22]，考虑一个batch的训练含有 $N$ 个训练样本，第 $n$ 训练样本为 $B = [{x_1},{x_2}, \cdots ,{x_m}]$ ，则计算训练样本的均值为：

 ${\mu }_{B}=\frac{1}{m}\underset{i=1}{\overset{m}{{{\displaystyle \sum }}^{\text{}}}}{x}_{i}。$ (5)

 ${\sigma }_{B}^{2}=\frac{1}{m}\underset{m}{\overset{i+1}{{{\displaystyle \sum }}^{\text{}}}}{\left({x}_{i}-{\mu }_{B}\right)}^{2}，$ (6)

 $\widehat{B}=\frac{B-{\mu }_{B}}{\sqrt{{\sigma }_{B}^{2}+\epsilon}} 。$ (7)

 $y = \gamma * \hat B + \beta 。$ (8)
2 基于时域全卷积网络的语音增强 2.1 网络结构设计

 图 1 时域全卷积网络结构 Fig. 1 Structure of the full convolution network in time domain

2.2 算法流程

 $m(n) = s(n) + d(n) 。$ (9)

 图 2 基于时域全卷积网络的语音增强流程图 Fig. 2 Flowchart of speech enhancement based on full convolutional network in time domain
3 仿真与结果分析

3.1 数据处理

3.2 结果分析

SNR和LSD可以分别在时域和频域上评估语音的失真程度，其公式分别为：

 ${\text{SNR}} = 10{\rm{lg}}\frac{{\displaystyle\sum\limits_{i = 1}^N {{s^2}(n)} }}{{\displaystyle\sum\limits_{n = 1}^N {{{\left| {m(n) - s(n)} \right|}^2}} }}，$ (10)
 $\text{LSD}=\frac{1}{T}\underset{t=0}{\overset{T-1}{{{\displaystyle \sum }}^{\text{}}}}\sqrt{\left\{\frac{1}{L/2+1}\underset{k=0}{\overset{L/2}{{{\displaystyle \sum }}^{\text{}}}}{\left[10\mathrm{lg}\frac{P(k)}{\widehat{P}(k)}\right]}^{2}\right\}}。$ (11)

 图 3 white噪声下的时域图（5 dB） Fig. 3 Time domain diagram under white noise (5 dB)

 图 4 white噪声下的增强语谱图(5 dB) Fig. 4 Enhanced language spectra under white noise (5 dB)

4 结　语

