Journal Home > Volume 27 , Issue 6

In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.


menu
Abstract
Full text
Outline
About this article

Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

Show Author's information Xiang HaoChenglin Xu( )Lei Xie( )Haizhou Li
School of Computer Science, Northwestern Polytechnical University, Xi’an 710000, China
Department of Electrical and Computer Engineering, National University of Singapore, Singapore 710129, Singapore

Abstract

In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.

Keywords: neural networks, reinforcement learning, speech enhancement, dynamic filter

References(43)

[1]
S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process., vol. 27, no. 2, pp. 113–120, 1979.
[2]
P. Scalart and J. V. Filho, Speech enhancement based on a priori signal to noise estimation, in 1996 IEEE Int. Conf. Acoustics, Speech, and Signal Processing Conf. Proc., Atlanta, GA, USA, 1996, pp. 629–632.
[3]
Y. Ephraim and D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process., vol. 32, no. 6, pp. 1109–1121, 1984.
[4]
R. Xin, J. Zhang, and Y. Shao, Complex network classification with convolutional neural network, Tsinghua Science and Technology, vol. 25, no. 4, pp. 447–457, 2020.
[5]
Q. Dang, J. Yin, B. Wang, and W. Zheng, Deep learning based 2D human pose estimation: A survey, Tsinghua Science and Technology, vol. 24, no. 6, pp. 663–676, 2019.
[6]
Y. Xu, J. Du, L. R. Dai, and C. H. Lee, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 23, no. 1, pp. 7–19, 2015.
[7]
Y. X. Wang and D. L. Wang, Towards scaling up classification-based speech separation, IEEE Trans. Audio Speech Lang. Process., vol. 21, no. 7, pp. 1381–1390, 2013.
[8]
H. Zhao, S. Zarar, I. Tashev, and C. H. Lee, Convolutional-recurrent neural networks for speech enhancement, in Proc. 2018 Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 2401–2405.
[9]
H. S. Choi, J. H. Kim, J. Huh, A. Kim, J. W. Ha, and K. Lee, Phase-aware speech enhancement with deep complex U-Net, in Proc. Int. Conf. Learning Representations, New Orleans, LA, USA, 2019.
[10]
D. C. Yin, C. Luo, Z. W. Xiong, and W. J. Zeng, PHASEN: A phase-and-harmonics-aware speech enhancement network, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 5, pp. 9458–9465, 2020.
[11]
Y. X. Hu, Y. Liu, S. B. Lv, M. T. Xing, S. M. Zhang, Y. H. Fu, J. Wu, B. H. Zhang, and L. Xie, DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement, in INTERSPEECH 2020, Shanghai, China, 2020, 2472–2476.
[12]
S. W. Fu, Y. Tsao, X. G. Lu, and H. Kawai, Raw waveform-based speech enhancement by fully convolutional networks, in Proc. Asia-Pacific Signal and Information Processing Association Annu. Summit and Conf., Kuala Lumpur, Malaysia, 2017, pp. 6–12.
[13]
S. W. Fu, T. W. Wang, Y. Tsao, X. G. Lu, and H. Kawai, End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 26, no. 9, pp. 1570–1584, 2018.
[14]
Y. Luo and N. Mesgarani, Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.
[15]
C. L. Xu, W. Rao, E. S. Chng, and H. Z. Li, SpEx: Multiscale time domain speaker extraction network, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 1370–1384, 2020.
[16]
Y. Zhao, B. Y. Xu, R. Giri, and T. Zhang, Perceptually guided speech enhancement using deep neural networks, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5074–5078.
[17]
M. Kolbæk, Z. H. Tan, and J. Jensen, Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5059–5063.
[18]
H. Zhang, X. L. Zhang, and G. L. Gao, Training supervised speech separation system to improve STOI and PESQ directly, in Proc. 2018 Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5374–5378.
[19]
J. M. Martin-Doñas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, A deep learning loss function based on the perceptual evaluation of the speech quality, IEEE Signal Process. Lett., vol. 25, no. 11, pp. 1680–1684, 2018.
[20]
J. Kim, M. El-Kharmy, and J. Lee, End-to-end multi-task denoising for joint SDR and PESQ optimization, arXiv preprint arXiv: 1901.09146, 2019.
[21]
S. W. Fu, C. F. Liao, and Y. Tsao, Learning with learned loss function: Speech enhancement with quality-net to improve perceptual evaluation of speech quality, IEEE Signal Process. Lett., vol. 27, pp. 26–30, 2020.
[22]
S. W. Fu, C. F. Liao, Y. Tsao, and S. D. Lin, MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement, in Proc. 36th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 2031–2041.
[23]
K. Zhu and T. Zhang, Deep reinforcement learning based mobile robot navigation: A review, Tsinghua Science and Technology, vol. 26, no. 5, pp. 674–691, 2021.
[24]
Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, DNN-based source enhancement self-optimized by reinforcement learning using sound quality measurements, in Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 2017, pp. 81–85.
[25]
Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, DNN-based source enhancement to increase objective sound quality assessment score, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 26, no. 10, pp. 1780–1792, 2018.
[26]
B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool, Dynamic filter networks, in Proc. 30th Conf. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 667–675.
[27]
A. Narayanan and D. L. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 7092–7096.
[28]
H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, South Brisbane, Australia, 2015, pp. 708–712.
[29]
D. S. Williamson, Y. X. Wang, and D. L. Wang, Complex ratio masking for monaural speech separation, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 3, pp. 483–492, 2016.
[30]
W. Mack and E. A. P. Habets, Deep filtering: Signal extraction and reconstruction using complex time-frequency filters, IEEE Signal Process. Lett., vol. 27, pp. 61–65, 2020.
[31]
K. Tan, J. T. Chen, and D. L. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 1, pp. 189–198, 2019.
[32]
Y. L. Zhang, Y. P. Tian, Y. Kong, B. N. Zhong, and Y. Fu, Residual dense network for image super-resolution, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 2472–2481.
[33]
A. L. Maas, A. Y. Hannun, and A. Y. Ng, Rectifier nonlinearities improve neural network acoustic models, in Proc. 30th Int. Conf. Machine Learning, Atlanta, GA, USA, 2013.
[34]
R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Cambridge, MA, USA: MIT Press, 2018.
[35]
X. Hao, C. H. Shan, Y. Xu, S. N. Sun, and L. Xie, An attention-based neural network approach for single channel speech enhancement, in Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brighton, UK, 2019, pp. 6895–6899.
[36]
N. Hou, C. L. Xu, E. S. Chng, and H. Z. Li, Domain adversarial training for speech enhancement, in Proc. 2019 Asia-Pacific Signal and Information Processing Association Annu. Summit and Conf., Lanzhou, China, 2019, pp. 667–672.
[37]
T. Gao, J. Du, L. R. Dai, and C. H. Lee, Densely connected progressive learning for LSTM-based speech enhancement, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5054–5058.
[38]
M. Kolbæk, Z. H. Tan, S. H. Jensen, and J. Jensen, On loss functions for supervised monaural time-domain speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 825–838, 2020.
[39]
C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, Investigating RNN-based speech enhancement methods for noise-robust text-to-speech, in Proc. 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 2016, pp. 146–152.
[40]
C. Veaux, J. Yamagishi, and S. King, The voice bank corpus: Design, collection and data analysis of a large regional accent speech database, in Proc. 2013 Int. Conf. Oriental COCOSDA held jointly with 2013 Conf. Asian Spoken Language Research and Evaluation, Gurgaon, India, 2013, pp. 1–4.
[41]
J. Thiemann, N. Ito, and E. Vincent, The diverse environments multi-channel acoustic noise database: A database of multichannel environmental noise recordings, J. Acoust. Soc. Am., vol. 133, no. 5, p. 3591, 2013.
[42]
S. Pascual, A. Bonafonte, and J. Serrà, SEGAN: Speech enhancement generative adversarial network, in INTERSPEECH 2017, Stockholm, Sweden, 2017, pp. 3642–3646.
[43]
D. Rethage, J. Pons, and X. Serra, A wavenet for speech denoising, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5069–5073.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 02 March 2021
Revised: 08 June 2021
Accepted: 12 July 2021
Published: 21 June 2022
Issue date: December 2022

Copyright

© The author(s) 2022.

Acknowledgements

This work was supported by the National Research Foundation of Singapore (No. AISG-100E-2018-006); and Human-Robot Interaction Phase 1 (No. 1922500054), under the National Robotics Programme, Singapore.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return