Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

Xiang Hao; Chenglin Xu; Lei Xie; Haizhou Li

doi:10.26599/TST.2021.9010048

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (707 KB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

Xiang Hao, Chenglin Xu(

), Lei Xie(

), Haizhou Li

School of Computer Science, Northwestern Polytechnical University, Xi’an 710000, China

Department of Electrical and Computer Engineering, National University of Singapore, Singapore 710129, Singapore

Show Author Information

Abstract

In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.

Keywords

neural networks reinforcement learning speech enhancement dynamic filter

References

[1]

S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process., vol. 27, no. 2, pp. 113–120, 1979.

Crossref Google Scholar

[2]

P. Scalart and J. V. Filho, Speech enhancement based on a priori signal to noise estimation, in 1996 IEEE Int. Conf. Acoustics, Speech, and Signal Processing Conf. Proc., Atlanta, GA, USA, 1996, pp. 629–632.

Google Scholar

[3]

Y. Ephraim and D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process., vol. 32, no. 6, pp. 1109–1121, 1984.

Crossref Google Scholar

[4]

R. Xin, J. Zhang, and Y. Shao, Complex network classification with convolutional neural network, Tsinghua Science and Technology, vol. 25, no. 4, pp. 447–457, 2020.

Crossref Google Scholar

[5]

Q. Dang, J. Yin, B. Wang, and W. Zheng, Deep learning based 2D human pose estimation: A survey, Tsinghua Science and Technology, vol. 24, no. 6, pp. 663–676, 2019.

Crossref Google Scholar

[6]

Y. Xu, J. Du, L. R. Dai, and C. H. Lee, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 23, no. 1, pp. 7–19, 2015.

Crossref Google Scholar

[7]

Y. X. Wang and D. L. Wang, Towards scaling up classification-based speech separation, IEEE Trans. Audio Speech Lang. Process., vol. 21, no. 7, pp. 1381–1390, 2013.

Crossref Google Scholar

[8]

H. Zhao, S. Zarar, I. Tashev, and C. H. Lee, Convolutional-recurrent neural networks for speech enhancement, in Proc. 2018 Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 2401–2405.

Crossref Google Scholar

[9]

H. S. Choi, J. H. Kim, J. Huh, A. Kim, J. W. Ha, and K. Lee, Phase-aware speech enhancement with deep complex U-Net, in Proc. Int. Conf. Learning Representations, New Orleans, LA, USA, 2019.

Google Scholar

[10]

D. C. Yin, C. Luo, Z. W. Xiong, and W. J. Zeng, PHASEN: A phase-and-harmonics-aware speech enhancement network, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 5, pp. 9458–9465, 2020.

Crossref Google Scholar

[11]

Y. X. Hu, Y. Liu, S. B. Lv, M. T. Xing, S. M. Zhang, Y. H. Fu, J. Wu, B. H. Zhang, and L. Xie, DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement, in INTERSPEECH 2020, Shanghai, China, 2020, 2472–2476.

Google Scholar

[12]

S. W. Fu, Y. Tsao, X. G. Lu, and H. Kawai, Raw waveform-based speech enhancement by fully convolutional networks, in Proc. Asia-Pacific Signal and Information Processing Association Annu. Summit and Conf., Kuala Lumpur, Malaysia, 2017, pp. 6–12.

Crossref Google Scholar

[13]

S. W. Fu, T. W. Wang, Y. Tsao, X. G. Lu, and H. Kawai, End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 26, no. 9, pp. 1570–1584, 2018.

Crossref Google Scholar

[14]

Y. Luo and N. Mesgarani, Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.

Crossref Google Scholar

[15]

C. L. Xu, W. Rao, E. S. Chng, and H. Z. Li, SpEx: Multiscale time domain speaker extraction network, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 1370–1384, 2020.

Crossref Google Scholar

[16]

Y. Zhao, B. Y. Xu, R. Giri, and T. Zhang, Perceptually guided speech enhancement using deep neural networks, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5074–5078.

Crossref Google Scholar

[17]

M. Kolbæk, Z. H. Tan, and J. Jensen, Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5059–5063.

Crossref Google Scholar

[18]

H. Zhang, X. L. Zhang, and G. L. Gao, Training supervised speech separation system to improve STOI and PESQ directly, in Proc. 2018 Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5374–5378.

Crossref Google Scholar

[19]

J. M. Martin-Doñas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, A deep learning loss function based on the perceptual evaluation of the speech quality, IEEE Signal Process. Lett., vol. 25, no. 11, pp. 1680–1684, 2018.

Crossref Google Scholar

[20]

J. Kim, M. El-Kharmy, and J. Lee, End-to-end multi-task denoising for joint SDR and PESQ optimization, arXiv preprint arXiv: 1901.09146, 2019.

Google Scholar

[21]

S. W. Fu, C. F. Liao, and Y. Tsao, Learning with learned loss function: Speech enhancement with quality-net to improve perceptual evaluation of speech quality, IEEE Signal Process. Lett., vol. 27, pp. 26–30, 2020.

Crossref Google Scholar

[22]

S. W. Fu, C. F. Liao, Y. Tsao, and S. D. Lin, MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement, in Proc. 36^th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 2031–2041.

Google Scholar

[23]

K. Zhu and T. Zhang, Deep reinforcement learning based mobile robot navigation: A review, Tsinghua Science and Technology, vol. 26, no. 5, pp. 674–691, 2021.

Crossref Google Scholar

[24]

Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, DNN-based source enhancement self-optimized by reinforcement learning using sound quality measurements, in Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 2017, pp. 81–85.

Crossref Google Scholar

[25]

Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, DNN-based source enhancement to increase objective sound quality assessment score, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 26, no. 10, pp. 1780–1792, 2018.

Crossref Google Scholar

[26]

B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool, Dynamic filter networks, in Proc. 30^th Conf. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 667–675.

Google Scholar

[27]

A. Narayanan and D. L. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 7092–7096.

Crossref Google Scholar

[28]

H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, South Brisbane, Australia, 2015, pp. 708–712.

Crossref Google Scholar

[29]

D. S. Williamson, Y. X. Wang, and D. L. Wang, Complex ratio masking for monaural speech separation, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 3, pp. 483–492, 2016.

Crossref Google Scholar

[30]

W. Mack and E. A. P. Habets, Deep filtering: Signal extraction and reconstruction using complex time-frequency filters, IEEE Signal Process. Lett., vol. 27, pp. 61–65, 2020.

Crossref Google Scholar

[31]

K. Tan, J. T. Chen, and D. L. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 1, pp. 189–198, 2019.

Crossref Google Scholar

[32]

Y. L. Zhang, Y. P. Tian, Y. Kong, B. N. Zhong, and Y. Fu, Residual dense network for image super-resolution, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 2472–2481.

Crossref Google Scholar

[33]

A. L. Maas, A. Y. Hannun, and A. Y. Ng, Rectifier nonlinearities improve neural network acoustic models, in Proc. 30^th Int. Conf. Machine Learning, Atlanta, GA, USA, 2013.

Google Scholar

[34]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Cambridge, MA, USA: MIT Press, 2018.

[35]

X. Hao, C. H. Shan, Y. Xu, S. N. Sun, and L. Xie, An attention-based neural network approach for single channel speech enhancement, in Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brighton, UK, 2019, pp. 6895–6899.

Crossref Google Scholar

[36]

N. Hou, C. L. Xu, E. S. Chng, and H. Z. Li, Domain adversarial training for speech enhancement, in Proc. 2019 Asia-Pacific Signal and Information Processing Association Annu. Summit and Conf., Lanzhou, China, 2019, pp. 667–672.

Crossref Google Scholar

[37]

T. Gao, J. Du, L. R. Dai, and C. H. Lee, Densely connected progressive learning for LSTM-based speech enhancement, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5054–5058.

Crossref Google Scholar

[38]

M. Kolbæk, Z. H. Tan, S. H. Jensen, and J. Jensen, On loss functions for supervised monaural time-domain speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 825–838, 2020.

Crossref Google Scholar

[39]

C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, Investigating RNN-based speech enhancement methods for noise-robust text-to-speech, in Proc. 9^th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 2016, pp. 146–152.

Crossref Google Scholar

[40]

C. Veaux, J. Yamagishi, and S. King, The voice bank corpus: Design, collection and data analysis of a large regional accent speech database, in Proc. 2013 Int. Conf. Oriental COCOSDA held jointly with 2013 Conf. Asian Spoken Language Research and Evaluation, Gurgaon, India, 2013, pp. 1–4.

Crossref Google Scholar

[41]

J. Thiemann, N. Ito, and E. Vincent, The diverse environments multi-channel acoustic noise database: A database of multichannel environmental noise recordings, J. Acoust. Soc. Am., vol. 133, no. 5, p. 3591, 2013.

Crossref Google Scholar

[42]

S. Pascual, A. Bonafonte, and J. Serrà, SEGAN: Speech enhancement generative adversarial network, in INTERSPEECH 2017, Stockholm, Sweden, 2017, pp. 3642–3646.

Google Scholar

[43]

D. Rethage, J. Pons, and X. Serra, A wavenet for speech denoising, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5069–5073.

Crossref Google Scholar

Tsinghua Science and Technology

Volume 27 Issue 6,
December 2022

Pages 939-947

DOI: 10.26599/TST.2021.9010048

Cite this article:

Hao X, Xu C, Xie L, et al. Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning. Tsinghua Science and Technology, 2022, 27(6): 939-947. https://doi.org/10.26599/TST.2021.9010048

571

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 02 March 2021

Revised: 08 June 2021

Accepted: 12 July 2021

Published: 21 June 2022

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).