Journal Home > Volume 1 , Issue 1

Robustness is a long-standing challenge for automatic speech recognition (ASR) as the applied environment of any ASR system faces much noisier speech samples than clean training corpora. However, it is impractical to annotate every types of noisy environments. In this work, we propose a novel phonetic-semantic pre-training (PSP) framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training, self-supervised learning, and fine-tuning. In particular, there are three fundamental stages in PSP. First, pre-train the phone-to-word transducer (PWT) to map the generated phone sequence to the target text using only unpaired text data; second, continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic, in additional to self-supervised signals by recovering the tainted phones; and third, fine-tune the resultant PWT with real world speech data. We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets, which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate (CER) reductions of 28.63% and 26.38%, respectively, in two real-life datasets. It also demonstrates its robustness against synthetic highly noisy speech datasets.


menu
Abstract
Full text
Outline
About this article

A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition

Show Author's information Xueyang Wu1Rongzhong Lian2Di Jiang2Yuanfeng Song2Weiwei Zhao2Qian Xu1,2Qiang Yang1,2( )
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.
WeBank Co. Ltd., Shenzhen 518057, China.

Abstract

Robustness is a long-standing challenge for automatic speech recognition (ASR) as the applied environment of any ASR system faces much noisier speech samples than clean training corpora. However, it is impractical to annotate every types of noisy environments. In this work, we propose a novel phonetic-semantic pre-training (PSP) framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training, self-supervised learning, and fine-tuning. In particular, there are three fundamental stages in PSP. First, pre-train the phone-to-word transducer (PWT) to map the generated phone sequence to the target text using only unpaired text data; second, continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic, in additional to self-supervised signals by recovering the tainted phones; and third, fine-tune the resultant PWT with real world speech data. We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets, which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate (CER) reductions of 28.63% and 26.38%, respectively, in two real-life datasets. It also demonstrates its robustness against synthetic highly noisy speech datasets.

Keywords: self-supervised learning, pre-training, automatic speech recognition

References(28)

1
D. Jurafsky and J. H. Martin, Speech and Language Processing. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2000.
2
H. Hu, T. Tan, and Y. Qian, Generative adversarial networks based data augmentation for noise robust speech recognition, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 5044−5048.
DOI
3
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672−2680.
4
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.
5
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, arXiv preprint arXiv: 1910.13461, 2019.
DOI
6
Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, et al., Transformer-based acoustic modeling for hybrid speech recognition, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6874–6878.
DOI
7

D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, et al., The subspace Gaussian mixture model—A structured model for speech recognition, Comput. Speech Lang., vol. 25, no. 2, pp. 404–439, 2011.

8

M. Gales and S. Young, The application of hidden Markov models in speech recognition, Found. Trends Signal Process., vol. 1, no. 3, pp. 195–304, 2007.

9

G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.

10

S. Latif, M. Usman, R. Rana, and J. Qadir, Phonocardiographic sensing using deep learning for abnormal heartbeat detection, IEEE Sensors J., vol. 18, no. 22, pp. 9393–9400, 2018.

11
A. Qayyum, S. Latif, and J. Qadir, Quran reciter identification: A deep learning approach, in Proc. 7th Int. Conf. Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia, 2018, pp. 492–497.
DOI
12
Z. Chen, Q. Liu, H. Li, and K. Yu, On modular training of neural acoustics-to-word model for LVCSR, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 4754–4758.
DOI
13
M. N. Sundararaman, A. Kumar, and J. Vepa, Phoneme-BERT: Joint language modelling of phoneme sequence and ASR transcript, arXiv preprint arXiv: 2102.00804, 2021.
DOI
14
S. Ghorbani, S. Khorram, and J. H. L. Hansen, Domain expansion in DNN-based acoustic models for robust speech recognition, in Proc. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 2019, pp. 107–113.
DOI
15
A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, Recurrent neural networks for noise reduction in robust ASR, in Proc. Interspeech 2012, Portland, OR, USA, 2012, pp. 22–25.
DOI
16
J. Liao, Y. Shi, M. Gong, L. Shou, S. Eskimez, L. Lu, H. Qu, and M. Zeng, Generating human readable transcript for automatic speech recognition with pre-trained language model, arXiv preprint arXiv: 2102.11114, 2021.
DOI
17

X. Qiu, T. Sun, Y. G. Xu, Y. Shao, N. Dai, and X. Huang, Pre-trained models for natural language processing: A survey, Sci. China Technol. Sci., vol. 63, no. 10, pp. 1872–1897, 2020.

18
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781, 2013.
19
O. Melamud, J. Goldberger, and I. Dagan, context2vec: Learning generic context embedding with bidirectional LSTM, in Proc. 20th SIGNLL Conf. Computational Natural Language Learning, Berlin, Germany, 2016, pp. 51–61.
DOI
20
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, in Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 2018, pp. 2227–2237.
DOI
21
J. Howard and S. Ruder, Universal language model fine-tuning for text classification, arXiv preprint arXiv: 1801.06146, 2018.
DOI
22
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training, https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018.
23
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2019.
24
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The Kaldi speech recognition toolkit, in Proc. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA, 2011.
25
Kaldi: ‘Chain’ Models, https://kaldi-asr.org/doc/chain.html, 2021.
26
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., PyTorch: An imperative style, high-performance deep learning library, arXiv preprint arXiv: 1912.01703, 2019.
27
G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. Rush, OpenNMT: Open-source toolkit for neural machine translation, in Proc. ACL 2017, System Demonstrations, Vancouver, Canada, 2017, pp. 67–72.
DOI
28
D. Snyder, G. Chen, and D. Povey, MUSAN: A music, speech, and noise corpus, arXiv preprint arXiv: 1510.08484, 2015.
Publication history
Copyright
Rights and permissions

Publication history

Received: 05 November 2021
Revised: 28 March 2022
Accepted: 02 April 2022
Published: 28 August 2022
Issue date: September 2022

Copyright

© The author(s) 2022

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return