AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
View PDF
Submit Manuscript AI Chat Paper
Show Outline
Show full outline
Hide outline
Show full outline
Hide outline
Open Access

A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition

Xueyang Wu1Rongzhong Lian2Di Jiang2Yuanfeng Song2Weiwei Zhao2Qian Xu1,2Qiang Yang1,2( )
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.
WeBank Co. Ltd., Shenzhen 518057, China.
Show Author Information


Robustness is a long-standing challenge for automatic speech recognition (ASR) as the applied environment of any ASR system faces much noisier speech samples than clean training corpora. However, it is impractical to annotate every types of noisy environments. In this work, we propose a novel phonetic-semantic pre-training (PSP) framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training, self-supervised learning, and fine-tuning. In particular, there are three fundamental stages in PSP. First, pre-train the phone-to-word transducer (PWT) to map the generated phone sequence to the target text using only unpaired text data; second, continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic, in additional to self-supervised signals by recovering the tainted phones; and third, fine-tune the resultant PWT with real world speech data. We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets, which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate (CER) reductions of 28.63% and 26.38%, respectively, in two real-life datasets. It also demonstrates its robustness against synthetic highly noisy speech datasets.


D. Jurafsky and J. H. Martin, Speech and Language Processing. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2000.
H. Hu, T. Tan, and Y. Qian, Generative adversarial networks based data augmentation for noise robust speech recognition, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 5044−5048.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672−2680.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, arXiv preprint arXiv: 1910.13461, 2019.
Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, et al., Transformer-based acoustic modeling for hybrid speech recognition, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6874–6878.

D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, et al., The subspace Gaussian mixture model—A structured model for speech recognition, Comput. Speech Lang., vol. 25, no. 2, pp. 404–439, 2011.


M. Gales and S. Young, The application of hidden Markov models in speech recognition, Found. Trends Signal Process., vol. 1, no. 3, pp. 195–304, 2007.


G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.


S. Latif, M. Usman, R. Rana, and J. Qadir, Phonocardiographic sensing using deep learning for abnormal heartbeat detection, IEEE Sensors J., vol. 18, no. 22, pp. 9393–9400, 2018.

A. Qayyum, S. Latif, and J. Qadir, Quran reciter identification: A deep learning approach, in Proc. 7th Int. Conf. Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia, 2018, pp. 492–497.
Z. Chen, Q. Liu, H. Li, and K. Yu, On modular training of neural acoustics-to-word model for LVCSR, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 4754–4758.
M. N. Sundararaman, A. Kumar, and J. Vepa, Phoneme-BERT: Joint language modelling of phoneme sequence and ASR transcript, arXiv preprint arXiv: 2102.00804, 2021.
S. Ghorbani, S. Khorram, and J. H. L. Hansen, Domain expansion in DNN-based acoustic models for robust speech recognition, in Proc. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 2019, pp. 107–113.
A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, Recurrent neural networks for noise reduction in robust ASR, in Proc. Interspeech 2012, Portland, OR, USA, 2012, pp. 22–25.
J. Liao, Y. Shi, M. Gong, L. Shou, S. Eskimez, L. Lu, H. Qu, and M. Zeng, Generating human readable transcript for automatic speech recognition with pre-trained language model, arXiv preprint arXiv: 2102.11114, 2021.

X. Qiu, T. Sun, Y. G. Xu, Y. Shao, N. Dai, and X. Huang, Pre-trained models for natural language processing: A survey, Sci. China Technol. Sci., vol. 63, no. 10, pp. 1872–1897, 2020.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781, 2013.
O. Melamud, J. Goldberger, and I. Dagan, context2vec: Learning generic context embedding with bidirectional LSTM, in Proc. 20th SIGNLL Conf. Computational Natural Language Learning, Berlin, Germany, 2016, pp. 51–61.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, in Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 2018, pp. 2227–2237.
J. Howard and S. Ruder, Universal language model fine-tuning for text classification, arXiv preprint arXiv: 1801.06146, 2018.
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training,, 2018.
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2019.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The Kaldi speech recognition toolkit, in Proc. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA, 2011.
Kaldi: ‘Chain’ Models,, 2021.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., PyTorch: An imperative style, high-performance deep learning library, arXiv preprint arXiv: 1912.01703, 2019.
G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. Rush, OpenNMT: Open-source toolkit for neural machine translation, in Proc. ACL 2017, System Demonstrations, Vancouver, Canada, 2017, pp. 67–72.
D. Snyder, G. Chen, and D. Povey, MUSAN: A music, speech, and noise corpus, arXiv preprint arXiv: 1510.08484, 2015.
CAAI Artificial Intelligence Research
Pages 1-7
Cite this article:
Wu X, Lian R, Jiang D, et al. A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition. CAAI Artificial Intelligence Research, 2022, 1(1): 1-7.







Received: 05 November 2021
Revised: 28 March 2022
Accepted: 02 April 2022
Published: 28 August 2022
© The author(s) 2022

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (