AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
View PDF
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition

Xueyang Wu1Rongzhong Lian2Di Jiang2Yuanfeng Song2Weiwei Zhao2Qian Xu1,2Qiang Yang1,2( )
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.
WeBank Co. Ltd., Shenzhen 518057, China.
Show Author Information

Abstract

Robustness is a long-standing challenge for automatic speech recognition (ASR) as the applied environment of any ASR system faces much noisier speech samples than clean training corpora. However, it is impractical to annotate every types of noisy environments. In this work, we propose a novel phonetic-semantic pre-training (PSP) framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training, self-supervised learning, and fine-tuning. In particular, there are three fundamental stages in PSP. First, pre-train the phone-to-word transducer (PWT) to map the generated phone sequence to the target text using only unpaired text data; second, continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic, in additional to self-supervised signals by recovering the tainted phones; and third, fine-tune the resultant PWT with real world speech data. We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets, which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate (CER) reductions of 28.63% and 26.38%, respectively, in two real-life datasets. It also demonstrates its robustness against synthetic highly noisy speech datasets.

References

1
D. Jurafsky and J. H. Martin, Speech and Language Processing. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2000.
2
H. Hu, T. Tan, and Y. Qian, Generative adversarial networks based data augmentation for noise robust speech recognition, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 5044−5048.
DOI
3
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672−2680.
4
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.
5
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, arXiv preprint arXiv: 1910.13461, 2019.
DOI
6
Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, et al., Transformer-based acoustic modeling for hybrid speech recognition, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6874–6878.
DOI
7

D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, et al., The subspace Gaussian mixture model—A structured model for speech recognition, Comput. Speech Lang., vol. 25, no. 2, pp. 404–439, 2011.

8

M. Gales and S. Young, The application of hidden Markov models in speech recognition, Found. Trends Signal Process., vol. 1, no. 3, pp. 195–304, 2007.

9

G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.

10

S. Latif, M. Usman, R. Rana, and J. Qadir, Phonocardiographic sensing using deep learning for abnormal heartbeat detection, IEEE Sensors J., vol. 18, no. 22, pp. 9393–9400, 2018.

11
A. Qayyum, S. Latif, and J. Qadir, Quran reciter identification: A deep learning approach, in Proc. 7th Int. Conf. Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia, 2018, pp. 492–497.
DOI
12
Z. Chen, Q. Liu, H. Li, and K. Yu, On modular training of neural acoustics-to-word model for LVCSR, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 4754–4758.
DOI
13
M. N. Sundararaman, A. Kumar, and J. Vepa, Phoneme-BERT: Joint language modelling of phoneme sequence and ASR transcript, arXiv preprint arXiv: 2102.00804, 2021.
DOI
14
S. Ghorbani, S. Khorram, and J. H. L. Hansen, Domain expansion in DNN-based acoustic models for robust speech recognition, in Proc. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 2019, pp. 107–113.
DOI
15
A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, Recurrent neural networks for noise reduction in robust ASR, in Proc. Interspeech 2012, Portland, OR, USA, 2012, pp. 22–25.
DOI
16
J. Liao, Y. Shi, M. Gong, L. Shou, S. Eskimez, L. Lu, H. Qu, and M. Zeng, Generating human readable transcript for automatic speech recognition with pre-trained language model, arXiv preprint arXiv: 2102.11114, 2021.
DOI
17

X. Qiu, T. Sun, Y. G. Xu, Y. Shao, N. Dai, and X. Huang, Pre-trained models for natural language processing: A survey, Sci. China Technol. Sci., vol. 63, no. 10, pp. 1872–1897, 2020.

18
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781, 2013.
19
O. Melamud, J. Goldberger, and I. Dagan, context2vec: Learning generic context embedding with bidirectional LSTM, in Proc. 20th SIGNLL Conf. Computational Natural Language Learning, Berlin, Germany, 2016, pp. 51–61.
DOI
20
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, in Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 2018, pp. 2227–2237.
DOI
21
J. Howard and S. Ruder, Universal language model fine-tuning for text classification, arXiv preprint arXiv: 1801.06146, 2018.
DOI
22
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training, https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018.
23
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2019.
24
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The Kaldi speech recognition toolkit, in Proc. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA, 2011.
25
Kaldi: ‘Chain’ Models, https://kaldi-asr.org/doc/chain.html, 2021.
26
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., PyTorch: An imperative style, high-performance deep learning library, arXiv preprint arXiv: 1912.01703, 2019.
27
G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. Rush, OpenNMT: Open-source toolkit for neural machine translation, in Proc. ACL 2017, System Demonstrations, Vancouver, Canada, 2017, pp. 67–72.
DOI
28
D. Snyder, G. Chen, and D. Povey, MUSAN: A music, speech, and noise corpus, arXiv preprint arXiv: 1510.08484, 2015.
CAAI Artificial Intelligence Research
Pages 1-7
Cite this article:
Wu X, Lian R, Jiang D, et al. A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition. CAAI Artificial Intelligence Research, 2022, 1(1): 1-7. https://doi.org/10.26599/AIR.2022.9150001

4543

Views

642

Downloads

0

Crossref

Received: 05 November 2021
Revised: 28 March 2022
Accepted: 02 April 2022
Published: 28 August 2022
© The author(s) 2022

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return