3891
Views
589
Downloads
0
Crossref
N/A
WoS
N/A
Scopus
N/A
CSCD
Robustness is a long-standing challenge for automatic speech recognition (ASR) as the applied environment of any ASR system faces much noisier speech samples than clean training corpora. However, it is impractical to annotate every types of noisy environments. In this work, we propose a novel phonetic-semantic pre-training (PSP) framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training, self-supervised learning, and fine-tuning. In particular, there are three fundamental stages in PSP. First, pre-train the phone-to-word transducer (PWT) to map the generated phone sequence to the target text using only unpaired text data; second, continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic, in additional to self-supervised signals by recovering the tainted phones; and third, fine-tune the resultant PWT with real world speech data. We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets, which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate (CER) reductions of 28.63% and 26.38%, respectively, in two real-life datasets. It also demonstrates its robustness against synthetic highly noisy speech datasets.
Robustness is a long-standing challenge for automatic speech recognition (ASR) as the applied environment of any ASR system faces much noisier speech samples than clean training corpora. However, it is impractical to annotate every types of noisy environments. In this work, we propose a novel phonetic-semantic pre-training (PSP) framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training, self-supervised learning, and fine-tuning. In particular, there are three fundamental stages in PSP. First, pre-train the phone-to-word transducer (PWT) to map the generated phone sequence to the target text using only unpaired text data; second, continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic, in additional to self-supervised signals by recovering the tainted phones; and third, fine-tune the resultant PWT with real world speech data. We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets, which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate (CER) reductions of 28.63% and 26.38%, respectively, in two real-life datasets. It also demonstrates its robustness against synthetic highly noisy speech datasets.
D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, et al., The subspace Gaussian mixture model—A structured model for speech recognition, Comput. Speech Lang., vol. 25, no. 2, pp. 404–439, 2011.
M. Gales and S. Young, The application of hidden Markov models in speech recognition, Found. Trends Signal Process., vol. 1, no. 3, pp. 195–304, 2007.
G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
S. Latif, M. Usman, R. Rana, and J. Qadir, Phonocardiographic sensing using deep learning for abnormal heartbeat detection, IEEE Sensors J., vol. 18, no. 22, pp. 9393–9400, 2018.
X. Qiu, T. Sun, Y. G. Xu, Y. Shao, N. Dai, and X. Huang, Pre-trained models for natural language processing: A survey, Sci. China Technol. Sci., vol. 63, no. 10, pp. 1872–1897, 2020.
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).