AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (989 KB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition

Xueyang Wu1Rongzhong Lian2Di Jiang2Yuanfeng Song2Weiwei Zhao2Qian Xu1,2Qiang Yang1,2( )
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.
WeBank Co. Ltd., Shenzhen 518057, China.
Show Author Information

Abstract

Robustness is a long-standing challenge for automatic speech recognition (ASR) as the applied environment of any ASR system faces much noisier speech samples than clean training corpora. However, it is impractical to annotate every types of noisy environments. In this work, we propose a novel phonetic-semantic pre-training (PSP) framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training, self-supervised learning, and fine-tuning. In particular, there are three fundamental stages in PSP. First, pre-train the phone-to-word transducer (PWT) to map the generated phone sequence to the target text using only unpaired text data; second, continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic, in additional to self-supervised signals by recovering the tainted phones; and third, fine-tune the resultant PWT with real world speech data. We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets, which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate (CER) reductions of 28.63% and 26.38%, respectively, in two real-life datasets. It also demonstrates its robustness against synthetic highly noisy speech datasets.

References

【1】
【1】
 
 
CAAI Artificial Intelligence Research
Pages 1-7

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Wu X, Lian R, Jiang D, et al. A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition. CAAI Artificial Intelligence Research, 2022, 1(1): 1-7. https://doi.org/10.26599/AIR.2022.9150001

6136

Views

741

Downloads

3

Crossref

Received: 05 November 2021
Revised: 28 March 2022
Accepted: 02 April 2022
Published: 28 August 2022
© The author(s) 2022

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).