D. Jurafsky and J. H. Martin, Speech and Language Processing. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2000.
H. Hu, T. Tan, and Y. Qian, Generative adversarial networks based data augmentation for noise robust speech recognition, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 5044−5048.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672−2680.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, arXiv preprint arXiv: 1910.13461, 2019.
Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, et al., Transformer-based acoustic modeling for hybrid speech recognition, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6874–6878.
D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, et al., The subspace Gaussian mixture model—A structured model for speech recognition, Comput. Speech Lang., vol. 25, no. 2, pp. 404–439, 2011.
M. Gales and S. Young, The application of hidden Markov models in speech recognition, Found. Trends Signal Process., vol. 1, no. 3, pp. 195–304, 2007.
G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
S. Latif, M. Usman, R. Rana, and J. Qadir, Phonocardiographic sensing using deep learning for abnormal heartbeat detection, IEEE Sensors J., vol. 18, no. 22, pp. 9393–9400, 2018.
A. Qayyum, S. Latif, and J. Qadir, Quran reciter identification: A deep learning approach, in Proc. 7th Int. Conf. Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia, 2018, pp. 492–497.
Z. Chen, Q. Liu, H. Li, and K. Yu, On modular training of neural acoustics-to-word model for LVCSR, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 4754–4758.
M. N. Sundararaman, A. Kumar, and J. Vepa, Phoneme-BERT: Joint language modelling of phoneme sequence and ASR transcript, arXiv preprint arXiv: 2102.00804, 2021.
S. Ghorbani, S. Khorram, and J. H. L. Hansen, Domain expansion in DNN-based acoustic models for robust speech recognition, in Proc. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 2019, pp. 107–113.
A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, Recurrent neural networks for noise reduction in robust ASR, in Proc. Interspeech 2012, Portland, OR, USA, 2012, pp. 22–25.
J. Liao, Y. Shi, M. Gong, L. Shou, S. Eskimez, L. Lu, H. Qu, and M. Zeng, Generating human readable transcript for automatic speech recognition with pre-trained language model, arXiv preprint arXiv: 2102.11114, 2021.
X. Qiu, T. Sun, Y. G. Xu, Y. Shao, N. Dai, and X. Huang, Pre-trained models for natural language processing: A survey, Sci. China Technol. Sci., vol. 63, no. 10, pp. 1872–1897, 2020.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781, 2013.
O. Melamud, J. Goldberger, and I. Dagan, context2vec: Learning generic context embedding with bidirectional LSTM, in Proc. 20th SIGNLL Conf. Computational Natural Language Learning, Berlin, Germany, 2016, pp. 51–61.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, in Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 2018, pp. 2227–2237.
J. Howard and S. Ruder, Universal language model fine-tuning for text classification, arXiv preprint arXiv: 1801.06146, 2018.
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2019.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The Kaldi speech recognition toolkit, in Proc. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA, 2011.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., PyTorch: An imperative style, high-performance deep learning library, arXiv preprint arXiv: 1912.01703, 2019.
G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. Rush, OpenNMT: Open-source toolkit for neural machine translation, in Proc. ACL 2017, System Demonstrations, Vancouver, Canada, 2017, pp. 67–72.
D. Snyder, G. Chen, and D. Povey, MUSAN: A music, speech, and noise corpus, arXiv preprint arXiv: 1510.08484, 2015.