[1]
B. Kim, S. Chang, J. Lee, and D. Sung, Broadcasted residual learning for efficient keyword spotting, in Proc. 22 nd Annu. Conf. Int. Speech Communication Association, Brno, Czechia, 2021, pp. 4538–4542.
[2]
T. Mo, Y. Yu, M. Salameh, D. Niu, and S. Jui, Neural architecture search for keyword spotting, in Proc. 21 st Annu. Conf. Int. Speech Communication Association, Shanghai, China, 2020, pp. 1982–1986.
[3]
M. Xu and X. L. Zhang, Depthwise separable convolutional ResNet with squeeze-and-excitation blocks for small-footprint keyword spotting, in Proc. 21 st Annu. Conf. Int. Speech Communication Association, Shanghai, China, 2020, pp. 2547–2551.
[4]
X. Li, X. Wei, and X. Qin, Small-footprint keyword spotting with multi-scale temporal convolution, in Proc. 21 st Annu. Conf. Int. Speech Communication Association, Shanghai, China, 2020, pp. 1987–1991.
[5]
D. C. D. Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, A neural attention model for speech command recognition, arXiv preprint arXiv: 1808.08929, 2018.
[6]
O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and S. Laurenzo, Streaming keyword spotting on mobile devices, in Proc. 21 st Annu. Conf. Int. Speech Communication Association, Shanghai, China, 2020, pp. 2277–2281.
[7]
X. Chen, S. Yin, D. Song, P. Ouyang, L. Liu, and S. Wei, Small-footprint keyword spotting with graph convolutional network, in Proc. 2019 IEEE Automatic Speech Recognition and Understanding Workshop, Singapore, 2019, pp. 539–546.
[8]
Y. Bai, J. Yi, J. Tao, Z. Wen, Z. Tian, C. Zhao, and C. Fan, A time delay neural network with shared weight self-attention for small-footprint keyword spotting, in Proc. 20 th Annu. Conf. Int. Speech Communication Association, Graz, Austria, 2019, pp. 2190–2194.
[9]
A. Berg, M. O’Connor, and M. T. Cruz, Keyword transformer: A self-attention model for keyword spotting, in Proc. 22 nd Annu. Conf. Int. Speech Communication Association, Brno, Czechia, 2021, pp. 4249–4253.
[10]
L. Wang, R. Gu, N. Chen, and Y. Zou, Text anchor based metric learning for small-footprint keyword spotting, in Proc. 22 nd Annu. Conf. Int. Speech Communication Association, Brno, Czechia, 2021, pp. 4219–4223.
[12]
L. Lugosch, S. Myer, and V. S. Tomar, DONUT: CTC-based query-by-example keyword spotting, arXiv preprint arXiv: 1811.10736, 2018.
[13]
S. Settle, K. Levin, H. Kamper, and K. Livescu, Query-by-example search with discriminative neural acoustic word embeddings, in Proc. 18 th Annu. Conf. Int. Speech Communication Association, Stockholm, Sweden, 2017, pp. 2874–2878.
[14]
G. Chen, C. Parada, and T. N. Sainath, Query-by-example keyword spotting using long short-term memory networks, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, South Brisbane, Australia, 2015, pp. 5236–5240.
[15]
J. Huang, W. Gharbieh, H. S. Shim, and E. Kim, Query-by-example keyword spotting system using multi-head attention and soft-triple loss, in Proc. 2021 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Toronto, Canada, 2021, pp. 6858–6862.
[16]
D. Ng, R. Zhang, J. Q. Yip, C. Zhang, Y. Ma, T. H. Nguyen, C. Ni, E. S. Chng, and B. Ma, Contrastive speech mixup for low-resource keyword spotting, in Proc. 2023 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, doi: 10.48550/arXiv.2305.01170.
[17]
Y. Xi, B. Yang, H. Li, J. Guo, and K. Yu, Contrastive learning with audio discrimination for customizable keyword spotting in continuous speech, arXiv preprint arXiv: 2401.06485, 2024.
[19]
P. M. Reuter, C. Rollwage, and B. T. Meyer, Multilingual query-by-example keyword spotting with metric learning and phoneme-to-embedding mapping, in Proc. 2023 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, pp. 1–5.
[20]
U. Michieli, P. P. Parada, and M. Ozay, Online continual learning in keyword spotting for low-resource devices via pooling high-order temporal statistics, in Proc. 24 th Annu. Conf. Int. Speech Communication Association, Dublin, Ireland, 2023, pp. 1628–1632.
[21]
Y. H. Lee and N. Cho, PhonMatchNet: Phoneme-guided zero-shot keyword spotting for user-defined keywords, in Proc. 24 th Annu. Conf. Int. Speech Communication Association, Dublin, Ireland, 2023, pp. 3964–3968.
[22]
M. Rusci and T. Tuytelaars, Few-shot open-set learning for on-device customization of keyword spotting systems, in Proc. 24 th Annu. Conf. Int. Speech Communication Association, Dublin, Ireland, 2023, pp. 2768–2772.
[23]
B. Labrador, P. Zhu, G. Zhao, A. S. Scarpati, Q. Wang, A. Lozano-Diez, A. Park, and I. L. Moreno, Personalizing keyword spotting with speaker information, arXiv preprint arXiv: 2311.03419, 2023.
[24]
T. Higuchil, A. Gupta, and C. Dhir, Multi-task learning with cross attention for keyword spotting, in Proc. 2021 IEEE Automatic Speech Recognition and Understanding Workshop, Cartagena, Colombia, 2021, pp. 571–578.
[25]
D. Ng, Y. Chen, B. Tian, Q. Fu, and E. S. Chng, Convmixer: Feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting, in Proc. 2022 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Singapore, 2022, pp. 3603–3607.
[26]
D. Ng, Y. Xiao, J. Q. Yip, Z. Yang, B. Tian, Q. Fu, E. S. Chng, and B. Ma, Small footprint multi-channel network for keyword spotting with centroid based awareness, in Proc. 24 th Annu. Conf. Int. Speech Communication Association, Dublin, Ireland, 2023, pp. 296–300.
[27]
Y. Shi, D. Wang, L. Li, J. Han, and S. Yin, Spot keywords from very noisy and mixed speech, in Proc. 24 th Annu. Conf. Int. Speech Communication Association, Dublin, Ireland, 2023, pp. 1488–1492.
[29]
A. Trockman and J. Z. Kolter, Patches are all you need? arXiv preprint arXiv: 2201.09792, 2022.
[30]
S. Sigtia, P. Clark, R. Haynes, H. Richards, and J. Bridle, Multi-task learning for voice trigger detection, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020, pp. 7449–7453.
[31]
Y. Tian, H. Yao, M. Cai, Y. Liu, and Z. Ma, Improving RNN transducer modeling for small-footprint keyword spotting, in Proc. 2021 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Toronto, Canada, 2021, pp. 5624–5628.
[32]
M. Jung, Y. Jung, J. Goo, and H. Kim, Multi-task network for noise-robust keyword spotting and speaker verification using CTC-based soft VAD and global query attention, in Proc. 21 st Annu. Conf. Int. Speech Communication Association, Shanghai, China, 2020, pp. 931–935.
[33]
S. Yang, B. Kim, I. Chung, and S. Chang, Personalized keyword spotting through multi-task learning, in Proc. 23 rd Annu. Conf. Int. Speech Communication Association, Incheon, Republic of Korea, 2022, pp. 1881–1885.
[35]
B. Alwadei, M. Zuair, M. Al Rahhal, and Y. Bazi, Convmixer with selective kernel attention for hyperspectral image classification, in Proc. 2022 IEEE Int. Geoscience and Remote Sensing Symp., Kuala Lumpur, Malaysia, 2022, pp. 3203–3206.
[36]
X. Li, W. Wang, X. Hu, and J. Yang, Selective kernel networks, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 510–519.
[39]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31 st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.
[42]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–778.
[43]
K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, in Proc. 3 rd Int. Conf. Learning Representations, arXiv preprint arXiv:1409.1556 .
[44]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 1–9.
[45]
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, MobileNets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv: 1704.04861, 2017.
[46]
X. Zhang, X. Zhou, M. Lin, and J. Sun, ShuffleNet: An extremely efficient convolutional neural network for mobile devices, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6848–6856.
[48]
L. Zhang, K. Zhang, and H. Pan, SUNet++: A deep network with channel attention for small-scale object segmentation on 3D medical images, Tsinghua Science and Technology, vol. 28, no. 4, pp. 628–638, 2023.
[49]
J. Park, S. Woo, J. Y. Lee, and I. S. Kweon, BAM: Bottleneck attention module, in Proc. British Machine Vision Conf. 2018, Newcastle, UK, 2018, p. 147.
[50]
S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, CBAM: Convolutional block attention module, in Proc. 15 th European Conf. Computer Vision, Munich, Germany, 2018, pp. 3–19.
[51]
Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, ECA-Net: Efficient channel attention for deep convolutional neural networks, in Proc. 2010 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 11531–11539.
[53]
D. Misra, T. Nalamada, A. U. Arasanipalai, and Q. Hou, Rotate to attend: Convolutional triplet attention module, in Proc. 2021 IEEE Winter Conf. Applications of Computer Vision, Waikoloa, HI, USA, 2021, pp. 3138–3147.
[55]
X. Liu, H. Peng, N. Zheng, Y. Yang, H. Hu, and Y. Yuan, EfficientViT: Memory efficient vision transformer with cascaded group attention, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 14420–14430.
[56]
A. Shaker, M. Maaz, H. Rasheed, S. Khan, M. H. Yang, and F. S. Khan, SwiftFormer: Efficient additive attention for transformer-based real-time mobile vision applications, in Proc. 2023 IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 17379–17390.
[58]
M. Munir, W. Avery, and R. Marculescu, MobileViG: Graph-based sparse attention for mobile vision applications, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Vancouver, Canada, 2023, pp. 2211–2219.
[59]
I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, et al., MLP-mixer: An all-MLP architecture for vision, in Proc. 35 th Int. Conf. Neural Information Processing Systems, arXiv preprint arXiv:2105.01601 .
[60]
S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe, CutMix: Regularization strategy to train strong classifiers with localizable features, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 6022–6031.
[61]
J. H. Kim, W. Choo, H. Jeong, and H. O. Song, Co-Mixup: Saliency guided joint mixup with supermodular diversity, arXiv preprint arXiv:2102.03065, 2021.
[62]
P. Warden, Speech commands: A dataset for limited-vocabulary speech recognition, arXiv preprint arXiv: 1804.03209, 2018.
[63]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, South Brisbane, Australia, 2015, pp. 5206–5210.
[65]
N. Dalsaniya, S. H. Mankad, S. Garg, and D. Shrivastava, Development of a novel database in Gujarati language for spoken digits classification, in Proc. 5 th Int. Symp. Advances in Signal Processing and Intelligent Recognition Systems, Trivandrum, India, 2020, pp. 208–219.
[67]
S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, and S. Ha, Temporal convolution for real-time keyword spotting on mobile devices, in Proc. 20 th Annu. Conf. Int. Speech Communication Association, Graz, Austria, 2019, pp. 3372–3376.
[68]
H. Zhang, K. Zu, J. Lu, Y. Zou, and D. Meng, EPSANet: An efficient pyramid squeeze attention block on convolutional neural network, in Proc. 16 th Asian Conf. Computer Vision, Macao, China, 2022, pp. 541–557.