AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (20.3 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Multi-Task ConvMixer Networks with Triplet Attention for Low-Resource Keyword Spotting

School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
Show Author Information

Abstract

Customized keyword spotting needs to adapt quickly to small user samples. Current methods primarily solve the problem under moderate noise conditions. Recent work increases the level of difficulty in detecting keywords by introducing keyword interference. However, the current solution has been explored on large models with many parameters, making it unsuitable for deployment on small devices. When applying the current solution to lightweight models with minimal training data, the performance degrades compared to the baseline model. Therefore, we propose a light-weight multi-task architecture (< 9.0×104 parameters) created from integrating the triplet attention module in the ConvMixer networks and a new auxiliary mixed labeling encoding to address the challenge. The results of our experiment show that the proposed model outperforms similar light-weight models for keyword spotting, with accuracy gains ranging from 0.73% to 2.95% for a clean set and from 2.01% to 3.37% for a mixed set under different scales of training set. Furthermore, our model shows its robustness in different low-resource language datasets while converging faster.

References

[1]
B. Kim, S. Chang, J. Lee, and D. Sung, Broadcasted residual learning for efficient keyword spotting, in Proc. 22 nd Annu. Conf. Int. Speech Communication Association, Brno, Czechia, 2021, pp. 4538–4542.
[2]
T. Mo, Y. Yu, M. Salameh, D. Niu, and S. Jui, Neural architecture search for keyword spotting, in Proc. 21 st Annu. Conf. Int. Speech Communication Association, Shanghai, China, 2020, pp. 1982–1986.
[3]
M. Xu and X. L. Zhang, Depthwise separable convolutional ResNet with squeeze-and-excitation blocks for small-footprint keyword spotting, in Proc. 21 st Annu. Conf. Int. Speech Communication Association, Shanghai, China, 2020, pp. 2547–2551.
[4]
X. Li, X. Wei, and X. Qin, Small-footprint keyword spotting with multi-scale temporal convolution, in Proc. 21 st Annu. Conf. Int. Speech Communication Association, Shanghai, China, 2020, pp. 1987–1991.
[5]
D. C. D. Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, A neural attention model for speech command recognition, arXiv preprint arXiv: 1808.08929, 2018.
[6]
O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and S. Laurenzo, Streaming keyword spotting on mobile devices, in Proc. 21 st Annu. Conf. Int. Speech Communication Association, Shanghai, China, 2020, pp. 2277–2281.
[7]
X. Chen, S. Yin, D. Song, P. Ouyang, L. Liu, and S. Wei, Small-footprint keyword spotting with graph convolutional network, in Proc. 2019 IEEE Automatic Speech Recognition and Understanding Workshop, Singapore, 2019, pp. 539–546.
[8]
Y. Bai, J. Yi, J. Tao, Z. Wen, Z. Tian, C. Zhao, and C. Fan, A time delay neural network with shared weight self-attention for small-footprint keyword spotting, in Proc. 20 th Annu. Conf. Int. Speech Communication Association, Graz, Austria, 2019, pp. 2190–2194.
[9]
A. Berg, M. O’Connor, and M. T. Cruz, Keyword transformer: A self-attention model for keyword spotting, in Proc. 22 nd Annu. Conf. Int. Speech Communication Association, Brno, Czechia, 2021, pp. 4249–4253.
[10]
L. Wang, R. Gu, N. Chen, and Y. Zou, Text anchor based metric learning for small-footprint keyword spotting, in Proc. 22 nd Annu. Conf. Int. Speech Communication Association, Brno, Czechia, 2021, pp. 4219–4223.
[11]

M. Zeng and N. Xiao, Effective combination of DenseNet and BiLSTM for keyword spotting, IEEE Access, vol. 7, pp. 10767–10775, 2019.

[12]
L. Lugosch, S. Myer, and V. S. Tomar, DONUT: CTC-based query-by-example keyword spotting, arXiv preprint arXiv: 1811.10736, 2018.
[13]
S. Settle, K. Levin, H. Kamper, and K. Livescu, Query-by-example search with discriminative neural acoustic word embeddings, in Proc. 18 th Annu. Conf. Int. Speech Communication Association, Stockholm, Sweden, 2017, pp. 2874–2878.
[14]
G. Chen, C. Parada, and T. N. Sainath, Query-by-example keyword spotting using long short-term memory networks, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, South Brisbane, Australia, 2015, pp. 5236–5240.
[15]
J. Huang, W. Gharbieh, H. S. Shim, and E. Kim, Query-by-example keyword spotting system using multi-head attention and soft-triple loss, in Proc. 2021 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Toronto, Canada, 2021, pp. 6858–6862.
[16]
D. Ng, R. Zhang, J. Q. Yip, C. Zhang, Y. Ma, T. H. Nguyen, C. Ni, E. S. Chng, and B. Ma, Contrastive speech mixup for low-resource keyword spotting, in Proc. 2023 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, doi: 10.48550/arXiv.2305.01170.
[17]
Y. Xi, B. Yang, H. Li, J. Guo, and K. Yu, Contrastive learning with audio discrimination for customizable keyword spotting in continuous speech, arXiv preprint arXiv: 2401.06485, 2024.
[18]

L. Lei, G. Yuan, H. Yu, D. Kong, and Y. He, Multilingual customized keyword spotting using similar-pair contrastive learning, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 2437–2447, 2023.

[19]
P. M. Reuter, C. Rollwage, and B. T. Meyer, Multilingual query-by-example keyword spotting with metric learning and phoneme-to-embedding mapping, in Proc. 2023 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, pp. 1–5.
[20]
U. Michieli, P. P. Parada, and M. Ozay, Online continual learning in keyword spotting for low-resource devices via pooling high-order temporal statistics, in Proc. 24 th Annu. Conf. Int. Speech Communication Association, Dublin, Ireland, 2023, pp. 1628–1632.
[21]
Y. H. Lee and N. Cho, PhonMatchNet: Phoneme-guided zero-shot keyword spotting for user-defined keywords, in Proc. 24 th Annu. Conf. Int. Speech Communication Association, Dublin, Ireland, 2023, pp. 3964–3968.
[22]
M. Rusci and T. Tuytelaars, Few-shot open-set learning for on-device customization of keyword spotting systems, in Proc. 24 th Annu. Conf. Int. Speech Communication Association, Dublin, Ireland, 2023, pp. 2768–2772.
[23]
B. Labrador, P. Zhu, G. Zhao, A. S. Scarpati, Q. Wang, A. Lozano-Diez, A. Park, and I. L. Moreno, Personalizing keyword spotting with speaker information, arXiv preprint arXiv: 2311.03419, 2023.
[24]
T. Higuchil, A. Gupta, and C. Dhir, Multi-task learning with cross attention for keyword spotting, in Proc. 2021 IEEE Automatic Speech Recognition and Understanding Workshop, Cartagena, Colombia, 2021, pp. 571–578.
[25]
D. Ng, Y. Chen, B. Tian, Q. Fu, and E. S. Chng, Convmixer: Feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting, in Proc. 2022 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Singapore, 2022, pp. 3603–3607.
[26]
D. Ng, Y. Xiao, J. Q. Yip, Z. Yang, B. Tian, Q. Fu, E. S. Chng, and B. Ma, Small footprint multi-channel network for keyword spotting with centroid based awareness, in Proc. 24 th Annu. Conf. Int. Speech Communication Association, Dublin, Ireland, 2023, pp. 296–300.
[27]
Y. Shi, D. Wang, L. Li, J. Han, and S. Yin, Spot keywords from very noisy and mixed speech, in Proc. 24 th Annu. Conf. Int. Speech Communication Association, Dublin, Ireland, 2023, pp. 1488–1492.
[28]
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, mixup: Beyond empirical risk minimization, in Proc. ICLR 2018 Conference, doi: 10.48550/arXiv.1710.09412.
[29]
A. Trockman and J. Z. Kolter, Patches are all you need? arXiv preprint arXiv: 2201.09792, 2022.
[30]
S. Sigtia, P. Clark, R. Haynes, H. Richards, and J. Bridle, Multi-task learning for voice trigger detection, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020, pp. 7449–7453.
[31]
Y. Tian, H. Yao, M. Cai, Y. Liu, and Z. Ma, Improving RNN transducer modeling for small-footprint keyword spotting, in Proc. 2021 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Toronto, Canada, 2021, pp. 5624–5628.
[32]
M. Jung, Y. Jung, J. Goo, and H. Kim, Multi-task network for noise-robust keyword spotting and speaker verification using CTC-based soft VAD and global query attention, in Proc. 21 st Annu. Conf. Int. Speech Communication Association, Shanghai, China, 2020, pp. 931–935.
[33]
S. Yang, B. Kim, I. Chung, and S. Chang, Personalized keyword spotting through multi-task learning, in Proc. 23 rd Annu. Conf. Int. Speech Communication Association, Incheon, Republic of Korea, 2022, pp. 1881–1885.
[34]

X. Liang, Z. Zhang, and R. Xu, Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting, EURASIP J. Audio Speech Music Process., vol. 2023, no. 1, p. 28, 2023.

[35]
B. Alwadei, M. Zuair, M. Al Rahhal, and Y. Bazi, Convmixer with selective kernel attention for hyperspectral image classification, in Proc. 2022 IEEE Int. Geoscience and Remote Sensing Symp., Kuala Lumpur, Malaysia, 2022, pp. 3203–3206.
[36]
X. Li, W. Wang, X. Hu, and J. Yang, Selective kernel networks, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 510–519.
[37]

T. V. Le, H. M. Q. Le, V. Y. Vu, T. T. Tran, and V. Pham, Attention ConvMixer model and application for fish species classification, EAI Endorsed Trans. Ind. Network. Intell. Syst., vol. 10, no. 3, p. e2, 2023.

[38]

S. M. Alzahrani, ConvAttenMixer: Brain tumor detection and type classification using convolutional mixer with external and self-attention mechanisms, J. King Saud Univ. Comput. Inf. Sci., vol. 35, no. 10, p. 101810, 2023.

[39]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31 st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.
[40]

M. H. Guo, Z. N. Liu, T. J. Mu, and S. M. Hu, Beyond self-attention: External attention using two linear layers for visual tasks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 5, pp. 5436–5447, 2022.

[41]

J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, Squeeze and-excitation networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 8, pp. 2011–2023, 2020.

[42]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–778.
[43]
K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, in Proc. 3 rd Int. Conf. Learning Representations, arXiv preprint arXiv:1409.1556 .
[44]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 1–9.
[45]
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, MobileNets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv: 1704.04861, 2017.
[46]
X. Zhang, X. Zhou, M. Lin, and J. Sun, ShuffleNet: An extremely efficient convolutional neural network for mobile devices, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6848–6856.
[47]

H. Deng, Y. Zhang, R. Li, C. Hu, Z. Feng, and H. Li, Combining residual attention mechanisms and generative adversarial networks for hippocampus segmentation, Tsinghua Science and Technology, vol. 27, no. 1, pp. 68–78, 2022.

[48]
L. Zhang, K. Zhang, and H. Pan, SUNet++: A deep network with channel attention for small-scale object segmentation on 3D medical images, Tsinghua Science and Technology, vol. 28, no. 4, pp. 628–638, 2023.
[49]
J. Park, S. Woo, J. Y. Lee, and I. S. Kweon, BAM: Bottleneck attention module, in Proc. British Machine Vision Conf. 2018, Newcastle, UK, 2018, p. 147.
[50]
S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, CBAM: Convolutional block attention module, in Proc. 15 th European Conf. Computer Vision, Munich, Germany, 2018, pp. 3–19.
[51]
Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, ECA-Net: Efficient channel attention for deep convolutional neural networks, in Proc. 2010 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 11531–11539.
[52]

Q. Hua, L. Chen, P. Li, S. Zhao, and Y. Li, A pixel-channel hybrid attention model for image processing, Tsinghua Science and Technology, vol. 27, no. 5, pp. 804–816, 2022.

[53]
D. Misra, T. Nalamada, A. U. Arasanipalai, and Q. Hou, Rotate to attend: Convolutional triplet attention module, in Proc. 2021 IEEE Winter Conf. Applications of Computer Vision, Waikoloa, HI, USA, 2021, pp. 3138–3147.
[54]

S. Yan, H. Shao, J. Wang, X. Zheng, and B. Liu, LiConvFormer: A lightweight fault diagnosis framework using separable multiscale convolution and broadcast self-attention, Expert Syst. Appl., vol. 237, p. 121338, 2024.

[55]
X. Liu, H. Peng, N. Zheng, Y. Yang, H. Hu, and Y. Yuan, EfficientViT: Memory efficient vision transformer with cascaded group attention, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 14420–14430.
[56]
A. Shaker, M. Maaz, H. Rasheed, S. Khan, M. H. Yang, and F. S. Khan, SwiftFormer: Efficient additive attention for transformer-based real-time mobile vision applications, in Proc. 2023 IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 17379–17390.
[57]

H. Ge, L. Wang, M. Liu, X. Zhao, Y. Zhu, H. Pan, and Y. Liu, Pyramidal multiscale convolutional network with polarized self-attention for pixel-wise hyperspectral image classification, IEEE Trans. Geosci. Remote Sens., vol. 61, p. 5504018, 2023.

[58]
M. Munir, W. Avery, and R. Marculescu, MobileViG: Graph-based sparse attention for mobile vision applications, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Vancouver, Canada, 2023, pp. 2211–2219.
[59]
I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, et al., MLP-mixer: An all-MLP architecture for vision, in Proc. 35 th Int. Conf. Neural Information Processing Systems, arXiv preprint arXiv:2105.01601 .
[60]
S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe, CutMix: Regularization strategy to train strong classifiers with localizable features, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 6022–6031.
[61]
J. H. Kim, W. Choo, H. Jeong, and H. O. Song, Co-Mixup: Saliency guided joint mixup with supermodular diversity, arXiv preprint arXiv:2102.03065, 2021.
[62]
P. Warden, Speech commands: A dataset for limited-vocabulary speech recognition, arXiv preprint arXiv: 1804.03209, 2018.
[63]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, South Brisbane, Australia, 2015, pp. 5206–5210.
[64]

A. R. Kivaisi, Q. Zhao, and J. T. Mbelwa, Swahili speech dataset development and improved pre-training method for spoken digit recognition, ACM Trans. Asian Low Resour. Lang. Inf. Process., vol. 22, no. 7, p. 190, 2023.

[65]
N. Dalsaniya, S. H. Mankad, S. Garg, and D. Shrivastava, Development of a novel database in Gujarati language for spoken digits classification, in Proc. 5 th Int. Symp. Advances in Signal Processing and Intelligent Recognition Systems, Trivandrum, India, 2020, pp. 208–219.
[66]

A. Ghandoura, F. Hjabo, and O. Al Dakkak, Engineering applications of artificial intelligence building and benchmarking an Arabic speech commands dataset for small-footprint keyword spotting, Eng. Appl. Artif. Intell., vol. 102, p. 104267, 2021.

[67]
S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, and S. Ha, Temporal convolution for real-time keyword spotting on mobile devices, in Proc. 20 th Annu. Conf. Int. Speech Communication Association, Graz, Austria, 2019, pp. 3372–3376.
[68]
H. Zhang, K. Zu, J. Lu, Y. Zou, and D. Meng, EPSANet: An efficient pyramid squeeze attention block on convolutional neural network, in Proc. 16 th Asian Conf. Computer Vision, Macao, China, 2022, pp. 541–557.
Tsinghua Science and Technology
Pages 875-893
Cite this article:
Kivaisi AR, Zhao Q, Zou Y. Multi-Task ConvMixer Networks with Triplet Attention for Low-Resource Keyword Spotting. Tsinghua Science and Technology, 2025, 30(2): 875-893. https://doi.org/10.26599/TST.2024.9010088

73

Views

5

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 28 December 2023
Revised: 04 April 2024
Accepted: 09 May 2024
Published: 24 September 2024
© The Author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return