Pretrained Models and Evaluation Data for the Khmer Language

Shengyi Jiang; Sihui Fu; Nankai Lin; Yingwen Fu

doi:10.26599/TST.2021.9010060

Tsinghua Science and Technology 2022, 27(4): 709-718 https://doi.org/10.26599/TST.2021.9010060

Open Access | Issue | Published: 09 December 2021

Pretrained Models and Evaluation Data for the Khmer Language

Show Author's Information Hide Author's Information Shengyi Jiang, Sihui Fu, Nankai Lin(

), Yingwen Fu

School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510000, China

Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong University of Foreign Studies, Guangzhou 510000, China

Keywords:

pretrained models, Khmer language, word segmentation, part-of-speech (POS) tagging, news categorization

Cite this article:

Jiang S, Fu S, Lin N, et al. Pretrained Models and Evaluation Data for the Khmer Language. Tsinghua Science and Technology, 2022, 27(4): 709-718. https://doi.org/10.26599/TST.2021.9010060

Download citation

EndNote(RIS)

BibTeX

1114

Views

117

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Trained on a large corpus, pretrained models (PTMs) can capture different levels of concepts in context and hence generate universal language representations, which greatly benefit downstream natural language processing (NLP) tasks. In recent years, PTMs have been widely used in most NLP applications, especially for high-resource languages, such as English and Chinese. However, scarce resources have discouraged the progress of PTMs for low-resource languages. Transformer-based PTMs for the Khmer language are presented in this work for the first time. We evaluate our models on two downstream tasks: Part-of-speech tagging and news categorization. The dataset for the latter task is self-constructed. Experiments demonstrate the effectiveness of the Khmer models. In addition, we find that the current Khmer word segmentation technology does not aid performance improvement. We aim to release our models and datasets to the community in hopes of facilitating the future development of Khmer NLP applications.

Full text

Abstract

Full text

Outline

About this article

Pretrained Models and Evaluation Data for the Khmer Language

Show Author's information Hide Author's Information Shengyi Jiang, Sihui Fu, Nankai Lin(

), Yingwen Fu

School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510000, China

Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong University of Foreign Studies, Guangzhou 510000, China

Abstract

Keywords: pretrained models, Khmer language, word segmentation, part-of-speech (POS) tagging, news categorization

References(30)

[1]

X. P. Qiu, T. X. Sun, Y. G. Xu, Y. F. Shao, N. Dai, and X. J. Huang, Pre-trained models for natural language processing: A survey, Sci. China Technol. Sci., vol. 63, no. 10, pp. 1872-1897, 2020.

DOI Google Scholar

[2]

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019, pp. 4171-4186.

[3]

Y. H. Liu, M. Ott, N. Goyal, J. F. Du, M. Joshi, D. Q. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv: 1907.11692, 2019.

Google Scholar

[4]

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Language models are unsupervised multitask learners, OpenAI Blog, vol. 1, no. 8, pp. 9-32, 2019.

Google Scholar

[5]

Z. L. Yang, Z. H. Dai, Y. M. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, XLNet: Generalized autoregressive pretraining for language understanding, presented at the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, Canada, 2019, pp. 5754-5764.

[6]

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res., vol. 21, no. 140, pp. 1-67, 2020.

Google Scholar

[7]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv preprint arXiv: 2005.14165, 2020.

Google Scholar

[8]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, in Proc. 1st Int. Conf. Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2013, pp. 1-9.

[9]

J. Pennington, R. Socher, and C. Manning, GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing, Doha, Qatar, 2014, pp. 1532-1543.

DOI

[10]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, in Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 2018, pp. 2227-2237.

DOI

[11]

J. Howard and S. Ruder, Universal language model fine-tuning for text classification, in Proc. 56th Annu. Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 328-339.

DOI

[12]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 1-15.

[13]

K. Clark, M. T. Luong, Q. V. Le, and C. D. Manning, ELECTRA: Pre-training text encoders as discriminators rather than generators, in Proc. 8th Int. Conf. Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 2020, pp. 1-18.

[14]

B. B. Yu, Y. Nuo, X. Yan, Q. L. Lei, G. Y. Xu, and F. Zhou, Segmentation and alignment of Chinese and Khmer bilingual names based on hierarchical dirichlet process, presented at Int. Conf. Mechatronics and Intelligent Robotics (ICMIR2018), Kunming, China, 2018, pp. 441-450.

DOI

[15]

U. Phon and C. Pluempitiwiriyawej, Khmer WordNet construction, presented at the 5th Int. Conf. Information Technology (InCIT), Chonburi, Thailand, 2020, pp. 122-127.

DOI

[16]

H. Y. Chi, X. Yan, S. Y. Li, F. Zhou, G. Y. Xu, and L. Zhang, The acquisition of Khmer-Chinese parallel sentence pairs from comparable corpus based on manhattan-BiGRU model, presented at the 2020 Chinese Control and Decision Conf., Hefei, China, 2020, pp. 4801-4805.

DOI

[17]

S. Ning, X. Yan, Y. Nuo, F. Zhou, Q. Xie, and J. P. Zhang, Chinese-Khmer parallel fragments extraction from comparable corpus based on dirichlet process, Procedia Comput. Sci., vol. 166, pp. 213-221, 2020.

DOI Google Scholar

[18]

H. S. Pan, X. Yan, Z. T. Yu, and J. Y. Guo, A Khmer named entity recognition method by fusing language characteristics, presented at the 26th Chinese Control and Decision Conf., Changsha, China, 2014, pp. 4003-4007.

DOI

[19]

X. H. Liu, X. Yan, G. Y. Xu, Z. T. Yu, and G. S. Qin, Khmer-Chinese bilingual LDA topic model based on dictionary, Int. J. Comput. Sci. Math., vol. 10, no. 6, pp. 557-565, 2019.

DOI Google Scholar

[20]

C. Nou and W. Kameyama, Khmer POS Tagger: A transformation-based approach with hybrid unknown word handling, presented at the Int. Conf. Semantic Computing (ICSC 2007), Irvine, CA, USA, 2007, pp. 482-492.

DOI

[21]

C. Nou and W. Kameyama, Hybrid approach for Khmer unknown word POS guessing, presented at the 2007 IEEE Int. Conf. Information Reuse and Integration, Las Vegas, NV, USA, 2007, pp. 215-220.

DOI

[22]

PAN Localization Cambodia (PLC) of IDRC, Part of speech template, https://www.dit.gov.bt/sites/default/files/PartOfSpeech.pdf, 2007.

[23]

PAN Localization Cambodia (PLC) of IDRC, Khmer automatic Pos tagging, https://moam.info/research-report-on-khmer-automatic-pos-pan-localization_5a22d8711723ddefdcf2139f.html, 2008.

[24]

C. C. Ding, M. Utiyama, and E. Sumita, NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging, ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 18, no. 2, p. 17, 2019.

DOI Google Scholar

[25]

Y. K. Thu, V. Chea, and Y. Sagisaka, Comparison of six POS tagging methods on 12K sentences Khmer language POS tagged corpus, in Proc. 1st Regional Conf. Optical Character Recognition and Natural Language Processing Technologies for ASEAN Languages (ONA 2017), Phnom Penh, Cambodia, 2017, pp. 1-12.

[26]

S. Khoeurn and Y. S. Kim, Sentiment analysis engine for Cambodian music industry re-building, J. Korea Soc. Simul., vol. 26, no. 4, pp. 23-34, 2017.

Google Scholar

[27]

T. Ratanak, A study on the sentiment classification for Khmer comments on news, (in Chinese), Master dissertation, Kunming Univ. Sci. Technol., Kunming, China, 2017.

[28]

P. J. O. Suárez, B. Sagot, and L. Romary, Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures, in Proc. 22nd Workshop on Challenges in the Management of Large Corpora (CMLC- 7), Mannheim, Germany, 2019, pp. 9-16.

[29]

V. Chea, Y. K. Thu, C. C. Ding, M. Utiyama, A. Finch, and E. Sumita, Khmer word segmentation using conditional random fields, in Khmer Natural Language Processing, Phnom Penh, Cambodia, 2015, pp. 62-69.

[30]

X. Y. Liu, J. X. Wu, and Z. H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Trans. Syst., Man, Cybern., Part B (Cybern.), vol. 39, no. 2, pp. 539-550, 2009.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 15 May 2021

Revised: 01 July 2021

Accepted: 30 July 2021

Published: 09 December 2021

Issue date: August 2022

Copyright

Acknowledgements

This work was supported by the Major Projects of Guangdong Education Department for Foundation Research and Applied Research (No. 2017KZDXM031) and Guangzhou Science and Technology Plan Project (No. 202009010021). We would like to extend our sincere gratitude to the anonymous reviewers for their insightful feedbacks.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).