Pretrained Models and Evaluation Data for the Khmer Language

Shengyi Jiang; Sihui Fu; Nankai Lin; Yingwen Fu

doi:10.26599/TST.2021.9010060

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (1.5 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Pretrained Models and Evaluation Data for the Khmer Language

Shengyi Jiang, Sihui Fu, Nankai Lin(

), Yingwen Fu

School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510000, China

Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong University of Foreign Studies, Guangzhou 510000, China

Show Author Information

Abstract

Trained on a large corpus, pretrained models (PTMs) can capture different levels of concepts in context and hence generate universal language representations, which greatly benefit downstream natural language processing (NLP) tasks. In recent years, PTMs have been widely used in most NLP applications, especially for high-resource languages, such as English and Chinese. However, scarce resources have discouraged the progress of PTMs for low-resource languages. Transformer-based PTMs for the Khmer language are presented in this work for the first time. We evaluate our models on two downstream tasks: Part-of-speech tagging and news categorization. The dataset for the latter task is self-constructed. Experiments demonstrate the effectiveness of the Khmer models. In addition, we find that the current Khmer word segmentation technology does not aid performance improvement. We aim to release our models and datasets to the community in hopes of facilitating the future development of Khmer NLP applications.

Keywords

pretrained models Khmer language word segmentation part-of-speech (POS) tagging news categorization

References

[1]

X. P. Qiu, T. X. Sun, Y. G. Xu, Y. F. Shao, N. Dai, and X. J. Huang, Pre-trained models for natural language processing: A survey, Sci. China Technol. Sci., vol. 63, no. 10, pp. 1872-1897, 2020.

Crossref Google Scholar

[2]

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019, pp. 4171-4186.

[3]

Y. H. Liu, M. Ott, N. Goyal, J. F. Du, M. Joshi, D. Q. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv: 1907.11692, 2019.

Google Scholar

[4]

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Language models are unsupervised multitask learners, OpenAI Blog, vol. 1, no. 8, pp. 9-32, 2019.

Google Scholar

[5]

Z. L. Yang, Z. H. Dai, Y. M. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, XLNet: Generalized autoregressive pretraining for language understanding, presented at the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, Canada, 2019, pp. 5754-5764.

[6]

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res., vol. 21, no. 140, pp. 1-67, 2020.

Google Scholar

[7]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv preprint arXiv: 2005.14165, 2020.

Google Scholar

[8]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, in Proc. 1st Int. Conf. Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2013, pp. 1-9.

[9]

J. Pennington, R. Socher, and C. Manning, GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing, Doha, Qatar, 2014, pp. 1532-1543.

Crossref

[10]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, in Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 2018, pp. 2227-2237.

Crossref

[11]

J. Howard and S. Ruder, Universal language model fine-tuning for text classification, in Proc. 56th Annu. Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 328-339.

Crossref

[12]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 1-15.

[13]

K. Clark, M. T. Luong, Q. V. Le, and C. D. Manning, ELECTRA: Pre-training text encoders as discriminators rather than generators, in Proc. 8th Int. Conf. Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 2020, pp. 1-18.

[14]

B. B. Yu, Y. Nuo, X. Yan, Q. L. Lei, G. Y. Xu, and F. Zhou, Segmentation and alignment of Chinese and Khmer bilingual names based on hierarchical dirichlet process, presented at Int. Conf. Mechatronics and Intelligent Robotics (ICMIR2018), Kunming, China, 2018, pp. 441-450.

Crossref

[15]

U. Phon and C. Pluempitiwiriyawej, Khmer WordNet construction, presented at the 5th Int. Conf. Information Technology (InCIT), Chonburi, Thailand, 2020, pp. 122-127.

Crossref

[16]

H. Y. Chi, X. Yan, S. Y. Li, F. Zhou, G. Y. Xu, and L. Zhang, The acquisition of Khmer-Chinese parallel sentence pairs from comparable corpus based on manhattan-BiGRU model, presented at the 2020 Chinese Control and Decision Conf., Hefei, China, 2020, pp. 4801-4805.

Crossref

[17]

S. Ning, X. Yan, Y. Nuo, F. Zhou, Q. Xie, and J. P. Zhang, Chinese-Khmer parallel fragments extraction from comparable corpus based on dirichlet process, Procedia Comput. Sci., vol. 166, pp. 213-221, 2020.

Crossref Google Scholar

[18]

H. S. Pan, X. Yan, Z. T. Yu, and J. Y. Guo, A Khmer named entity recognition method by fusing language characteristics, presented at the 26th Chinese Control and Decision Conf., Changsha, China, 2014, pp. 4003-4007.

Crossref

[19]

X. H. Liu, X. Yan, G. Y. Xu, Z. T. Yu, and G. S. Qin, Khmer-Chinese bilingual LDA topic model based on dictionary, Int. J. Comput. Sci. Math., vol. 10, no. 6, pp. 557-565, 2019.

Crossref Google Scholar

[20]

C. Nou and W. Kameyama, Khmer POS Tagger: A transformation-based approach with hybrid unknown word handling, presented at the Int. Conf. Semantic Computing (ICSC 2007), Irvine, CA, USA, 2007, pp. 482-492.

Crossref

[21]

C. Nou and W. Kameyama, Hybrid approach for Khmer unknown word POS guessing, presented at the 2007 IEEE Int. Conf. Information Reuse and Integration, Las Vegas, NV, USA, 2007, pp. 215-220.

Crossref

[22]

PAN Localization Cambodia (PLC) of IDRC, Part of speech template, https://www.dit.gov.bt/sites/default/files/PartOfSpeech.pdf, 2007.

[23]

PAN Localization Cambodia (PLC) of IDRC, Khmer automatic Pos tagging, https://moam.info/research-report-on-khmer-automatic-pos-pan-localization_5a22d8711723ddefdcf2139f.html, 2008.

[24]

C. C. Ding, M. Utiyama, and E. Sumita, NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging, ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 18, no. 2, p. 17, 2019.

Crossref Google Scholar

[25]

Y. K. Thu, V. Chea, and Y. Sagisaka, Comparison of six POS tagging methods on 12K sentences Khmer language POS tagged corpus, in Proc. 1st Regional Conf. Optical Character Recognition and Natural Language Processing Technologies for ASEAN Languages (ONA 2017), Phnom Penh, Cambodia, 2017, pp. 1-12.

[26]

S. Khoeurn and Y. S. Kim, Sentiment analysis engine for Cambodian music industry re-building, J. Korea Soc. Simul., vol. 26, no. 4, pp. 23-34, 2017.

Google Scholar

[27]

T. Ratanak, A study on the sentiment classification for Khmer comments on news, (in Chinese), Master dissertation, Kunming Univ. Sci. Technol., Kunming, China, 2017.

[28]

P. J. O. Suárez, B. Sagot, and L. Romary, Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures, in Proc. 22nd Workshop on Challenges in the Management of Large Corpora (CMLC- 7), Mannheim, Germany, 2019, pp. 9-16.

[29]

V. Chea, Y. K. Thu, C. C. Ding, M. Utiyama, A. Finch, and E. Sumita, Khmer word segmentation using conditional random fields, in Khmer Natural Language Processing, Phnom Penh, Cambodia, 2015, pp. 62-69.

[30]

X. Y. Liu, J. X. Wu, and Z. H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Trans. Syst., Man, Cybern., Part B (Cybern.), vol. 39, no. 2, pp. 539-550, 2009.

Crossref Google Scholar

Tsinghua Science and Technology

Volume 27 Issue 4,
August 2022

Pages 709-718

DOI: 10.26599/TST.2021.9010060

Cite this article:

Jiang S, Fu S, Lin N, et al. Pretrained Models and Evaluation Data for the Khmer Language. Tsinghua Science and Technology, 2022, 27(4): 709-718. https://doi.org/10.26599/TST.2021.9010060

Part of a topical collection:

Special Issue on Social Computing

1319

Views

146

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 15 May 2021

Revised: 01 July 2021

Accepted: 30 July 2021

Published: 09 December 2021

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).