X. P. Qiu, T. X. Sun, Y. G. Xu, Y. F. Shao, N. Dai, and X. J. Huang, Pre-trained models for natural language processing: A survey, Sci. China Technol. Sci., vol. 63, no. 10, pp. 1872-1897, 2020.
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019, pp. 4171-4186.
Y. H. Liu, M. Ott, N. Goyal, J. F. Du, M. Joshi, D. Q. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv: 1907.11692, 2019.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Language models are unsupervised multitask learners, OpenAI Blog, vol. 1, no. 8, pp. 9-32, 2019.
Z. L. Yang, Z. H. Dai, Y. M. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, XLNet: Generalized autoregressive pretraining for language understanding, presented at the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, Canada, 2019, pp. 5754-5764.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res., vol. 21, no. 140, pp. 1-67, 2020.
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv preprint arXiv: 2005.14165, 2020.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, in Proc. 1st Int. Conf. Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2013, pp. 1-9.
J. Pennington, R. Socher, and C. Manning, GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing, Doha, Qatar, 2014, pp. 1532-1543.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, in Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 2018, pp. 2227-2237.
J. Howard and S. Ruder, Universal language model fine-tuning for text classification, in Proc. 56th Annu. Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 328-339.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 1-15.
K. Clark, M. T. Luong, Q. V. Le, and C. D. Manning, ELECTRA: Pre-training text encoders as discriminators rather than generators, in Proc. 8th Int. Conf. Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 2020, pp. 1-18.
B. B. Yu, Y. Nuo, X. Yan, Q. L. Lei, G. Y. Xu, and F. Zhou, Segmentation and alignment of Chinese and Khmer bilingual names based on hierarchical dirichlet process, presented at Int. Conf. Mechatronics and Intelligent Robotics (ICMIR2018), Kunming, China, 2018, pp. 441-450.
U. Phon and C. Pluempitiwiriyawej, Khmer WordNet construction, presented at the 5th Int. Conf. Information Technology (InCIT), Chonburi, Thailand, 2020, pp. 122-127.
H. Y. Chi, X. Yan, S. Y. Li, F. Zhou, G. Y. Xu, and L. Zhang, The acquisition of Khmer-Chinese parallel sentence pairs from comparable corpus based on manhattan-BiGRU model, presented at the 2020 Chinese Control and Decision Conf., Hefei, China, 2020, pp. 4801-4805.
S. Ning, X. Yan, Y. Nuo, F. Zhou, Q. Xie, and J. P. Zhang, Chinese-Khmer parallel fragments extraction from comparable corpus based on dirichlet process, Procedia Comput. Sci., vol. 166, pp. 213-221, 2020.
H. S. Pan, X. Yan, Z. T. Yu, and J. Y. Guo, A Khmer named entity recognition method by fusing language characteristics, presented at the 26th Chinese Control and Decision Conf., Changsha, China, 2014, pp. 4003-4007.
X. H. Liu, X. Yan, G. Y. Xu, Z. T. Yu, and G. S. Qin, Khmer-Chinese bilingual LDA topic model based on dictionary, Int. J. Comput. Sci. Math., vol. 10, no. 6, pp. 557-565, 2019.
C. Nou and W. Kameyama, Khmer POS Tagger: A transformation-based approach with hybrid unknown word handling, presented at the Int. Conf. Semantic Computing (ICSC 2007), Irvine, CA, USA, 2007, pp. 482-492.
C. Nou and W. Kameyama, Hybrid approach for Khmer unknown word POS guessing, presented at the 2007 IEEE Int. Conf. Information Reuse and Integration, Las Vegas, NV, USA, 2007, pp. 215-220.
C. C. Ding, M. Utiyama, and E. Sumita, NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging, ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 18, no. 2, p. 17, 2019.
Y. K. Thu, V. Chea, and Y. Sagisaka, Comparison of six POS tagging methods on 12K sentences Khmer language POS tagged corpus, in Proc. 1st Regional Conf. Optical Character Recognition and Natural Language Processing Technologies for ASEAN Languages (ONA 2017), Phnom Penh, Cambodia, 2017, pp. 1-12.
S. Khoeurn and Y. S. Kim, Sentiment analysis engine for Cambodian music industry re-building, J. Korea Soc. Simul., vol. 26, no. 4, pp. 23-34, 2017.
T. Ratanak, A study on the sentiment classification for Khmer comments on news, (in Chinese), Master dissertation, Kunming Univ. Sci. Technol., Kunming, China, 2017.
P. J. O. Suárez, B. Sagot, and L. Romary, Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures, in Proc. 22nd Workshop on Challenges in the Management of Large Corpora (CMLC- 7), Mannheim, Germany, 2019, pp. 9-16.
V. Chea, Y. K. Thu, C. C. Ding, M. Utiyama, A. Finch, and E. Sumita, Khmer word segmentation using conditional random fields, in Khmer Natural Language Processing, Phnom Penh, Cambodia, 2015, pp. 62-69.
X. Y. Liu, J. X. Wu, and Z. H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Trans. Syst., Man, Cybern., Part B (Cybern.), vol. 39, no. 2, pp. 539-550, 2009.