AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1.5 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Pretrained Models and Evaluation Data for the Khmer Language

Shengyi JiangSihui FuNankai Lin( )Yingwen Fu
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510000, China
Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong University of Foreign Studies, Guangzhou 510000, China
Show Author Information

Abstract

Trained on a large corpus, pretrained models (PTMs) can capture different levels of concepts in context and hence generate universal language representations, which greatly benefit downstream natural language processing (NLP) tasks. In recent years, PTMs have been widely used in most NLP applications, especially for high-resource languages, such as English and Chinese. However, scarce resources have discouraged the progress of PTMs for low-resource languages. Transformer-based PTMs for the Khmer language are presented in this work for the first time. We evaluate our models on two downstream tasks: Part-of-speech tagging and news categorization. The dataset for the latter task is self-constructed. Experiments demonstrate the effectiveness of the Khmer models. In addition, we find that the current Khmer word segmentation technology does not aid performance improvement. We aim to release our models and datasets to the community in hopes of facilitating the future development of Khmer NLP applications.

References

【1】
【1】
 
 
Tsinghua Science and Technology
Pages 709-718

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Jiang S, Fu S, Lin N, et al. Pretrained Models and Evaluation Data for the Khmer Language. Tsinghua Science and Technology, 2022, 27(4): 709-718. https://doi.org/10.26599/TST.2021.9010060
Part of a topical collection:

3445

Views

366

Downloads

13

Crossref

11

Web of Science

15

Scopus

0

CSCD

Received: 15 May 2021
Revised: 01 July 2021
Accepted: 30 July 2021
Published: 09 December 2021
© The author(s) 2022

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).