A Tibetan Sentence Boundary Disambiguation Model Considering the Components on Information on Both Sides of Shad

Fenfang Li; Hui Lv; Yiming Gao; Dolha; Yan Li; Qingguo Zhou

doi:10.26599/TST.2022.9010055

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (4.8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

A Tibetan Sentence Boundary Disambiguation Model Considering the Components on Information on Both Sides of Shad

Fenfang Li^¹, Hui Lv^¹, Yiming Gao^¹, Dolha^², Yan Li^¹, Qingguo Zhou^¹(

)

1School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China

2Key Laboratory of China’s National Linguistic Information Technology, Northwest Minzu University, Lanzhou 730030, China

Show Author Information

Abstract

Sentence Boundary Disambiguation (SBD) is a preprocessing step for natural language processing. Segmenting text into sentences is essential for Deep Learning (DL) and pretraining language models. Tibetan punctuation marks may involve ambiguity about the sentences’ beginnings and endings. Hence, the ambiguous punctuation marks must be distinguished, and the sentence structure must be correctly encoded in language models. This study proposed a component-level Tibetan SBD approach based on the DL model. The models can reduce the error amplification caused by word segmentation and part-of-speech tagging. Although most SBD methods have only considered text on the left side of punctuation marks, this study considers the text on both sides. In this study, 465 669 Tibetan sentences are adopted, and a Bidirectional Long Short-Term Memory (Bi-LSTM) model is used to perform SBD. The experimental results show that the F1-score of the Bi-LSTM model reached 96 $%$ , the most efficient among the six models. Experiments are performed on low-resource languages such as Turkish and Romanian, and high-resource languages such as English and German, to verify the models’ generalization.

Keywords

Sentence Boundary Disambiguation (SBD)punctuation marks ambiguity Bidirectional Long Short-Term Memory (Bi-LSTM) model

References

【1】

Crossref Google Scholar

Tsinghua Science and Technology

Volume 28 Issue 6,
December 2023

Pages 1085-1100

DOI: 10.26599/TST.2022.9010055

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Li F, Lv H, Gao Y, et al. A Tibetan Sentence Boundary Disambiguation Model Considering the Components on Information on Both Sides of Shad. Tsinghua Science and Technology, 2023, 28(6): 1085-1100. https://doi.org/10.26599/TST.2022.9010055

2375

Views

104

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 29 April 2022

Revised: 14 August 2022

Accepted: 16 November 2022

Published: 28 July 2023

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).