Journal Home > Volume 28 , Issue 6

Sentence Boundary Disambiguation (SBD) is a preprocessing step for natural language processing. Segmenting text into sentences is essential for Deep Learning (DL) and pretraining language models. Tibetan punctuation marks may involve ambiguity about the sentences’ beginnings and endings. Hence, the ambiguous punctuation marks must be distinguished, and the sentence structure must be correctly encoded in language models. This study proposed a component-level Tibetan SBD approach based on the DL model. The models can reduce the error amplification caused by word segmentation and part-of-speech tagging. Although most SBD methods have only considered text on the left side of punctuation marks, this study considers the text on both sides. In this study, 465 669 Tibetan sentences are adopted, and a Bidirectional Long Short-Term Memory (Bi-LSTM) model is used to perform SBD. The experimental results show that the F1-score of the Bi-LSTM model reached 96 %, the most efficient among the six models. Experiments are performed on low-resource languages such as Turkish and Romanian, and high-resource languages such as English and German, to verify the models’ generalization.


menu
Abstract
Full text
Outline
About this article

A Tibetan Sentence Boundary Disambiguation Model Considering the Components on Information on Both Sides of Shad

Show Author's information Fenfang Li1Hui Lv1Yiming Gao1 Dolha2Yan Li1Qingguo Zhou1( )
School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
Key Laboratory of China’s National Linguistic Information Technology, Northwest Minzu University, Lanzhou 730030, China

Abstract

Sentence Boundary Disambiguation (SBD) is a preprocessing step for natural language processing. Segmenting text into sentences is essential for Deep Learning (DL) and pretraining language models. Tibetan punctuation marks may involve ambiguity about the sentences’ beginnings and endings. Hence, the ambiguous punctuation marks must be distinguished, and the sentence structure must be correctly encoded in language models. This study proposed a component-level Tibetan SBD approach based on the DL model. The models can reduce the error amplification caused by word segmentation and part-of-speech tagging. Although most SBD methods have only considered text on the left side of punctuation marks, this study considers the text on both sides. In this study, 465 669 Tibetan sentences are adopted, and a Bidirectional Long Short-Term Memory (Bi-LSTM) model is used to perform SBD. The experimental results show that the F1-score of the Bi-LSTM model reached 96 %, the most efficient among the six models. Experiments are performed on low-resource languages such as Turkish and Romanian, and high-resource languages such as English and German, to verify the models’ generalization.

Keywords: Sentence Boundary Disambiguation (SBD), punctuation marks, ambiguity, Bidirectional Long Short-Term Memory (Bi-LSTM) model

References(53)

[1]
K. Sirts and K. Peekman, Evaluating sentence segmentation and word tokenization systems on Estonian web texts, in Human Language Technologies-the Baltic Perspective (HIT 2020). Amsterdam, the Netherlands: IOS Press, 2020, pp. 174–181.
DOI
[2]
J. Asghar, S. Akbar, M. Z. Asghar, B. Ahmad, M. S. Al-Rakhami, and A. Gumaei, Detection and classification of psychopathic personality trait from social media text using deep learning model, Comput. Math. Methods Med., vol. 2021, p. 5512241, 2021.
[3]
H. B. Wang, J. X. Wang, Q. Shen, Y. T. Xian, and Y. F. Zhang, Maximum entropy Thai sentence segmentation combined with Thai grammar rules correction, Univ. Politehn. Bucharest Sci. Bull. Seri. C-Electr. Eng. Comput. Sci., vol. 82, no. 1, pp. 19–34, 2020.
[4]
T. N. Ho, T. Y. Chong, V. H. Do, V. T. Pham, and E. S. Chng, Improving efficiency of sentence boundary detection by feature selection, In Intelligent Information and Database Systems, N. T. Nguyen, B. Trawiński, H. Fujita, and T. P. Hong, eds. Berlin, Germany: Springer, 2016, pp. 169–174.
[5]
L. Zhao, A. Zhang, Y. Liu, and H. Fei, Encoding multi-granularity structural information for joint Chinese word segmentation and POS tagging, Pattern Recogn. Lett., vol. 138, pp. 163–169, 2020.
[6]
A. Elnagar, R. Al-Debsi, and O. Einea, Arabic text classification using deep learning models, Informat. Process. Manag., vol. 57, no. 1, p. 102121, 2020.
[7]
P. Ke, H. Ji, S. Liu, X. Zhu, and M. Huang, SentiLR: Linguistic knowledge enhanced language representation for sentiment analysis, in Proc. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2020, pp. 6975–6988.
[8]
J. Ainslie, S. Ontañón, C. Alberti, V. Cvicek, Z. Fisher, P. Pham, A. Ravula, S. Sanghai, Q. Wang, and L. Yang, ETC: Encoding long and structured inputs in transformers, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing EMNLP, Punta Cana, Dominican Republic, 2020, pp. 268–284.
[9]
B. Bi, C. Li, C. Wu, M. Yan, W. Wang, S. Huang, F. Huang, and L. Si, PALM: Pre-training an autoencoding & autoregressive language model for context-conditioned generation, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing EMNLP, Punta Cana, Dominican Republic, 2020, pp. 8681–8691.
[10]
A. Singh, B. P. Singh, A. K. Poddar, and A. Singh, Sentence boundary detection for Hindi-English social media text, in Recent Findings in Intelligent Computing Techniques, P. K. Sa, S. Bakshi, I. K. Hatzilygeroudis, and M. N. Sahoo, eds. Singapore: Springer, 2018, pp. 207–215.
DOI
[11]
C. Özbey and Ö. Dinçsoy, Sentence boundary detection in Turkish news with regular expressions, in Proc. of 2019 27th Signal Processing and Communications Applications Conf., Sivas, Turkey, 2019, pp. 1–4.
[12]
A. Mekki, I. Zribi, M. Ellouze, and L. H. Belguith, Sentence boundary detection of various forms of Tunisian Arabic, Lang. Res. Eval., vol. 56, no. 1, pp. 357–385, 2022.
[13]
N. Sun and C. Du, News text classification method and simulation based on the hybrid deep learning model, Complexity, vol. 2021, p. 8064579, 2021.
[14]
S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao, Deep learning-based text classification: A comprehensive review, ACM Comput. Surv., vol. 54, no. 3, p. 62, 2021.
[15]
F. Wan and X. He, Tibetan syntactic parsing based on syllables, in Proc. 3rd Int. Conf. Mechatronics and Industrial Informatics, Zhuhai, China, 2015, pp. 753–756.
[16]
M. Maimaiti, Y. Liu, H. Luan, and M. Sun, Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation, Tsinghua Science and Technology, vol. 27, no. 1, pp. 150–163, 2022.
[17]
G. Lobsang, W. Lu, K. Honda, J. Wei, W. Guan, Q. Fang, and J. Dang, Tibetan vowel analysis with a multi-modal Mandarin-Tibetan speech corpus, in Proc. 2016 Asia-Pacific Signal and Information Processing Association Ann. Summit and Conf. (APSIPA ), Jeju, Republic of Korea, 2016, pp. 1–6.
[18]
F. C. Wan, H. Z. Yu, X. H. Wu, and X. Z. He, Tibetan syntactic parsing for Tibetan-Chinese machine translation, in Proc. Int. Conf. Advanced Computer Science and Engineering (ACSE 2014), Guangzhou, China, 2014, pp. 371–376.
[19]
L. Liang, F. Tian, and B. Sun, Current status of Tibetan sentiment analysis and cross-language analysis, in Proc. 2018 6th Int. Conf. Machinery, Materials and Computing Technology (ICMMCT 2018), Jinan, China, 2018, pp. 324–329.
[20]
Y. Bie and Y. Yang, A multitask multiview neural network for end-to-end aspect-based sentiment analysis, Big Data Mining and Analytics, vol. 4, no. 3, pp. 195–207, 2021.
[21]
C. Xu, L. Xie, and X. Xiao, A bidirectional LSTM approach with word embeddings for sentence boundary detection, J. Signal Process. Syst., vol. 90, no. 7, pp. 1063–1075, 2018.
[22]
S. Yu, D. Liu, W. Zhu, Y. Zhang, and S. Zhao, Attention-based LSTM, GRU and CNN for short text classification, J. Intell. Fuzzy Syst. Appl. Eng. Technol,, vol. 39, no. 1, pp. 333–340, 2020.
[23]
S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computat., vol. 9, no. 8, pp. 1735–1780, 1997.
[24]
T. A. Le, Sequence labeling approach to the task of sentence boundary detection, in Proc. 4th Int. Conf. Machine Learning and Soft Computing (ICMLSC 2020), Haiphong City, Vietnam, 2020, pp. 144–148.
[25]
H. Wang, J. He, X. Zhang, and S. Liu, A short text classification method based on N-gram and CNN, Chin. J. Electron., vol. 29, no. 2, pp. 248–254, 2020.
[26]
Y. Gao, M. Wang, Y. Yu, and C. Zhang, Human motion sequence recognition based on correlation feature selection and multilayer perceptron, in Proc. SPIE 11584, 2020 Int. Conf. Image, Video Processing and Artificial Intelligence, Shanghai, China, 2020, p. 115841D.
[27]
Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper, Enriching speech recognition with automatic detection of sentence boundaries and disfluencies, IEEE Trans. Audio Speech Lang. Process., vol. 14, no. 5, pp. 1526–1540, 2006.
[28]
S. Li and B. Gong, Word embedding and text classification based on deep learning methods, in Proc. 2020 2nd Int. Conf. Computer Science Communication and Network Security (CSCNS2020), Sanya, China, 2020, p. 06022.
[29]
A. Al-Doulat, I. Obaidat, and M. Lee, Unstructured medical text classification using linguistic analysis: A supervised deep learning approach, in Proc. 2019 IEEE/ACS 16th Int. Conf. Computer Systems and Applications (AICCSA 2019), Abu Dhabi, the United Arab Emirates, 2019, pp. 1–7.
[30]
A. Zhang, B. Li, W. Wang, S. Wan, and W. Chen, MII: A novel text classification model combining deep active learning with Bert, Comput. Mater. Con., vol. 63, no. 3, pp. 1499–1514, 2020.
[31]
M. V. Abrahams and M. G. Kattenfeld, The role of turbidity as a constraint on predator-prey interactions in aquatic environments, Behav. Ecol. Sociobiol., vol. 40, no. 3, pp. 169–174, 1997.
[32]
J. Read, R. Ridan, S. Oepen, and L. J. Solberg, Sentence boundary detection: A long solved problem? in Proc. COLING 2012: Posters, Mumbai, India, 2012, pp. 985–994.
[33]
M. D. Riley, Some applications of tree-based modelling to speech and language, in Proc. Workshop on Speech and Natural Language, Cape Cod, MA, USA, 1989, pp. 339–352.
[34]
D. D. Palmer and M. A. Hearst, Adaptive multilingual sentence boundary disambiguation, Computat. Linguist., vol. 23, no. 2, pp. 241–267, 1997.
[35]
J. C. Reynar and A. Ratnaparkhi, A maximum entropy approach to identifying sentence boundaries, in Proc. 5th Conf. Applied Natural Language Processing, Washington, DC, USA, 1997, pp. 16–19.
[36]
D. Gillick, Sentence boundary detection and the problem with the U.S., in Proc. Human Language Technologies: 2009 Ann. Conf. North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Boulder, CO, USA, 2009, pp. 241–244.
[37]
A. Mikheev, Tagging sentence boundaries, in Proc. 1st North American Chapter of the Association for Computational Linguistics Conf., Seattle, WA, USA, 2000, pp. 264–271.
[38]
A. Mikheev, Periods, capitalized words, etc., Comput. Linguist., vol. 28, no. 3, pp. 289–318, 2002.
[39]
T. Kiss and J. Strunk, Unsupervised multilingual sentence boundary detection, Comput. Linguist., vol. 32, no. 4, pp. 485–525, 2006.
[40]
O. Hellwig, Detecting sentence boundaries in Sanskrit texts, in Proc. COLING 2016, 26th Int. Conf. Computational Linguistics: Technical Papers, Osaka, Japan, 2016, pp. 288–297.
[41]
H. H. Hock, Some issues in Sanskrit syntax, in Proc. Seminar on Sanskrit Syntax and Discourse Structures, Pairs, France, 2013, pp. 13–15.
[42]
L. Yang, A. Stolcke, E. Shriberg, and M. P. Harper, Using conditional random fields for sentence boundary detection in speech, in Proc. 43rd Ann. Meeting of the Association for Computational Linguistics, Ann Arbor, MI, USA, 2005, pp. 451–458.
[43]
Y. Zhao, C. Wang, and G. Fu, A CRF sequence labeling approach to Chinese punctuation prediction, in Proc. 26th Pacific Asia Conf. Language, Information, and Computation, Bali, Indonesia, 2012, pp. 508–514.
[44]
W. N. Zhao, H. D. Liu, X. Yu, J. Wu, and P. Zhang, The Tibetan sentence boundary identification based on legal texts, (in Chinese), in Proc. National Symp. on Computational Linguistics for Young People (YWCL2010), Wuhan, China, 2010, pp. 490–496.
[45]
R. Cai and T. Ji, Researches of speech classification methods based on Tibetan repertoire, (in Chinese), J. Northwest Univ. Nat. (Nat. Sci.), vol. 26, no. 2, pp. 39–42, 2005.
[46]
Q. J. Ren and J. C. R. An, Research on automatic recognition method of Tibetan sentence boundary, (in Chinese), China Comput. Commun., vol. 8, no. 316, pp. 62–63, 2014.
[47]
Z. T. Cai, Research on the automatic identification of Tibetan sentence boundaries with maximum entropy classifier, (in Chinese), Comput. Eng. Sci., vol. 34, no. 6, pp. 187–190, 2012.
[48]
X. Li, Z. Cai, W. Jiang, Y. Lv, and Q. Liu, A maximum entropy and rules approach to identifying Tibetan sentence boundaries, (in Chinese), J. Chin. Informat. Proc., vol. 25, no. 4, pp. 39–44, 2011.
[49]
W. Z. Ma, Z. Wanme, and Z. Nima, Method of identification of Tibetan sentence boundary, (in Chinese), J. Tibet Univ., vol. 27, no. 2, pp. 70–76, 2012.
[50]
W. Zhao, X. Yu, H. Liu, L. Li, L. Wang, and J. Wu, Modern Tibetan auxiliary ending sentence boundary detection, (in Chinese), J. Chin. Informat. Proc., vol. 27, no. 1, pp. 115–119, 2013.
[51]
X. Zha and B. Luo, Based on function words and sentence patterns Tibetan sentence extraction method, (in Chinese), J. Northwest Minzu Univ. (Nat. Sci.)), vol. 39, no. 4, pp. 39–43&62, 2018.
[52]
C. Z. M. Que, Q. C. R. Hua, R. D. Z. Cai, and W. J. Xia, Tibetan sentence boundary recognition based on mixed strategy, (in Chinese), J. Inner Mongolia Normal Univ. (Nat. Sci. Ed.), vol. 48, no. 5, pp. 400–405, 2019.
[53]
P. Koehn, EUROPARL: A parallel corpus for statistical machine translation, in Proc. Machine Translation Summit X: Papers, Phuket, Thailand, 2005, pp. 79–86.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 29 April 2022
Revised: 14 August 2022
Accepted: 16 November 2022
Published: 28 July 2023
Issue date: December 2023

Copyright

© The author(s) 2023.

Acknowledgements

This work was supported by the National Key R&D Program of China (No. 2020YFC0832500), the Ministry of Education-China Mobile Research Foundation (No. MCM20170206), the Fundamental Research Funds for the Central Universities (Nos. lzujbky-2022-kb12, lzujbky-2021-sp43, lzujbky-2020-sp02, lzujbky-2019-kb51, and lzujbky-2018-k12), the National Natural Science Foundation of China (No. 61402210), the Science and Technology Plan of Qinghai Province (No. 2020-GX-164), the Google Research Awards and Google Faculty Award, the Provincial Science and Technology Plan (Major Science and Technology Projects-Open Solicitation) (No. 22ZD6GA048), the Gansu Provincial Science and Technology Major Special Innovation Consortium Project (No. 21ZD3GA002), and the Gansu Province Green and Smart Highway Key Technology Research and Demonstration. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Jetson TX1 used for this research.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return