Libraly Journal

Libraly Journal ›› 2025, Vol. 44 ›› Issue (413): 104-115.

Previous Articles     Next Articles

Research on Automatic Word Segmentation of Cross-Language Classics Based on Large Language Model

Wang Xiyu1 2,Wang Dongbo1 2(1 College of Information Management NanjingAgricultural University 2 Research Center for Humanities and Social Computing Nanjing AgriculturalUniversity)   

  • Online:2025-09-15 Published:2025-09-29
  • About author:

    Wang Xiyu1 2,Wang Dongbo1 2(1 College of Information Management NanjingAgricultural University 2 Research Center for Humanities and Social Computing Nanjing AgriculturalUniversity)

Abstract:

The purpose of this study is to explore the application and effectiveness of the large language models LLMs in the task of automatic word separation in cross-linguistic canonical texts with a focuson addressing segmentation differences between ancient and modern Chinese. It explores how LLMs canimprove the accuracy and efficiency of word separation. This study not only provides a new way fordigitization of ancient literature and enrichment of language resources but also provides technical supportfor comparative literature and cross-cultural research. In this study Xunzi-Qwen1. 5-7B Xunzi-Baichuan2-7B Xunzi-GLM3-6B and their corresponding base models Qwen1. 5-7B-Base Baichuan2-7BBase and Chatglm3-6B-Base were selected to conduct cross-linguistic canonical lexicography experiment. Based on the Zuo Zhuan a cross-lingual canonical lexicon dataset containing ancient Chinese and modern Chinese was constructed and the data cleaned labeled and integrated. The dataset waspartitioned into training sets of 500 1000 2000 and 5000 entries and each model was fine-tuned with commands based on these subsets to test and compare their performance in cross-language word segmentation tasks. The experimental results showed that LLMs have a significant performance advantage in cross-linguistic canonical word segmentation. Even with smaller scale training data the model could demonstrate a high accuracy of word segmentation. The results validated the effectiveness and potential of LLMs in dealing with cross-era and cross-lingual textual word segmentation and provided valuable references and insights for subsequent research on ancient book digitization and language technology.

Key words:

Digital humanities, Cross-language, Word segmentation of ancient classics, Large language models LLMs