Research on Automatic Word Segmentation of
Cross-Language Classics Based on Large Language Model

Libraly Journal ›› 2025, Vol. 44 ›› Issue (413): 104-115.

Research on Automatic Word Segmentation of Cross-Language Classics Based on Large Language Model

Wang Xiyu1 2，Wang Dongbo1 2（1 College of Information Management NanjingAgricultural University 2 Research Center for Humanities and Social Computing Nanjing AgriculturalUniversity）

Online:2025-09-15 Published:2025-09-29
About author:
Wang Xiyu1 2，Wang Dongbo1 2（1 College of Information Management NanjingAgricultural University 2 Research Center for Humanities and Social Computing Nanjing AgriculturalUniversity）

Abstract

Abstract:

The purpose of this study is to explore the application and effectiveness of the large language models LLMs in the task of automatic word separation in cross-linguistic canonical texts with a focuson addressing segmentation differences between ancient and modern Chinese. It explores how LLMs canimprove the accuracy and efficiency of word separation. This study not only provides a new way fordigitization of ancient literature and enrichment of language resources but also provides technical supportfor comparative literature and cross-cultural research. In this study Xunzi-Qwen1. 5-7B Xunzi-Baichuan2-7B Xunzi-GLM3-6B and their corresponding base models Qwen1. 5-7B-Base Baichuan2-7BBase and Chatglm3-6B-Base were selected to conduct cross-linguistic canonical lexicography experiment. Based on the Zuo Zhuan a cross-lingual canonical lexicon dataset containing ancient Chinese and modern Chinese was constructed and the data cleaned labeled and integrated. The dataset waspartitioned into training sets of 500 1000 2000 and 5000 entries and each model was fine-tuned with commands based on these subsets to test and compare their performance in cross-language word segmentation tasks. The experimental results showed that LLMs have a significant performance advantage in cross-linguistic canonical word segmentation. Even with smaller scale training data the model could demonstrate a high accuracy of word segmentation. The results validated the effectiveness and potential of LLMs in dealing with cross-era and cross-lingual textual word segmentation and provided valuable references and insights for subsequent research on ancient book digitization and language technology.

Key words:

Digital humanities, Cross-language, Word segmentation of ancient classics, Large language models LLMs

Wang Xiyu1 2, Wang Dongbo1 2（1 College of Information Management NanjingAgricultural University 2 Research Center for Humanities and Social Computing Nanjing AgriculturalUniversity）.

Research on Automatic Word Segmentation of Cross-Language Classics Based on Large Language Model [J]. Libraly Journal, 2025, 44(413): 104-115.

References

[ 1 ] 孙茂松,左正平,黄昌宁. 汉语自动分词词典机制的实验研究[J]. 中文信息学报,2000(1):1 6.

[ 2 ] 李庆虎,陈玉健,孙家广. 一种中文分词词典新机制———双字哈希机制[J]. 中文信息学报,2003(4): 13 18.

[ 3 ] 李家福,张亚非. 基于EM 算法的汉语自动分词方法[J]. 情报学报,2002(3):269 272.

[ 4 ] 张梅山,邓知龙,车万翔,等. 统计与词典相结合的领域自适应中文分词[ J]. 中文信息学报, 2012, 26(2):8 12.

[ 5 ] 涂文博,袁贞明,俞凯. 无池化层卷积神经网络的中文分词方法[J]. 计算机工程与应用,2020, 56(2): 120 126.

[ 6 ] 金宸,李维华,姬晨,等. 基于双向LSTM 神经网络模型的中文分词[J]. 中文信息学报,2018, 32(2): 29 37.

[ 7 ] 胡昊天,邓三鸿,张逸勤,等. 数字人文视角下的非物质文化遗产文本自动分词及应用研究[J].图书馆杂志,2022, 41(8):76 83.

[ 8 ] 钱智勇,周建忠,童国平,等. 基于HMM 的楚辞自动分词标注研究[J]. 图书情报工作,2014, 58(4): 105 110.

[ 9 ] 石民,李斌,陈小荷. 基于CRF 的先秦汉语分词标注一体化研究[J]. 中文信息学报,2010, 24(2): 39 45.

[10] 程宁,李斌,葛四嘉,等. 基于BiLSTM-CRF 的古汉语自动断句与词法分析一体化研究[J]. 中文信息学报,2020, 34(4):1 9.

[11] 邢付贵,朱廷劭. 基于大规模语料库的古文词典构建及分词技术研究[J]. 中文信息学报,2021,35(7):41 46.

[12] 王晓玉,李斌. 基于CRFs 和词典信息的中古汉语自动分词[J]. 数据分析与知识发现,2017, 1(5): 62 70.

[13] 俞敬松,魏一,张永伟,等. 基于非参数贝叶斯模型和深度学习的古文分词研究[J]. 中文信息学报,2020, 34(6):1 8.

[14] 刘畅,王东波,胡昊天,等. 面向数字人文的融合外部特征的典籍自动分词研究———以SikuBERT预训练模型为例[J]. 图书馆论坛,2022, 42(6): 44 54.

[15] 唐雪梅,苏祺,王军,等. 基于图卷积神经网络的古汉语分词研究[J]. 情报学报,2023, 42 (6): 740 750.