图书馆杂志

图书馆杂志 ›› 2025, Vol. 44 ›› Issue (413): 104-115.

• 数字人文 • 上一篇    下一篇

基于大语言模型的跨语言典籍自动分词研究

王希羽1, 2  王东波1, 2(1 南京农业大学信息管理学院 2 南京农业大学人文与社会计算研究中心)   

  • 出版日期:2025-09-15 发布日期:2025-09-29
  • 作者简介:

    王希羽 南京农业大学信息管理学院,硕士研究生。研究方向:数字人文、自然语言处理。作者贡献:撰写论文初稿、完成论文修改。E-mail wangxiyu @ stu.njau. edu. cn 江苏南京 210095

    王东波 南京农业大学信息管理学院,教授,博士生导师。研究方向:数字人文、自然语言处理。作者贡献:研究思路指导、提供修改建议。 江苏南京 210095

Research on Automatic Word Segmentation of Cross-Language Classics Based on Large Language Model

Wang Xiyu1 2,Wang Dongbo1 2(1 College of Information Management NanjingAgricultural University 2 Research Center for Humanities and Social Computing Nanjing AgriculturalUniversity)   

  • Online:2025-09-15 Published:2025-09-29
  • About author:

    Wang Xiyu1 2,Wang Dongbo1 2(1 College of Information Management NanjingAgricultural University 2 Research Center for Humanities and Social Computing Nanjing AgriculturalUniversity)

摘要:

本研究旨在探索大语言模型在跨语言典籍自动分词任务中的应用和效果,特别是针对古汉语与现代汉语的分词差异,以及如何利用大模型的语言处理能力提高分词的准确性和效率。本研究不仅为古籍文献的数字化和语言资源的丰富提供了新的途径,也为比较文学和跨文化研究提供了技术支持。本研究选择Xunzi-Qwen1. 5-7B、 Xunzi-Baichuan2-7B、 Xunzi-GLM3-6B 与其对应的基座模型Qwen1. 5-7B-Base、 Baichuan2-7B-Base、 Chatglm3-6B-Base 进行跨语言典籍分词的实验。基于《左传》构建包含古汉语和现代汉语的跨语言典籍分词数据集,对数据进行清洗、标注和整合。在此基础上,将数据集分为500、 1000、 2000 和5000 条不同规模的训练集,并基于这些子集对模型进行指令微调,以测试和比较不同模型在跨语言分词任务上的性能。实验结果表明,大语言模型在跨语言典籍分词任务上具有显著的性能优势。即使是在较小规模的训练数据条件下,模型也能展现出较高的分词准确率。研究结果验证了大语言模型在处理跨时代、跨语言文本分词任务中的有效性和潜力,为后续的古籍数字化和语言技术研究提供了有价值的参考和启示。

关键词: 数字人文&emsp, 跨语言&emsp, 典籍分词&emsp, 大语言模型

Abstract:

The purpose of this study is to explore the application and effectiveness of the large language models LLMs in the task of automatic word separation in cross-linguistic canonical texts with a focuson addressing segmentation differences between ancient and modern Chinese. It explores how LLMs canimprove the accuracy and efficiency of word separation. This study not only provides a new way fordigitization of ancient literature and enrichment of language resources but also provides technical supportfor comparative literature and cross-cultural research. In this study Xunzi-Qwen1. 5-7B Xunzi-Baichuan2-7B Xunzi-GLM3-6B and their corresponding base models Qwen1. 5-7B-Base Baichuan2-7BBase and Chatglm3-6B-Base were selected to conduct cross-linguistic canonical lexicography experiment. Based on the Zuo Zhuan a cross-lingual canonical lexicon dataset containing ancient Chinese and modern Chinese was constructed and the data cleaned labeled and integrated. The dataset waspartitioned into training sets of 500 1000 2000 and 5000 entries and each model was fine-tuned with commands based on these subsets to test and compare their performance in cross-language word segmentation tasks. The experimental results showed that LLMs have a significant performance advantage in cross-linguistic canonical word segmentation. Even with smaller scale training data the model could demonstrate a high accuracy of word segmentation. The results validated the effectiveness and potential of LLMs in dealing with cross-era and cross-lingual textual word segmentation and provided valuable references and insights for subsequent research on ancient book digitization and language technology.

Key words:

Digital humanities, Cross-language, Word segmentation of ancient classics, Large language models LLMs