图书馆杂志

图书馆杂志 ›› 2024, Vol. 43 ›› Issue (393): 96-108.

• 数字人文 • 上一篇    下一篇

面向古诗词的物象库构建方法及其分布规律研究

刘懋霖1,2 赵 萌1,2 王 昊1,2( 1 南京大学信息管理学院 2 江苏省数据工程与知识服务重点实验室)   

  • 出版日期:2024-01-15 发布日期:2024-01-31
  • 作者简介:刘懋霖 南京大学信息管理学院,江苏省数据工程与知识服务重点实验室,硕士研究生。研究方向:数字人文。作者贡献:提出研究思路、设计模型及实验、论文撰写。E-mail:liuml@smail.nju.edu.cn 江苏南京210023 赵 萌 南京大学信息管理学院,江苏省数据工程与知识服务重点实验室,硕士研究生。研究方向:文本分析与挖掘。作者贡献:数据采集与处理、论文修订。江苏南京 210023 王 昊 南京大学信息管理学院,江苏省数据工程与知识服务重点实验室,博士,教授,博士生导师。研究方向:知识本体构建及应用、数据挖掘技术应用等。作者贡献:指导研究方向、论文修订及定稿。 江苏南京 210023

Research on the Construction Method and Distribution Law of Object-Image Database for Ancient Poetry

Liu Maolin1, 2, Zhao Meng1, 2, Wang Hao1, 2 (1 School of Information Management, NanjingUniversity; 2 Jiangsu Key Laboratory of Data Engineering and Knowledge Service)   

  • Online:2024-01-15 Published:2024-01-31
  • About author:Liu Maolin1, 2, Zhao Meng1, 2, Wang Hao1, 2 (1 School of Information Management, NanjingUniversity; 2 Jiangsu Key Laboratory of Data Engineering and Knowledge Service)

摘要:

在数字人文视野下,古诗词资源蕴含巨大价值但难以规模化分析。研究古诗词知识库的自动构建方法,有利于从宏观的角度对古诗词进行分析研究,挖掘其中价值。首先,基于“物象”的概念,尝试提取古诗词中所有可能包含情感的客观名物,降低分析复杂度以构建自动化流程;其次,基于深度学习方法构建RoBERTa-BiLSTM-CRF 模型,对古诗词语料进行物象抽取;之后,使用《全唐诗》和部分宋代诗词资源验证模型的可行性与泛用性;最后,成功构建《全唐诗》物象库,并初步分析其物象分布规律。使用《全唐诗》自动标注语料训练模型后,模型对普通名词、时间名词和地名识别的F1 分值分别达到89.6%、93.3% 和93.6%。将模型迁移至未用于训练的宋代诗词语料,抽取密度为每首诗4.5 个物象,具备未登录词发现能力,说明模型有良好的泛用性和可扩展性。

Abstract:

From the perspective of digital humanities, ancient poetry resources are of great value butdifficult to be analyzed at scale. The research on the automatic construction method of knowledge base ofancient poetry is conducive to the analysis and research of ancient poetry from a macro perspective and themining of its value. Firstly, based on the concept of “object image”, the key information in ancient poemsis extracted to reduce the complexity of analysis to build an automated process. Secondly, roberta-BilstMCRFmodel is constructed based on deep learning method, and object image is extracted from ancient poetrycorpus. Then, The Whole Tang Dynasty Poems and some Song Dynasty poetry resources are used to verifythe feasibility and universality of the model. Finally, the object image database of The Whole Tang DynastyPoems is constructed successfully, and the distribution law of the object images is preliminarily analyzed.After using the automatic tagging corpus training model, the F1 scores of common nouns, time nounsand place names reached 89.6%, 93.3% and 93.6% respectively. The model was transferred to the SongDynasty poetry corpus that was not used for training, and the extraction density was 4.5 objects per poem,which showed the ability to discover unknown words, indicating that the model has good universality andexpansibility.