图书馆杂志

图书馆杂志 ›› 2026, Vol. 45 ›› Issue (3): 53-65.

• 数据科学 • 上一篇    下一篇

提要式目录数据集构建与检索系统实现

颜欣杰,肖卓,卢子言,徐健   

  1. 颜欣杰,肖卓,卢子言,徐健
  • 出版日期:2026-03-15 发布日期:2026-03-24
  • 作者简介:颜欣杰 中山大学信息管理学院,硕士研究生。研究方向:数字人文、跨学科知识发现。作者贡献:数据处理、技术研发、论文撰写与修改。E-mail: yanxj9@mail2.sysu.edu.cn广东广州510006
    肖卓 中山大学图书馆,馆员。研究方向:古典文献学、书志研究、数字人文。作者贡献:数据收集、研究设计、指导并修改论文。广东广州510275
    卢子言 中山大学信息管理学院,硕士研究生。研究方向:数字人文。作者贡献:数据处理。广东广州510006
    徐健 中山大学信息管理学院,教授。研究方向:信息分析与情报研究、科学计量学与科技管理、跨学科知识发现。作者贡献:研究设计、指导实验、论文思路调整与修改。广东广州510006

Constructing the Dataset and Implementing the Retrieval System for the Synopsis Bibliography

Yan Xinjie, Xiao Zhuo, Lu Ziyan, Xu Jian   

  • Online:2026-03-15 Published:2026-03-24
  • About author:Yan Xinjie, Xiao Zhuo, Lu Ziyan, Xu Jian

摘要: 书目提要是古代目录学“辨章学术,考镜源流”的重要体现,拥有宝贵的研究价值,但现有的书目提要资源分散、质量不一,限制了提要研究应用的开展。本研究遵循全面性、权威性、版权保护、版本选择4大原则,选取47本书目提要进行数字化和字段信息抽取,构建起包含题名、卷数、部类、版本、责任者等24个字段,共计59624条记录的提要式目录数据集。基于微调后的GujiBERT模型,实现1669117条提要实体信息的自动化抽取。项目同步开发上线了书目提要检索平台,实现提要全文的可视化检索和数据集下载。该数字化处理方法提升了提要资源的管理利用效率,所开发的数据集为深入分析古籍中的人物特征、地缘关系、评价特色等研究提供了新的数据支撑,从而推动古籍信息的深度挖掘与共享交流。

关键词: 提要式目录, 方法论, 数据集成, 检索系统

Abstract: Synopsis bibliography represents a crucial embodiment of the principles of ancient  Chinese bibliographic studies, characterized by “distinguishing academic disciplines and tracing the origin and development of knowledge”. As such, it holds an immense scholarly value. However, existing synopsis bibliography resources are fragmented and uneven in quality, impeding advancements in synopsis-based research applications. In this study, adhering to the principles of comprehensiveness, authority, copyright protection, and careful version selection, 47 synopsis bibliographies were digitized and processed to extract 24 data fields—including title, volume count, category, edition, and contributor, resulting in a dataset of 59,624 records. Leveraging a finetuned GujiBERT model, 1,669,117 entity records were automatically extracted from the texts. Concurrently, an online retrieval platform was developed to enable visualized full-text search and dataset downloads. This digital processing method enhances the efficiency of synopsis bibliography management and utilization, while the constructed dataset provides new support for in-depth analyses of character profiles, geographical relationships, and evaluative features in ancient texts, thereby promoting the deep mining and sharing of ancient literature information.