图书馆杂志

图书馆杂志 ›› 2025, Vol. 44 ›› Issue (415): 64-74.

• 数字人文 • 上一篇    下一篇

基于机器阅读理解与词汇增强模型的古代科技领域命名实体识别∗

潘 俊 肖名城 陶祥兴(浙江科技大学理学院)   

  • 出版日期:2025-11-15 发布日期:2025-11-26
  • 作者简介:

    潘 俊 浙江科技大学理学院,博士,副教授,硕士生导师。研究方向: 数字人文、数据科学。作者贡献:提出研究思路、数据集构建、撰写论文。E-mail panjun@zust. edu. cn 浙江杭州 310023
    肖名城 浙江科技大学理学院, 硕士研究生。研究方向:自然语言处理。作者贡献:实验分析、论文初稿。浙江杭州310023
    陶祥兴 浙江科技大学理学院, 教授, 博士生导师。研究方向:统计分析与应用、科技史。作者贡献:修改定稿。 浙江杭州 310023

Machine Reading Comprehension and Lexicon-Enhanced BERT Based Named Entity Recognition Ancient Chinese Science and Technology Texts

Pan Jun,Xiao Mingcheng, Tao Xiangxing(School of Science Zhejiang University of Science and Technology)   

  • Online:2025-11-15 Published:2025-11-26
  • About author:Pan Jun,Xiao Mingcheng, Tao Xiangxing(School of Science Zhejiang University of Science and Technology)

摘要:

针对古代科技领域的命名实体识别问题,提出一种基于机器阅读理解与词汇增强模型的方法DLEBERT-MRC。该方法通过BERT 预训练模型提取问句和目标文本的上下文信息, 并在Transformer 层之间通过双线性注意力机制引入领域词典进行特征增强,采用SoftMax 预测实体的首尾位置。基于所构建的领域词典,以百度百科为数据源建立古代科技语料库并进行标注,构建了符合机器阅读标准的命名实体识别数据集。实验评估表明了该方法的有效性,消融实验进一步验证了模型各个结构的重要性。

关键词: 命名实体识别&emsp, 机器阅读理解&emsp, 领域词汇增强&emsp, 中国古代科技&emsp, 数字人文

Abstract:

Recognizing named entities in the texts of ancient Chinese science and technology haspresented a unique challenge in recent years. In response we introduce a novel named entity recognitionNER method DLEBERT-MRC which is grounded in a machine reading comprehension MRCframework and utilizes a domain lexicon-enhanced BERT model to extract contextual information fromboth questions and target texts. The introduction of domain-specific lexicons through a bilinear attentionmechanism between transformer layers significantly enriches contextual information. SoftMax is employedin the decoding layer to accurately predict the start and end positions of entities in the input text.Furthermore we constructed an ancient Chinese science and technology NER dataset employing theBIOES scheme. The dataset derived from Baidu Encyclopedia is designed to align with thespecifications of MRC tasks. Experimental evaluation verifies the effectiveness of the proposed methodwhile ablation experiments demonstrate the importance of each component of the model.

Key words:

Named entity recognition, Machine reading comprehension, Domain lexicon enhanced BERT, Ancient Chinese science and technology, Digital humanities