图书馆杂志

图书馆杂志 ›› 2022, Vol. 41 ›› Issue (2): 93-102.

• 数字人文 • 上一篇    下一篇

基于Bootstrapping的家谱文本信息抽取方法研究

鲍宸洋 任 明(中国人民大学信息资源管理学 院)   

  • 出版日期:2022-02-15 发布日期:2022-02-23
  • 作者简介:鲍宸洋 中国人民大学信息资源管理学院,硕士研究生。研究方向:信息抽取、自然语言处理。作者贡献:数据处理、数据实验、方案改进、初稿撰写。E-mail:chunqiu829@126.com 北京 100872 任 明 女,中国人民大学信息资源管理学院,博士,副教授。研究方向:信息抽取、知识图谱、数字人文。作者贡献:提出模型构建方案、指导实验设计、论文修改。北京 100872

A Bootstrapping-based Information Extraction Method for Genealogy Text

Bao Chenyang, Ren Ming (School of Information Resource Management, Renmin University of China)   

  • Online:2022-02-15 Published:2022-02-23
  • About author:Bao Chenyang, Ren Ming (School of Information Resource Management, Renmin University of China)

摘要: 实现家谱文本信息的自动抽取是家谱资源深度开发利用的关键。目前深度学习在家谱文本信息抽取方面取得了良好的效果,但是对标注数据的依赖始终是其发展瓶颈之一。本文面向家谱的世系小传,研究基于小规模标注数据进行家谱人物和关系的抽取方法。具体来说:基于Bootstrapping的思想,以少量的标注数据作为初始种子集,使用深度学习BiLSTM-CRF模型为待标注样本自动预测标签序列,并筛选高置信分数的样本加入标注集中,从而迭代地扩展标注集,最后训练得到的模型用于命名实体识别和关系抽取。基于真实数据集的实验表明,使用Bootstrapping改进的BiLSTM-CRF模型能够基于小规模标注数据实现家谱信息抽取,使基于深度学习的家谱信息抽取更加高效。在种子集规模为250条时取得的预测效果与训练集规模为1800条的BiLSTM-CRF模型的预测效果接近。

Abstract: Automatic information extraction from genealogical text is the key to exploiting genealogy resources efficiently. Recently, deep learning has achieved remarkable success in information extraction from genealogy text, but has been limited by a lack of labeled data in this field. This paper aims at developing a bootstrapping-based method targeting at small-scale labeled genealogy text, which extracts information from biographies of family members. To be specific, the method starts with a small-scale labeled data and uses the BiLSTM-CRF model to predict label sequence, with those samples with the highest confidence scores chosen and added to the labeled data. In this way, the labeled data is incrementally expanded and the trained model can predict label sequence for given genealogy text, which is further used to derive entities and relationships. According to the experiment on real dataset, the proposed method can extract the information from digital genealogy text based on a small scale of labeled data, which makes deep learning methods more effective and practical for information extraction from genealogy records. The proposed method with a size of 250 achieves similar performance to that of the BiLSTM-CRF model with 1800 labeled data.