A Methodological and Empirical Study of Extracting Event Information in Textual Historical Materials Based on Conditional Random Fields: Taking the Digital Humanities Study of the Rabe’s Diary as an Example

Abstract

Abstract:

Textual histories are widely digitized. How to extract geographically named entities and related information from the texts and how to effectively realize geographic information mining have become an important research topic. This paper proposes an idea of extracting event elements related to time, place, persons, things, events and phenomena associated with geographically named entities by taking the geographically named entities as the core and making the semantic information associated with geographical locations, and by converting the event information described in the text into the attribute data of each geographically named entity. The study used the document Japanese Soldiers’ Atrocities in the Nanking Safety Zone included in Rabe’s Diary as an empirical case, and used the conditional random field method to extract events. Combined with historical maps and other related data, geographical information is finally mapped to the map. The methodology of this paper expands the way textual information is exploited in the digital information era, opening up new ideas for text mining analysis and knowledge discovery.

Zhao Xiaoxuan, Chen Gang, Huang Zijing ( School of Geography and Ocean Science, Nanjing University, Jiangsu Provincial Key Laboratory of Geographic Information Science and Technology, Key Laboratory for Land Satellite Remote Sensing Applications of Ministry of Natural Resources). A Methodological and Empirical Study of Extracting Event Information in Textual Historical Materials Based on Conditional Random Fields: Taking the Digital Humanities Study of the Rabe’s Diary as an Example[J]. Libraly Journal, 2024, 43(395): 101-108.

References

[ 1 ] 陈刚.“数字人文”与历史地理信息化研究[J]. 南京社会科学，2014（3）：136–142.

[ 2 ] 赵思渊. 地方历史文献的数字化、数据化与文本挖掘：以《中国地方历史文献数据库》为例[J]. 清史研究，2016（4）：26–35.

[ 3 ] 李娜. 面向方志类古籍的多类型命名实体联合自动识别模型构建[J]. 图书馆论坛，2021，41（12）： 113–123.

[ 4 ] 徐晨飞，叶海影，包平. 基于深度学习的方志物产资料实体自动识别模型构建研究[J]. 数据分析与知识发现，2020，4（8）：86–97.

[ 5 ] 何小波，罗跃，金贤锋，等. 规则匹配和深度学习结合的文本空间信息识别及定位[J]. 地理信息世界，2020，27（5）：121–128.

[ 6 ] 余丽，陆锋，张恒才. 网络文本蕴涵地理信息抽取：研究进展与展望[J]. 地球信息科学学报， 2015，17（2）：127–134.

[ 7 ] Gina-Anne Levow. The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition[C]. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. Sydney, Australia. Association for Computational Linguistics. 2006：108—117.

[ 8 ] Peng, Nanyun, Mark Dredze. Named entity recognition for chinese social media with jointly trained embeddings[C]. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal. 2015：548—554.

[ 9 ] Zhang Yue, Jie Yang. Chinese NER Using Lattice LSTM[C]. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia. 2018：1554—1564.

[10] 李雁群，何云琪，钱龙华，等. 中文嵌套命名实体识别语料库的构建[J]. 中文信息学报，2018， 32（8）：19–26.

[11] 邱奇志，周三三，刘长发，等. 基于文体和词表的突发事件信息抽取研究[J]. 中文信息学报， 2018，32（9）：56–65；74.

[12] 黄宗财，仇培元，王海波，等. 结合事件和语境特征的台风事件信息抽取方法[J]. 测绘科学技术学报，2019，36（2）：209–214.

[13] 张萌，陈佳惠，孙然然，等. 基于规则的城市轨道交通安全事件信息抽取及其知识元表示[J]. 科学技术与工程，2021，21（15）：6435–6440.

[14] John Lafferty，Andrew McCallum， Fernando Pereira. Conditional random fields：Probabilistic models for segmenting and labeling sequence data[C]. 18th International Conf. on Machine Learning. Morgan Kaufmann，San Francisco，CA， 2001：282—289.

[15] Mozharova V A, Loukachevitch N V. Combining Knowledge and CRF-Based Approach to Named Entity Recognition in Russian[C]. 5th International Conference on Analysis of Images, Social Networks, and Texts (AIST). Springer, Cham 2016：185—195.

[16] 王东波，黄水清，何琳. 基于多特征知识的先秦典籍词性自动标注研究[J]. 图书情报工作， 2017，61（12）：64–70.

[17] 王红斌，郜洪奎，沈强，等. 泰语人名、地名、机构名实体识别研究[J]. 系统仿真学报，2019，31（5）： 1010–1018.

[18] 贺瑞芳，段绍杨. 基于多任务学习的中文事件抽取联合模型[J]. 软件学报，2019，30（4）： 1015–1030.

[19] 景慎旗，赵又霖. 面向中文电子病历文书的医学命名实体识别研究——一种基于半监督深度学习的方法[J]. 信息资源管理学报，2021，11（6）： 105–115.

[20] 高佳奕，刘震，杨涛，等. 基于条件随机场的中医临床医案症状命名实体抽取研究[J]. 世界科学技术—— 中医药现代化，2020，22（6）： 1947–1954.

[21] 邬伦，刘磊，李浩然，等. 基于条件随机场的中文地名识别方法[J]. 武汉大学学报（信息科学版），2017，42（2）：150–156.

[22] 杨德彬，马卫春. 基于条件随机场模型的中文地址分词研究[J]. 测绘与空间地理信息，2021，44（11）： 73–75；79.

[23] 段艳会，李晓林，黄爽. 基于条件随机场的中文地址行政区划提取方法[J]. 武汉工程大学学报， 2015，37（11）：47–51.

[24] 李章超，李忠凯，何琳.《左传》战争事件抽取技术研究[J]. 图书情报工作，2020，64（7）：20–29.

[25] 刘忠宝，党建飞，张志剑.《史记》历史事件自动抽取与事理图谱构建研究[J]. 图书情报工作， 2020，64（11）：116–124.

[26] Hongjie Dai, Poting Lai,Yungchun Chang, et al. Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization[J]. Journal of Cheminformatics, 2015, 7（S1）.

[27] Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California. Association for Computational Linguistics. 2016： 260—270.

[28] 王卫星. 论南京国际安全区的成立[J]. 民国档案，2005（4）：103–110.

[29] Lempereur A. Humanitarian Negotiation to Protect：John Rabe and the Nanking International Safety Zone（1937–1938）. Group Decis Negot 25，663–691（2016）[EB/OL]. https://doi. org/10.1007/s10726-015-9461-7.

[30] Xu Xin. Book Reviews, Holocaust and Genocide Studies[EB/OL]. [2022-12-10]. https://doi. org/10.1093/hgs/15.2.331.

[31] 黄紫荆，邱玉倩，沈彤，等. 数字人文视角下的《拉贝日记》情感识别与分析[J]. 图书馆论坛， 2023（3）：54—63.

[32] Kadari R， Zhang Y， Zhang W，et al. CCG supertagging via Bidirectional LSTM-CRF neural architecture[J]. Neurocomputing，2017，283（29）： 31–37.

[33] 武惠，吕立，于碧辉. 基于迁移学习和BiLSTM-CRF 的中文命名实体识别[J]. 小