Research on Data Extraction from Annual Reports of
Public Cultural Service Institutions in the Multi-source Data
Environment

Abstract

Abstract: On one hand, public cultural service agencies are rich in data resources but have difficulty in integrating them. On the other hand, the public culture sector lacks macro management data. The annual reports contain rich data such as venue information, event data, and business data, while the data quality of annual reports is relatively high, making these reports an important data source for public cultural services. How to extract data from annual reports and integrate them effectively have become an important research task in the multi-source data environment. The authors compile a crawler program to download annual reports, identify the PDF file format, summarize the text structure and the context characteristics of specific data, and use regular expressions to set up templates, and to match and extract various annual report data. In this paper, three sets of templates are designed for different types of data items such as data located in paragraph headings, data with obvious numerical characteristics, and memorabilia with a fixed and unified format, and better matching and extraction results were achieved.

Liu Shiyang, Wang Weiwei, Hua Bolin. Research on Data Extraction from Annual Reports of Public Cultural Service Institutions in the Multi-source Data Environment[J]. Libraly Journal, 2020, 39(12): 52-60.

References

[ 1 ] 刘志辉, 赵筱媛. 上市公司年报在产业竞争情报分析中的应用研究[J]. 图书情报工作, 2013, 57(3): 65-68; 119. [ 2 ] 周双文. 基于领域本体的创业板公司年报风险信息抽取方法研究[D]. 长沙: 湖南大学, 2013. [ 3 ] 李珍, 田学东. PDF文件信息的抽取与分析[J]. 计算机应用, 2003(12): 145-147. [ 4 ] 陈云榕, 刘立柱, 丁志鸿. PDF文件中关键信息的提取与组织方法研究[J]. 计算机工程与设计, 2007(7): 1688-1690. [ 5 ] 于丰畅, 陆伟. 基于机器视觉的PDF学术文献结构识别[J]. 情报学报, 2019, 38(4): 384-390. [ 6 ] 丁晟春, 王莉, 刘梦露. 基于规则的动物卫生事件舆情信息抽取研究[J]. 计算机应用与软件, 2018, 35(9): 56-62. [ 7 ] 杨春磊. 基于模式匹配的结构化信息抽取研究[D]. 合肥: 合肥工业大学, 2013. [ 8 ] 冷伏海, 白如江, 祝清松. 面向科技文献的混合语义信息抽取方法研究[J]. 图书情报工作, 2013, 57(11): 112-119. [ 9 ] 马晓荣. 科技云中非结构化数据向结构化数据的转换方法[D]. 西安: 西安电子科技大学, 2017. [10] 李超. 基于深度学习的短文本分类及信息抽取研究[D]. 郑州: 郑州大学, 2017. [11] 俞琰, 陈磊, 姜金德, 等. 网络招聘文本技能信息自动抽取研究[J]. 图书情报工作, 2019, 63(13): 105-113. [12] 唐晓文. 基于本体论的文本特征提取[J]. 电脑与信息技术, 2005(1): 36-38; 62. [13] 于江德, 肖新峰, 樊孝忠. 基于隐马尔可夫模型的中文文本事件信息抽取[J]. 微电子学与计算机, 2007(10): 92-94; 98. [14] 李晨, 刘卫国. 基于NLTK的中文文本内容抽取方法[J]. 计算机系统应用, 2019, 28(1): 275-278. [15] 张博. 基于领域知识库的简历信息抽取系统的设计与实现[D]. 北京: 北京邮电大学, 2018. [16] Rastin M, Casper H, Christian H, et al. Predicting distresses using deep learning of text segments in annual reports[J]. Expert systems with applications, 2019, 132: 199-208. [17] 张智雄. 信息抽取技术及其在数字图书馆中的应用前景分析[J]. 现代图书情报技术, 2004(6): 1-5; 23. [18] 牟冬梅, 陈倩, 王丽伟. 基于语义模型的数字图书馆知识组织信息抽取策略[J]. 图书情报工作, 2009, 53(15): 21-25. [19] 刘柏嵩. 面向数字图书馆的本体自动构建[J]. 中国图书馆学报, 2006(5): 47-51. [20] 黄永, 陆伟, 程齐凯, 等. 学术文本的结构功能识别——基于段落的识别[J]. 情报学报, 2016, 35(5): 530-538. [21] 黄永, 陆伟, 程齐凯. 学术文本的结构功能识别——基于章节内容的识别[J]. 情报学报, 2016, 35(3): 293-300. [22] 方龙, 李信, 黄永, 等. 学术文本的结构功能识别——在关键词自动抽取中的应用[J]. 情报学报, 2017, 36(6): 599-605. [23] 雷声伟, 陈海华, 黄永, 等. 学术文献引文上下文自动识别研究[J]. 图书情报工作, 2016, 60(17): 78-87. [24] 陈海华, 黄永, 张炯, 等. 基于引文上下文的学术文本自动摘要技术研究[J]. 数字图书馆论坛, 2016(8): 43-49. [25] 黄永文, 李广建. 数字图书馆中的ETL应用研究综述[J]. 现代图书情报技术, 2007(12): 1-5. [26] 毕崇武, 王忠义, 宋红文. 基于知识元的数字图书馆多粒度集成知识服务研究[J]. 图书情报工作, 2017, 61(4): 115-122. [27] 欧石燕, 唐振贵. 面向图书馆关联数据的自动问答技术研究[J]. 中国图书馆学报, 2015, 41(6): 44- 60. [28] 周凡坤. 面向领域的文本信息抽取方法研究[D]. 南京: 南京邮电大学, 2014.

Research on Data Extraction from Annual Reports of Public Cultural Service Institutions in the Multi-source Data Environment

PDF(CN)

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 0

Recommended Articles

Metrics

Comments