Libraly Journal

Libraly Journal ›› 2020, Vol. 39 ›› Issue (12): 52-60.

• PRACTICE RESEARCH • Previous Articles     Next Articles

Research on Data Extraction from Annual Reports of Public Cultural Service Institutions in the Multi-source Data Environment

Liu Shiyang, Wang Weiwei, Hua Bolin   

  • Online:2020-12-30 Published:2020-12-30

Abstract: On one hand, public cultural service agencies are rich in data resources but have difficulty in integrating them. On the other hand, the public culture sector lacks macro management data. The annual reports contain rich data such as venue information, event data, and business data, while the data quality of annual reports is relatively high, making these reports an important data source for public cultural services. How to extract data from annual reports and integrate them effectively have become an important research task in the multi-source data environment. The authors compile a crawler program to download annual reports, identify the PDF file format, summarize the text structure and the context characteristics of specific data, and use regular expressions to set up templates, and to match and extract various annual report data. In this paper, three sets of templates are designed for different types of data items such as data located in paragraph headings, data with obvious numerical characteristics, and memorabilia with a fixed and unified format, and better matching and extraction results were achieved.