图书馆杂志

图书馆杂志 ›› 2022, Vol. 41 ›› Issue (5): 109-118.

• 信息管理 • 上一篇    下一篇

基于 BERT 模型的中文期刊文献自动分类实践研究

沈立力  姜  鹏  王  静(上海图书馆)
  

  • 出版日期:2022-05-15 发布日期:2022-05-24
  • 作者简介:沈立力  女,硕士,上海图书馆(上海科学技术情 报研究所),馆员。研究方向:数字人文、知识组织 与知识发现。作者贡献:论文框架设计、内容撰写。 E-mail:llshen@libnet.sh.cn 上海 200031 姜   鹏  硕士,上海图书馆(上海科学技术情报研 究所),工程师。研究方向:数字化数据化、知识组 织与知识发现。作者贡献:论文修改、数据收集与分 析。 上海 200031 王   静  女,硕士,上海图书馆(上海科学技术情报 研究所),馆员。研究方向:数字化数据化。作者贡 献:数据收集与分析。 上海 200031

A Study on the Automatic Classification of Chinese
Literature in Periodicals Based on BERT Model

Shen Lili, Jiang Peng, Wang Jing (Shanghai Library)
  

  • Online:2022-05-15 Published:2022-05-24
  • About author:Shen Lili, Jiang Peng, Wang Jing (Shanghai Library)

摘要: Google AI 团队发布的 BERT 模型在多项自然语言处理任务中取得了研究成果,但在中文 文献自动分类领域尚有待探索。本文旨在探索 BERTbase 中文基础模型在中文社科、科技期刊文献分 类上的实际分类效果,指出模型在实际应用中存在的问题并提出解决方法。本文选取 R 大类(医 药、卫生)、TG 大类(金属学与金属工艺)、F 大类(经济)、J 大类(艺术)共 1 745 000 条数据作 为训练语料,并以另外 9 610 条数据作为测试样本,利用 BERT 模型分别对社科、科技期刊文献进 行分类研究。测试结果表明 BERT 模型在社科文献中的四级准确率为 76.95%,科技文献为 68.55%。 之后引入惩罚策略,为实际工作中免检数据阈值的设定提供参考。BERTbase 模型在《全国报刊索引》 实际分类标引工作中有一定可行性,基本满足当前网络环境下中文文献自动分类的需求。

关键词:

Abstract: The BERT model released by Google AI team has achieved results in a number of Natural Language Processing tasks. But the research in the field of automatic classification of Chinese literature remains to be explored. The purpose of this paper is to explore the actual classification effect of BERT’s Chinese basic model in the classification of Chinese social science and sci-tech periodicals, to point out the problems existing in the practical application of the model, and to propose solutions. This paper selects more than 1745000 Chinese documents of R category (medicine, health), TG category (metallogy and metalworking), F category (economics), and J category (art) as training corpus, and uses another 9610 data as test samples. BERT Model is used to classify the literatures of social science and sci-tech periodicals. The results show that the four-level accuracy of BERT model is 76.95% in social science literature and 68.55% in scientific literature. Then the penalty strategy is introduced to provide reference for the threshold setting of the exemption data in practice. The BERT model can be used in the actual classification and indexing of the Quan Guo Bao Kan Suo Yin (CNBKSY) to meet the needs of automatic classification of Chinese documents under the current network environment.

Key words: