Libraly Journal

Libraly Journal ›› 2022, Vol. 41 ›› Issue (5): 109-118.

Previous Articles     Next Articles

A Study on the Automatic Classification of Chinese
Literature in Periodicals Based on BERT Model

Shen Lili, Jiang Peng, Wang Jing (Shanghai Library)
  

  • Online:2022-05-15 Published:2022-05-24
  • About author:Shen Lili, Jiang Peng, Wang Jing (Shanghai Library)

Abstract: The BERT model released by Google AI team has achieved results in a number of Natural Language Processing tasks. But the research in the field of automatic classification of Chinese literature remains to be explored. The purpose of this paper is to explore the actual classification effect of BERT’s Chinese basic model in the classification of Chinese social science and sci-tech periodicals, to point out the problems existing in the practical application of the model, and to propose solutions. This paper selects more than 1745000 Chinese documents of R category (medicine, health), TG category (metallogy and metalworking), F category (economics), and J category (art) as training corpus, and uses another 9610 data as test samples. BERT Model is used to classify the literatures of social science and sci-tech periodicals. The results show that the four-level accuracy of BERT model is 76.95% in social science literature and 68.55% in scientific literature. Then the penalty strategy is introduced to provide reference for the threshold setting of the exemption data in practice. The BERT model can be used in the actual classification and indexing of the Quan Guo Bao Kan Suo Yin (CNBKSY) to meet the needs of automatic classification of Chinese documents under the current network environment.

Key words: