Libraly Journal

Libraly Journal ›› 2024, Vol. 43 ›› Issue (395): 61-74.

Previous Articles     Next Articles

A Study of Automated Deep Classification of Literature Based on Chinese Library Classification

Zhang Yuhui (Shanghai Library)   

  • Online:2024-03-15 Published:2024-04-01
  • About author:Zhang Yuhui (Shanghai Library)

Abstract:

Deep classification of literature based on Chinese Library Classification (CLC) includes two classical natural language processing problems: Extreme Multi-label Text Classification (XMC) and Hierarchical Text Classification (HTC). However, the current research on literature classification based on CLC generally treats it as an ordinary text classification problem. Since the core features of the problem are not fully explored, these studies are generally unsatisfactory or even infeasible in deep categorization. This paper, through the in-depth analysis of the characteristics and difficulties of the literature classification based on CLC, examines and researches the deep classification of literature based on the CLC and related solutions from the perspectives of XMC and HTC. It applies and innovates them for the characteristics of this scenario, which not only improves the accuracy of the classification, but also extends the depth and breadth of the classification. In this paper, the model first extracts the semantic features of the text as the basis of classification through a lightweight deep learning model applicable to the XMC problem. And then, for the HTC problem in the classification of CLC, it utilizes the LTR (Learning to Rank) framework to incorporate multivariate features including hierarchical structural information as the auxiliary basis of classification, which greatly exploits the information and knowledge embedded in the semantic and classification system of the text. The model utilizes the LTR framework to incorporate multiple features including hierarchical structure information as an auxiliary basis for classification. It also combines the powerful semantic understanding ability of deep learning models with the interpretability of machine learning models, and has good scalability, which can be easily improved by incorporating new features customized by experts at a later stage. Moreover, the model is relatively lightweight, which can easily cope with tens of thousands of classified labels under the limited computational resources, and lays a good foundation for the full-depth categorization based on the CLC.