A Study of Automated Deep Classification of Literature Based on Chinese Library Classification

Abstract

Abstract:

Deep classification of literature based on Chinese Library Classification (CLC) includes two classical natural language processing problems: Extreme Multi-label Text Classification (XMC) and Hierarchical Text Classification (HTC). However, the current research on literature classification based on CLC generally treats it as an ordinary text classification problem. Since the core features of the problem are not fully explored, these studies are generally unsatisfactory or even infeasible in deep categorization. This paper, through the in-depth analysis of the characteristics and difficulties of the literature classification based on CLC, examines and researches the deep classification of literature based on the CLC and related solutions from the perspectives of XMC and HTC. It applies and innovates them for the characteristics of this scenario, which not only improves the accuracy of the classification, but also extends the depth and breadth of the classification. In this paper, the model first extracts the semantic features of the text as the basis of classification through a lightweight deep learning model applicable to the XMC problem. And then, for the HTC problem in the classification of CLC, it utilizes the LTR (Learning to Rank) framework to incorporate multivariate features including hierarchical structural information as the auxiliary basis of classification, which greatly exploits the information and knowledge embedded in the semantic and classification system of the text. The model utilizes the LTR framework to incorporate multiple features including hierarchical structure information as an auxiliary basis for classification. It also combines the powerful semantic understanding ability of deep learning models with the interpretability of machine learning models, and has good scalability, which can be easily improved by incorporating new features customized by experts at a later stage. Moreover, the model is relatively lightweight, which can easily cope with tens of thousands of classified labels under the limited computational resources, and lays a good foundation for the full-depth categorization based on the CLC.

Zhang Yuhui (Shanghai Library). A Study of Automated Deep Classification of Literature Based on Chinese Library Classification[J]. Libraly Journal, 2024, 43(395): 61-74.

References

[ 1 ] H P Luhn. The Automatic Creation of Literature Abstracts[J]. IBM Journal of Research and Development，1958，2（2）：159–165.

[ 2 ] Larson R R. Experiments in Automatic Library of Congress Classification[J]. J. Am. Soc. Inf. Sci.， 1992，43：130–148.

[ 3 ] 成颖，史九林. 自动分类研究现状与展望[J]. 情报学报，1999，18（1）：20–26.

[ 4 ] Kuo J. An Automatic Library Data Classification System Using Layer Structure and Voting Strategy[C] // International Conference on Asian Digital Libraries. Chiang Mai, Thailand: Springer, Cham, 2014：279–287.

[ 5 ] Frank E, Paynter G W. Predicting Library of Congress Classifications from Library of Congress Subject Headings[J]. Journal of the American Society for Information Science and Technology, 2004，55(3)：214–227.

[ 6 ] Jun Wang. An extensive study on automated Dewey Decimal Classification[J]. Journal of the American Society for Information Science & Technology, 2009，60(11)：2269–2286.

[ 7 ] 郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报，2017（6）：96–103.

[ 8 ] Devlin J，Chang M，Lee K，et al. BERT：Pre-training of Deep Bidirectional Transformers for Language Understanding[C] // Minneapolis, Minnesota, 2019：4171—4186.

[ 9 ] 赵旸，张智雄，刘欢，等. 基于BERT 模型的中文医学文献分类研究[J]. 数据分析与知识发现， 2020，4（8）：41–49.

[10] 姜鹏. 基于BERT的《中图法》文本分类系统及其影响因素分析[J]. 图书馆研究与工作，2022，41（5）： 43–48.

[11] 沈立力，姜鹏，王静. 基于BERT 模型的中文期刊文献自动分类实践研究[J]. 图书馆杂志， 2022，41（5）：109–118.

[12] Liu K，Peng S，Wu J，et al. MeSHLabler： improving the accuracy of large-scale MeSH indexing by integrating diverse evidence[J]. Bioinformatics，2015，31（12）：i339—i347.

[13] Peng S，You R，Wang H，et al. DeepMeSH：deep semantic representation for improving large-scale MeSH indexing[J]. Bioinformatics，2016，32（12）： i70—i79.

[14] You R, Zhang Z, Wang Z, et al. AttentionXML: Label Tree-Based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification[C]// NIPS’19. Red Hook, NY, USA: Curran Associates Inc., 2019：5820–5830.

[15] 中国图书馆分类法编辑委员会.《中国图书馆分类法》第五版使用手册[M]. 北京：国家图书馆出版社，2012.

[16] Chang W, Yu H, Zhong K, et al. Taming Pretrained Transformers for Extreme Multi-Label Text Classification[C]// KDD’20. New York, USA, 2020：3163–3171.

[17] Jiang T, Wang D, Sun L, et al. LightXML: Transformer with Dynamic Negative Sampling for High-Performance Extreme Multi-label Text Classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021，35(9)： 7987–7994.

[18] Mao Y, Tian J, Han J, et al. Hierarchical text classification with reinforced label assignment[C]// EMNLP-IJCNLP 2019.Hong Kong, China: Association for Computational Linguistics, 2019： 445–455.

[19] Yang Z, Yang D, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Conference of the North American Chapter of the Association for Computational Linguistics. San Diego, California: Association for Computational Linguistics, 2016：1480–1489.

[20] Kim Y, Iwata T, Boyd-Graber J. Text Classification using Label Hierarchies with Adaptive Learning[C]// Conference of the North American Chapter of the Association for Computational Linguistics. Minneapolis, Minnesota, USA, 2019：1230–1244.

[21] Zhou J, Ma C, Long D, et al. Hierarchy-Aware Global Model for Hierarchical Text Classification[C]// the 58th Annual Meeting of the Association for Computational Linguistics, Online: Association for Computational Linguistics, 2020：1106–1117.

[22] Chan Z, Yuan S, Wang J, et al. Hierarchical Graph Network for Multi-label Text Classification[C]// CIKM’22: Association for Computing Machinery. New York, United States, 2022：2315–2318.

[23] Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need[C]// NIPS’17.Long Beach, California, USA: Curran Associates Inc., 2017： 6000–6010.

[24] Burges C. From ranknet to lambdarank to lambdamart：An overview[J]. Learning，2010，11.

[25] Gargiulo F，Silvestri S，Ciampi M，et al. Deep neural network for hierarchical extreme multi-label text classification[J]. Applied Soft Computing， 2019，79：125–138.