Libraly Journal

Libraly Journal ›› 2018, Vol. 37 ›› Issue (6): 11-20.

Previous Articles     Next Articles

A Comparative Study of Relevant Classification Techniques in Automatic Classification for Two Categories with Similar Contents: Taking E271 and E712.51 in the Chinese Library Classification as Example

Li Xiangdong, Ruan Tao   

  • Online:2018-06-15 Published:2018-06-11

Abstract: The purpose of this paper is to study the automatic classification (two types of classification) based on machine learning in two categories with very similar contents in the Chinese Library Classification. In this paper, we use the bibliographic information of E271 and E712.51 as two types of bibliographic information, and provide a comparative study of the performance of some representative technologies, three feature selection methods, namely, CHI, IG and MI, two feature weighting methods, namely, TF and TF * IDF, and three classification algorithm, namely, KNN, NB and SVM, in the classification of two categories, which provides basic data for targeted automatic classification research. The experimental results show that the performance of CHI and IG is better than MI. However, when the number of features of MI are more than 4000, the performance is improved enormouslyly. For the classification algorithm, the performance of the NB, which adopts the MI feature selection, is the best. The performance of the SVM is better, which uses the feature selection of CHI and IG, than NB and KNN. And the KNN is worse than the former. For feature weighting, TF is better than TF * IDF in most cases. However, the performance of feature weighting is easily influenced by the classification algorithm, the number of features or feature selection method. The related technology in each classification can be combined to adapt to the automatic classification of imitation classification, but the performance of related methods have different advantages and disadvantages, which needs to further improve the classification of related technology and to further improve the classification of similar categories to carry out automatic classification of performance.

Key words: Classification for two categories, Chinese Library Classification, Feature selection, Feature weighting, Text classification