A Comparative Study of Relevant Classification Techniques in Automatic Classification for Two Categories with Similar Contents: Taking E271 and E712.51 in the Chinese Library Classification as Example

Abstract

Abstract: The purpose of this paper is to study the automatic classification (two types of classification) based on machine learning in two categories with very similar contents in the Chinese Library Classification. In this paper, we use the bibliographic information of E271 and E712.51 as two types of bibliographic information, and provide a comparative study of the performance of some representative technologies, three feature selection methods, namely, CHI, IG and MI, two feature weighting methods, namely, TF and TF * IDF, and three classification algorithm, namely, KNN, NB and SVM, in the classification of two categories, which provides basic data for targeted automatic classification research. The experimental results show that the performance of CHI and IG is better than MI. However, when the number of features of MI are more than 4000, the performance is improved enormouslyly. For the classification algorithm, the performance of the NB, which adopts the MI feature selection, is the best. The performance of the SVM is better, which uses the feature selection of CHI and IG, than NB and KNN. And the KNN is worse than the former. For feature weighting, TF is better than TF * IDF in most cases. However, the performance of feature weighting is easily influenced by the classification algorithm, the number of features or feature selection method. The related technology in each classification can be combined to adapt to the automatic classification of imitation classification, but the performance of related methods have different advantages and disadvantages, which needs to further improve the classification of related technology and to further improve the classification of similar categories to carry out automatic classification of performance.

Key words: Classification for two categories, Chinese Library Classification, Feature selection, Feature weighting, Text classification

Li Xiangdong, Ruan Tao. A Comparative Study of Relevant Classification Techniques in Automatic Classification for Two Categories with Similar Contents: Taking E271 and E712.51 in the Chinese Library Classification as Example[J]. Libraly Journal, 2018, 37(6): 11-20.

References

[1] 庞观松, 蒋盛益. 文本自动分类技术研究综述[J].情报理论与实践, 2012, 35(2): 123-128.
[2] 樊兴华, 孙茂松. 一种高性能的两类中文文本分类方法[J]. 计算机学报, 2006, 29(1): 124-131.
[3] 王军. 数字图书馆的知识组织系统: 从理论到实践[M]. 北京: 北京大学出版社, 2008: 129-194.
[4] Waltinger U, Mehler A, L?sch M, et al. HierarchicalClassification of OAI Metadata Using the DDCTaxonomy[M]//Advanced Language Technologies for DigitalLibraries. Springer Berlin Heidelberg, 2011: 29-40.
[5] Pong Y H, Kwok C W, Lau Y K, et al. A comparativestudy of two automatic document classificationmethods in a library setting[J]. Journal of InformationScience, 2008, 34(2): 213-230.
[6] Wang J. An extensive study on automated DeweyDecimal Classification[J]. Journal of the AmericanSociety for Information Science & Technology, 2010,60(11): 2269-2286.
[7] 王昊, 严明, 苏新宁. 基于机器学习的中文书目自动分类研究[J]. 中国图书馆学报, 2010, 36(6): 28-39.
[8] 苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17(9): 1848-1859.
[9] 申红, 吕宝粮, 内山将夫,等. 文本分类的特征提取方法比较与改进[J]. 计算机仿真, 2006, 23(3):222-224.
[10] 刘洋. 中文文本分类中特征选择方法的比较研究[J].科技信息, 2007, 4(3): 54.
[11] Yang Y, Pedersen J O. A Comparative Study onFeature Selection in Text Categorization[C]//FourteenthInternational Conference on Machine Learning. MorganKaufmann Publishers Inc. 1997: 412-420.
[12] He J, Tan A, Tan C. A Comparative Study on ChineseText Categorization Methods[C]//Proceedings of theInternational Workshop on Text and Web Mining.(PRICAI), Melbourne, Australia, 2000.
[13] 高媛, 刘大中. 中文文本分类方法比较研究 [J]. 科技信息: 科学·教研, 2008,3(2): 7-8.
[14] 崔彩霞, 张朝霞. 文本分类方法对比研究 [J]. 太原师范学院学报(自然科学版), 2007, 6(4): 52-54.
[15] 熊忠阳, 黎刚, 陈小莉,等. 文本分类中词语权重计算方法的改进与应用 [J]. 计算机工程与应用,2008, 44(5): 187-189.
[16] 施聪莺, 徐朝军, 杨晓江. TFIDF算法研究综述 [J].计算机应用, 2009, 29(s1): 167-170.
[17] 张保富, 施化吉, 马素琴. 基于TFIDF文本特征加权方法的改进研究 [J]. 计算机应用与软件, 2011,28(2): 17-20.
[18] Salton G, Buckley C. Term-weighting approaches inautomatic text retrieval [J]. Information Processing &Management An International Journal, 1988, 24(5):513-523.
[19] Deng Z H, Tang S W, Yang D Q, et al. A ComparativeStudy on Feature Weight in Text Categorization [J].Lecture Notes in Computer Science, 2004, 3007(3):588-597.
[20] Beel J, Breitinger C, Langer S. Evaluating the CCIDFcitation-weighting scheme: How effectively can‘Inverse Document Frequency’(IDF) be appliedto references?[C]// Iconference. 2017.
[21] 李亚南. 微博评论情感倾向性分类研究[D]. 天津:天津科技大学, 2015.
[22] 杨欢. 基于文本分类的微博情感倾向研究[D]. 重庆: 重庆师范大学, 2016.
[23] 薛春香, 何琳, 侯汉清. 基于《中图法》知识库的自动分类相关问题探析 [J]. 图书馆建设, 2015(6):16-20.
[24] Cardosocachopo A, Oliveira A L. An EmpiricalComparison of Text Categorization Methods[C]//International Symposium on String Processing andInformation Retrieval. Springer Berlin Heidelberg,2003: 183-196.
[25] 卢苇, 彭雅. 几种常用文本分类算法性能比较与分析 [J]. 湖南大学学报(自科版), 2007, 34(6): 67-69.
[26] Forman G. An extensive empirical study of featureselection metrics for text classification [J]. Journalof machine learning research, 2003, 3(Mar): 1289-1305.
[27] 郭亚维, 刘晓霞. 文本分类中信息增益特征选择方法的研究 [J]. 计算机工程与应用, 2012, 48(27):119-122.
[28] 邓彩凤. 中文文本分类中互信息特征选择方法研究[D]. 重庆: 西南大学, 2011.
[29] 孙建军. 信息检索技术[M] . 北京: 科学出版社,2004: 169-170.
[30] 张宁, 贾自艳, 史忠植. 使用KNN算法的文本分类 [J].计算机工程, 2005, 31(8): 171-172.
[31] 李丹. 基于朴素贝叶斯方法的中文文本分类研究[D].保定: 河北大学, 2011.
[32] 熊浩勇. 基于SVM的中文文本分类算法研究与实现[D]. 武汉: 武汉理工大学, 2008.