Title Information Classification Based on Hownet Semantics Feature Extension

Abstract

Abstract:

This paper uses the internal semantic relevance of the text and get the core semantic word set of the training text through high frequency words and the hidden theme. It then use the Hownet as an external resource to calculate the similarity between the core semantic word set and testing text. It extends
the keywords in training text, whose similarity is greater than a certain level, into the testing text, and classifies them with SVM. The result shows that in the case where training set and test set are only titles,and there are 200 pieces in each category of training set, there is an increase of efficiency to 3.1%; but the
efficiency declines with the increase of the number of training set text over 200. In the case where training sets are titles and abstracts whereas the testing sets are titles, the classification algorithm put forward in this paper could achieve 1.5% and 3.1% on Macro_F1in Fudan corpus and the self-builtjournal corpus, and 2.3% and 5.3% on Micro_F1. This paper aims to implement characteristic extension of journal titles with sparse characteristics in the hope of improving the work of title classification.

Key words: Journal title information, Short-text classification, Hownet, LDA

Li Xiangdong, Liu Kang,Ding Cong, Liao Xiangpeng. Title Information Classification Based on Hownet Semantics Feature Extension[J]. LIBRARY JOURNAL.

References

[ 1 ] 董振东, 董强, 郝长伶. 知网的理论发现[J]. 中文信息学报, 2007, 21(4): 3-9.

[ 2 ] 宁亚辉, 樊兴华, 吴渝. 基于领域词本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145.

[ 3 ] 赵辉, 刘怀亮. 一种基于维基百科的中文短文本分类算法[J]. 图书情报工作, 2013, 57(11): 120-124.

[ 4 ] Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text ＆ Web with Hidden Topics from Large-scale Data Collections[C] //Proceedings of the 17th International Conference on World Wide Web(WWW’08). New York: ACM,2008: 91-100.

[ 5 ] 王细薇, 樊兴华, 赵军. 一种基于特征扩展的中文短文本分类方法[J]. 计算机应用, 2009, 29(3):843-845.

[ 6 ] 王细薇, 张凯. 一种改进的基于共现关系的短文本特征扩展算法研究[J]. 河南城建学院学报,2012, 21(4): 48-50.

[ 7 ] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6):42-48.

[ 8 ] Quan X, Liu G, Lu Z, et al. Short Text Similarity Based on Probabilistic Topics[J]. Knowledge and Information Systems, 2010, 25(3): 473-491.

[ 9 ] 王盛, 樊兴华, 陈现麟. 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010, 30(3): 603-607.

[10] Xinghua Fan, Dingbang Wei. A Method of Agent and Patient Relation Acquisition for Short-Text Classification[C] //International Conference, CSIE.2011: 27-33.

[11] 李湘东, 曹环, 丁丛, 等. 基于《知网》和领域关键词集扩展的短文本分类[J]. 现代图书情报技术, 2015, 31(2): 31-38.

[12] Blei D, Ng A, Jordan M. Latent dirichlet allocation[J].Journal of Machine Learning Research, 2003(3):993-1022.

[13] 司宪策. 基于内容的社会标签推荐与分析研究[D]. 北京: 清华大学, 2010.

[14] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 计算语言学及中文信息处理, 2002(7):59-76.

[15] 吴健, 吴朝晖, 李莹, 等. 基于本体论和词汇语义相似度的Web服务发现[J]. 计算机学报, 2005,28(4): 595-602.

[16] 李生琦, 田巧燕, 汤承. 基于《知网》词汇语义相关度计算的消歧方法[J]. 情报学报, 2009, 28(5):706-711.

[17] 孙建旺, 吕学强, 张雷瀚. 基于语义与最大匹配度的短文本分类研究[J]. 计算机工程与设计, 2013,34(10): 3613-3618.

[18] 复旦公开语料[DB/OL]. [2014-06-20]. http: //www.datatang. com/data/43318.

[19] 奉国和. 文本分类性能评价研究[J]. 情报杂志,2011, 30(8): 66-70.

[1]	Ma Buyao (Shanghai Library). Exploration of Genealogy Reading Promotion Strategies in Public Libraries from the Perspective of Users [J]. Libraly Journal, 2024, 43(394): 64-72.
[2]	Meng Qiuqing, , Xiong Huixiang, Yang Zirong ( School of Information, Guizhou University of Finance and Economics, School of Information Management, Central China Normal University). Analysis on Health Information Needs and Evolution of Internet Users in the Post-Epidemic Period [J]. Libraly Journal, 2022, 41(2): 119-127.
[3]	Li Meining, Zhang Qin, Zhang Xiumei. The New Method of Sci-tech Novelty Retrieval Service Based on LDA Topic Model [J]. Libraly Journal, 2020, 39(10): 45-52.
[4]	Ruan Guangce, Ren Jinyue. Application of Information Retrieval Results Visualization Based on Hierarchical Topic Model [J]. Libraly Journal, 2019, 38(5): 71-78.