基于大模型的文献数据库服务创新探索与研究——以《全国报刊索引》数据库智能检索服务为例

图书馆杂志 ›› 2026, Vol. 45 ›› Issue (4): 71-81.

基于大模型的文献数据库服务创新探索与研究——以《全国报刊索引》数据库智能检索服务为例

戴晴宜，韩春磊，高智晨

出版日期:2026-04-15 发布日期:2026-04-29
作者简介:戴晴宜上海图书馆（上海科学技术情报研究所），数字资源中心工程师。研究方向：数字资源平台建设、数字人文、大模型应用。作者贡献：论文资料收集与整理、论文撰写与修改。E-mail：qydai@libnet.sh.cn 上海 200031
韩春磊上海图书馆（上海科学技术情报研究所），数字资源中心主任，研究馆员。研究方向：文献资源数字化、数字人文、数字资源平台建设。作者贡献：论文选题、论文修改。上海 200031
高智晨上海双地信息系统有限公司，研发负责人。研究方向：大语言模型在垂直领域的应用、RAG方向算法优化。作者贡献：应用开发、算法逻辑实现、交互设计。上海 200092

Exploration and Research on the Innovation of Literature Database Service Based on LLM： Using Quan Guo Bao Kan Suo Yin (CNBKSY) Intelligent Search Service as a Case Study

Dai Qingyi, Han Chunlei, Gao Zhichen

Online:2026-04-15 Published:2026-04-29
About author:Dai Qingyi, Han Chunlei, Gao Zhichen

摘要/Abstract

摘要： 本文基于大模型技术，围绕《全国报刊索引》平台的智能化升级需求，提出了一种融合自然语言处理（NLP）、语义检索、生成式问答的智能检索系统，旨在解决传统检索效率低、查全率和查准率不足的问题。系统主要包含3个核心创新：首先，通过多源异构数据的融合与集成，构建统一的知识表示模型，突破了文献资源的格式差异，实现从关键词匹配到语义理解的跨越式升级；其次，基于BERT和BGE等向量化模型，结合BM25和Solr检索等多策略召回机制，实现了精确高效的文献检索；最后，系统集成了智能问答模块，支持自然语言的多轮对话检索与高精度问答。测试结果表明，该系统在检索效率、查全率和查准率方面较传统检索方法有显著提升，为《全国报刊索引》平台的智能化发展提供了可行的技术路径。

关键词: 自然语言处理, 语义检索, 智能问答, 多源异构数据融合, 向量化模型, 向量数据库, 生成式大语言模型, 多路召

Abstract: Based on large language model(LLM) technology, this paper addresses the demand for intelligent upgrading of the Quan Guo Bao Kan Suo Yin(CNBKSY) platform and proposes an intelligent retrieval system integrating natural language processing(NLP), semantic retrieval, and intelligent Q&A. The system aims to overcome the limitations in traditional systems , such as low retrieval efficiency and insufficient recall and precision. It embodies three key innovations. First, it constructs a unified knowledge representation model by integrating heterogeneous data from multiple sources, overcoming format inconsistencies among literature resources and achieving a significant transition from keyword matching to semantic understanding. Second, the system employs advanced vectorization models such as BERT and BGE, combined with hybrid retrieval strategies including BM25 and Solr-based methods, to achieve precise and efficient document retrieval. Third, the system incorporates an intelligent Q&A module, supporting multi-round natural language search and high-precision question answering. Thetest results demonstrate that the system achieves significant improvements over traditional retrieval methods in terms of efficiency, recall, and precision, providing a viable technical solution for the intelligent development of the CNBKSY platform.

Key words: Natural language processing, Semantic retrieval, Intelligent Q&, A, Multi-source heterogeneous data fusion, Vectorized modeling, Vectorized database, Generative large language models(LLM), Multi-route retrieval

戴晴宜, 韩春磊, 高智晨. 基于大模型的文献数据库服务创新探索与研究——以《全国报刊索引》数据库智能检索服务为例[J]. 图书馆杂志, 2026, 45(4): 71-81.

Dai Qingyi, Han Chunlei, Gao Zhichen. Exploration and Research on the Innovation of Literature Database Service Based on LLM： Using Quan Guo Bao Kan Suo Yin (CNBKSY) Intelligent Search Service as a Case Study[J]. Libraly Journal, 2026, 45(4): 71-81.

参考文献

［1］ Lin J, Ma X. A few brief notes on deepimpact, coil, and a conceptual framework for information retrieval techniques［PP］. arXiv preprint arXiv:2106.14807, 2021.
［2］熊回香,李跃艳.基于Word2vec的学者推荐与跨语言论文推荐模型研究［J］.情报科学,2019, 37(12):1926.
［3］覃俊,刘璐,刘晶,等.基于BERT与主题模型联合增强的长文档检索模型［J］.中南民族大学学报(自然科学版), 2023, 42(4):469476.
［4］ Karpukhin V, Oguz B, Min S, et al. Dense Passage Retrieval for OpenDomain Question Answering［C］//EMNLP(1), 2020:67696781.
［5］ Gao L, Ma X, Lin J, et al. Precise zeroshot dense retrieval without relevance labels［C］//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023:17621777.
［6］ Lewis P, Perez E, Piktus A, et al. Retrievalaugmented generation for knowledgeintensive nlp tasks［J］. Advances in Neural Information Processing Systems, 2020, 33:94599474.
［7］ Ram O, Levine Y,Dalmedigos I, et al. Incontext retrievalaugmented language models［J］. Transactions of the Association for Computational Linguistics, 2023, 11:13161331.
［8］ Luo L, Yang Z, Yang P, et al. An attentionbased BiLSTMCRF approach to documentlevel chemical named entity recognition［J］. Bioinformatics, 2018, 34(8):13811388.
［9］ Geigle G, Pfetffer J, Reimers N, et al. Retrieve fast, rerank smart: cooperative and joint approaches for improved crossmodal retrieval［J］. Transactions of the Association for Computational Linguistics, 2022, 10:503521.
［10］ Mughees, Muhammad Haris, Ling Ren. Vectorized batch private information retrieval［C］. IEEE Symposium on Security and Privacy(SP), Sanfrancisco, CA: May 1, 2023.
［11］ Ji T, Wu Y, Lan M. Graphbased dependency parsing with graph neural networks［C］//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019:24752485.
［12］ Zhou H, Zhang Y, Huang S, et al. A neural probabilistic structuredprediction model for transitionbased dependency parsing［C］//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015:12131222.
［13］ Li J, Sun A, Han J, et al. A survey on deep learning for named entity recognition［J］. IEEE Transactions on Knowledge and Data Engineering, 2020, 34(1):5070.
［14］计峰,邱锡鹏.基于序列标注的中文依存句法分析方法［J］.计算机应用与软件,2009, 26(10):3.
［15］ White J, Fu Q, Hays S, et al. A prompt pattern catalog to enhance prompt engineering with chatgpt［PP］. arXiv preprint arXiv:2302.11382, 2023.
［16］ Kumar Jayant. Apache Solr search patterns［M］. Birmingham: Packt Publishing Ltd, 2015.
［17］ Chen Qian, Zhu Zhuo, Wen Wang. Bert for joint intent classification and slot filling［PP/OL］. arXiv preprint arXiv:1902.10909, 2019.
［18］ Singh A,Bacchuwar K, Bhasin A. A survey of OCR applications［J］. International Journal of Machine Learning and Computing, 2012, 2(3):314.
［19］ Vasiliev Yuli. Natural language processing with Python and spaCy: a practical introduction［M］. No Starch Press, 2020.
［20］ Gormley Clinton, Zachary Tong. Elasticsearch: the definitive guide: a distributed realtime search and analytics engine［M］. OReilly Media, Inc., 2015.
［21］ Robertson S, Zaragoza H, Taylor M. Simple BM25 extension to multiple weighted fields［C］//Proceedings of the thirteenth ACM international conference on Information and knowledge management, 2004:4249.
［22］ Faruqui M, Tsvetkov Y, Yogatama D, et al. Sparse overcomplete word vector representations［PP］. arXiv preprint arXiv:1506.02004(2015).
［23］ Mikolov T. Efficient estimation of word representations in vector space［PP］. arXiv preprint arXiv:1301.3781, 2013.
［24］ Arya S, Mount D M, Netanyahu N S, et al. An optimal algorithm for approximate nearest neighbor searching fixed dimensions［J］. Journal of the ACM, 1998, 45(6): 891923.
［25］ Xiao S, Liu Z, Zhang P, et al. Cpack: packaged resources to advance general chinese embedding［PP］. arXiv preprint arXiv:2309.07597, 2023.
［26］ Kwon W, Li Z, Zhuang S, et al. Efficient memory management for large language model serving with pagedattention［C］//Proceedings of the 29th Symposium on Operating Systems Principles, 2023:611626.