Empowering Computational Research in the History of Ideas with Generative Artificial Intelligence: Model Construction and Applications

Abstract

Abstract:

The large language model has changed the natural language processing and is enhancing the computational analysis of historical texts. Taking the Baichuan Large Language Model as the benchmark model and using the text of the book series Biographies of Chinese Thinkers as the data source, the Thinkers Model was constructed by using domain-specific pre-training, supervised fine-tuning, and direct preference optimization, whereas the performance was evaluated. Evaluation results show that the Thinkers Model outperforms general models in this specialized domain, demonstrating its potential in computational humanities research. The Thinker Model reduces the professional barriers to knowledge exchange and can address challenges in natural language interpretation within computational humanities research.

Key words: Computational Historiography, AIGC, Thinkers, Large language model, Biographies of Chinese Thinkers, Computational humanities

Liu Jiangfeng, Zhang Ran, Zhang Jundong, Pei Lei (1 Data Intelligence and Cross Inno¬vation Laboratory, Nanjing University, 2 School of Information Management, Nanjing University). Empowering Computational Research in the History of Ideas with Generative Artificial Intelligence: Model Construction and Applications [J]. Libraly Journal, 2025, 44(407): 113-127.

References

[ 1 ] 梁启超. 历史统计学[M]// 梁启超全集（第十四卷）. 北京：北京出版社，1999.

[ 2 ] 宋学勤. 梁启超对历史统计学的倡导与实践[J]. 史学理论研究，2006（3）：31–41；158.

[ 3 ] 陈加晋，卢勇. 发现更真的历史：中国计算史学的百年之路与时代使命[J]. 图书与情报，2023（1）： 21–30.

[ 4 ] 郭超. 计量史学简论[J]. 天中学刊（驻马店师专学报），1995（4）：72–75.

[ 5 ] 中国信息通信研究院，京东探索研究院. 人工智能生成内容（AIGC）白皮书[EB/OL]. （2022- 09-02）[2024-05-07]. http://www.caict.ac.cn/kxyj/ qwfb/bps/202209/P020220902534520798735.pdf.

[ 6 ] Zhao W X，Zhou K，Li J，et al. A survey of large language models：arXiv：2303.18223[Z/OL]. arXiv，2023（2023-06-29）[2024-05-07]. http:// arxiv.org/abs/2303.18223.

[ 7 ] Anonymous. LaMDA：our breakthrough conversation technology[EB/OL].（2021-05-18）[2024-05-07]. https://blog.google/technology/ai/lamda/.

[ 8 ] Chowdhery A，Narang S，Devlin J，et al. PaLM： scaling language modeling with pathways：arXiv： 2204.02311[Z/OL]. arXiv，2022（2022-10-05） [2024-05-07]. http://arxiv.org/abs/2204.02311.

[ 9 ] Anonymous. Gemini-google deepmind[EB/ OL]. [2024-05-07]. https://deepmind.google/ technologies/gemini/.

[10] Touvron H，Lavril T，Izacard G，et al. LLaMA：open and efficient foundation language models：arXiv： 2302.13971[Z/OL]. arXiv，2023（2023-02-27） [2024-05-07]. http://arxiv.org/abs/2302.13971.

[11] Rohan Taori，Ishaan Gulrajani，Tianyi Zhang，et al. Alpaca：a strong，replicable instruction-following model[EB/OL]. [2024-05-07]. https://crfm.stanford. edu/2023/03/13/alpaca.html.

[12] Xinyang Geng，Arnav Gudibande，Hao Liu，et al. Koala：a dialogue model for academic research[EB/OL]. [2024-05-07]. http://bair.berkeley.edu/blog/2023/04/03/ koala/.

[13] Anonymous. Vicuna：an open-source chatbot impressing gpt-4 with 90%* chatgpt quality | lmsys org[EB/OL]. [2024-05-07]. https://lmsys.org/ blog/2023-03-30-vicuna.

[14] Zheng Q，Xia X，Zou X，et al. CodeGeeX： a pre-trained model for code generation with multilingual evaluations on humaneval-x：arXiv： 2303.17568[Z/OL]. arXiv，2023（2023-03-30） [2024-05-07]. http://arxiv.org/abs/2303.17568.

[15] Du Z，Qian Y，Liu X，et al. GLM：general language model pretraining with autoregressive blank infilling[C/OL]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1：Long Papers）. Dublin， Ireland：Association for Computational Linguistics， 2022：320–335[2024-05-07]. https://aclanthology. org/2022.acl-long.26.

[16] Zeng A，Liu X，Du Z，et al. GLM-130b：an open bilingual pre-trained model：arXiv：2210.02414[Z/OL]. arXiv，2023（2023-10-25）[2024-05-07]. http:// arxiv.org/abs/2210.02414.

[17] Cui Y，Yang Z，Yao X. Efficient and effective text encoding for chinese llama and alpaca：arXiv： 2304.08177[Z/OL]. arXiv，2023（2023-06-15） [2024-05-07]. http://arxiv.org/abs/2304.08177.

[18] Bai J，Bai S，Chu Y，et al. Qwen technical report： arXiv：2309.16609[Z/OL]. arXiv，2023（2023-09- 28）[2024-05-07]. http://arxiv.org/abs/2309.16609.

[19] Yang A，Xiao B，Wang B，et al. Baichuan 2：open large-scale language models：arXiv： 2309.10305[Z/OL]. arXiv，2023（2023-09-20） [2024-05-07]. http://arxiv.org/abs/2309.10305.

[20] 文心一言[EB/OL]. [2024-05-07]. https://yiyan. baidu.com.

[21] Xiong H，Wang S，Zhu Y，et al. DoctorGLM： fine-tuning your chinese doctor is not a herculean task：arXiv：2304.01097[Z/OL]. arXiv，2023 （2023-04-17）[2024-05-07]. http://arxiv.org/ abs/2304.01097.

[22] Wang H，Liu C，Xi N，et al. HuaTuo：tuning llama model with chinese medical knowledge：arXiv： 2304.06975[Z/OL]. arXiv，2023（2023-04-14） [2024-05-07]. http://arxiv.org/abs/2304.06975.

[23] Wang H，Zhao S，Qiang Z，et al. Knowledge-tuning large language models with structured medical knowledge bases for reliable response generation in chinese：arXiv：2309.04175[Z/OL]. arXiv，2023（2023-09-08）[2024-05-07]. http:// arxiv.org/abs/2309.04175.

[24] Du Y，Zhao S，Cai M，et al. The calla dataset： probing llms’ interactive knowledge acquisition from chinese medical literature：arXiv：2309.04198[Z/ OL]. arXiv，2023（2023-09-12）[2024-05-07]. http://arxiv.org/abs/2309.04198.

[25] Chen Y，Wang Z，Xing X，et al. BianQue： balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt：arXiv：2310.15896[Z/OL]. arXiv，2023（2023-12-04）[2024-05-07]. http:// arxiv.org/abs/2310.15896.

[26] Zhang H，Chen J，Jiang F，et al. HuatuoGPT， towards taming language model to be a doctor： arXiv：2305.15075[Z/OL]. arXiv，2023（2023-05- 24）[2024-05-07]. http://arxiv.org/abs/2305.15075.

[27] 张君冬，杨松桦，刘江峰，等. AIGC 赋能中医古籍活化：Huang-Di 大模型的构建[J/OL]. 图书馆论坛，1–13[2024-05-07]. http://kns.cnki.net/kcms/ detail/44.1306.G2.20240124.1341.002.html.

[28] Michael-Wzhu. Michael-wzhu/chatmed[Z/OL]. （2023-12-17）[2024-05-07]. https://github.com/ michael-wzhu/ChatMed.

[29] Xu M. MedicalGPT：training your own medical gpt model with chatgpt training pipeline[Z/OL]. （2023-12-16）[2024-05-07]. https://github.com/ shibing624/MedicalGPT.

[30] Thomas-Yanxin. Thomas-yanxin/sunsimiao[Z/OL]. （2023-12-16）[2024-05-07]. https://github.com/ thomas-yanxin/Sunsimiao.

[31] Michael-Wzhu. Michael-wzhu/shennong-tcm-llm[Z/OL].（2023-12-14）[2024-05-07]. https:// github.com/michael-wzhu/ShenNong-TCM-LLM.

[32] Chen Y，Xing X，Lin J，et al. SoulChat：improving llms’ empathy，listening，and comfort abilities through fine-tuning with multi-turn empathy conversations[C/OL]//Findings of the Association for Computational Linguistics：EMNLP 2023. Singapore：Association for Computational Linguistics，2023：1170–1183[2024-05-07]. https://aclanthology.org/2023.findings-emnlp.83.

[33] Qiu H，He H，Zhang S，et al. SMILE：single-turn to multi-turn inclusive language expansion via chatgpt for mental health support：arXiv： 2305.00450[Z/OL]. arXiv，2023（2023-04-30） [2024-05-07]. http://arxiv.org/abs/2305.00450.

[34] Hongchengliu. LiuHC0428/law-gpt[Z/OL]. （2023-12-15）[2024-05-07]. https://github.com/ LiuHC0428/LAW-GPT.

[35] Song P. Pengxiao-song/lawgpt[Z/OL].（2023-12- 17）[2024-05-07]. https://github.com/pengxiao-song/LaWGPT.

[36] Li H. CSHaitao/lexilaw[Z/OL].（2023-12-16） [2024-05-07]. https://github.com/CSHaitao/ LexiLaw.

[37] Huang Q，Tao M，Zhang C，et al. Lawyer llama technical report：arXiv：2305.15062[Z/OL]. arXiv，2023（2023-10-13）[2024-05-07]. http:// arxiv.org/abs/2305.15062.

[38] Siat-nlp/hanfei[Z/OL]. SIAT NLP，2023（2023- 12-06）[2024-05-07]. https://github.com/siat-nlp/ HanFei.

[39] Yuyangmu. Jerry1993-tech/cornucopia-llama-fin-chinese[Z/OL].（2023-12-17）[2024-05-07]. https://github.com/jerry1993-tech/Cornucopia- LLaMA-Fin-Chinese.

[40] Lu D，Wu H，Liang J，et al. BBT-fin：comprehensive construction of chinese financial domain pre-trained language model，corpus and benchmark： arXiv：2302.09432[Z/OL]. arXiv，2023（2023-02- 26）[2024-05-07]. http://arxiv.org/abs/2302.09432.

[41] Zhang X，Yang Q，Xu D. XuanYuan 2.0：a large chinese financial chat model with hundreds of billions parameters：arXiv：2305.12002[Z/OL]. arXiv，2023（2023-05-19）[2024-05-07]. http:// arxiv.org/abs/2305.12002.

[42] Yang H，Liu X Y，Wang C D. FinGPT：open-source financial large language models：arXiv： 2306.06031[Z/OL]. arXiv，2023（2023-06-09） [2024-05-07]. http://arxiv.org/abs/2306.06031.

[43] Blcuicall/taoli[Z/OL]. BLCU-ICALL，2023（2023- 12-16）[2024-05-07]. https://github.com/blcuicall/ taoli.

[44] Dan Y，Lei Z，Gu Y，et al. EduChat：a large-scale language model-based chatbot system for intelligent education：arXiv：2308.02773[Z/OL]. arXiv，2023（2023-08-04）[2024-05-07]. http:// arxiv.org/abs/2308.02773.

[45] Macropodus. Yongzhuo/chatglm-maths[Z/OL]. （2023-12-12）[2024-05-07]. https://github.com/ yongzhuo/chatglm-maths.

[46] Yang Z，Ding M，Lv Q，et al. GPT can solve mathematical problems without a calculator： arXiv：2309.03241[Z/OL]. arXiv，2023（2023-09- 12）[2024-05-07]. http://arxiv.org/abs/2309.03241.

[47] Yuyang L. Yu-yang-li/starwhisper[Z/OL].（2023- 12-17）[2024-05-07]. https://github.com/Yu-Yang- Li/StarWhisper.

[48] DUOMO. DUOMO/transgpt[Z/OL].（2023-12- 17）[2024-05-07]. https://github.com/DUOMO/ TransGPT.

[49] GMFTBY. GmftbyGMFTBY/science-llm[Z/OL]. （2023-12-06）[2024-05-07]. https://github.com/ gmftbyGMFTBY/science-llm.

[50] Wang D B. Xunzi-llm-of-chinese-classics/ xunziallm[Z/OL]. Xunzi-LLM-of-Chinese-classics，2024（2024-03-30）[2024-05-07]. https:// github.com/Xunzi-LLM-of-Chinese-classics/ XunziALLM.

[51] 百川智能. 百川大模型—汇聚世界知识创作妙笔生花— 百川智能[EB/OL]. [2024-05-07]. https:// www.baichuan-ai.com/home.

[52] Hu E J，Shen Y，Wallis P，et al. LoRA：low-rank adaptation of large language models：arXiv： 2106.09685[Z/OL]. arXiv，2021（2021-10-16） [2024-05-07]. http://arxiv.org/abs/2106.09685. DOI：10.48550/arXiv.2106.09685.

[53] Wang Y，Kordi Y，Mishra S，et al. Self-instruct： aligning language models with self-generated instructions：arXiv：2212.10560[Z/OL]. arXiv， 2023（2023-05-25）[2024-05-07]. http://arxiv.org/ abs/2212.10560. DOI：10.48550/arXiv.2212.10560.

[54] Reimers N，Gurevych I. Sentence-bert：sentence embeddings using siamese bert-networks：arXiv： 1908.10084[Z/OL]. arXiv，2019（2019-08-27） [2024-05-07]. http://arxiv.org/abs/1908.10084.

[55] Rafailov R，Sharma A，Mitchell E，et al. Direct preference optimization：your language model is secretly a reward model：arXiv：2305.18290[Z/OL]. arXiv，2023（2023-05-29）[2024-05-07]. http:// arxiv.org/abs/2305.18290.