Libraly Journal

Libraly Journal ›› 2022, Vol. 41 ›› Issue (8): 76-83.

Previous Articles     Next Articles

Chinese Word Segmentation and Application of Intangible Cultural Heritage Texts from the Perspective of Digital Humanities

Hu Haotian1, 2, Deng Sanhong1, 2, Zhang Yiqin1, 2, Zhang Qi1, 2, Kong Jia1, 2, Wang Dongbo2, 3 (1 School of Information Management, Nanjing University; 2 Jiangsu Key Laboratory of Data Engineering and Knowledge Service; 3 School of Information Management, Nanjing Agricultural University)   

  • Online:2022-08-15 Published:2022-08-18
  • About author:Hu Haotian1, 2, Deng Sanhong1, 2, Zhang Yiqin1, 2, Zhang Qi1, 2, Kong Jia1, 2, Wang Dongbo2, 3 (1 School of Information Management, Nanjing University; 2 Jiangsu Key Laboratory of Data Engineering and Knowledge Service; 3 School of Information Management, Nanjing Agricultural University)

Abstract: Automatic word segmentation is the foundation and key step of digital humanities research related to intangible cultural heritage, and it is the prerequisite to in-depth exploration of intangible cultural heritage internal information. We constructed automatic word segmentation models for the application text of national intangible cultural heritage projects. We compared the segmentation performance of CRF, Bi-LSTM-CRF, BERT, RoBERTa and ALBERT on intangible cultural heritage texts. And, the results of Han LP, Jieba, and NLPIR, general CWS tools were compared. In all 14 models, the RoBERTa model had the best effect, with an F-score of 97.28%, and ALBERT had the fastest training speed under the same conditions of PTMs. The word segmentation model was used to construct the intangible cultural heritage text domain vocabulary and segmentation corpus, whereas the intangible cultural heritage text vocabulary distribution was analyzed and mined. We developed the Chinese Intangible Cultural Heritage Text Automatic Segmentation System (CITS), which provided a tool for the automatic segmentation of intangible cultural heritage texts and the multi-dimensional visual analysis of the segmentation results.