基于RoBERTa-MHA-BiGRU 的社交媒体虚假健康信息识别研究

摘要/Abstract

摘要： 社交媒体中的虚假健康信息纷繁复杂,且传播速度快,对公众健康危害大。快速、有效识别社交媒体虚假健康信息具有重要意义。本文首先从多个社交媒体搜集健康信息,建立中英文健康数据集, 并构建社交媒体虚假健康信息识别的RoBERTa-MHA-BiGRU 模型, 在该模型中, 利用RoBERTa 对健康数据进行向量化表示,将多头注意力机制与双向门控循环单元相结合抽取健康信息文本语义特征,并利用全连接与Softmax 函数对虚假健康信息进行识别。为验证RoBERTa-MHABiGRU模型的有效性,针对中英文数据集分别设计了3 部分实验:实验一表明,深度学习模型的识别效果优于机器学习模型,并且RoBERTa 的文本表示效果优于BERT;实验二表明,引入注意力机制有助于提升模型的学习能力,且添加多头注意力机制的RoBERTa-MHA-BiGRU 模型识别效果优于单头注意力机制;实验三表明,数据增强可进一步提升模型性能。理论上,本文拓展了虚假健康信息研究的深度和广度;实践上,为社交媒体虚假健康信息识别提供技术指导,有助于社交媒体用户及时规避虚假健康信息,提高虚假健康信息治理效率和效果。

关键词: 虚假健康信息, 社交媒体, 多头注意力机制, BiGRU, 数据增强

Abstract: Health misinformation on social media is intricate fast-spreading and highly harmful topublic health. Rapid and effective identification of such misinformation is thus of great importance. Inthis study we first gathered health information from various social media platforms to build a bilingualdataset. We then developed a RoBERTa-MHA-BiGRU model to identify health misinformation. In thismodel the RoBERTa a pre-trained language model was used to vectorize the health data combining amulti-head attention mechanism with a bidirectional gated recurrent units BiGRU to extract semanticfeatures from the texts. And a fully connected layer and the Softmax function were employed to identifyhealth misinformation. Finally three sets of experiments were conducted for the Chinese and Englishdatasets to validate the effectiveness of the RoBERTa-MHA-BiGRU model. Experiment 1 showed thatdeep learning models outperformed machine learning models and that RoBERTa's text representation wassuperior to BERT􀆳s. Experiment 2 demonstrated that incorporating an attention mechanism enhanced themodel's learning capabilities with the RoBERTa-MHA-BiGRU model outperforming the single-headattention model. Experiment 3 revealed that data augmentation further improved the model sperformance. In summary this paper expands the depth and breadth of theoretical research on healthmisinformation. Practically it provides technical guidance for the identification of health misinformationon social media helping social media users to avoid such misinformation promptly and improving the efficiency and effectiveness of health misinformation management.

Key words: Health misinformation, Social media, Multi-head attention mechanisms, BiGRU, Dataaugmentation

陈明红　何嘉宁(中山大学信息管理学院). 基于RoBERTa-MHA-BiGRU 的社交媒体虚假健康信息识别研究[J]. 图书馆杂志, 2025, 44(416): 81-92.

Chen Minghong, He Jianing (School of Information Management Sun Yat-sen University). Research on Identification of Health Misinformation on Social Media Based on RoBERTa-MHA-BiGRU[J]. Libraly Journal, 2025, 44(416): 81-92.

参考文献

[ 1 ] 宋士杰,赵宇翔,朱庆华. 社交媒体中失真健康信息的传播、识别与纠偏研究[J]. 情报杂志,2023,42(6):162 169 .

[ 2 ] Naeem S B Bhatti R. The Covid-19 ?? infodemic??a new front for information professionals J .Health Information & Libraries Journal 2020 373 233 239.

[ 3 ] Kumari R Ashok N Ghosal T et al. What thefake Probing misinformation detection standing onthe shoulder of novelty and emotion J .Information Processing & Management 2022 59 1102740.

[ 4 ] Czerniak K Pillai R Parmar A et al. A scopingreview of digital health interventions for combatingCOVID-19 misinformation and disinformation J . Journal of the American Medical InformaticsAssociation 2023 30 4 752 760 .

[ 5 ] World Health Organization. Managing the COVID-19 infodemic promoting healthy behaviours andmitigating the harm from misinformation anddisinformation EB / OL . 2023-10-08 . https / /www. who. int / news / item / 23-09-2020-managingthe-covid-19-infodemic-promoting-healthy-behaviours-and-mitigating-the-harm-from-misinformationand-disinformation .

[ 6 ] 金燕,韩二莹,毕崇武. 面向在线健康信息质量治理的群体动力系统构建及优化研究[J]. 情报科学,2022, 40(8):76 84 .

[ 7 ] 朱梦蝶,付少雄,郑德俊,等. 文献视角下的社交媒体健康谣言研究:特征、传播与治理[J]. 图书情报知识,2022, 39(5):131 143 .

[ 8 ] 贺国秀,任佳渝,李宗耀,等. 以可解释工具重探基于深度学习的谣言检测[J]. 数据分析与知识发现,2024, 8(4):1 13 .

[ 9 ] 金燕,徐何贤,毕崇武. 多维特征融合的虚假健康信息识别方法研究:基于LightGBM 算法[J]. 情报理论与实践,2023, 46(8):159 164 .

[10] 邓胜利,顾一飞. 网络虚假健康信息研究综述:认知、行为与治理[J]. 图书馆杂志,2022, 41(5): 14 22 .

[11] 朱宏淼,莫雨桐,齐佳音. 医联网失真健康信息的三维审视:内涵与成因·特征与分类·识别与治理[J]. 北京交通大学学报(社会科学版),2023,22(4):100 110 .

[12] 武晓立. 视觉操纵影响下短视频伪健康信息的纠正———基于在线实验的讨论[J]. 现代传播(中国传媒大学学报),2024, 46(6):139 147 .

[13] 代童,杜建. 矛盾性健康信息的概念、分布及原因分析[J]. 信息资源管理学报,2022, 12 (5):114122 .

[14] 朱庆华,陈琼,陆冬梅,等. 互联网环境下失真健康信息研究进展[J]. 情报学报,2023, 42(9):11251138. [15 ] Bode L Vraga E K. In related news that waswrong the correction of misinformation throughrelated stories functionality in social media J .Journal of Communication 2015 65 4 619 638 .

[16] 王雷,宋士杰,朱庆华. 基于微信公众号文章的失真健康信息识别方法比较与优化[J]. 情报学报, 2023, 42(2): 127 135 .

[17] Zhao Y Da J Yan J. Detecting health misinformationin online health communities incorporatingbehavioral features into machine learning baseda pproaches J . Information Processing & Management2021 58 1 102390 .

[18] Hou R Pérez-Rosas V Loeb S et al. Towardsautomatic detection of misinformation in onlinemedical videos C / / 2019 International Conferenceon Multimodal Interaction 2019 235 243 .

[19] 陈燕方,周晓英. 基于文本特征融合的衍生性网络健康谣言识别模型研究[J]. 图书情报工作, 2023, 67(14):73 84 .

[20] 曾子明,张瑜. 基于数据增强和多任务学习的突发公共卫生事件谣言识别研究[J]. 数据分析与知识发现,2023, 7(11):56 67 .

[21] Williams J A Aleroud A Zimmerman D. Detectingscience-based health disinformation a stylometricmachine learning approach J . Journal of ComputationalSocial Science 2023 6 2 817 843.

[22] 於张闲,冒宇清,胡孔法. 基于深度学习的虚假健康信息识别[J]. 软件导刊,2020, 19(3):16 20 .

[23] 赵月华,朱思成,苏新宁. 面向网络虚假医疗信息的识别模型构建研究———一种基于预训练的BERT模型[J]. 情报科学,2021, 39(12):165 173.

[24] Serrano J C M Papakyriakopoulos O Hegelich S.NLP-based feature extraction for the detection ofCOVID-19 misinformation videos on YouTubeC / / Proceedings of the 1st Workshop on NLP forCOVID-19 at ACL 2020 2020 1 7 .

[25] Bahdanau D Cho K Bengio Y. Neural machinetranslation by jointly learning to align and translate J .arXiv preprint arXiv 2014 1409. 0473 .

[26] Xia H Wang Y Zhang J Z et al. COVID-19fake news detection a hybrid CNN-BiLSTM-AMmodel J . Technological Forecasting and SocialChange 2023 195 122746 .

[27] Ma K Tang C Zhang W et al. DC-CNN dualchannelconvolutional neural networks withattention-pooling for fake news detection J .Applied Intelligence 2023 53 7 8354 8369 .

[28] Feng Y Cheng Y. Short text sentiment analysisbased on multi-channel CNN with multi-headattention mechanism J . IEEE Access 2021 919854 19863 .

[29] Li B Hou Y Che W. Data augmentation approachesin natural language processing a survey J .AI Open 2022 3 71 90 .

[30] Chen X Zhu D Lin D et al. Rumor knowledgeembedding based data augmentation for imbalancedrumor detection J . Information Sciences 2021580 352 370 .

[31] Keya A J Wadud M A H Mridha M F et al.AugFake-BERT handling imbalance throughaugmentation of fake news using BERT to enhancethe performance of fake news classification J .Applied Sciences 2022 12 17 1 21 .

[32] 刘勘,黄哲英. 重大突发疫情事件中的谣言识别[J].华南理工大学学报(自然科学版),2021, 49(1): 18 28 .

[33] Liu Y Ott M Goyal N et al. RoBERTa Arobustly optimized BERT pretraining approach J .arXiv preprint arXiv 1907. 11692 2019 1 13 .

[34] Adoma A F Henry N M Chen W. Comparative xlnet for text-based emotion recognition C / / 202017th International Computer Conference on WaveletActive Media Technology and InformationProcessing ICCWAMTIP . IEEE 2020 117121 .

[35] Vaswani A Shazeer N Parmar N et al. Attentionis all you need C / / Advances in neuralinformation processing systems. 2017 30 .

[36] Prabhakar S K Won D O. Medical text classificationusing hybrid deep learning models withmultihead attention J . Computational Intelligenceand Neuroscience 2021 1 945655 .

[37] Dai E Sun Y Wang S. Ginger cannot cure cancerbattling fake health news with a comprehensive datarepository C / / Proceedings of the InternationalAAAI Conference on Web and Social Media2020 14 853 862 .

[38] Cui L Lee D. Coaid Covid-19 healthcare misinformationdataset J . arXiv preprint arXiv 2006.00885 2020 1 11 .

[39] Zhou X Mulay A Ferrara E et al. Recovery amultimodal repository for Covid-19 news credibilityresearch C / / Proceedings of the 29th ACMInternational Conference on Information &Knowledge Management 2020 3205 3212 .

[40] Aimeur E Amri S Brassard G. Fake newsdisinformation and misinformation in social mediaa review J . Social Network Analysis and Mining2023 13 1 30 .

[41] Wei J Zou K. Eda Easy data augmentationtechniques for boosting performance on textclassification tasks J . arXiv preprint arXiv 1901.11196 2019 1901. 11196 .

[42 ] Lai S Xu L Liu K et al. Recurrentconvolutional neural networks for text classificationC / / Proceedings of the AAAI Conference onArtificial Intelligence 2015 29 1 2267 2273.

[1]	付少雄曾源来孙岚邓胜利（南京农业大学信息管理学院武汉大学信息管理学院）. 组态视角下短视频虚假健康信息传播行为的影响因素研究[J]. 图书馆杂志, 2024, 43(404): 103-116.
[2]	余馨玲常娥（东南大学经济管理学院东南大学图书馆）. 基于DA-BERT-CRF 模型的古诗词地名自动识别研究——以金陵古诗词为例[J]. 图书馆杂志, 2023, 42(390): 87-94.
[3]	陈素白顾晨昱（厦门大学新闻传播学院）. “脱敏”的数字原住民：基于计划行为理论扩展的社交媒体用户隐私保护意愿研究[J]. 图书馆杂志, 2023, 42(386): 121-132.
[4]	邓胜利顾一飞（武汉大学信息资源研究中心）. 网络虚假健康信息研究综述：认知、行为与治理 [J]. 图书馆杂志, 2022, 41(5): 14-22.
[5]	顾润德陈媛媛董伟. 基于扎根理论的社交媒体用户倦怠情绪与转移行为研究[J]. 图书馆杂志, 2021, 40(6): 110-118.
[6]	王世伟. 略论“信息疫情”十大特征[J]. 图书馆杂志, 2020, 39(3): 19-23.
[7]	侯爱花. 社交媒体环境下采访馆员角色的转型[J]. 图书馆杂志, 2015, 34(6): 48-52.
[8]	汪琼,陈伟. 移动网络社交媒体对图书馆行业发展的影响研究[J]. 图书馆杂志, 2015, 34(12): 69-76.
[9]	都平平李雨珂孟勇陈越穆亚凤吴玲. 类百度百科模式专家学者知识链数据库建设研究[J]. 图书馆杂志, 2015, 34(11): 46-51.