基于数据挖掘的非结构化工单信息关键要素自动标识AUTOMATIC IDENTIFICATION OF KEY ELEMENTS IN UNSTRUCTURED CHEMICAL ORDER INFORMATION BASED ON DATA MINING
罗润祥,陈春雪,李雄
摘要(Abstract):
不同渠道的非结构化工单信息缺乏统一格式约束,格式存在异构性,增加了工单信息关键要素的标识难度,为此本文提出基于数据挖掘的非结构化工单信息关键要素自动标识方法。对非结构化工单信息进行泛化处理,通过建立正向最短编辑距离路径并进行聚类,以消除格式异构性。利用预训练语言模型(BERT)提取泛化工单信息特征,并将其与工单元数据进行结构化映射,得到混合特征向量。将该混合特征向量输入数据挖掘算法中的最小二乘支持向量机模型中,优化模型参数,实现非结构化工单信息关键要素的自动标识。实验以国家电网95598服务热线工单为例,结果表明,该方法泛化后领域适应差距远低于阈值,格式异构性较低。特征提取的斯皮尔曼等级相关系数接近1,结构化映射长尾实体覆盖率良好,对同地点报修工单的服务地点、故障核心要素和特殊影响等关键要素标识精准,能够有效标识电力运维非结构化信息的关键要素。
关键词(KeyWords): 数据挖掘;非结构化工单;关键要素;自动标识;结构化映射
基金项目(Foundation): 基于大数据和AI技术的95598服务风险管控总体技术研究与应用(GZKJXM20232616)
作者(Author): 罗润祥,陈春雪,李雄
DOI: 10.27024/j.wlygc.2025.09.03.02
参考文献(References):
- [1] 周显春,喻佳.基于图神经网络的人工自然语言语义挖掘仿真[J].计算机仿真,2024,41(1):344-348.ZHOU X C,YU J.Simulation of artificial natural language semantic mining based on graph neural network[J].Computer Simulation,2024,41(1):344-348.(in Chinese)
- [2] 安然,储继华,洪先锋.面向非结构化数据的情报分析方法体系框架研究[J].情报理论与实践,2024,47(2):143-150.AN R,CHU J H,HONG X F.Research on the system framework of intelligence analysis methods for unstructured data[J].Information Studies:Theory & Application,2024,47(2):143-150.(in Chinese)
- [3] 王章宇,陈阳,周彬,等.一种面向非结构化道路的点云语义分割方法[J].北京航空航天大学学报,2025,51(2):457-465.WANG Z Y,CHEN Y,ZHOU B,et al.A point cloud semantic segmentation method for unstructured roads[J].Journal of Beijing University of Aeronautics and Astronautics,2025,51(2):457-465.(in Chinese)
- [4] 艾青林,张俊瑞,吴飞青.基于小目标类别注意力机制与特征融合的AF-ICNet非结构化场景语义分割方法[J].光子学报,2023,52(1):189-202.AI Q L,ZHANG J R,WU F Q.AF-ICNet unstructured scene semantic segmentation method based on small target category attention mechanism and feature fusion[J].Acta Photonica Sinica,2023,52(1):189-202.(in Chinese)
- [5] 唐瑞雪,秦永彬,陈艳平.融合限定关系和交互信息的实体关系联合抽取模型[J].中文信息学报,2024,38(10):106-116.TANG R X,QIN Y B,CHEN Y P.A joint entity relation extraction model fusing limited relations and interactive information[J].Journal of Chinese Information Processing,2024,38(10):106-116.(in Chinese)
- [6] 李玉聪,汪士钦,张梦玺,等.基于WAI-ARIA的网页导航栏地标属性的标识方法[J].吉林大学学报(理学版),2024,62(3):697-703.LI Y C,WANG S Q,ZHANG M X,et al.A method for identifying landmark attributes of web navigation bars based on WAI-ARIA[J].Journal of Jilin University (Science Edition),2024,62(3):697-703.(in Chinese)
- [7] 王景慧,卢玲,段志丽,等.融合依存信息的关系导向型实体关系抽取方法[J].计算机应用研究,2023,40(5):1410-1415,1440.WANG J H,LU L,DUAN Z L,et al.Relation-oriented entity relation extraction method fusing dependency information[J].Application Research of Computers,2023,40(5):1410-1415,1440.(in Chinese)
- [8] 姜艳杰,东春浩,刘辉.一种基于词法特征和数据挖掘的无意义变量名检测方法[J].计算机科学,2024,51(6):23-33.JIANG Y J,DONG C H,LIU H.A method for detecting meaningless variable names based on lexical features and data mining[J].Computer Science,2024,51(6):23-33.(in Chinese)
- [9] 张咏华,邬开俊.基于随机森林的非常规突发事件结构化情景体系模型[J].公路交通科技,2024,41(2):139-147,190.ZHANG Y H,WU K J.A structured scenario system model for unconventional emergencies based on random forest[J].Journal of Highway and Transportation Research and Development,2024,41(2):139-147,190.(in Chinese)
- [10] 孙熠衡,刘茂福.基于知识提示微调的标书信息抽取方法[J].计算机应用,2025,45(4):1169-1176.SUN Y H,LIU M F.A bid information extraction method based on knowledge prompt tuning[J].Journal of Computer Applications,2025,45(4):1169-1176.(in Chinese)
- [11] 温清华,朱洪银,侯磊,等.多策略中文开放关系抽取方法[J].中文信息学报,2023,37(1):88-96.WEN Q H,ZHU H Y,HOU L,et al.A multi-strategy method for open Chinese relation extraction[J].Journal of Chinese Information Processing,2023,37(1):88-96.(in Chinese)
- [12] 邢季,刘瑾,张建伟.基于双步抽取的低资源中文工业领域术语抽取方法[J].武汉大学学报(理学版),2024,70(3):329-340.XING J,LIU J,ZHANG J W.A low-resource Chinese industrial term extraction method based on two-step extraction[J].Journal of Wuhan University (Natural Science Edition),2024,70(3):329-340.(in Chinese)
- [13] 庞娜,袁钺,薛秋红.基于迁移学习的化学键能数据自动抽取[J].现代情报,2023,43(1):19-28.PANG N,YUAN Y,XUE Q H.Automatic extraction of chemical bond energy data based on transfer learning[J].Journal of Modern Information,2023,43(1):19-28.(in Chinese)
- [14] 董家慧子,谢忠,邱芹军,等.融合容错机制的基于Attention-Mask RCNN地质表格信息抽取方法[J].地质科学,2023,58(3):1147-1163.DONG J H Z,XIE Z,QIU Q J,et al.A geological table information extraction method based on Attention-Mask RCNN with fault-tolerant mechanism[J].Geological Science,2023,58(3):1147-1163.(in Chinese)
- [15] 周炫余,刘林,卢笑,等.多模态信息增强表示的中文关键词抽取方法[J].清华大学学报(自然科学版),2024,64(10):1785-1796.ZHOU X Y,LIU L,LU X,et al.A Chinese keyword extraction method with multimodal information enhanced representation[J].Journal of Tsinghua University (Science and Technology),2024,64(10):1785-1796.(in Chinese)