﻿ 一种面向船舶制造的数据关联规则挖掘算法
 舰船科学技术  2022, Vol. 44 Issue (20): 143-148    DOI: 10.3404/j.issn.1672-7649.2022.20.029 PDF

1. 中国船舶集团有限公司第七一六研究所，江苏 连云港 222006;
2. 江苏杰瑞科技集团有限责任公司，江苏 连云港 222006;
3. 哈尔滨工程大学 计算机科学与技术学院，黑龙江 哈尔滨 150001

A data association rule mining algorithm for shipbuilding
XU Peng1,2, MENG Yu-long3, YANG Zhe1,2, DONG Nai-bo3, DENG Bo-wei1
1. The 716 Research Institute of CSSC, Lianyungang 222006, China;
2. Jiangsu JARI Technology Group Co.Ltd., Lianyungang 222006, China;
3. College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China
Abstract: In order to solve the problem of low mining efficiency caused by too much space occupied by transaction sets in the process of mining association rules of shipbuilding massive data, the Eclat algorithm is used to convert the merging of transactions into set operations using vertical databases. A LBM-Eclat algorithm based on locally sensitive hash bitmap (LBM) is proposed. LBM Eclat combines two data structures, local sensitive hash and bitmap, and can dynamically adjust the internal data storage structure according to the changes of the amount of stored data. Through comparative experiments, it is proved that LBM-Eclat algorithm based on LBM can effectively improve the mining efficiency of dense data sets and reduce the space consumption in the mining process.
Key words: shipbuilding data     association rule mining     LBM-Eclat algorithm     local sensitive hash     bitmap
0 引　言

1 数据压缩存储 1.1 局部敏感哈希

 图 1 LSH数据映射过程 Fig. 1 LSH data mapping process

 $\left| {A \cap B} \right| = \sum\limits_{i = 1}^k {{A_i} \cap {B_i}} ，$ (1)
 $\left| {A \cap B} \right| \leqslant \sum\limits_{i = 1}^k {\min (\left| {{A_i}} \right|,\left| {{B_i}} \right|)} 。$ (2)

 $Jaccard(A,B) = \frac{{\left| {A \cap B} \right|}}{{\left| {A \cup B} \right|}} ，$ (3)

 $has{h_{\min }}(A) = \mathop {\min }\limits_{o \in A} hash(o) ，$ (4)

 $P[has{h_{\min }}(A) = has{h_{\min }}(B)] = Jaccard(A,B) 。$ (5)

Jaccard相似度计算如下式：

 $Jaccard(A,B) = \frac{{\displaystyle\sum\limits_{j = 1}^k {{I_{X_A^j = X_B^j}}} }}{K} 。$ (6)

1.2 位图与压缩位图算法

WAH（word-aligned hybrid algorithm）是一种位图压缩算法，WAH压缩算法采用Word-Aligned和Hybrid这2种计算机数据处理思想。WAH算法采用Word-Aligned的方式将Word进行对齐，然后根据对齐的Word将位图进行分组压缩；Hybrid则是针对位图中存在的大量连续的“1”或“0”数据段与混合存在的“1”和“0”数据段分别进行压缩编码操作。通过这种方式对数据进行区分，在面对存在大量的连续的数据段时，可以有效地对数据进行压缩，减少不必要的比较。

2 LBM-Eclat算法实现

LBM-Eclat算法的主要思想是将事务集中32位int型数据Tid通过哈希索引进行拆分，尽可能地均匀分散到各个容器中进行存储。LBM-Eclat算法中的LBM在创建时，主要根据不同数据集中最大数据量确定其哈希映射函数的基数，映射基数的计算公式如下：

 ${\text{base = Math}}{\text{.floor(lo}}{{\text{g}}_2}(DB.length)/2) 。$ (7)

 图 2 LBM数据结构 Fig. 2 LBM data structure

 $|A \cap B| = \frac{{Jaccard(A,B)}}{{Jaccard(A,B) + 1}}(|A| + |B|) 。$ (8)

 $\frac{{\left| {\min Sup - S} \right|}}{{\min Sup}} \lt B 。$ (9)

LBM-Eclat算法的具体实现流程为，在对项集数据进行交集操作时，每次会生成一个新的容器存储2个交集生成的结果，根据交集的2个 LBM内容器类型以及预计生成的数据量大小，能够确定新生成容器的类型。同时根据得到的容器内数据量与最小支持度进行比较，对项集进行过滤，最终得到新的频繁项集，用于后续的挖掘，并通过前缀树Trie进行存储。

3 实验分析 3.1 实验目的

3.2 实验数据集

3.3 实验方案与结果分析

 $P = \frac{{TP}}{{TP + FN}} ，$ (10)
 $R = \frac{{TP}}{{TP + FN}} ，$ (11)
 $F_{1} = \frac{{2P \times R}}{{P + R}} 。$ (12)

 图 3 K值变化引起的挖掘时间和精确度变化图 Fig. 3 Variation diagram of excavation time and accuracy caused by change of K value

 图 4 B值变化引起的挖掘时间和精确度变化图 Fig. 4 Variation diagram of excavation time and accuracy caused by change of B value

 图 5 密集型数据集挖掘时间消耗对比图 Fig. 5 Comparison of time consumption in dense dataset mining

 图 6 密集型数据集挖掘空间消耗对比图 Fig. 6 Comparison of memory consumption in dense dataset mining

4 结　语

 [1] 卞德志, 胡昌平, 杨哲, 等. 面向船舶制造的统一数据库集成平台应用研究[J]. 舰船科学技术, 2020, 42(13): 134-138. BIAN De-zhi, HU Chang-ping, YANG Zhe, et al. Research on application of unified database integration platform to shipbuilding enterprises[J]. Ship Science and Technology, 2020, 42(13): 134-138. [2] AGRAWAL R. Mining association rules between sets of items in large databases[J]. ACM SIGMOD Record, 1993, 22(2). [3] ZAKI M J. Scalable algorithms for association mining[J]. IEEE Transattions on Knowledge and Data Engineenig, 2000, 12(3): 372-390. DOI:10.1109/69.846291 [4] 崔馨月, 孙静宇. 改进的Eclat算法研究与应用[J]. 计算机工程与设计, 2018, 39(4): 1059-1063+1147. DOI:10.16208/j.issn1000-7024.2018.04.028 [5] KAUR M, GARG U, KAUR S. Advanced eclat algorithm for frequent itemsets generation[J]. International Journal of Applied Engineering Research, 2015, 10(9): 23263-23279. [6] BAKAR W A , MAN M, et al. I-Eclat: Performance enhancement of Eclat via incremental approach in frequent itemset mining[J]. Telkomnika, 2020, 18(1): 562-570. DOI:10.12928/telkomnika.v18i1.13497 [7] MAN M, JULAILY A J, SAANY S I A, et al. Analysis study on R-Eclat algorithm in infrequent itemsets mining[J]. International Journal of Electrical and Computer Engineering, 2019, 9(6): 5446. [8] ABBASI S, MOIENI A. BloomEclat: efficient eclat algorithm based on bloom filter[J]. Journal of Algorithms and Computation, 2021, 53(1): 197-208. [9] MURALIDHARAN C, ANITHA R. Risk analysis of cloud service providers by analyzing the frequency of occurrence of problems using E-Eclat algorithm[J]. Wireless Networks, 2021, 27(8): 5587-5595. DOI:10.1007/s11276-019-02191-4 [10] 高强, 张凤荔, 陈学勤, 等. 基于改进Eclat算法的资源池节点异常模式挖掘[J]. 计算机应用研究, 2018, 35(2): 6. DOI:10.3969/j.issn.1001-3695.2018.02.003 [11] YU X, WANG H. Improvement of eclat algorithm based on support in frequent itemset mining[C]// The 6th International Conference on Computer Research and Development, 2014, 9(9): 2116–2123. [12] MA Z, YANG J, ZHANG T, et al. An improved eclat algorithm for mining association rules based on increased search strategy[J]. International Journal of Database Theory and Application, 2016, 9(5): 251-266. DOI:10.14257/ijdta.2016.9.5.26 [13] 田攀博. 基于等价类变换的快速关联规则挖掘方法研究[D]. 哈尔滨: 哈尔滨工业大学, 2019. [14] LV S, HUANG J. An improved eclat algorithm based on pruning optimization and indexing intersection[J]. 2018. [15] 李成严, 辛雪, 赵帅, 等. Sp-IEclat: 一种大数据并行关联规则挖掘算法[J]. 哈尔滨理工大学学报, 2021, 26(4): 109-118. [16] ONDOV B D, TREANGEN T J, MELSTED P, et al. Mash: fast genome and metagenome distance estimation using MinHash[J]. Genome biology, 2016, 17(1): 1-14. DOI:10.1186/s13059-015-0866-z [17] WU K, OTOO E J, SHOSHANI A. Optimizing bitmap indices with efficient compression[J]. ACM Transactions on Database Systems (TODS), 2006, 31(1): 1-38. DOI:10.1145/1132863.1132864