﻿ 基于快速模拟退火的组合聚类算法<sup>*</sup>
 文章快速检索 高级检索

Ensemble clustering algorithm based on rapid simulated annealing
LI Hong, ZHANG Zhibin
School of Economics and Management, Beihang University, Beijing 100083, China
Received: 2018-11-08; Accepted: 2019-03-29; Published online: 2019-04-18 11:15
Foundation item: National Natural Science Foundation of China (71471009)
Corresponding author. LI Hong, E-mail: hong_lee@buaa.edu.cn
Abstract: There are two key issues in applying simulated annealing algorithm to solve the problem of ensemble clustering. One is how to use basic partition information in annealing process to obtain better result, and the other is how to accelerate the algorithm convergence. In this paper, the rapid simulated annealing based on voting (BV-RSA) model is presented, in which the complete and partial consensuses of basic partitions are used to recognize super-objects and construct voting box for each super-object. In the process of simulated annealing, some data samples represented by a super-object are controlled to move in a group, and the motion direction of a super-object is selected randomly in the scope of its voting box, thus reducing moving randomness and speeding up the clustering of super-objects. Experiments on multiple data sets demonstrate that the BV-RSA model performs well in both clustering accuracy and robustness.
Keywords: ensemble clustering     simulated annealing     super-object     voting     combinatorial optimization

1 BV-RSA模型框架

 (1)

BV-RSA模型解决式(1)组合优化问题的流程框架如图 1所示。

 图 1 BV-RSA模型框架 Fig. 1 Framework of BV-RSA model

 (2)
 (3)
 (4)
 (5)

2 超点和投票箱机制 2.1 超点和超点运动

 数据样本 基础聚类的簇标签 超点 π1 π2 π3 x1 C1 C2 C3 S1 x2 C1 C2 C3 x3 C1 C2 C3 x4 C1 C1 C1 S2 x5 C2 C1 C1 S3 x6 C2 C1 C1 x7 C2 C1 C1 x8 C2 C1 C1 x9 C3 C3 C1 S4 x10 C3 C3 C2 S5 x11 C3 C3 C2 x12 C3 C3 C2

2.2 投票箱与标签选择

1) SuSv的基础划分一致性概率，零范数‖yuyv0刻画了SuSv在基础划分中的差异性。

2) Bu={Sv|Puvαvu}称为Su的投票箱，α为基础划分一致性概率阈值。

3 BV-RSA模型的算法实现 3.1 算法描述

BV-RSA模型的具体实现算法描述如下，包括初始化和迭代退火2个子过程。

1) 初始化过程

2) 迭代退火过程

3.2 退火检验的增量计算

 数据集 样本数 属性数 类别数 来源 iris 150 3 3 UCI glass 214 9 6 UCI segment 2 310 19 7 UCI hitech 2 301 126 321 6 CLUTO k1b 2 340 21 839 6 CLUTO la12 6 279 31 472 6 CLUTO re1 1 657 3 758 25 CLUTO reviews 4 069 126 373 5 CLUTO sports 8 580 126 373 7 CLUTO tr11 414 6 429 9 CLUTO tr12 313 5 804 8 CLUTO tr41 878 7 454 10 CLUTO tr45 690 8 261 10 CLUTO letter 20 000 16 26 LIBSVM mnist 70 000 786 10 LIBSVM

 (6)
 (7)

 (8)
 (9)

 (10)
4 实验分析及模型应用 4.1 数据集和实验设置

4.2 实验结果分析

 数据集 BV-RSA/C BV-RSA/R CSPA HGPA MCLA k-means iris 0.91 0.91 0.79 0.73 0.91 0.89 glass 0.53 0.54 0.60 0.56 0.53 0.54 segment 0.67 0.65 0.59 0.52 0.67 0.66 hitech 0.56 0.56 0.56 0.53 0.56 0.51 k1b 0.89 0.87 0.86 0.86 0.91 0.64 la12 0.74 0.75 0.71 0.60 0.76 0.69 re1 0.71 0.70 0.70 0.62 0.71 0.43 reviews 0.67 0.67 0.68 0.62 0.67 0.64 sports 0.93 0.93 0.65 0.49 0.93 0.64 tr11 0.85 0.86 0.79 0.65 0.85 0.68 tr12 0.78 0.74 0.79 0.73 0.78 0.64 tr41 0.82 0.81 0.72 0.61 0.81 0.65 tr45 0.80 0.78 0.79 0.65 0.74 0.68 letter 0.34 0.33 0.24 0.24 0.30 0.26 mnist 0.57 0.55 — 0.35 0.56 0.54

 图 2 基础聚类数量对模型精度的影响 Fig. 2 Impact of number of basic clusters on model precision
4.3 网约车司机分群应用

 字段名称 字段含义 uid 司机id morning_cnt 司机在早高峰时间段总共接单数 night_cnt 司机在晚高峰时间段总共接单数 usual_cnt 司机在平峰时间段总共接单数 weekday_cnt 司机在工作日总共接单数 weekend_cnt 司机在周末总共接单数 income 司机的收入 complain_rate 司机被投诉率 route_num 司机设置有效的常用路线数 gender 性别(1-男；2-女) Age 年龄

 图 3 k-means和BV-RSA模型的司机分群结果 Fig. 3 Driver grouping result obtained from k-means and BV-RSA model

 图 4 各类司机群体的工作日接单量分布 Fig. 4 Distribution of order quantity received by different driver groups in workdays
 图 5 各类司机群体的周末接单量分布 Fig. 5 Distribution of order quantity received by different driver groups on weekend
5 结论

1) 引入了超点运动机制，使满足完全一致性划分条件的若干数据样本以成组方式参与退火过程，在压缩样本空间的同时，加快了节点聚类速度。

2) 对于每个超点，根据基础划分的部分一致性构造其投票箱，超点运动的随机性受投票箱约束。该机制保留了解空间搜索的部分随机性，同时引入启发信息加快模拟退火过程的收敛。

3) BV-RSA模型采用与文献[15]同样的效用函数，推导了超点运动引起效用函数变化的增量计算方法，降低了模拟退火检验的计算开销。

15个标准数据集上的实验表明，BV-RSA模型在10个数据集上获得了最优结果，且模型精度对基础聚类数量的变化不敏感，表现出良好的鲁棒性。

 [1] NGUYEN N, CARUANA R.Consensus clusterings[C]//IEEE International Conference on Data Mining.Piscataway, NJ: IEEE Press, 2007: 607-612. [2] STREHL A, GHOSH J. Cluster ensembles:A knowledge reuse framework for combining partitionings[J]. Journal of Machine Learning Research, 2002, 3(3): 583-617. [3] AYAD H, KAMEL M.Refined shared nearest neighbors graph for combining multiple data clusterings[C]//Advances in Intelligent Data Analysis.Berlin: Springer, 2003: 307-318. https://www.researchgate.net/publication/221460998_Refined_Shared_Nearest_Neighbors_Graph_for_Combining_Multiple_Data_Clusterings [4] YANG Y, KAMEL M S. An aggregated clustering approach using multi-ant colonies algorithms[J]. Pattern Recognition, 2006, 39(7): 1278-1289. DOI:10.1016/j.patcog.2006.02.012 [5] FERN X Z, BRODLEY C E.Solving cluster ensemble problems by bipartite graph partitioning[C]//Proceedings of Twenty-First International Conference on Machine Learning.New York: ACM, 2004: 281-288. http://web.engr.oregonstate.edu/~xfern/graph_icml04.pdf [6] YU Z, HAN G, LI L, et al.Adaptive noise immune cluster ensemble using affinity propagation[C]//IEEE International Conference on Data Engineering.Piscataway, NJ: IEEE Press, 2016: 1454-1455. Adaptive noise immune cluster ensemble using affinity propagation [7] FRED A L N, JAIN A K. Combining multiple clusterings using evidence accumulation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(6): 835-850. DOI:10.1109/TPAMI.2005.113 [8] WANG X, YANG C, ZHOU J. Clustering aggregation by probability accumulation[J]. Pattern Recognition, 2009, 42(5): 668-675. DOI:10.1016/j.patcog.2008.09.013 [9] HU M, DENG X, YAO Y.A sequential three-way approach to constructing a co-association matrix in consensus clustering[C]//International Joint Conference on Rough Sets.Berlin: Springer, 2018, 11103: 599-613. https://link.springer.com/chapter/10.1007%2F978-3-319-99368-3_47 [10] LIU H, LIU T, WU J, et al.Spectral ensemble clustering[C]//Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York: ACM, 2015: 715-724. https://wenku.baidu.com/view/b5f0cb917fd5360cbb1adb37.html [11] HUANG D, WANG C D, LAI J H. Locally weighted ensemble clustering[J]. IEEE Transactions on Cybernetics, 2016, 48(5): 1460-1473. [12] ZHOU Z H, TANG W. Clusterer ensemble[J]. Knowledge-Based Systems, 2006, 19(1): 77-83. [13] FU Y, YANG Y, LIU Y.A decision model for fuzzy clustering ensemble[C]//Proceedings of the International Conference on Intelligent Systems and Knowledge Engineering.Paris: Atlantis Press, 2007. https://www.researchgate.net/publication/266650562_A_Decision_Model_for_Fuzzy_Clustering_Ensemble [14] 陈晓云, 陈刚. 基于最大内聚度基准的加权投票聚类集成[J]. 控制与决策, 2014(2): 236-240. CHEN X Y, CHEN G. Weighted voting clustering ensemble based on maximum cohesion[J]. Control and Decision, 2014(2): 236-240. (in Chinese) [15] LU Z, PENG Y, XIAO J.From comparing clusterings to combining clusterings[C]//23rd AAAI Conference on Artificial Intelligence and the 20th Innovative Applications of Artificial Intelligence Conference.Palo Alto: AAAI, 2008, 2: 665-670. https://www.aaai.org/Papers/AAAI/2008/AAAI08-106.pdf [16] HUANG D, LAI J, WANG C D. Ensemble clustering using factor graph[J]. Pattern Recognition, 2016, 50(C): 131-142. [17] TOPCHY A P, JAIN A K, PUNCH W F.A mixture model for clustering ensembles[C]//Proceedings of the Fourth SIAM International Conference on Data Mining.Philadelphia: SIAM Publications, 2004: 379-390. http://dataclustering.cse.msu.edu/papers/topchy_mixture_siam_accepted.pdf [18] 蒋君, 徐蔚鸿, 潘楚. 基于粒计算和模拟退火的K-medoids聚类算法[J]. 计算机仿真, 2015, 32(12): 214-217. JIANG J, XU W H, PAN C. Improved K-medoids clustering algorithm based on many factors[J]. Computer Simulation, 2015, 32(12): 214-217. DOI:10.3969/j.issn.1006-9348.2015.12.045 (in Chinese) [19] SARTAKHTI J S, AFRABANDPEY H, SARAEE M. Simulated annealing least squares twin support vector machine (SA-LSTSVM) for pattern classification[J]. Soft Computing, 2017, 21(15): 4361-4373. DOI:10.1007/s00500-016-2067-4

#### 文章信息

LI Hong, ZHANG Zhibin

Ensemble clustering algorithm based on rapid simulated annealing

Journal of Beijing University of Aeronautics and Astronsutics, 2019, 45(8): 1646-1652
http://dx.doi.org/10.13700/j.bh.1001-5965.2018.0647