文章快速检索 高级检索

1. 华南农业大学 数学与信息学院, 广东 广州 510640;
2. 中山大学 数据科学与计算机学院, 广东 广州 510006;
3. 广东省信息安全技术重点实验室, 广东 广州 510006

Clustering ensemble by decision weighting
HUANG Dong1, WANG Changdong2,3 , LAI Jianhuang2,3, LIANG Yun1, BIAN Shan1, CHEN Yu1
1. College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510640, China;
2. School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China;
3. Guangdong Key Laboratory of Information Security Technology, Guangzhou 510006, China
Abstract: The clustering ensemble technique aims to combine multiple base clusterings to achieve better and more robust clustering results.To evaluate the reliability of the base clusterings and weight them accordingly, in this paper, we propose a new clustering ensemble approach based on a bipartite graph formulation and decision weighting strategy. Each base clustering is treated as a bag of decisions, and is assigned one unit of credit. This credit is shared (divided) by all the decisions in one clustering. Using the credit sharing concept, we propose weighting the decisions in the base clusterings with regard to the credit they have. Then, the clustering ensemble problem is formulated into a bipartite graph model that incorporates the decision weights, and the final clustering is obtained by rapidly partitioning the bipartite graph. Experimental results have demonstrated the superiority of the proposed algorithm in terms of both effectiveness and efficiency.
Key words: clustering     clustering ensemble     decision weighting     bipartite graph formulation     graph partitioning     base clustering     credit sharing     weighted clustering ensemble

1 相关研究

2 基于决策加权的聚类集成算法 2.1 问题建模

 Π=π1,π2,…,πM

 πm=Cm1,Cm2,…,Cmnm

 C=C1,C2,…,CNc

2.2 决策加权

 #Decisions(Cmk)=$\frac{\left( 1-\left| C_{k}^{m} \right| \right)\left| C_{k}^{m} \right|}{2}$

 #Decisions(πm)=$\sum\limits_{k=1}^{{{n}^{m}}}{{}}$#Decisions(Cmk) (1)

 RatioCDπm=$\frac{\#CorrectDecisions\left( {{\pi }^{m}} \right)}{\#Decisions\left( {{\pi }^{m}} \right)}$ (2)

 图 1 对于MNIST数据集，各聚类成员的连接决策数与正确决策率之间的关系 Fig. 1 The relation between #Decisions and RatioCD for the MNIST dataset

 wπm=$\frac{\frac{1}{\#Decisions\left( {{\pi }^{m}} \right)}}{\sum\limits_{{}}^{{}}{_{k=1}^{M}}\frac{1}{\#Decisions\left( {{\pi }^{m}} \right)}}$

 wπm=$\frac{1}{\#Decisions\left( {{\pi }^{m}} \right)\sum\limits_{{}}^{{}}{_{k=1}^{M}}\frac{1}{\#Decisions\left( {{\pi }^{m}} \right)}}$ (3)

 $\sum\limits_{m=1}^{M}{w\left( {{\pi }^{m}} \right)}=1$

2.3 二部图构造与聚类集成

 G=(U,V,E)

 ${{e}_{ij}}=\left\{ \begin{matrix} w\left( \pi \left( {{v}_{j}} \right) \right), & {{u}_{i}}\in X,{{v}_{j}}\in C,{{u}_{i}}\in {{v}_{j}} \\ 0, & 否则 \\ \end{matrix} \right.$

2.4 时间复杂度

3 实验结果与分析

3.1 数据集

 数据集 数据点数 维度 类别数 Glass 214 9 7 Ecoli 336 7 8 IS 2 310 19 7 MNIST 5 000 784 10 ISOLET 7 797 617 26 PD 10 992 16 10 USPS 11 000 256 10 LR 20 000 16 26
3.2 实验设置与评价指标

3.3 与聚类成员的对比实验

 图 2 本文方法与聚类成员的性能对比 Fig. 2 Comparison between our method and the base clusterings

3.4 聚类集成方法的对比实验

 测试方法 Glass Ecoli IS MNIST ISOLET PD USPS LR 本文方法 0.463 0.682 0.641 0.653 0.756 0.787 0.632 0.454 EAC [4] 0.418 0.640 0.618 0.592 0.746 0.747 0.580 0.435 HBGF [3] 0.397 0.635 0.624 0.609 0.747 0.757 0.588 0.441 SRS [21] 0.423 0.632 0.623 0.594 0.747 0.755 0.593 0.436 WCT [22] 0.434 0.678 0.623 0.627 0.752 0.764 0.598 0.439 WEAC [8] 0.409 0.637 0.616 0.607 0.746 0.752 0.581 0.439 GP-MGLA[8] 0.399 0.640 0.634 0.624 0.747 0.758 0.602 0.441
3.5 在不同聚类集成规模下的对比实验

 图 3 各个方法的聚类集成性能 Fig. 3 The performances of different clustering ensemble methods with varying ensemble sizes
3.6 运行时间

 图 4 各个聚类集成方法在不同数据规模下的运行时间对比 Fig. 4 Execution time of different methods with varying data sizes
3 结束语

 [1] STREHL A, GHOSH J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions[J]. The journal of machine learning research, 2003, 3(3): 583-617. [2] CRISTOFOR D, SIMOVICI D. Finding median partitions using information-theoretical-based genetic algorithms[J]. Journal of universal computer science, 2002, 8(2): 153-172. [3] FERN X Z, BRODLEY C E. Solving cluster ensemble problems by bipartite graph partitioning[C]//Proceedings of the 21st International Conference on Machine Learning. New York, NY, USA, 2004. [4] FRED A L N, JAIN A K. Combining multiple clusterings using evidence accumulation[J]. IEEE transactions on pattern analysis and machine intelligence, 2005, 27(6): 835-850. [5] WANG Xi, YANG Chunyu, ZHOU Jie. Clustering aggregation by probability accumulation[J]. Pattern recognition, 2009, 42(5): 668-675. [6] SINGH V, MUKHERJEE L, PENG Jiming, et al. Ensemble clustering using semidefinite programming with applications[J]. Machine learning, 2010, 79(1/2): 177-200. [7] HUANG Dong, LAI Jianhuang, WANG Changdong. Exploiting the wisdom of crowd: a multi-granularity approach to clustering ensemble[C]//Proceedings of the 4th International Conference on Intelligence Science and Big Data Engineering. Beijing, China, 2013: 112-119. [8] HUANG Dong, LAI Jianhuang, WANG Changdong. Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis[J]. Neurocomputing, 2015, 170: 240-250. [9] HUANG Dong, LAI Jianhuang, WANG Changdong. Ensemble clustering using factor graph[J]. Pattern recognition, 2016, 50: 131-142. [10] HUANG Dong, LAI Jianhuang, WANG Changdong. Robust ensemble clustering using probability trajectories[J]. IEEE transactions on knowledge and data engineering, 2016, 28(5): 1312-1326. [11] LI Tao, DING C. Weighted consensus clustering[C]//Proceedings of the 2008 SIAM International Conference on Data mining. Auckland, New Zealand, 2008: 798-809. [12] KARYPIS G, KUMAR V. Multilevel k-way partitioning scheme for irregular graphs[J]. Journal of parallel and distributed computing, 1998, 48(1): 96-129. [13] NG A Y, JORDAN M I, WEISS Y. On spectral clustering: Analysis and an algorithm[C]//Advances in Neural Information Processing Systems. Vancouver, Canada, 2001. [14] TOPCHY A, JAIN A K, PUNCH W. Clustering ensembles: models of consensus and weak partitions[J]. IEEE transactions on pattern analysis and machine intelligence, 2005, 27(12): 1866-1881. [15] VEGA-PONS S, CORREA-MORRIS J, RUIZ-SHULCLOPER J. Weighted partition consensus via kernels[J]. Pattern recognition, 2010, 43(8): 2712-2724. [16] VEGA-PONS S, RUIZ-SHULCLOPER J, GUERRA-GANDóN A. Weighted association based methods for the combination of heterogeneous partitions[J]. Pattern recognition letters, 2011, 32(16): 2163-2170. [17] 徐森, 周天, 于化龙, 等. 一种基于矩阵低秩近似的聚类集成算法[J]. 电子学报, 2013, 41(6): 1219-1224.XU Sen, ZHOU Tian, YU Hualong, et al. Matrix low rank approximation-based cluster ensemble algorithm[J]. Acta electronica sinica, 2013, 41(6): 1219-1224. [18] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [19] LI Zhenguo, WU Xiaoming, CHANG S F. Segmentation using superpixels: a bipartite graph partitioning approach[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA, 2012: 789-796. [20] BACHE K, LICHMAN M. UCI machine learning repository[EB/OL]. (2013-04-04). http://archive.ics.uci.edu/ml. [21] IAM-ON N, BOONGOEN T, GARRETT S. Refining pairwise similarity matrix for cluster ensemble problem with cluster relations[C]//Proceedings of the 11th International Conference on Discovery Science. Budapest, Hungary, 2008: 222-233. [22] IAM-ON N, BOONGOEN T, GARRETT S, et al. A link-based approach to the cluster ensemble problem[J]. IEEE transactions on pattern analysis and machine intelligence, 2011, 33(12): 2396-2409.
DOI: 10.11992/tis.2016030

0

#### 文章信息

HUANG Dong, WANG Changdong, LAI Jianhuang, LIANG Yun, BIAN Shan, CHEN Yu

Clustering ensemble by decision weighting

CAAI Transactions on Intelligent Systems, 2016, 11(3): 418-425.
DOI: 10.11992/tis.2016030