文章快速检索 高级检索

An improved clustering algorithm that searches and finds density peaks
GAN Wenyan, LIU Chong
College of Command Information System, PLA University of Science and Technology, Nanjing 210007, China
Abstract: Clustering is a fundamental issue for big data analysis and data mining. In July 2014, a paper in the Journal of Science proposed a simple yet effective clustering algorithm based on the idea that cluster centers are characterized by a higher density than their neighbors and having a relatively large distance from points with higher densities. The proposed algorithm can detect clusters of arbitrary shapes and differing densities but is very sensitive to tunable parameter dc. In this paper, we propose an improved clustering algorithm that adaptively optimizes parameter dc. The time complexity of our algorithm was super-linear with respect to the size of the dataset. Further, our theoretical analysis and experimental results show the effectiveness and efficiency of our improved algorithm.
Key words: data mining     clustering algorithms     kernel density estimation     entropy

2014年《Science》杂志上刊登了一篇题为《Clustering by fast search and find of density peaks》的论文[1]，论文提出一种快速搜索和发现密度峰值的聚类算法。算法将具有局部极大密度估计值的样本点视为聚类中心，通过快速搜索聚类中心，将每一个非中心样本点沿着密度递增的最近邻方向迭代划分给相应的聚类中心，实现数据划分。算法思路新颖，简单实用，具有良好的聚类质量，能够发现任意形状、大小和密度的聚类，能够有效处理噪声和离群数据，对人脸等高维非结构化数据具有良好的适用性。虽然论文的局限性遭到众多读者的质疑，如聚类结果严重依赖于密度参数dc的仔细选择，但整体上可以为聚类算法设计提供一种新思路。

1 快速搜索密度峰值的聚类算法

1.1 局部密度估计与高密度最近邻距离

xiD, 1≤in, 局部密度估计值dc定义为

 (1)

1) 截断核估计

 (2)

2) 高斯核估计

 (3)

3) 指数核估计

 (4)

 (5)

1.2 基于决策图的聚类划分

 图 1 包含4 000个样本点的测试数据集的聚类结果 Fig. 1 The clustering results of the test dataset containing 4 000 sample points

1.3 核函数及其参数选择对聚类结果的影响

 图 2 两个标准测试数据集 Fig. 2 Two standard test datasets: aggregation and spiral

 图 3 aggregation数据集的聚类结果 Fig. 3 The clustering results for aggregation datasets
 图 4 spiral数据集的聚类结果 (采用指数核估计) Fig. 4 The clustering results for aggregation datasets (using exponential kernel estimation)
 图 5 spiral数据集的聚类结果 (采用高斯核估计) Fig. 5 The clustering results for aggregation datasets (using gaussian kernel estimation)

 图 6 指数核与高斯核的截断距离比较 Fig. 6 Comparison of truncative distance between exponential kernel and Gaussian kernel

2 改进的搜索密度峰值的聚类算法 2.1 基于密度估计熵最小化的自适应参数优选

 (6)

 (7)

2.2 局部密度估计值的近似计算

 (8)
 (9)

2.3 改进算法描述

1) 随机抽取nsample个样本点组成抽样数据集SampleSet；

2) dc = Optimal_Parameter (SampleSet)；//用抽样数据集优化估计密度参数dc

3) Map= CreateMap (D, )；//以为尺度对空间进行网格划分并构建索引树；

4) ρ = Density_Estimation (D, Map, dc)；//计算所有样本点的局部密度估计值ρ1, ρ2, …, ρn

5) δ = NN_Distance (D, Map, ρ )；//按照局部密度估从大到小的顺序，计算所有样本点的高密度最所邻距离δ1, …, δn

6) C=Decision_Graph (D, ρ, δ)；//形成决策图，根据用户交互，确定代表聚类中心的样本子集；

7) Π=Partition (D, C)；//将所有非中心样本点沿着密度估计值递增的最近邻方向，迭代划分给相应的聚类中心，实现数据划分。

3 实验结果与比较

 图 7 4 000个随机样本点的聚类结果比较 Fig. 7 Comparison of clustering results of 4 000 random sample points
 图 8 aggregation数据集的聚类结果比较 Fig. 8 Comparison of clustering results for aggregation datasets
 图 9 spiral数据集的聚类结果比较 Fig. 9 Comparison of clustering results for spiral datasets

4 结束语

 [1] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496. DOI:10.1126/science.1242072. [2] MANYIKA J, CHUI M, BROWN B, et al. Big data: the next frontier for innovation, competition, and productivity[M]. McKinsey Global Institute, 2011. [3] HAN Jiawei, KAMBER M, PEI Jian. Data mining: concepts and techniques[M]. 3rd ed. Burlington: Morgan Kaufmann, 2011. [4] JAIN A K. Data clustering: 50 years beyond k-means[Z]. Pattern Recognition Letters, 2009. [5] 唐杰, 东昱晓, 蒋朦, 等. SIGKDD二十周年庆典[J]. 中国计算机学会通讯, 2014, 10(10): 58-64. . [6] http://comments.sicencemag.org/content/10.1126/science.1242072 (请核对网址及补充文献内容) [7] 淦文燕, 李德毅. 基于核密度估计的层次聚类算法[J]. 系统仿真学报, 2004, 16(2): 302-305. GAN Wenyan, LI Deyi. Hierarchical clustering based on kernel density estimation[J]. Journal of System Simulation, 2004, 16(2): 302-305. [8] ESTER M, KRIEGEL H, SANDER J, et al. A density based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the 2nd international conference on knowledge discovery and data mining. Portland, 1996: 226-231. [9] GIONIS A, MANNILA H, TSAPARAS P. Clustering aggregation[J]. ACM transactions on knowledge discovery from data, 2007, 1(1): Article No.4.
DOI: 10.11992/tis.201512036

0

#### 文章信息

GAN Wenyan, LIU Chong

An improved clustering algorithm that searches and finds density peaks

CAAI Transactions on Intelligent Systems, 2017, 12(2): 229-235
http://dx.doi.org/10.11992/tis.201512036