«上一篇
 文章快速检索 高级检索

 智能系统学报  2017, Vol. 12 Issue (6): 873-882  DOI: 10.11992/tis.201706029 0

### 引用本文

XIE Juanying, ZHOU Ying, WANG Mingzhao, et al. New criteria for evaluating the validity of clustering[J]. CAAI Transactions on Intelligent Systems, 2017, 12(6), 873-882. DOI: 10.11992/tis.201706029.

### 文章历史

New criteria for evaluating the validity of clustering
XIE Juanying, ZHOU Ying, WANG Mingzhao, JIANG Weiliang
School of Computer Science, Shaanxi Normal University, Xi’an 710062, China
Abstract: There are two kinds of criteria for evaluating the clustering ability of a clustering algorithm, internal and external. The current external evaluation indexes fails to consider the skewed clustering result; it is difficult to get optimum cluster numbers from the clustering validity inspection results from the internal evaluation indexes. Considering the defects in the present internal and external clustering evaluation indices, we propose two external evaluation indexes, which consider both positive and negative information and which are respectively based on the contingency table and sample pairs for the evaluation of clustering results from a dataset with arbitrary distribution. The variance is proposed to measure the tightness of a cluster and the separability between clusters, and the ratio of these parameters is used as an internal evaluation index for the measurement index. Experiments on the datesets from UCI (University of California in Iven) machine learning repository and artificially simulated datasets show that the proposed new internal index can be used to effectively find the truenumber of clusters in a dataset. The proposed external indexes based on the contingency table and sample pairs are a very effective external evaluation indexes and can be used to evaluate the clustering results from existing types of skewed and noisy data.
Key words: clustering    validity of clustering    evaluation index    external criteria    internal criteria    F-measure    Adjusted Rand Index    STDI    S2    PS2

UCI机器学习数据库真实数据集和人工模拟的带有刁难性的及带有噪音与类偏斜的人工模拟数据集实验测试表明，提出的内部评价新指标STDI能发现更合理的数据集类簇数；提出的分别基于相依表和样本对的外部评价指标S2和PS2可以有效评价有类偏斜现象的聚类结果。

1 外部指标

 $\left\{{\begin{array}{l}\!\!\! {{\mathop{\rm TP}\nolimits} _c} = \left| {\left\{ {i\left| {{\mathit{\boldsymbol{l}}}\left( {{{\mathit{\boldsymbol{x}}}_i}} \right) = {\mathit{\boldsymbol{L}}}\left( {{{\mathit{\boldsymbol{x}}}_i}} \right) = c,1 \leqslant i \leqslant n} \right.} \right\}} \right| = {n_{cc}}\\[5pt]\!\!\! {{\mathop{\rm FN}\nolimits} _c} \!= \left| {\left\{ {i\left| {\left\{ {{\mathit{\boldsymbol{l}}}\left( {{{\mathit{\boldsymbol{x}}}_i}} \right) = c} \right\} \wedge \left\{ {{\mathit{\boldsymbol{L}}}\left( {{{\mathit{\boldsymbol{x}}}_i}} \right) \ne c} \right\},1 \leqslant i \leqslant n} \right.} \right\}} \right| = {n_{c \cdot }} - {n_{cc}}\\[5pt]\!\!\! {{\mathop{\rm FP}\nolimits} _c} = \left| {\left\{ {i\left| {\left\{ {{\mathit{\boldsymbol{l}}}\left( {{{\mathit{\boldsymbol{x}}}_i}} \right) \ne c} \right\} \wedge \left\{ {{\mathit{\boldsymbol{L}}}\left( {{{\mathit{\boldsymbol{x}}}_i}} \right) = c} \right\},1 \leqslant i \leqslant n} \right.} \right\}} \right| = {n_{ \cdot c}} - {n_{cc}}\\[5pt]\!\!\! {{\mathop{\rm TN}\nolimits} _c} \!= \left| {\left\{ {i\left| {{\mathit{\boldsymbol{l}}}\left( {{{\mathit{\boldsymbol{x}}}_i}} \right) \ne c,{\mathit{\boldsymbol{L}}}\left( {{{\mathit{\boldsymbol{x}}}_i}} \right)\! \ne \! c,1 \leqslant i \leqslant n} \right.} \right\}} \right| \!=\! n \!-\! {n_{c \cdot }} - {n_{ \cdot c}} \!+\! {n_{cc}}\end{array}}\right.$ (1)
 $\begin{array}{c}{\rm{sensitivit}}{{\rm{y}}_c} = \displaystyle\frac{{{\rm{T}}{{\rm{P}}_c}}}{{{\rm{T}}{{\rm{P}}_c} + {\rm{F}}{{\rm{N}}_c}}} = \displaystyle\frac{{{n_{cc}}}}{{{n_{c \cdot }}}}\\[7pt]{\rm{specificit}}{{\rm{y}}_c} = \displaystyle\frac{{{\rm{T}}{{\rm{N}}_c}}}{{{\rm{T}}{{\rm{N}}_c} + {\rm{F}}{{\rm{P}}_c}}} = \displaystyle\frac{{n - {n_{c \cdot }} - {n_{ \cdot c}} + {n_{cc}}}}{{n - {n_{c \cdot }}}}\end{array}$ (2)
 $S2 = \frac{1}{{\min \left\{ {C,K} \right\}}}\sum\limits_{c = 1}^{\min \left\{ {C,K} \right\}} {\frac{{2 \times {\rm{sensitivit}}{{\rm{y}}_c} \times {\rm{specificit}}{{\rm{y}}_c}}}{{{\rm{sensitivit}}{{\rm{y}}_c} + {\rm{specificit}}{{\rm{y}}_c}}}}$ (3)
 $S2 = \frac{{2 \times {\rm{sensitivity}} \times {\rm{specificity}}}}{{{\rm{sensitivity}} + {\rm{specificity}}}}$ (4)

 $\left\{\begin{array}{l}{\mathop{\rm TP}\nolimits} = \left| {\left\{ {\left( {{{\mathit{\boldsymbol{x}}}_i},{{\mathit{\boldsymbol{x}}}_j}} \right)\left| {{\mathit{\boldsymbol{l}}}({{\mathit{\boldsymbol{x}}}_i})} \right. = {\mathit{\boldsymbol{l}}}({{\mathit{\boldsymbol{x}}}_j}),\;{\mathit{\boldsymbol{L}}}({{\mathit{\boldsymbol{x}}}_i}) = {\mathit{\boldsymbol{L}}}({{\mathit{\boldsymbol{x}}}_j})} \right\}} \right|\\[5pt]{\mathop{\rm FN}\nolimits}\! = \left| {\left\{ {\left( {{{\mathit{\boldsymbol{x}}}_i},{{\mathit{\boldsymbol{x}}}_j}} \right)\left| {{\mathit{\boldsymbol{l}}}({{\mathit{\boldsymbol{x}}}_i})} \right. = {\mathit{\boldsymbol{l}}}({{\mathit{\boldsymbol{x}}}_j}),\;{\mathit{\boldsymbol{L}}}({{\mathit{\boldsymbol{x}}}_i}) \ne {\mathit{\boldsymbol{L}}}({{\mathit{\boldsymbol{x}}}_j})} \right\}} \right|\\[5pt]{\mathop{\rm FP}\nolimits} = \left| {\left\{ {\left( {{{\mathit{\boldsymbol{x}}}_i},{{\mathit{\boldsymbol{x}}}_j}} \right)\left| {{\mathit{\boldsymbol{l}}}({{\mathit{\boldsymbol{x}}}_i})} \right. \ne {\mathit{\boldsymbol{l}}}({{\mathit{\boldsymbol{x}}}_j}),\;{\mathit{\boldsymbol{L}}}({{\mathit{\boldsymbol{x}}}_i}) = {\mathit{\boldsymbol{L}}}({{\mathit{\boldsymbol{x}}}_j})} \right\}} \right|\\[5pt]{\mathop{\rm TN}\nolimits} \! = \left| {\left\{ {\left( {{{\mathit{\boldsymbol{x}}}_i},{{\mathit{\boldsymbol{x}}}_j}} \right)\left| {{\mathit{\boldsymbol{l}}}({{\mathit{\boldsymbol{x}}}_i})} \right. \ne {\mathit{\boldsymbol{l}}}({{\mathit{\boldsymbol{x}}}_j}),\;{\mathit{\boldsymbol{L}}}({{\mathit{\boldsymbol{x}}}_i}) \ne {\mathit{\boldsymbol{L}}}({{\mathit{\boldsymbol{x}}}_j})} \right\}} \right|\end{array}\right.$ (5)
 $\left\{\begin{array}{l}{\mathop{\rm TP}\nolimits} = \sum\limits_{i = 1}^C {\sum\limits_{j = 1}^K {\left( \begin{array}{l}{n_{ij}}\\\;2\end{array} \right)} } \\[9pt]{\mathop{\rm FN}\nolimits} = \sum\limits_{i = 1}^C {\left( \begin{array}{l}{n_{i \cdot }}\\\;2\end{array} \right) - {\mathop{\rm TP}\nolimits} } \\[9pt]{\mathop{\rm FP}\nolimits} = \sum\limits_{j = 1}^K {\left( \begin{array}{l}{n_{ \cdot j}}\\\;2\end{array} \right)} - {\mathop{\rm TP}\nolimits} \\[9pt]{\mathop{\rm TN}\nolimits} = N - \left( {{\mathop{\rm TP}\nolimits} + {\rm FN} + {\rm FP}} \right)\end{array}\right.$ (6)
 $\left\{\begin{array}{l}{\mathop{\rm sensitivity}\nolimits} = \displaystyle\frac{\rm {TP}}{\rm {TP + FN}}\\[9pt]{\mathop{\rm specificity}\nolimits} = \displaystyle\frac{\rm {TN}}{\rm {TN + FP}}\end{array}\right.$ (7)
 $\left\{\begin{array}{c}{\rm{PS}}2 = \displaystyle\frac{{2 \times {\rm{sensitivity}} \times {\rm{specificity}}}}{{{\rm{sensitivity}} + {\rm{specificity}}}} =\\[9pt] \displaystyle\frac{{2 \times {\rm{TP}} \times {\rm{TN}}}}{{{\rm{TP}}\left( {{\rm{FP}} + {\rm{TN}}} \right) + {\rm{TN}}\left( \rm {TP + FN} \right)}}\end{array}\right.$ (8)
2 内部指标

 ${\rm{STDI}} = \frac{{{\displaystyle\frac{1}{K}}\left( {\sum\limits_{k = 1}^K {{{\left\| {{{\mathit{\boldsymbol{c}}}_k} - {\bar{x}}} \right\|}^2}} } \right)}}{{\sum\limits_{k = 1}^K {{\displaystyle\frac{1}{{{n_k}}}}\left( {\sum\limits_{i = 1}^{{n_k}} {{{\left\| {{{\mathit{\boldsymbol{x}}}_i} - {{\mathit{\boldsymbol{c}}}_k}} \right\|}^2}} } \right)} }}$ (9)

3 实验分析

 图 1 测试内部指标STDI的人工数据集原始分布 Fig.1 The synthetic data set to test the new internal criteria STDI
 图 2 测试外部指标S2和PS2的人工数据集原始分布 Fig.2 The synthetic data sets to test the new external criteria S2 and PS2

3.1 内部指标有效性测试实验

 图 3 各内部指标在人工数据集的测试结果 Fig.3 The results on synthetic data set of internal criteria

3.2 外部指标有效性测试实验

 图 4 S2指标与其他指标的测试结果比较 Fig.4 The comparison of S2 with other criteria

 图 5 PS2指标的测试结果与其他指标的比较 Fig.5 The comparison of PS2 with other criteria

 图 6 S2与PS2指标与聚类准确率比较 Fig.6 The comparison of S2 and PS2 and clustering accuracy

4 结束语

 [1] ESTEVA A, KUPREL B, NOVOA RA, et al. Dermatologist-level classification of skin cancer with deep neural networks[J]. Nature, 2017, 542(7639): 115-118. (0) [2] FARINA D, VUJAKLIJA I, SARTORI M, et al. Man/machine interface based on the discharge timings of spinal motor neurons after targeted muscle reinnervation[J]. Nature biomedical engineering, 2017, 1: 25. (0) [3] GULSHAN V, PENG L, CORAM M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs[J]. JAMA, 2016, 316(22): 2402-2410. (0) [4] LONG E, LIN H, LIU Z, et al. An artificial intelligence platform for the multihospital collaborative management of congenital cataracts[J]. Nature biomedical engineering, 2017, 1: 0024. (0) [5] ORRINGER DA, PANDIAN B, NIKNAFS Y S, et al. Rapid intraoperative histology of unprocessed surgical specimens via fibre-laser-based stimulated Raman scattering microscopy[J]. Nature biomedical engineering, 2017, 1: 0027. (0) [6] HAN J, PEI J, KAMBER M. Data mining: concepts and techniques[M]. Singapore: Elsevier, 2011. (0) [7] JAIN AK, DUBES RC. Algorithms for clustering data[M]. Prentice-Hall, 1988. (0) [8] DE SOUTO MCP, COELHO ALV, FACELI K, et al. A comparison of external clustering evaluation indices in the context of imbalanced data sets[C]//2012 Brazilian Symposium on Neural Networks (SBRN). [S.l.], 2012: 49-54. (0) [9] HUANG S, CHRNG Y, LANG D, et al. A formal algorithm for verifying the validity of clustering results based on model checking[J]. PloS one, 2014, 9(3): e90109. (0) [10] RENDÓN E, ABUNDEZ I, ARIZMENDI A, et al. Internal versus external cluster validation indexes[J]. International journal of computers and communications, 2011, 5(1): 27-34. (0) [11] ROSALES-MENDÉZ H, RAMÍREZ-CRUZ Y. CICE-BCubed: A new evaluation measure for overlapping clustering algorithms[C]//Iberoamerican Congress on Pattern Recognition. Berlin: Springer Berlin Heidelberg, 2013: 157-164. (0) [12] SAID AB, HADJIDJ R, FOUFOU S. Cluster validity index based on jeffrey divergence[J]. Pattern analysis and applications, 2017, 20(1): 21-31. (0) [13] XIONG H, WU J, CHEN J. K-means clustering versus validation measures: a data-distribution perspective[J]. IEEE transactions on systems, man, and cybernetics, part b (cybernetics), 2009, 39(2): 318-331. (0) [14] POWERS D M W. Evaluation: from precision, recall and F-factor to ROC, informedness, markedness and correlation[J]. Journal of machine learning technologies, 2011, 2: 2229-3981. (0) [15] LARSEN B, AONE C. Fast and effective text mining using linear-time document clustering[C]//Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, USA: ACM, 1999: 16-22. (0) [16] ZU EISSEN, B S S M, WIßBROCK F. On cluster validity and the information need of users[C]//Conference on Artificial Intelligence and Applications, Benalmádena, Spain, 2003. Calgary, Canada: ACTA Press, 2003: 216-221. (0) [17] 谢娟英. 无监督学习方法及其应用[M]. 北京: 电子工业出版社, 2016. XIE Juanying, Unsupervised learning methods and applications[M]. Beijing: Publishing House of Electronics Industry, 2016. (0) [18] XIE J Y, GAO H C, XIE W X, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J]. Information sciences, 2016, 354: 19-40. (0) [19] 谢娟英, 高红超, 谢维信. K 近邻优化的密度峰值快速搜索聚类算法[J]. 中国科学: 信息科学, 2016, 46(2): 258-280. XIE Juanying, GAO Hongchao, XIE Weixin. K-nearest neighbors optimized clustering algorithm by fast search and finding the density peaks of a dataset[J]. Scientia sinica informationis, 2016, 46(2): 258-280. (0) [20] AMIGÓ E, GONZALO J, ARTILES J, et al. A comparison of extrinsic clustering evaluation metrics based on formal constraints[J]. Information retrieval, 2009, 12(4): 461-486. (0) [21] VINH NX, EPPS J, BAILEY J. Information theoretic measures for clusterings comparison: is a correction for chance necessary [C]//Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Canada, 2009. New York, USA: ACM, 2009: 1073-1080. (0) [22] D'HAESELEER P. How does gene expression clustering work[J]. Nature biotechnology, 2005, 23(12): 1499. (0) [23] QUACKENBUSH J. Computational analysis of microarray data[J]. Nature reviews genetics, 2001, 2(6): 418-427. (0) [24] CHOU CH, SU MC, LAI E. A new cluster validity measure for clusters with different densities[C]//IASTED International Conference on Intelligent Systems and Control. Calgary, Canada: ACTA Press, 2003: 276-281. (0) [25] 谢娟英, 周颖. 一种新聚类评价指标[J]. 陕西师范大学学报: 自然科学版, 2015, 43(6): 1-8. XIE Juanying, ZHOU Ying. A new criterion for clustering algorithm[J]. Journal of Shaanxi normal university: natural science edition, 2015, 43(6): 1-8. (0) [26] KAPP AV, TIBSHIRANI R. Are clusters found in one dataset present in another dataset[J]. Biostatistics, 2007, 8(1): 9-31. (0) [27] DAVIES DL, BOULDIN DW. A cluster separation measure[J]. IEEE transactions on pattern analysis and machine intelligence, 1979(2): 224-227. (0) [28] HASHIMOTO W, NAKAMURA T, MIYAMOTO S. Comparison and evaluation of different cluster validity measures including their kernelization[J]. Journal of advanced computational intelligence and intelligent informatics, 2009, 13(3): 204-209. (0) [29] XIE XL, BENI G. A validity measure for fuzzy clustering[J]. IEEE transactions on pattern analysis and machine intelligence, 1991, 13(8): 841-847. (0) [30] ROUSSEEUW PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis[J]. Journal of computational and applied mathematics, 1987, 20: 53-65. (0) [31] 周世兵, 徐振源, 唐旭清. 一种基于近邻传播算法的最佳聚类数确定方法[J]. 控制与决策, 2011, 26(8): 1147-1152. ZHOU Shibing, XU Zhenyuan, TANG Xuqing. Method for determining optimal number of clusters based on affinity propagation clustering[J]. Control and decision, 2011, 26(8): 1147-1152. (0) [32] 盛骤, 谢式千. 概率论与数理统计及其应用[M]. 北京: 高等教育出版社, 2004. SHENG Zhou, XIE Shiqian. Probability and mathematical statistics and its application[M]. Beijing: Higher education press, 2004. (0) [33] LICHMAN M, UCI Machine learning repository[EB/OL]. 2013, University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml. (0) [34] 谢娟英, 高瑞. 方差优化初始中心的K-medoids聚类算法[J]. 计算机科学与探索, 2015, 9(8): 973-984. XIE Juanying, GAO Rui. K-medoids clustering algorithms with optimized initial seeds by variance[J]. Journal of frontiers of computer science and technology, 2015, 9(8): 973-984. (0) [35] PARK HS, JUN CH. A simple and fast algorithm for K-medoids clustering[J]. Expert systems with applications, 2009, 36(2): 3336-3341. (0)