文章快速检索 高级检索

An improved KNN algorithm based on Chi-square distance measure
XIE Hong , ZHAO Hongye
College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China
Abstract:The K-nearest neighbor (KNN) algorithm is one of the classification algorithms, and it is obvious in classification effect, is simple and can be grasped easily. One of the most significant factors which determine the effect of the K-nearest neighbor classification is distance measure. Euclidean distance is usually considered as the measure function of the K-nearest neighbor algorithm, it assigns the same weight to different characteristics of the samples, but different characteristic parameter has different influence on the accuracy of the classification results. This paper adopts Chi-square distance which can more reflect the relative relationship between characteristics as measure function of KNN algorithm and uses sensitivity method to compute the characteristic weight, so as to overcome the shortcoming of Euclidean distance. The result of classification test shows evaluation indexes of the improved algorithm based on Chi-square distance are better than the traditional KNN algorithm.
Key words: K-nearest neighbor algorithm     Chi-square distance     distance measure     quadratic-form distance     Euclidean distance     sensitivity method

1 KNN算法

KNN作为一种基于实例的分类算法，被认为是向量模型下最好的分类算法之一[9]。KNN算法从测试样本点y开始生长，不断的扩大区域，直到包含进k个训练样本点为止，并把测试样本点y的类别归为这最近的k个训练样本点出现频率最大的类别。KNN算法的主要操作过程如下：
1)建立测试样本集和训练样本集。

2)设定k值。k值的确定只能在实验中根据分类效果反复调整，直到找到最优的k值。

3)计算测试样本点和所有训练样本点的距离。距离公式为

4)选择k个近邻训练样本。对于测试样本点y按照式(1)找到在训练样本集中离测试样本点y最近的k个训练样本点。

5)测试样本y类别的判别规则。即对步骤4)中得到的k个近邻训练样本点进行统计，计算k个训练样本点每个类别所占的个数，把测试样本的类别归为所占个数最多的训练样本所属的类别。

2 卡方距离度量

2.1 卡方距离模型

2.2 卡方距离与二次卡方距离的关系

2个直方图统计量X、Y是非负值而且有界限的，则2个直方图距离的二次式距离公式为

m=0，二次卡方距离就是二次式距离。由此可见卡方距离和欧氏距离、马氏距离一样可以对特征向量进行有效的度量。

2.3 权重调整系数的计算方法

3)计算，当nq=0或n=0时，uq=1； nq越大说明分类误差越大，说明第q个特征量对分类的贡献就越大；nq越小分类误差越小，则第q特征量的作用越小。

4)第q个特征量的权重系数定义为

3 基于卡方距离的改进KNN算法

3.1 CSKNN算法基本流程

1)创建测试样本集和训练样本集，训练样本集表示为X，测试样本集表示为Y。

2)给k设定一个初值。

3)计算测试样本点和每个训练样本点的加权卡方距离。公式如下所示：

4)在步骤3)所得的距离按照降序排列，选择离测试样本点较近的k个训练样本。

5)统计步骤4)中得到的k个近邻训练样本点所属的类别，把测试样本的类别归为k个训练样本点中出现次数最多的类别。

6)根据分类结果，评价其查全率R、查准率P、F1宏值以及分类精度，判断分类效果是否达最好，若是，则结束；若否，返回步骤2)对k值进行调整。

 图 1 算法流程图
3.2 实验结果及分析

 数据集 属性类型 特征数 样本数 类别数 Iris Real 4 150 3 Haberman Integer 3 306 2 Pima-India diabetes Integer,Real 8 760 2

F1测量值公式为

 数据集 RKNN RCSKNN PKNN PCSKNN F1，KNN F1，CSKNN Iris 0.940 0.959 0.939 0.959 0.939 0.959 Haberman 0.595 0.618 0.617 0.556 0.592 0.551 Pima-India diabetes 0.712 0.765 0.658 0.736 0.666 0.746

 数据集 RKNN RCSKNN PKNN PCSKNN F1，KNN F1，CSKNN Iris 0.956 0.981 0.949 0.980 0.949 0.980 Haberman 0.548 0.640 0.537 0.589 0.536 0.595 Pima-India diabetes 0.699 0.737 0.680 0.727 0.686 0.731

 图 2 Iris数据集分类精度随k值变化曲线

 图 3 Haberman数据集分类精度随k值变化曲线

 图 4 Pima-India diabetes数据集分类精度随k值变化曲线

4 结束语

 [1] 奉国和. 自动文本分类技术研究[J]. 情报杂志, 2004, 2(4): 108-111. [2] COVER T M, HART P E. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27. [3] LI J. TKNN: an improved KNN algorithm based on tree structure[C]// Seventh International IEEE Conference on Computational Intelligence and Security. Sanya, China, 2011: 1390-1394. [4] 彭凯, 汪伟, 杨煜普. 基于余弦距离度量学习的伪 K 近邻文本分类算法[J]. 计算机工程与设计, 2013, 34(6): 2200-2203. [5] LIM H. An improve KNN learning based Korean text classifier with heuristic information[C]// The 9th International Conference on Neural Information Processing. Singapore, 2002: 731-735 [6] 桑应宾. 一种基于特征加权的K-Nearest-Neighbor算法[J]. 海南大学学报, 2008, 26(4): 352-355. [7] WEINBERGER K Q, SAUL L K. Distance metric learning for large margin nearest neighbor classification[J]. The Journal of Machine Learning Research, 2009(10): 207-244. [8] KEDEM D, TYREE S, WEINBERGER K Q, et al. Non-linear metric learning[C]// Neural Information Processing Systems Foundation. Lake Tahoe, USA, 2012: 2582-2590. [9] 苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17(9): 1848-1859. [10] PELE O, WERMAN M. The quadratic-Chi histogram distance family[C]// The 11th European Conference on Computer Vision. Crete, Greece, 2010: 749-762.

#### 文章信息

XIE Hong, ZHAO Hongye

An improved KNN algorithm based on Chi-square distance measure

Applied Science and Technology, 2015, (01):10-14.
DOI: 10.3969/j.issn.1009-671X.201401002