﻿ F-粗糙集视角的概念漂移与属性约简
 首页 关于本刊 编 委 会 期刊动态 作者中心 审者中心 读者中心 下载中心 联系我们 English
 自动化学报  2018, Vol. 44 Issue (10): 1781-1789 PDF
F-粗糙集视角的概念漂移与属性约简

1. 浙江师范大学数理与信息工程学院 金华 321004;
2. 浙江师范大学行知学院 金华 321004;
3. 北京交通大学计算机与信息技术学院 北京 100044

Concept Drift and Attribute Reduction From the Viewpoint of F-rough Sets
DENG Da-Yong1,2, LI Ya-Nan1, HUANG Hou-Kuan3
1. College of Mathematics, Physics and Information Engineering, Zhejiang Normal University, Jinhua 321004;
2. Xingzhi College, Zhejiang Normal University, Jinhua 321004;
3. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044
Manuscript received : April 21, 2017, accepted: August 2, 2017.
Foundation Item: Supported by National Natural Science Foundation of China (61473030), Zhejiang Provincial Natural Science Foundation of China (LY15F020012), and Zhejiang Provincial Top Discipline of Cyber Security at Zhejiang Normal University
Corresponding author. DENG Da-Yong  Associate professor at Xingzhi College, Zhejiang Normal University. He received his Ph. D. degree from Beijing Jiaotong University in 2007. His research interest covers rough set theory and its application. Corresponding author of this paper.
Abstract: Concept drift detection is a challenge to data stream mining, and attribute reduction is a focus of rough set theory. In this paper the phenomenon of concept drift is analyzed from the viewpoint of attribute reduction, and on the other hand, attribute reduction is analyzed deeply from the viewpoint of concept drift. The difference and the relation between concept drift and attribute reduction are investigated. Two criteria of concept drift detection, which rely on dependence of conditional attributes and conditional entropy, are presented. Analysis and comparison are conducted between two classical criteria of concept drift detection and two criteria of attribute reduction. Dependence of conditional attributes and conditional entropy possess the advantages of both testability of classification accuracy and theoretic analyzability of joint probability. Experimental results show that classification accuracy, dependence of conditional attributes and conditional entropy are valid and efficient for detecting concept drift. Moreover, compared with classification accuracy, the results can be reused and the workload can be reduced when dependence of conditional attributes and conditional entropy are employed to detect concept drift. The study on the relation between concept drift and attribute reduction provides a new method for attribute reduction and concept drift detection, as well as a new idea for rough set theory and granular computing to blend in the era of big data.
Key words: F-rough sets     data streams     concept drift     attribute reduction     conditional information entropy

1 基础知识

1.1 粗糙集

$IS=(U, A)$是一个信息系统, 其中$U$是论域, $A$是论域$U$上的条件属性集.对于每个属性$a\in A$都对应一个函数$a:U\rightarrow V_{a}$, $V_{a}$称为属性$a$的值域, $U$中每个元素称为个体、对象或行.

 $Inf_{B}(x)=\{(a, a(x)):a\in B\}$

$B$-不分明关系(或称为不可区分关系)定义为

 $IND(B)=\{(x, y):Inf_{B}(x)=Inf_{B}(y)\}$

 \begin{align*} &\underline{B}(X)=\underline{B}(IS, X)=\{x\in U: [x]_{B}\subseteq X\}\\ &\overline{B}(X)=\overline{B}(IS, X)=\{x\in U: [x]_{B}\cap X\neq \emptyset\}\end{align*}

1) $POS(F, B, d)=POS(F$, $A$, $d)$;

2) 对任意$S\subset B$, 都有$POS(F, S, d)\neq POS(F, A, d).$

2 概念漂移准则分析与比较

1) 分类准确率(或分类错误率).如果分类器对数据集的分类准确率有较大的变化, 一般情况下是指分类准确率下降, 而且超过了一定的阈值, 则称发生了概念漂移.分类准确率是最常用的一种概念漂移探测准则, 常常用于实验分析中.

2) 联合概率.联合概率$p(X, Y)=p(X)p(Y|$$X) (其中, X表示由条件属性诱导的概念, Y表示由决策属性诱导的概念)发生变化, 而且超过一定的阈值.用联合概率作为概念漂移检测准则又分为虚拟漂移和真实漂移.虚拟漂移[12, 14]是指在联合概率中p(Y|X) 不发生变化, 仅p(X) 变化; 真实漂移[13-14]是指在联合概率中p(X) 不发生变化, 而p(Y|X) 发生变化. 联合概率的准则也较常用.这种准则因为有较强的数学理论基础(统计理论与概率论), 所以, 从理论方面研讨概念漂移, 人们喜欢使用联合概率的准则.但是, 这个准则局限于某些概念, 缺乏灵活性, 对于整个数据集而言就不太实用.这是因为联合概率p(X, Y) 只能针对某些特定的\langle X, Y\rangle研究其变化, 而整个数据集可能是一系列的有序对, 它们的概率分布并不相同, 也很难用同一个表达式表示, 如果用联合概率p(X, Y) 研究其变化, 难以从总体上把握概念漂移. 3) 属性重要性.用属性重要性变化探测概念漂移是文献[22]新提出来的.属性重要性是数据的内部属性, 是概念的内涵.用属性重要性准则探测概念漂移, 分析属性重要性随着数据的变化而变化, 不直接依赖于概念的外延, 而且用并行约简进行特征选择, 统一了比较准则, 避免了特征选择不同对概念漂移探测产生的负面影响, 所以, 用属性重要性准则探测概念漂移, 既有确定性, 又有灵活性; 既能用于实验检测, 又能用于理论分析, 兼具有分类准确率准则和联合概率准则的优点. 文献[22]定义的基于属性重要性的概念漂移准则, 探测了并行约简后各个属性重要性的变化而引起的概念漂移, 但是文献[22]中单个属性重要性的变化不能全面反映数据流中的概念漂移.为了更好地与分类准确率、联合概率分布进行比较, 本文定义和使用属性依赖度和条件熵作为概念漂移探测准则. 3 属性约简与概念漂移的对比与分析 粗糙集中有很多种类的属性约简, 其中最常用的是基于正区域的属性约简和基于互信息的属性约简.这些属性约简的准则, 从概念漂移的角度看, 就是在同一数据集中保持在属性约简准则下不发生概念漂移的最小属性子集.但是各种准则并不一定兼容, 有时一个准则可能违背另一个准则, 也就是说, 在一个准则下不发生概念漂移的约简, 在另一个准则下可能发生概念漂移.实际上, 基于互信息的属性约简等价于基于条件熵的属性约简, 但是因为条件熵比互信息更接近联合概率, 所以, 本文以基于正区域的属性约简和基于条件熵的属性约简为例进行阐述. 定义 5[15-17]. 给定一个决策系统DS= (U, A, d) , B\subseteq A称作决策系统DS的一个约简当且仅当B满足如下两个条件: 1) POS(DS, B, d)=POS(DS, A, d) ; 2) 对任意S\subset B, 都有POS(DS, S, d)\neq POS(DS, A, d) . 定义 6[15-17]. 给定一个决策系统DS=(U, A, d) , B\subseteq A称作决策系统DS的一个约简, 当且仅当B满足如下两个条件: 1) \gamma(DS, B, d)=\gamma(DS, A, d) ; 2) 对任意的S\subset B, 均有\gamma(DS, S, d)\neq \gamma(DS, A, d). 定义5从正区域的结构出发定义了属性约简, 定义6从属性依赖度角度定义了属性约简; 定义5适合理解, 定义6适合计算.从概念漂移角度看, 定义5的条件属性约简保证决策系统正区域不发生改变, 保证正区域部分的条件概率p(Y|X)=1, 即正区域部分不发生真实概念漂移; 定义6的条件属性约简保证了决策系统的属性依赖度不发生概念漂移; 定义5只适合于单个决策表本身, 不适合推广.定义6可以从决策子表簇角度进行推广, 适合于探测决策子表簇中概念漂移.所以, 在决策子表簇中可以将基于属性依赖性的概念漂移准则定义如下: 定义 7.在决策子表簇F中, DT_{i}, DT_{j}\in F相对于属性依赖度的概念漂移定义为  DACD(DT_{i}, DT_{j})=|\gamma(DT_{i}, A, d)-\gamma(DT_{j}, A, d)| 定义 8[24-25]. 给定一个决策系统DS=(U A, d) , 设A$$\{d\}$在论域$U$上导出的划分分别为$X$$Y, 其中, X=U/A=\{X_{1}, X_{2}, \cdots, X_{N}\}, Y = U/\{d\}=\{Y_{1}, Y_{2}, \cdots, Y_{M}\}, p(X_{i})={|X_{i}|}/{|U|}, p(Y_{j}) ={|Y_{j}|}/{|U|}, p(X_{i}, Y_{j})={|X_{i}\cap Y_{j}|}/{|U|}, i=1, 2, \cdots, N, j=1, 2, \cdots, M. A$$\{d\}$的信息熵分别定义为

 \begin{align*} &H(DS, A)=-\sum\limits_{i=1}^{N}p(X_{i})lbp(X_{i})\\ &H(DS, \{d\})=-\sum\limits_{j=1}^{M}p(Y_{j})lbp(Y_{j})\end{align*}

$\{d\}$相对于$A$的条件熵定义为

 \begin{align*}&H(DS, \{d\}|A)=\\&\qquad-\sum\limits_{i=1}^{N}p(X_{i})\sum\limits_{j=1}^{M}p(Y_{j}|X_{i})lbp(Y_{j}|X_{i}) \end{align*}

1) $H(DS, \{d\}|A)=H(DS, \{d\}|B)$;

2) 对任意$b\in B$, 都有$H(DS, \{d\}|B) < H(DS$, $\{d\}|B-\{b\}).$

 \begin{align*}&IECD(DT_{i}, DT_{j})=\\ &\qquad |H(DT_{i}, \{d\}|B)-H(DT_{j}, \{d\}|B)|\end{align*}

$p(X)=kp(X')$, 则$p(Y|X)=p(Y'|X')/k$.从而

 \begin{align*}&p(X)p(Y|X)lbp(Y|X)=\\&\qquad kp(X')\frac{1}{k}p(Y'|X')(lbp(Y'|X')-lbk)=\\&\qquad p(X')p(Y'|X')(lbp(Y'|X')-lbk)\end{align*}

1) 在$p(X)$不变, 即$p(X)=p(X')$的情况下:

a) 若$p(X, Y)=p(X', Y')$, 则必有

 \begin{align*}&p(X)p(Y|X)lbp(Y|X)=\\&\qquad p(X')p(Y'|X')lbp(Y'|X') \end{align*}

b) 若$p(X, Y)\neq p(X', Y')$, 则必有

 $p(X)p(Y|X)lbp(Y|X) \ne p(X')p(Y'|X')lbp(Y'|X')$

2) 在$p(Y|X)$不变, 即$p(Y|X)=p(Y'|X')$的情况下.

a) 若$p(X, Y)=p(X', Y')$, 则必有

 \begin{align*}&p(X)p(Y|X)lbp(Y|X)=\\&\qquad p(X')p(Y'|X')lbp(Y'|X') \end{align*}

b) 若$p(X, Y)\neq p(X', Y')$, 则必有

 \begin{align*}&p(X)p(Y|X)lbp(Y|X)\neq\\&\qquad p(X')p(Y'|X')lbp(Y'|X') \end{align*}

 \begin{align*}&B=\{a, b\}\\&DT_{1}/\{d\}=\{\{x_{1}, x_{2}, x_{4}\}, \{x_{3}\}\} \\&DT_{1}/B=\{\{x_{1}\}, \{x_{2}, x_{4}\}, \{x_{3}\}\}\\&DT_{2}/\{d\}=\{\{y_{1}, y_{4}, y_{5}\}, \{y_{2}, y_{3}, y_{6}\}\}\\&DT_{2}/B=\{\{y_{1}, y_{4}\}, \{y_{2}, y_{3}\}, \{y_{5}, y_{6}\}\} \end{align*}

 \begin{align*} &\gamma(DT_{1}, B, d)=1\\&\gamma(DT_{2}, B, d)=\frac{2}{3} \\& DACD(DT_{1}, DT_{2})=|\gamma(DT_{1}, B, d)\, -\\&\qquad\gamma(DT_{2}, B, d)|=\frac{1}{3} \\& H(DT_{1}, \{d\}|B)=\\&\qquad-\sum\limits_{i=1}^{3}p(X_{i}) \sum\limits_{j=1}^{2}p(Y_{j}|X_{i})lbp (Y_{j}|X_{i})=0\\& H(DT_{2}, \{d\}|B)=\\&\qquad-\sum\limits_{i=1}^{3}p(X_{i}) \sum\limits_{j=1}^{2}p(Y_{j}|X_{i})lbp(Y_{j}|X_{i})=\frac{1}{3} \\&IECD(DT_{1}, DT_{2}) =|H(DT_{1}, \{d\}|B)\, - \\&\qquad H(DT_{2}, \{d\}|B)|=\frac{1}{3}\end{align*}

4 实验结果与分析

 图 1 滑动窗口的大小为10 000, 且相邻窗口之间没有重复 Figure 1 The size of sliding windows is 10 000 without repeat
 图 2 滑动窗口的大小为5 000, 且相邻窗口之间没有重复 Figure 2 The size of sliding windows is 5 000 without repeat
 图 3 滑动窗口的大小为5 000, 且相邻窗口之间有10 %的重复 Figure 3 The size of sliding windows is 5 000 with 10 % repeat
 图 4 滑动窗口的大小为10 000, 且相临窗口之间有10 %的重复 Figure 4 The size of sliding windows is 10 000 with 10 % repeat

5 结论与展望

 1 Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In:Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Madison, Wisconsin, USA:ACM, 2002. 1-16 2 Wang Tao, Li Zhou-Jun, Yan Yue-Jin, Chen Huo-Wang. A survey of classification of data streams. Journal of Computer Research and Development, 2007, 44(11): 1809-1815.( 王涛, 李舟军, 颜跃进, 陈火旺. 数据流挖掘分类技术综述. 计算机研究与发展, 2007, 44(11): 1809-1815.) 3 Xu Wen-Hua, Qin Zheng, Chang Yang. Semi-supervised learning based ensemble classifier for stream data. Pattern Recognition and Artificial Intelligence, 2012, 25(2): 292-299.( 徐文华, 覃征, 常扬. 基于半监督学习的数据流集成分类算法. 模式识别与人工智能, 2012, 25(2): 292-299. DOI:10.3969/j.issn.1003-6059.2012.02.016) 4 Lu N, Zhang G Q, Lu J. Concept drift detection via competence models. Artificial Intelligence, 2014, 209: 11-28. DOI:10.1016/j.artint.2014.01.001 5 Domingos P, Hulten G. Mining high-speed data streams. In:Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, Massachusetts, USA:ACM, 2002. 71-80 6 Hulten G, Spencer L, Domingos P. Mining time-changing data streams. In:Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, California, USA:ACM, 2001. 97-106 7 Sun Yue, Mao Guo-Jun, Liu Xu, Liu Chun-Nian. Mining concept drifts from data streams based on multi-classifiers. Acta Automatica Sinica, 2008, 34(1): 93-97.( 孙岳, 毛国君, 刘旭, 刘椿年. 基于多分类器的数据流中的概念漂移挖掘. 自动化学报, 2008, 34(1): 93-97.) 8 Wang H X, Fan W, Yu P S, Han J W. Mining concept-drifting data streams using ensemble classifiers. In:Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, D.C., USA:ACM, 2003. 226-235 9 Li P P, Wu X D, Hu X G, Wang H. Learning concept-drifting data streams with random ensemble decision trees. Neurocomputing, 2015, 166: 68-83. DOI:10.1016/j.neucom.2015.04.024 10 Zhang Ming-Wei, Zhu Zhi-Liang, Liu Ying, Zhang Bin. Distributed assistant associative classification algorithm in big data environment. Journal of Software, 2015, 26(11): 2795-2810.( 张明卫, 朱志良, 刘莹, 张斌. 一种大数据环境中分布式辅助关联分类算法. 软件学报, 2015, 26(11): 2795-2810.) 11 Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldá R. New ensemble methods for evolving data streams. In:Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France:ACM, 2009. 139-148 12 Gao J, Fan W, Han J W, Yu P S. A general framework for mining concept-drifting data streams with skewed distributions. In:Proceedings of the 2007 SIAM International Conference on Data Mining. Minnesota, USA:SIAM, 2007. 13 Widmer G, Kubat M. Effective learning in dynamic environments by explicit context tracking. In:Proceedings of the 1993 European Conference on Machine Learning. Vienna, Austria:Springer, 1993. 227-243 14 Chai Yu-Mei, Zhou Chi, Wang Li-Ming. Detecting concept drift and classifying data streams. Journal of Chinese Computer Systems, 2011, 32(3): 421-425.( 柴玉梅, 周驰, 王黎明. 数据流上概念漂移的检测和分类. 小型微型计算机系统, 2011, 32(3): 421-425.) 15 Pawlak Z. Rough sets. International Journal of Computer and Information Sciences, 1982, 11(5): 341-356. DOI:10.1007/BF01001956 16 Zhang Qing-Hua, Xue Yu-Bin, Wang Guo-Yin. Optimal approximation sets of rough sets. Journal of Software, 2016, 27(2): 295-308.( 张清华, 薛玉斌, 王国胤. 粗糙集的最优近似集. 软件学报, 2016, 27(2): 295-308.) 17 Wang Guo-Yin. Rough Set Theory and Knowledge Acquisition. Xi'an: Xi'an Jiaotong University Press, 2001.( 王国胤. Rough集理论与知识获取. 西安: 西安交通大学出版社, 2001.) 18 Deng Da-Yong, Chen Lin. Parallel reducts and F-rough sets. Cloud Model and Granular Computing. Beijing: Science Press, 2012. 210-228.( 邓大勇, 陈林. 并行约简与F-粗糙集.云模型与粒计算. 北京: 科学出版社, 2012. 210-228.) 19 Chen Lin. Parallel Reducts and Decision in Various Levels of Granularity [Master thesis], Zhejiang Normal University, China, 2013. (陈林. 粗糙集中不同粒度层次下的并行约简及决策[硕士学位论文], 浙江师范大学, 中国, 2013.) 20 Cao F Y, Huang J Z. A concept-drifting detection algorithm for categorical evolving data. In:Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Gold Coast, Australia:Springer, 2013. 485-496 21 Deng Da-Yong, Pei Ming-Hua, Huang Hou-Kuan. The F-rough sets approaches to the measures of concept drift. Journal of Zhejiang Normal University (Natural Sciences), 2013, 36(3): 303-308.( 邓大勇, 裴明华, 黄厚宽. F-粗糙集方法对概念漂移的度量. 浙江师范大学学报(自然科学版), 2013, 36(3): 303-308. DOI:10.3969/j.issn.1001-5051.2013.03.012) 22 Deng Da-Yong, Xu Xiao-Yu, Huang Hou-Kuan. Concept drifting detection for categorical evolving data based on parallel reducts. Journal of Computer Research and Development, 2015, 52(5): 1071-1079.( 邓大勇, 徐小玉, 黄厚宽. 基于并行约简的概念漂移探测. 计算机研究与发展, 2015, 52(5): 1071-1079.) 23 Miao Duo-Qian, Hu Gui-Rong. A heuristic algorithm for reduction of knowledge. Journal of Computer Research and Development, 1999, 36(6): 681-684.( 苗夺谦, 胡桂荣. 知识约简的一种启发式算法. 计算机研究与发展, 1999, 36(6): 681-684.) 24 Wang Guo-Yin, Yu Hong, Yang Da-Chun. Decision table reduction based on conditional information entropy. Chinese Journal of Computers, 2002, 25(7): 759-766.( 王国胤, 于洪, 杨大春. 基于条件信息熵的决策表约简. 计算机学报, 2002, 25(7): 759-766. DOI:10.3321/j.issn:0254-4164.2002.07.013) 25 Yang Ming. Approximate reduction based on conditional information entropy in decision tables. Acta Electronica Sinica, 2007, 35(11): 2156-2160.( 杨明. 决策表中基于条件信息熵的近似约简. 电子学报, 2007, 35(11): 2156-2160. DOI:10.3321/j.issn:0372-2112.2007.11.023) 26 Deng Da-Yong, Xue Huan-Huan, Miao Duo-Qian, Lu Ke-Wen. Study on criteria of attribute reduction and information loss of attribute reduction. Acta Electronica Sinica, 2017, 45(2): 401-407.( 邓大勇, 薛欢欢, 苗夺谦, 卢克文. 属性约简准则与约简信息损失的研究. 电子学报, 2017, 45(2): 401-407. DOI:10.3969/j.issn.0372-2112.2017.02.019)