文章快速检索 高级检索

1. 厦门理工学院 计算机科学与技术系，福建 厦门 361024;
2. 南昌大学 计算机科学与技术系，江西 南昌 330031

Gene expression data feature selection with neighborhood relation
CHEN Yuming1, WU Keshou1, LI Xiangjun2
1. Department of Computer Science and Technology, Xiamen University of Technology, Xiamen 361024, China ;
2. Department of Computer Science and Technology, Nanchang University, Nanchang 330031, China
Abstract: The selection of an efficient gene feature is a key procedure for analysis of gene expression data. The rough set theory is an efficient classification tool to deal with uncertain, inconsistent and inaccurate gene data. One limitation of the rough set theory is the lack of effective methods for processing real valued data. However, gene expression data sets are always continuous. Discrete methods can result in information loss. This paper investigates an approach to the selection of gene feature on the basis of the neighborhood rough set theory. Starting from all the features, this approach gradually removes the redundant features, and finally gets the key features of the group classification study based on the importance degree of characteristics. To evaluate the performance of the proposed approach, we applied it to two bench mark gene expression data sets which were compared to certain aspects of the feature selections. The experimental results illustrate that our algorithm is more effective for selecting high discriminative genes in cancer classification tasks.
Key words: rough sets     neighborhood relation     gene expression data     feature selection     classification

1 邻域关系

1)DB(x,y)≥0，非负；

2)DB(x,y)=0，当且仅当x=y

3)DB(x,y)=DB(y,x)，对称；

4)DB(x,y)+DB(y,z)≥DB(x,z)。

p=1时，称为曼哈顿距离，当p=2时，称为欧氏距离。

1)nBδ(x)≠⌀；

2)xnBδ(x)；

3)ynBδ(x)⇔xnBδ(y)；

4)

2 基于邻域关系的基因选择方法

2.1 邻域特征选择

2.2 基于邻域关系的基因选择算法

1)计算整个条件特征集C相对于决策特征D的邻域依赖度为γC(D)δ

2)R:=C

3) 当γR(D)δ=γC(D)δ重复：

①对所有的aR计算特征重要度Sign(a,R,D);

②在R中选择特征a满足特征重要度最小；

R:=R－{a}。

4) 输出R

3 实验结果与分析

 数据集 基因个数 类别 样本数 Lymphoma 4 026 B-cell 42 Lymphoma 4 026 Other type 54 Liver cancer 1 648 HCCs 82 Liver cancer 1 648 Nontumor livers 74

 数据集 基因个数 样本数 TRS算法 GSNRS算法 Lymphoma 4 026 96 7 6 Liver cancer 1 648 156 6 5

 % 数据集特征选择算法 Lymphoma本文方法 Liver cancer TRS GSNRS TRS GSNRS KNN分类器 93.6 94.9 89.1 91.4 C5.0分类器 95.1 96.5 91.4 93.2

4 结束语

 [1] TIBSHIRANI R, HASTIE T, NARASHIMAN B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression[C]//Nat’1 Academy of Sciences. [S.l.], USA, 2002: 6567-6572. [2] KOHAVI R, JOHN G H. Wrappers for feature subset selection[J]. Artificial Intelligence , 1997, 97 (1/2) : 273-324 [3] PAWLAK Z. Rough sets[J]. International Journal of Computer and Information Science , 1982, 11 (5) : 341-356 DOI:10.1007/BF01001956 [4] BANERJEE M, MITRA S, BANKA H. Evolutinary-rough feature selection in gene expression data[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Application and Reviews , 2007, 37 : 622-632 DOI:10.1109/TSMCC.2007.897498 [5] YANG Ming, YANG Ping. A novel condensing tree structure for rough set feature selection[J]. Neurocomputing , 2008, 71 (4/5/6) : 1092-1100 [6] QIAN Yuhua, LIANG Jiye. Positive approximation: an accelerator for attribute reduction in rough set theory[J]. Artificial Intelligence , 2010, 174 (9/10) : 597-618 [7] CHEN Yuming, MIAO Duoqian. A rough set approach to feature selection based on power set tree[J]. Knowledge-Based Systems , 2011, 24 (2) : 275-281 DOI:10.1016/j.knosys.2010.09.004 [8] 苗夺谦. Rough set理论中连续属性的离散化方法[J]. 自动化学报 , 2001, 27 (3) : 296-302 MIAO Duoqian. A new method of discretization of continuous attributes in rough sets[J]. Acta Automatica Sinica , 2001, 27 (3) : 296-302 [9] 王国胤. Rough 集理论与知识获取[M]. 西安: 西安交通大学出版社, 2001 : 24 -28. [10] GRZYMALA-BUSSE J W. Handling missing attribute values[M]. [S.l.]: Springer, 2005 : 37 -57.
DOI: 10.3969/j.issn.1673-4785.201307014

0

#### 文章信息

CHEN Yuming, WU Keshou, LI Xiangjun

Gene expression data feature selection with neighborhood relation

CAAI Transactions on Intelligent Systems, 2014, 9(2): 210-213
http://dx.doi.org/10.3969/j.issn.1673-4785.201307014