文章快速检索 高级检索

1. 南开大学 计算机与控制工程学院，天津 300071;
2. 北京大学 数学科学学院，北京 100871;
3. 武警指挥学院 军事教育训练系, 天津 300250

The parallel implementation of MapReduce for the Bayesian algorithm to detect botnets
SHAO Xiuli1 , LIU Yiwei2 , GENG Meijie1 , HAN Jianbin3
1. College of Computer and Control Engineering, Nankai University, Tianjin 300071, China ;
2. School of Mathematical Sciences, Peking University, Beijing 100871, China ;
3. Department of Education and Training, Armed Police Command College, Tianjin 300250, China
Abstract: The botnet network poses a serious threat to the Internet security, and the accuracy of the botnet detection method is low, while the Bayesian algorithm has high accuracy. This paper puts forward a Bayesian algorithm with the mechanism of MapReduce based on the Hadoop platform to achieve botnet detection. Taking the host-pairs as analysis objects, this method extracts the traffic features of communications between two hosts, takes these features as input and trains the Bayesian classifier through parallel calculations of the prior probability and condition probability on the stage of the Bayesian algorithm training to learn to recognize botnet traffic. By using the Bayesian classifier trained on the stage of the Bayesian algorithm training and parallel calculations of the posterior probability on the stage of detecting, the detection of botnets can be achieved. Experiments show that the method for detecting botnets is effective and the correct detection rate is more than 90%. The efficiency of this method is greatly improved as compared with detecting the single Bayesian algorithm of the botnets.
Key words: botnets     botnet detection     Bayesian algorithm     Hadoop     MapReduce     flow

1 检测僵尸网络的计算架构

 图 1 基于MapReduce的检测僵尸网络 Fig. 1 Botnet detection based on MapReduce

2 基于MapReduce的贝叶斯算法

 图 2 基于MapReduce检测僵尸网络的贝叶斯算法 Fig. 2 The process of Bayesian algorithm to detect botnets based on MapReduce
2.1 MapReduce1的设计

map(Key、Value)

{StringTokenizer itr=new StringTokenizer(value.toString());

String[]temp=new String[9];

While(itr.hasMoreTokens())

{temp[i]=itr.nextToken(); //属性字符串数组

i++;

}

Sum++; //网络数据总行数

If(temp[8].equals(0))//类标签0为正常网络

Context.write(“正常网络”，1);

Else//表示为僵尸网络

Context.write(“僵尸网络”，1);

}

Reduce1接收到的信息为〈Key2, Value2〉。Reduce1的任务是对Key2相同的中间结果计数，若Key2值为“正常网络”，统计的Value2的行数为正常网络个数，并以成员变量sum_yes存储; 若Key2值为“僵尸网络”，统计的Value2的行数为僵尸网络个数，并以成员变量sum_no存储。并分别用sum_yes/sum、sum_no/sum计算得到先验概率P(n)和P(b)，并以成员变量sum_yes_p和sum_no_p存储。Reduce1过程伪代码如下。

reduce(Key，Value)

{for(IntWritable val:Value)

Sum1+=val.get(); //若Key为“正常网络”，统计的是正常网络数据行数；否则为僵尸网络数据行数

If(Key.equals(“正常网络”))

{sum_yes=sum1;//存储正常网络数据行数

sum_yes_p=sum_yes/sum; //正常网络先验概率

}

Else

{sum_no=sum1;//存储僵尸网络数据行数

sum_no_p=sum_no/sum; //僵尸网络先验概率

}

Context.write(key, (float)(sum1/sum)); //输出先验概率

}

2.2 MapReduce2的设计

Map2接收到的信息与Map1相同，是训练数据被Hadoop处理形成的〈Key, Value〉对, 形式为〈该行起始位置相对于文件起始位置的偏移量, 文本文件中的一行信息〉的信息。MapReduce2计算贝叶斯的条件概率，需用到Value的6个属性列及类标签值。因此Map2将每行Value数据按空格分割成字符串数组，取出数组的第3~9项w1, w2, …, w6, 以及类标签值。首先判断类标签值是否为“0”，然后判断各属性是否在各自阈值内。若标签值为“0”且属性值在阈值内，输出中间结果〈Key3, Value3〉对的形式为〈“wi|n”，1〉；若标签值为“0”且属性值在阈值外，输出中间结果〈Key3, Value3〉对的形式为〈“wi|n”，1〉；若标签值为“1”且属性值在阈值内，输出中间结果〈Key3, Value3〉对的形式为〈“wi|b”，1〉；若标签值为“1”且属性值在阈值外，输出中间结果〈Key3, Value3〉对的形式为〈“wi|b”，1〉。Map2过程伪代码如下。

map(Key，Value)

{StringTokenizer itr=new StringTokenizer(value.toString());

String[] temp=new String[9];

While(itr.hasMoreTokens())

{temp[i]=itr.nextToken(); //属性字符串数组

i++;

}

If(Float.parsefloat(temp[2])>140 && Float.parsefloat(temp[2]) < 150)

If(temp[8].equals(“0”))

Context.write(“w1|n”, 1);

Else

Context.write(“w1|b”, 1);

Else

If(temp[8].equals(“0”))

Context.write(“w1|n”, 1);

Else

Context.write(“w1|b”, 1);

//其他5个属性列训练过程同上。

}

Reduce2接收到的信息为〈Key4, Value4〉。Reduce2的任务是对Key4相同的中间结果计数，若Key4值为“wi|n”，统计的Value4的行数sum1为wi在阈值内且属于正常网络个数，利用sum1/sum_yes求得条件概率P(wi|n), 并以成员变量wi_in_nomal存储；若Key4值为“wi|n”，统计的Value4的行数sum1为wi在阈值外且属于正常网络个数，利用sum1/sum_yes求得条件概率P(wi|n), 并以成员变量wi_out_nomal存储；若Key4值为“wi|b”，统计的Value4的行数sum1为wi在阈值内且属于僵尸网络个数，利用sum1/sum_no求得条件概率P(wi|b), 并以成员变量wi_in_unnomal存储；若Key4值为“wi|僵尸”，统计的Value4的行数sum1为wi在阈值外且属于僵尸网络个数，利用sum1/sum_no求得条件概率P(wi|b), 并以成员变量wi_out_unnomal存储。Reduce2过程伪代码如下。

Reduce(Key、Value)

{for(IntWritable val:value)

{

Sum1+=val.get(); //key为不同的条件字符串，则统计的是满足不同条件的网络数据行数

}

//计算Key字符串长度length

//截取条件字符串Key最后2个字符msg_temp

If(msg_temp.equals(“n”))//若属于正常网络

{Float result=sum1/sum_yes; //计算条件概率

If(Key.equals(“w1|n”))

w1_in_nomal=result; //成员变量存储属性列1在阈值内且属于正常网络条件概率

Else if(Key.equals(“wi|n”))

w1_out_normal=result; //成员变量存储属性列1在阈值外且属于正常网络条件概率

//其他5个属性列计算在阈值内、外属于正常网络的条件概率同上

}

Else

{Float result=sum1/sum_no; //计算条件概率

If(Key.equals(“w1|b”))

w1_in_unnormal=result; //成员变量存储属性列1在阈值内且属于僵尸网络条件概率

Else if(Key.equals(“wi|b”))

w1_out_unnormal=result; //成员变量存储属性列1在阈值内且属于僵尸网络条件概率

//其他5个属性列计算在阈值内、外属于僵尸网络的条件概率同上

}

Context.write(Key, result); //输出条件概率

}

2.3 MapReduce3设计

Map3接收到的检测数据是被Hadoop处理形成的〈Key, Value〉对, 形式为〈该行起始位置相对于文件起始位置的偏移量, 文本文件中的一行信息〉的信息。MapReduce3要对6列属性全部检测，需用到Value的6个属性列。所以Map3将每行Value数据按空格分割成字符串数组，取出数组的第3~8项，分别为TCP数据流、时间间隔平均值、时间间隔变化、数据包字节数、数据包个数平均值、持续时间平均值。判断6个属性列的值是否在各自阈值内，若在阈值内，分别利用存储条件概率的成员变量wi_in_nomal、wi_in_unnormal计算后验概率；若在阈值外，分别利用存储条件概率的成员变量wi_out_nomal、wi_out_unnormal计算后验概率。并将每行网络数据的正常网络后验概率P(n|d)和僵尸网络后验概率P(b|d)一起输出。输出结果〈Key5, Value5〉对的形式为〈数据所在行数，P(n|d) P(b|d)〉。Map3过程伪代码如下。

map(Key，Value)

{StringTokenizer itr=new StringTokenizer(value.toString());

String[] temp=new String[9];

While(itr.hasMoreTokens())

{temp[i]=itr.nextToken(); //属性字符串数组

i++;

}

P1=sum_yes_p; P2=sum_no_p;

If(Float.parsefloat(temp[2])>140 && Float.parsefloat(temp[2]) < 150)

{

P1=P1*w1_in_nomal;

P2=P2*w1_in_unnormal;

}

Else if(Float.parsefloat(temp[2]) < 140‖Float.parsefloat(temp[2])>150)

{

P1=P1*w1_out_normal;

P2=P2*w1_out_unnormal;

}

//其他5个属性列检测过程同上。

Line++; //统计所在行数

Context.write(Line, P1 P2);

}

Reduce3接收到的信息为〈Key5, Value5〉。Reduce5的任务是逐行比较网络数据的P(b|d)和P(n|d)的大小。若P(n|d)>P(b|d), 判断该行网络数据为正常网络数据；否则为僵尸网络数据。Reduce3伪代码如下所示。

reduce(Key，Value)

{StringTokenizer itr=new StringTokenizer(value.toString());

String[] temp=new String[2];

While(itr.hasMoreTokens())

{temp[i]=itr.nextToken(); //正常网络后验概率与僵尸网络后验概率

i++;

}

If(Float.parseFloat(temp[0])>Float.parseFloat(temp[1]))//比较

Context.write(Key, “正常网络”); //判断

Else

Context.write(Key, “僵尸网络”); //判断

}

3 实验结果与分析

 图 3 检测不同属性列个数的正确率比较 Fig. 3 Comparison of correct rates among detecting different numbers of attribute columns

 阈值范围 图 4 改变阈值对正确率影响 Fig. 4 Influence of threshold changes on accuracy

 图 5 串行与MapReduce化贝叶斯算法检测僵尸网络效率比较 Fig. 5 Efficiency Comparison of Serial and MapReduce Bayesian algorithms to detect botnets
4 结束语

 [1] JIANG Hongli, SHAO Xiuli. Detecting P2P botnets by discovering flow dependency in C & C traffic[J]. Peer-to-Peer Networking and Applications , 2012, 5 (2) : 1-12 [2] 王威, 方滨兴, 崔翔. 基于终端行为特征的IRC僵尸网络检测[J]. 计算机学报 , 2009, 32 (10) : 1980-1988 WANG Wei, FANG Binxing, CUI Xiang. IRC botnet detection based on host behavior[J]. Chinese Journal of Computers , 2009, 32 (10) : 1980-1988 [3] 蒋鸿玲, 邵秀丽. 基于神经网络的僵尸网络检测方法[J]. 智能系统学报 , 2013, 8 (2) : 113-118 JIANG Honglin, SHAO Xiuli. Botnet detection algorithm based on neural network[J]. CAAI Transactions on Intelligent Systems , 2013, 8 (2) : 113-118 [4] DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large cluster[J]. Communications of the ACM , 2005, 51 (1) : 107-113 [5] 陶永才, 薛正元, 石磊. 基于MapReduce的贝叶斯垃圾邮件过滤机制[J]. 计算机应用 , 2011, 31 (9) : 2412-2416 TAO Yongcai, XUE Zhengyuan, SHI Lei. MapReduce-based Bayesian anti-spam filtering mechanism[J]. Journal of Computer Applications , 2011, 31 (9) : 2412-2416 [6] 杜跃进, 崔翔. 僵尸网络及其启发[J]. 中国数据通信 , 2005, 7 (5) : 9-13 DU Yuejin, CUI Xiang. Botnets and its enlightment[J]. China Data Communication , 2005, 7 (5) : 9-13 [7] VALIANT L G. A bridging model for parallel computation[J]. Communications of the ACM , 1990, 33 (8) : 103-111 [8] 李晓桢, 程佳, 胡军. 基于聚类分析的僵尸网络识别系统[J]. 计算机系统应用 , 2009 (8) : 130-135 LI Xiaozhen, CHENG Jia, HU Jun. Botnet recognition system based on the clustering technology[J]. Computer System and Application , 2009 (8) : 130-135 [9] STONEBRAKER M, ABADI D J, DEWITT D J, et al. MapReduce and parallel DBMSs: friends or foes?[J]. Communication of the ACM , 2010, 53 (1) : 64-71 [10] 张鹏, 唐世渭. 朴素贝叶斯分类中的隐私保护方法研究[J]. 计算机学报 , 2007, 30 (8) : 1267-1276 ZHANG Peng, TANG Shiwei. Privacy preserving naive Bayesian classification[J]. Chinese Journal of Computers , 2007, 30 (8) : 1267-1276
DOI: 10.3969/j.issn.1673-4785.201305011

0

#### 文章信息

SHAO Xiuli, LIU Yiwei, GENG Meijie, HAN Jianbin

The parallel implementation of MapReduce for the Bayesian algorithm to detect botnets

CAAI Transactions on Intelligent Systems, 2014, 9(1): 26-33
http://dx.doi.org/10.3969/j.issn.1673-4785.201305011