﻿ 多元线性回归模型的增量算法
 文章快速检索 高级检索

Incremental algorithm of multiple linear regression model
Wang Huiwen, Wei Yuan, Huang Lele
School of Economics and Management, Beijing University of Aeronautics and Astronautics, Beijing 100191, China
Abstract:With the development of computer-related technology, people can continuously obtain data faster and faster. Facing with the massive and continuously updated data sets, incremental algorithm was introduced to the popular multiple linear regression analysis. The incremental algorithm of least squares estimation model was derived based on incremental expression of cross product matrix. And further this algorithm was extended to other estimation models and test statistics. The incremental algorithm uses the information of all dataset, which can get the same results with non-incremental methods. This algorithm can save the time in reading and writing data, release the impression on transportation, and thus speed up the computation. Simulation results show that, this algorithm can improve computational efficiency and is very useful in many conditions.
Key words: linear regression model     incremental algorithm     cross product matrix     estimation     test

Syed等[4]1999提出了衡量增量算法的一个重要标准是能否正确处理概念漂移(concept drift)的问题.有效的增量算法必须保证具有一定的稳健性、计算结果的改进性,以及可恢复性.Gaber等[5]将增量算法分为了基于数据的算法和基于任务的算法两种类型.基于数据的算法主要是采用原数据的某个较优的子集,或者原数据的某种汇总形式进行下一步的增量计算,主要包括样本抽样(sampling)[6]、流数据抽样(load shedding)[7]、草图抽样(sketching)[8, 9]、概要数据结构(synopsis data structures)、聚合表示(aggregation)[10]等.这种方法必须对原数据集的结构进行检验,无法保证选取样本的均衡性.基于任务的算法主要包括近似算法(approximation algorithms)[11]、滑动窗口(sliding window)[8]、算法输出粒度(algorithm output granularity)[12]等,实现相对复杂.

 样本量 增量 1% 5% 10% 15% 20% 均值 方差 均值 方差 均值 方差 均值 方差 均值 方差 100 10.568 9 0.240 0 10.671 2 0.238 7 10.755 6 0.697 9 10.783 0 0.292 2 10.813 0 0.392 6 1 000 10.642 9 0.304 1 10.898 1 0.235 8 11.477 1 1.034 5 11.477 2 0.315 0 11.693 8 0.289 2 10 000 11.224 5 0.283 8 13.121 3 1.015 2 15.195 2 0.227 3 17.990 0 1.326 6 23.547 3 1.119 5 100 000 18.116 7 4.902 4 38.030 8 2.296 0 71.776 3 4.987 6 106.388 2 3.111 1 133.911 8 1.919 3 200 000 25.214 2 5.025 9 74.652 4 1.709 5 134.711 0 2.185 3 194.796 8 2.982 1 257.517 8 7.332 4 500 000 37.928 1 3.405 5 165.379 1 3.492 1 321.290 5 5.225 8 496.232 6 9.952 4 689.754 5 12.310 9 1 000 000 76.153 9 2.126 9 320.243 3 5.386 3 322.993 0 12.732 1 689.663 9 14.807 0 1 137.985 5 15.297 0

 图 1 回归系数的非增量算法与增量算法计算时间对比图Fig. 1 Comparison chart for the time in calculating coefficients between traditional method and incremental method

 图 2 检验的非增量算法与增量算法计算时间对比图Fig. 2 Comparison chart for the time of F-test between traditional method and incremental method

 图 3 检验的非增量算法与增量算法计算时间对比图Fig. 3 Comparison chart for the time of t-test between traditional method and incremental method
3 结 论

1) 该增量算法可以用在基于叉积阵求解的各种统计检验量的计算中,如最小二乘回归估计、t检验和F检验统计量、岭回归估计量和加权最小二乘估计等.

2) 该增量算法只需要存储原数据的叉积矩阵,可以节省大量的存储空间及数据载入时间,加快运算效率.

3) 在线性回归建模过程中,本文的增量方法运用了全部的数据信息,与使用全部数据建模具有完全相同的结果.最后,通过数据仿真实验,证明了算法的有效性.

 [1] Tomczak J M,Gonczarek A.Decision rules extraction from data stream in the presence of changing context for diabetes treatment[J].Knowledge and Information Systems,2013,34(3):521-546 Click to display the text [2] Yang L,Cao J N,Tang S J,et al.A framework for partitioning and execution of data stream applications in mobile cloud computing[C]//IEEE Fifth International Conference on Cloud Computing.Washington,DC:IEEE Computer Society,2012:794-802 Click to display the text [3] Coppock H W,Freund J E.All-or-none versus incremental learning of errorless shock escapes by the rat[J].Science,1962,135(3500):318-319 Click to display the text [4] Syed N A,Liu H,Sung K K.Handling concept drifts in incremental learning with support vector machines[C]//Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.San Diego,CA:ACM,1999:371-321 Click to display the text [5] Gaber M M,Zaslavsky A,Krishnaswamy S.Mining data streams:a review[J].ACM Sigmod Record,2005,34(2):18-26 Click to display the text [6] Domingos P,Hulten G.A general method for scaling up machinelearning algorithms and its application to clustering[C]//Proceedings of the Eighteenth International Conference on Machine Learning(ICML 2001).Williams College,Williamstown,MA:Morgan Kaufmann,2001:106-113 Click to display the text [7] Babcock B,Datar M,Motwani R.Load shedding techniques for data stream systems[C]//The 2003 Workshop on Management and Processing of Data Streams.San Diego,CA:ACM,2003 Click to display the text [8] Papapetrou O,Garofalakis M,Deligiannakis A.Sketch-based querying of distributed sliding-window data streams[J].Proceedings of the VLDB Endowment,2012,5(10):992-1003 Click to display the text [9] GAMA J.Data stream mining:the bounded rationality[J].Informatica,2013,37(1):21-25 [10] Nath S,Venkatesan R.Publicly verifiable grouped aggregation queries on outsourced data streams[C]//Data Engineering(ICDE),2013 IEEE 29th International Conference on.Washington,DC:IEEE Computer Society,2013:517-528 Click to display the text [11] Muthukrishnan S.Data streams:algorithms and applications[M].Hanover,MA:Now Publishers Inc,2005 [12] Krishnaswamy S,Gama J,Gaber M M.Mobile data stream mining:from algorithms to applications[C]//Mobile Data Management(MDM),2012 IEEE 13th International Conference on.Washington,DC:IEEE Computer Society,2012:360-363 Click to display the text [13] 肖智,王明恺,谢林林.基于支持向量机的大学生助学贷款个人信用评价[J].清华大学学报:自然科学版,2006,46(S1):1120-1124 Xiao Zhi,Wang Mingkai,Xie Linlin.Personal credit evaluation of college student loans with support vector machines[J].Journal of Tsinghua University:Science and Technology,2006,46(S1):1120-1124(in Chinese) Click to display the text [14] Babcock B,Babu S,Datar M,et al.Models and issues in data stream systems[C]//Proceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.Madison,WI:Association for Computing Machinery,2002:1-16 Click to display the text [15] Golab L,Ozsu M T.Data stream management [J].Synthesis Lectures on Data Management,2010,2(1):1-73 [16] 姚远.海量动态数据流分类方法研究[D].大连:大连理工大学,2013 Yao Yuan.The research on massive and dynamic data stream classification method[D].Dalian:Dalian University of Technology,2013(in Chinese) Cited By in Cnki

文章信息

Wang Huiwen, Wei Yuan, Huang Lele

Incremental algorithm of multiple linear regression model

Journal of Beijing University of Aeronautics and Astronsutics, 2014, 40(11): 1487-1491.
http://dx.doi.org/10.13700/j.bh.1001-5965.2013.0680