﻿ 基于经验分布的区间数据分析方法
 文章快速检索 高级检索

Interval data analysis based on empirical distribution function
WANG Huiwen, WANG Shengshuai, HUANG Lele, WANG Cheng
School of Economics and Management, Beijing University of Aeronautics and Astronautics, Beijing 100191, China
Abstract:Uniform distribution in some closed or tight interval is a basic assumption in the literature about interval data analysis, which is difficult to satisfy in real data processing. To solve this problem, the empirical cumulative distribution function (ECDF) and kernel estimation of cumulative distribution were studied, on the assumption that the date were from some continuous distribution. Based on ECDF and kernel estimation, a transformation to obtain new data was designed, which was uniformly distributed in theory. Then whether the distribution of transformed data was uniform distribution was tested. If the null hypothesis was not rejected, traditional methods in the field of interval data analysis could be utilized based on transformed data. The transform and the test were both for guaranteeing the transformed data were from some uniform distribution. Both simulation and real data example show that, the results based on ECDF and kernel estimation transformed data are more reasonable and with strong explanatory ability.
Key words: interval data     uniform distribution     kernel estimation     empirical distribution     hypothesis test

1 基于经验分布函数的变换

X为服从某一连续分布的随机变量,(x1,x2,…,xn)是已得到的一组样本数据,将其转化为区间数据的方法是取其最大值和最小值作为区间的两个端点,假定其他样本在这个区间服从均匀分布[5].这一假定明显过于严格,如果样本服从其他分布,会导致这一假定及其后续分析的结果失效.

X的分布函数为F(t),经验分布函数Fn(t)定义为

2 变换后的假设检验

3 基于变换数据的区间数据分析

4 数据模拟 4.1 数据模拟1

 样本量 N(0,1) Exp(2) Cauchy U(0,1) U(5,10) 5 0.115 0.086 0.052 0.865 0.878 10 0.035 0.012 0.004 0.934 0.925 20 0.006 0.002 0 0.965 0.951 40 0.002 0 0 0.949 0.948 50 0 0 0 0.957 0.955 100 0 0 0 0.972 0.952 200 0 0 0 0.956 0.947

4.2 数据模拟2

 图 1 对不同分布的分布函数分别采用经验分布函数和核方法进行估计的结果Fig. 1 Simulation results for estimating the cumulative distribution function by empirical distribution and kernel method

 分布类型 样本量20 样本量50 样本量100 样本量200 经验分布 核估计 经验分布 核估计 经验分布 核估计 经验分布 核估计 N(0,1) 0.247 6 0.197 8 0.180 6 0.164 9 0.171 4 0.126 4 0.096 4 0.081 9 Exp(2) 0.180 3 0.168 0 0.152 5 0.136 8 0.092 1 0.109 4 0.026 4 0.029 5 Cauchy 0.626 1 0.552 0 0.526 1 0.546 5 0.472 6 0.585 9 0.226 7 0.407 0 U(2,3) 0.086 0 0.079 3 0.057 6 0.040 1 0.028 6 0.020 5 0.011 9 0.016 1 U(5,10) 0.274 5 0.222 0 0.204 4 0.195 8 0.190 7 0.140 9 0.159 8 0.153 8
5 结 论

 [1] Sankararaman S, Mahadevan S.Likelihood-based representation of epistemic uncertainty due to sparse point data and/or interval data[J].Reliability Engineering & System Safety,2011,96(7):814-824. Click to display the text [2] Diday E, Noirhomme-Fraiture M.Symbolic data analysis and the SODAS software[M].London:Wiley Online Library,2008:81-92. [3] Billard L. Symbolic data analysis:what is it?[M].New York:Springer,2006:261-268. [4] Diday E, Esposito F.An introduction to symbollic data analysis and the SODAS software[J].Intelligent Data Analysis,2003,7(6): 583-601. Click to display the text [5] Wang H W, Guan R,Wu J J.CIPCA:complete-information-based principal component analysis for interval-valued data[J].Neurocomputing,2012,86:158-169. Click to display the text [6] Wang H W, Guan R,Wu J J.Linear regression of interval-valued data based on complete information in hypercubes[J].Journal of Systems Science and Systems Engineering,2012,21(4):422-442. Click to display the text [7] Yue Z L. A group decision making approach based on aggregating interval data into interval-valued intuitionistic fuzzy information[J].Applied Mathematical Modelling,2014,38(2):683-698. Click to display the text [8] Cerný M, Hladík M.The complexity of computation and approximation of the t-ratio over one-dimensional interval data[J].Computational Statistics and Data Analysis,2014,80:26-43. Click to display the text [9] Yang X J, Yan L L,Peng H,et al.Encoding words into cloud models from interval-valued data via fuzzy statistics and membership function fitting[J].Knowledge-Based Systems,2014,55:114-124. Click to display the text [10] 郭均鹏,陈颖, 李汶华.一般分布区间型符号数据的K均值聚类方法[J].管理科学学报,2013,16(3):21-28. Guo J P,Chen Y,Li W H.K-means clustering of generally distributed interval symbolic data[J].Journal of Management Sciences in China,2013,16(3):21-28(in Chinese). Cited By in Cnki (5) [11] 高飒. 一般分布区间型符号数据的聚类分析方法研究[D].天津:天津大学,2009. Gao S.The clustering analysis of generally distributed interval symbolic data[D].Tianjin:Tianjin University,2009(in Chinese). Cited By in Cnki (6) [12] Silverman B W. Density estimation for statistics and data analysis[M].London:Chapman and Hall,1986:34-48. [13] Fan J Q, Yao Q W.Nonlinear time series: nonparametric and parametric methods[M].New York:Springer Verlag,2003:193-212. [14] Marhuenda Y, Morales D,Pardo M C.Power results of tests for the uniform distribution,I-2005-09[R].Spain:Miguel Hernandez University of Elche,2005. [15] Kolmogorov A N. Sulla determinazione empirica di una legge di distribuzione[J].G Inst Ital Att,1933,4:83-91. [16] Sinclair C D, Spurr B D.Approximations to the distribution function of the anderson:darling test statistic[J].Journal of the American Statistical Association,1988,83(404):1190-1191. Click to display the text [17] Conover W J. Practical nonparametric statistics[M].New York:Wiley,1999:63-70. [18] Zhang J. Powerful goodness-of-fit tests based on the likelihood ratio[J].Journal of the Royal Statistical Society,Series B(Statistical Methodology),2002,64(2):281-294. Click to display the text

#### 文章信息

WANG Huiwen, WANG Shengshuai, HUANG Lele, WANG Cheng

Interval data analysis based on empirical distribution function

Journal of Beijing University of Aeronautics and Astronsutics, 2015, 41(2): 193-197.
http://dx.doi.org/10.13700/j.bh.1001-5965.2014.0435