2. 广东工业大学 大数据战略研究院,广东 广州 510006
2. Institute of Big Data Strategic Research, Guangdong University of Technology, Guangzhou, 510006
在抽样调查中,计量误差指样本单位的观测值与真实值不一致而产生的随机性或系统性误差[1-2]. 随机性计量误差主要源于调查员、被调查者进行调查时的主观性特性,当样本量较大时,这些误差得以相互抵消,不会对调查估计的结果造成太大的影响. 而系统性误差往往来自调查设计不合理、调查员误导、被调查者对调查内容的误解或不愿意做出真实的回答等,这些因素对调查结果的影响是系统性的,使调查获得的数据产生偏差,即使在大样本观测中这种误差也无法消除[3-4].
假设对于总体
为了评估计量误差对调查估计精度的影响,本文将建立一个计量误差作为随机变量的简单计量误差模型:对于给定的样本
${y_k} = {\theta _k} + {\varepsilon _k},$ | (1) |
则样本设计和计量模型的联合期望Epm(·)和联合方差Vpm(·)可以分别表示为
${E_{pm}}\left( \cdot \right) = {E_p}\left[ {{E_m}\left( { \cdot \left| s \right.} \right)} \right],$ | (2) |
其中样本设计
${V_{pm}}\left( \cdot \right) = {E_p}\left[ {{V_m}\left( { \cdot \left| s \right.} \right)} \right] + {V_p}\left[ {{E_m}\left( { \cdot \left| s \right.} \right)} \right],$ | (3) |
其中
模型表明对任何给定的样本
为了有效地利用计量误差模型进行计量误差效应的识别,有必要对其进行明确的定义. 对给定的概率样本
为了得出总体总和
${\rm{MS}}{{\rm{E}}_{pm}}\left( {\hat t} \right) = {E_{pm}}\left[ {{{\left( {\hat t - {t_\theta }} \right)}^2}} \right].$ | (4) |
为了进一步理解计量误差对
${\rm{MS}}{{\rm{E}}_{pm}}\left( {{{\hat t}_\pi }} \right) = {V_{pm}}\left( {{{\hat t}_\pi }} \right) + {\left[ {{B_{pm}}\left( {{{\hat t}_\pi }} \right)} \right]^2}.$ | (5) |
式(5)右边第一部分方差还可以分解为
$\begin{split}{V_{pm}}&({\hat t_\pi }) = {E_{pm}}\{ {[{\hat t_\pi } - {E_{pm}}({\hat t_\pi })]^2}\} =\\ &{E_p}[{V_m}({\hat t_\pi }|s)] + {V_p}[{E_m}({\hat t_\pi }|s)] = {V_1} + {V_2}.\end{split}$ | (6) |
$\begin{split}{V_1} = & {E_p}[{V_m}({{\hat t}_\pi }|s)] = {E_p}(\sum {\sum\nolimits_s {{\sigma _{kl}}/{\pi _k}} } {\pi _l}) = \\ & \sum {\sum\nolimits_U {({\pi _{kl}}/{\pi _k}{\pi _l})} } {\sigma _{kl}} = \\ & \sum\nolimits_U {\sigma _k^2/{\pi _k} + } \sum\limits_{k \ne l} {\sum\nolimits_U {({\pi _{kl}}/{\pi _k}{\pi _l}){\sigma _{kl}}} } = \\ & {V_{11}} + {V_{12}}.\end{split}$ | (7) |
其中样本
$\begin{aligned}&{V_{11}} = {N^2}{\sigma ^2}/n,\\&{V_{12}} = {N^2}\left( {n - 1} \right)\rho {\sigma ^2}/n,\\&{V_1} = {V_{11}} + {V_{12}} = {N^2}\left[ {1 + \left( {n - 1} \right)\rho } \right]{\sigma ^2}/n.\end{aligned}$ |
其中,平均方差
$\begin{split}{V_2} = & {V_p}[{E_m}({{\hat t}_\pi }|s)] = {V_p}(\sum\nolimits_s {{\mu _k}/{\pi _k}} ) = \\ &\sum {\sum\nolimits_U {{\Delta _{kl}}{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \mu } }_k}} } {{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \mu } }_l}.\end{split}$ | (8) |
由于
式(5)右边第二部分计量偏差反映总体单位测量偏差造成的系统性参数估计偏差:
${B_{pm}}\left( {{{\hat t}_\pi }} \right) = {E_{pm}}\left( {{{\hat t}_\pi }} \right) - {t_\theta }.$ | (9) |
由于
${B_{pm}}\left( {{{\hat t}_\pi }} \right) = \sum\nolimits_U {\left( {{\mu _k} - {\theta _k}} \right)} = B.$ | (10) |
针对计量误差,有时会采用一些特殊调查设计去估计方差,例如重复调查方法、访问员方差研究、随机实验方法和记录核对研究. 最常用的重复调查方法的基本思路是:根据初始抽样设计,以及通过抽取对初始设计具有代表性的子样本,得到方差成分的无偏估计[8-10]. 具体步骤如下:根据抽样设计
${\hat V_{11}} = \frac{{{n_s}}}{{2{n_r}}}\sum\nolimits_r {{{({z_k}/{\pi _k})}^2}} $ | (11) |
以及
${\hat V_{12}} = \frac{{{n_s}({n_s} - 1)}}{{2{n_r}({n_r} - 1)}}\left\{ \left(\sum\nolimits_r {{z_k}/{\pi _k}}\right){{^2} - \sum\nolimits_r {{{({z_k}/{\pi _k})}^2}} } \right\} .$ | (12) |
上述计量误差模型是没有指定观测数据过程的一般模型,这个模型可以应用到具体的实际情况. 考虑到调查员收集数据的过程中,可能把偏差、方差和相关性带入计量误差,是一种非常重要的计量误差[11-12],本节将调查员的影响引入计量误差模型. 在研究调查员方差和调查总方差的关系时,Hansen等[13]引入了4种随机性:随机选择调查区域、样本区域随机选取样本、随机选择调查员、调查员随机分配调查对象. 下文在简单计量误差模型的框架下研究不同调查员分配方案带来调查数据的相关性将如何影响计量误差.
4.1 固定分配调查员在这种情形下,固定的调查员调查固定的总体单位,例如某个调查员负责某个区域的调查. 对固定的
${y_k} = {\theta _k} + {\varepsilon _k}.$ | (13) |
显然同一调查员测量的
${\mu _k} = {\theta _k} + {b_i},k \in {U_i};\quad\;\quad\quad\quad\quad\quad$ | (14) |
$\sigma _k^2 = {\nu_i},k \in {U_i};\quad\quad\quad\quad\quad\quad\quad\quad$ | (15) |
${\sigma _{kl}} = \left\{ \begin{array}{l}{\rho _i}\nu_i,k \in {U_i},l \in {U_i},k \ne l,\\[4pt]0,k \in {U_i},l \in {U_j},i \ne j.\end{array} \right.\;\;\;$ | (16) |
将式(14)~式(16)的矩表达式代入上节式(6)~式(10)中,得到
${V_{11}} = \sum\limits_{i = 1}^a {\left(\sum\nolimits_{{U_i}} {1/{\pi _k}} \right){v_i}} ,$ | (17) |
相关计量方差为
${V_{12}} = \sum\limits_{i = 1}^a {\left(\mathop {\sum {} }\limits_{k \ne l} \sum\nolimits_{{U_i}} {{\pi _{kl}}/{\pi _k}{\pi _l}} \right){\rho _i}{v_i}} ,\;\;\;\;$ | (18) |
$\sigma _{}^2 = \sum\limits_{i = 1}^a {{N_i}{v_i}} /N,\quad\quad\quad\quad\quad\quad\quad\quad$ | (19) |
$\rho = \frac{{\sum\limits_{i = 1}^a {{N_i}({N_i} - 1){\rho _i}{v_i}} }}{{(N - 1)\sum\limits_{i = 1}^a {{N_i}{v_i}} }}.\quad\quad\quad\quad\quad\quad$ | (20) |
由式(20)可以看出,如果希望相关计量方差尽可能小,最极端的情形是
抽样方差为
${V_2} = \sum\limits_{i = 1}^a {\sum\limits_{j = 1}^a {\sum\limits_{k \in {U_i}} {\sum\limits_{l \in {U_j}} {{\Delta _{kl}}\frac{{({\theta _k} + {b_i})({\theta _l} + {b_j})}}{{{\pi _k}{\pi _l}}}} } } } ,$ | (21) |
计量偏差为
$B = \sum\limits_{i = 1}^a {{N_i}{b_i}} .$ | (22) |
在无放回简单随机抽样下,式(21)可以简化为
${V_2} = {V_{2,SI}} = {N^2}\frac{{1 - f}}{n}(S_{\theta U}^2 + S_{bU}^2 + 2S_{\theta bU}^{}).$ | (23) |
其中,
实际中,调查员一般是随机地分到各组中去,采用随机分配的一个原因是避免调查员偏差和真值间的相互作用. 在这种情形下,调查员对随机的样本组进行调查,例如将某个调查员随机地派去某个区域调查. 将总体
${\varepsilon _k} = {B_i} + {e_k},\;\;\;\; k \in {s_i}.$ | (24) |
${\mu _k} = {\theta _k} + {\mu _B},\;\;\;\;k \in U;\quad\quad\quad\quad\quad$ | (25) |
$\sigma _k^2 = {\nu_B} + {\nu_e},\;\;\;\;k \in U;\quad\quad\quad\quad\quad$ | (26) |
${\sigma _{kl}} = \left\{ \begin{array}{l}{\nu_B},\;\;\;\;k,l \in {U_i},k \ne l,\\[4pt]0,\;\;\;\;k \in {U_i},l \in {U_j},i \ne j.\end{array} \right.\;\;$ | (27) |
要计算
${V_{11}} = ({v_B} + {v_e})\sum\nolimits_U {1/{\pi _k}} ;$ | (28) |
相关计量方差为
${V_{12}} = {v_B}\sum\limits_{i = 1}^a {\mathop {\sum {} }\limits_{k \ne l} } \sum\nolimits_{{U_i}}^{} {{\pi _{kl}}/({\pi _k}{\pi _l})} ,$ | (29) |
$\sigma _{}^2 = {v_B} + {v_e},\quad\quad\quad\quad\quad\quad\quad\quad$ | (30) |
$\rho = \frac{{{v_B}}}{{{v_B} + {v_e}}}\frac{{\sum\limits_{i = 1}^a {N_i^2 - N} }}{{N(N - 1)}};\quad\quad\quad\quad$ | (31) |
抽样方差为
${V_2} = \sum {\sum\nolimits_U {{\Delta _{kl}}\frac{{({\theta _k} + {\mu _B})({\theta _l} + {\mu _B})}}{{{\pi _k}{\pi _l}}}} } ;$ | (32) |
计量偏差为
$B = N{\mu _B}.$ | (33) |
由于
${V_1} = {v_e}\sum\nolimits_U {1/{\pi _k}} .$ | (34) |
如果固定调查员数量
$\rho = \frac{{{N_0} - 1}}{{N - 1}}\frac{{{v_B}}}{{{v_B} + {v_e}}}.$ | (35) |
最极端的情形是一个调查员只访问一个调查单位,此时
${V_1} + {V_2} = {N^2} + (\frac{{{v_B}}}{a} + \frac{{{v_e}}}{n} + \frac{{S_{\theta U}^2}}{n}).$ | (36) |
显然,增大样本量对包含
${V_1} + {V_2} = {N^2}[1 + ({n_0} - 1){\rho _w}]\sigma _{\rm{tot}}^2/n.$ | (37) |
其中,
本文在简单计量误差模型的框架下,对
(1) 均方误差可以分解为参数估计的方差和计量偏差两类误差效应,前者由计量方差和抽样方差构成,计量方差又分为简单计量方差和相关计量方差,分别反映重复调查中计量结果的随机变动以及不同调查单位计量误差之间的相关模式,后者反映观测值和真实值差异的系统模式.
(2) 调查员误差是实践中非常重要的相关误差,在固定分配调查员的情形下,为了控制相关计量方差,应使得相关系数
[1] | SARNDL C E, SWENSSON B, WRETMAN J H. Model assisted survey sampling [M]. New York: Springer, 2003: 601-614. |
[2] | BUZAS J S, STEFANSKI L A, TOSTESON T D. Measurement error [M]. New York: Springer, 2014: 1241-1282. |
[3] | LOKEN E, GELMAN A. Measurement Error and the Replication Crisis[J]. Science, 2017, 355(6325): 584-585. DOI: 10.1126/science.aal3618. |
[4] |
王华, 金勇进. 统计数据准确性评估的误差效应分析方法[J].
统计与信息论坛, 2009, 24(9): 10-16.
WANG H, JIN Y J. Error effects analysis approach for statistical data accuracy evaluation[J]. Statistics & Information Forum, 2009, 24(9): 10-16. |
[5] | FERRO C A T, FRICKER T E. A bias-corrected decomposition of the brier score[J]. Quarterly Journal of the Royal Meteorological Society, 2012, 138(668): 1954-1960. DOI: 10.1002/qj.v138.668. |
[6] | SCHOOT R V D, SCHMIDT P, BEUCKELAER A D, et al. Editorial: measurement invariance[J]. Front Psychol, 2015, 6: 1064. |
[7] | ROOVER D, TIMMERMAN K, MARIEKE E, et al. What's hampering measurement invariance[J]. Front Psychol, 2015, 5: 604. |
[8] | YU C, ZHANG S, FRIEDENREICH C, et al. Using repeated measures to correct correlated measurement errors through orthogonal decomposition[J]. Communication in Statistics-Theory and Methods, 2017, 46(23): 11604-11611. DOI: 10.1080/03610926.2016.1275693. |
[9] | BLATTMAN C, JAMISON J, KOROKNAY-PALICZ T, et al. Measuring the measurement error: a method to qualitatively validate survey data[J]. Journal of Development Economics, 2016, 120: 99-112. DOI: 10.1016/j.jdeveco.2016.01.005. |
[10] | ABOWD J M, STINSON M H. Estimating measurement error in annual job earnings: a comparison of survey and administrative data[J]. Review of Economics & Statistics, 2013, 95(5): 1451-1467. |
[11] | BIEMER, PAUL P, GROVES R M, et al. Interviewer, respondent, and regional office effects on response variance: a statistical decomposition[J]. Applied Physics Letters, 2016, 86(7): 074104-074104. |
[12] |
王克林. 调查员误差的计量模型与测定方法[J].
统计与决策, 2009, 298(22): 11-12.
WANG K L. Measurement models and measurement methods for interviewer errors[J]. Statistics & Decision, 2009, 298(22): 11-12. |
[13] | HANSON R H, MARKS E S. Influence of the interviewer on the accuracy of survey results[J]. Journal of the American Statistical Association, 1958, 53(283): 635-655. DOI: 10.1080/01621459.1958.10501465. |
[14] | ELLIOTT M R, WEST B T. " Clustering by Interviewer”: a source of variance that is unaccounted for in single-stage health surveys[J]. American Journal of Epidemiology, 2015, 182(2): 118-126. DOI: 10.1093/aje/kwv018. |
[15] | DIJKSTRA W. How interviewer variance can bias the results of research on interviewer effects[J]. Quality and Quantity, 1983, 17(3): 179-187. DOI: 10.1007/BF00167582. |