社会  2017, Vol. 37 Issue (2): 1-25 0

### 引用本文 [复制中英文]

[复制中文]
CHEN Huashan. 2017. Penalized Gaussian Graphic Models and Their Applications in Social Network Measurement[J]. Chinese Journal of Sociology(in Chinese Version), 37(2): 1-25.
[复制英文]

Penalized Gaussian Graphic Models and Their Applications in Social Network Measurement
CHEN Huashan
This study was supported by the Chinese National Social Science Foundation (16BSH013)
Author: CHEN Huashan, National Institute of Social Development, CASS E-mail:chenhs@cass.org.cn
Abstract: Given the popularity of Internet and new technology, more and more behavioral data recording human interactions has now become available, and attracted the attention of sociological research. Most of the behavioral journal data are of event-action type and are the same data structure as two-mode networks. Two-mode networks are common in social network analysis fields and there are many methods for analyzing two-mode networks. However, unlike the classical two-mode network that is usually a small dataset and suitable for methods such as matrix decomposition, principal component analysis, and other descriptive analysis methods, the underlying network of behavioral data is rather large in scale, with information about time ordered heterogeneous events. Besides, the network members change dynamically, members may join or leave the network. Traditional analytic methods cannot effectively deal with such data. The analysis of such large-scale behavioral data is a huge challenge for social scientists. Over the past decade, the high dimensional Gaussian graphic model has received a great deal of attention in the research of network structure detection, especially those based on Tibshirani's lasso method of statistical analysis (1996). The success of the lasso based penalized Gaussian graphic model is not only due to its efficiency in high dimensional computation, but also due to its interpretability and ease of extension under further considerations. Hence, the lasso penalized Gaussian graphic model is a rapidly developing field with an overwhelming amount of literature on Biology, Genetics, Neurology, machine learning, etc. However, it hasn't caught the attention from social scientists. This paper presents an overview of the applications of lasso based penalized Gaussian graphic model for the measurement of network structures with observational behavioral data. The author does not focus on the specific solution algorithms and optimization processes, but rather on the potential substantial contributions of the Gaussian graphic model and its extensions to social science research. This paper derives different hypothesis under theoretical concern and demonstrates with real data examples. Finally, it also briefly summarizes the related models and their R packages, with intent to expand the application of the Gaussian graphic models in social science research.
Key words: social network measurement    two-mode networks    penalized Gaussian graphic models    glasso

 $\begin{array}{l} \;\;\;\;\;\;\;\;\begin{array}{*{20}{c}} {{v_1}}&{{v_2}}&{{v_3}}&{{v_4}}&{{v_5}} \end{array}\\ {\rm{P}} = \begin{array}{*{20}{c}} {{e_1}}\\ {{e_2}}\\ {{e_3}}\\ {{e_4}}\\ {{e_5}}\\ {{e_6}} \end{array}\left[ {\begin{array}{*{20}{c}} 1&1&0&0&0\\ 3&0&1&0&0\\ 1&0&0&3&0\\ 0&4&0&1&0\\ 0&0&1&0&2\\ 0&0&0&0&3 \end{array}} \right] \end{array}$

 $X = ({X_1}, \ldots ,{X_p})\sim N\left( {\mu ,\mathit{\Sigma} } \right)$

 $\begin{array}{l} {\mathit{\Sigma} ^{ - 1}} = \left[ {\begin{array}{*{20}{c}} 2&{ - 1}&0&0&0\\ { - 1}&2&{ - 1}&0&0\\ 0&{ - 1}&2&{ - 1}&0\\ 0&0&{ - 1}&2&{ - 1}\\ 0&0&0&{ - 1}&2 \end{array}} \right]\\ \mathit{\Sigma} = \left[ {\begin{array}{*{20}{c}} {0.83}&{0.67}&{0.50}&{0.33}&{0.17}\\ {0.67}&{1.33}&{1.00}&{0.67}&{0.33}\\ {0.50}&{1.00}&{1.50}&{1.00}&{0.50}\\ {0.33}&{0.67}&{1.00}&{1.33}&{0.67}\\ {0.17}&{0.33}&{0.50}&{0.67}&{0.83} \end{array}} \right] \end{array}$

Θ矩阵与偏相关系数有如下关系：

 ${\rho _{ij|(i,j)}} = \frac{{ - {\omega _{ij}}}}{{\sqrt {{\omega _{ii}}{\omega _{jj}}} }}$

 $log\;{\rm{det}}\mathit{\Theta} - trace\left( {S\mathit{\Theta} } \right)$ (1)

(二) 罚似然估计法

1. 罚似然估计法

 $maximiz{e_\mathit{\Theta} }\{ log\;{\rm{det}}\mathit{\Theta} - trace\left( {S\mathit{\Theta} } \right) - \lambda {\left\| \mathit{\Theta} \right\|_1}\}$ (2)

 ${\rm{RSS}} = \sum\limits_{i = 1}^n {{{\left( {{y_i} - {\beta _0} - \sum\limits_{j = 1}^p {{\beta _j}{x_{ij}}} } \right)}^2}}$

1. 除了公式 (2) 提到的一范数 (l1)，罚则范数的选择还包括零范数 (l0)、二范数 (l2)(岭回归)、核范数 (nuclear norm)，以及混合一范数和二范数的弹性网回归 (Elastic Net)(Zou and Hastie, 2005)，等等。更确切地说，本文所指的罚则模型是基于范数的罚则图模型 (lasso图模型)，包括融合了l1范数和其他范数的扩展模型，本文后续所介绍的某些模型会采用弹性网或多种罚则范数来处理。

 ${\rm{RSS}} + \lambda \sum\limits_{j = 1}^p {|{\beta _j}|}$ (3)

2. 最优参数选择与模型评估

(三) 应用与示例

2. 该数据由戴维斯和加纳 (Davis，Gardner and Gardner) 收集，故简称DGG。社会网络分析软件UCINet及R软件包latentnet均附带了该数据，单独的数据下载及更详细的介绍见该网站：https://networkdata.ics.uci.edu/netdata/html/davis.html

3. 偏相关系数矩阵中有可能出现负相关，即小于0的数值。对于负相关与网络关系的关联需根据具体的研究问题予以处理。在共现数据中，负相关往往出现在两个参与者没有发生共现行为的情形中。本文对负相关数值进行了技术处理，将其设为0，表示不存在网络关系。

 图 1 用glasso法计算的网络关联 (DGG)4

4. 这里采用偏相关系数矩阵作为社会关系网络测量的工具。为了更好地呈现网络关系的稀疏性，在构建网络时，通过设定阈值对偏相关系数进行二值化也是常见的做法。需要强调的是，不同于直接对频次的二值化，对偏相关系数的二值化与样本活跃度无关。本示例共拟合了5个模型，限于篇幅仅展示其中3个。本文示例数据、代码以及详细结果可从《社会》杂志官网下载。

5. 针对二项分布数据的估计问题可进一步参考：Banerjee et al., 2008Ravikumar et al., 2008van Borkulo et al., 2014。针对泊松分布数据可参考：Allen and Liu, 20122013。针对多分类分布可参考：Dai et al., 2013。针对混合数据类型的估计问题可参考：Chen et al., 2015Haslbeck and Waldorp, 2015

(一) 带协变量的罚似然图模型

 $\lambda \sum\limits_{p = 1}^p {|{b_p}|}$

 $\lambda \sum\limits_{p = 1}^m {|{b_p}|}$

 图 2 控制聚会规模以后的网络关联 (DGG)
(二) 多组罚似然图模型

 图 3 两本杂志 (2006—2015) 学术论文关键词关联网络
 图 4 《社会学研究》(2006—2015) 学术论文关键词关联网络
 图 5 《社会》(2006—2015) 学术论文关键词关联网络
(三) 潜类别罚似然图模型

 ${\pi _1}{N_1}\left( {{\mu _1},{\sum _1}} \right),{\pi _2}{N_2}\left( {{\mu _2},{\sum _2}} \right), \ldots {\pi _k}{N_k}({\mu _k},{\sum _k}),$

 图 6 DGG数据的两个子网络

(四) 罚似然图模型的其他扩展

1. 分组罚似然图模型

2. 潜变量罚似然图模型

 图 7 DGG数据的潜结构模型