基于通路关联深度神经网络急性肾损伤生物信息学分析

梁水芬 刚伟 谌卫 仲彩铭 黄麟翕 王远军 郭志勇

引用本文: 梁水芬,刚伟,谌卫,等.基于通路关联深度神经网络急性肾损伤生物信息学分析[J]. 海军军医大学学报,2025,46(9):1148-1158.DOI: 10.16781/j.CN31-2187/R.20240159..
Citation: LIANG S, GANG W, CHEN W, et al. Bioinformatics analysis of acute kidney injury based on pathway-associated deep neural network[J]. Acad J Naval Med Univ, 2025, 46(9): 1148-1158. DOI: 10.16781/j.CN31-2187/R.20240159..

基于通路关联深度神经网络急性肾损伤生物信息学分析

doi: 10.16781/j.CN31-2187/R.20240159
基金项目: 

国家自然科学基金 82070692.

详细信息
    作者简介:

    梁水芬,硕士生. E-mail: ll12158899@163.com;

    刚伟,硕士生. E-mail: 454738431@qq.com.

    通讯作者:

    郭志勇,E-mail: chguozhiyong@126.com.

  • 共同第一作者(Co-first authors).

Bioinformatics analysis of acute kidney injury based on pathway-associated deep neural network

Funds: 

National Natural Science Foundation of China 82070692.

  • 摘要:  目的 利用通路关联深度神经网络和多种机器学习算法筛选不同病因急性肾损伤(AKI)共同的关键基因和重要通路。 方法 将从基因表达汇编(GEO)数据库下载的AKI微阵列数据集GSE30718、GSE37838、GSE53769、GSE108113、GSE125779、GSE99325、GSE174020进行合并,包括60个AKI患者肾脏样本和79个健康对照肾脏样本,按8∶2比例划分为训练集和测试集,用于通路关联深度神经网络以及最小绝对收缩和选择算子(LASSO)、随机森林(RF)、支持向量机(SVM)-递归特征消除(RFE)、极限梯度提升树(XgBoost)4种机器学习算法的训练和评估,以筛选不同病因AKI的共同关键基因和通路。将下载的数据集GSE99340、GSE1563进行合并,包括43个AKI患者肾脏样本和36个健康对照肾脏样本,作为外部验证集用于基于最终筛选基因的LASSO模型和列线图性能测试。采用ROC曲线、精确度、召回率、准确度、F1分数对通路关联深度神经网络和机器学习算法进行评估。对AKI转录组数据进行CIBERSORT免疫细胞浸润评估,并对最终筛选的共同关键基因与免疫细胞浸润水平进行Pearson相关分析。 结果 5折交叉验证训练的通路关联深度神经网络在测试集中的AUC为0.914 5±0.007 0,精确度为0.750 0±0.044 0,召回率为0.923 1±0.048 0,准确度为0.838 7±0.016 0,F1分数为0.827 6±0.020 0,对AKI产生了稳健且高精度的分类性能,并识别了关键通路和候选基因子集。4种机器学习算法在测试集的AUC≥0.860,精确度≥0.750,召回率≥0.800,F1分数≥0.774,均实现了对AKI的高判别性能,并筛选出7个不同病因AKI的共同关键基因,分别为CD86、C-X-C基序趋化因子配体10(CXCL10)、发动蛋白2(DNM2)、原癌基因FOS、转录因子12(TCF12)、VGF神经生长因子诱导蛋白(VGF)、A激酶锚定蛋白5(AKAP5)。基于最终筛选基因的LASSO模型测试集AUC为0.940 4,外部验证AUC为0.944 4,模型对AKI样本表现出极高的鉴别能力,证明了基因的整体调控性能。基于筛选的7个基因构建的列线图具有较高的分类性能,AUC为0.928 9,验证了筛选的个体基因的突出贡献和整体作用。免疫细胞浸润分析显示初始B细胞、活化的肥大细胞、单核细胞、M1型巨噬细胞、记忆B细胞、活化的树突状细胞在AKI样本和健康对照样本间差异有统计学意义(均P<0.05)。M1型巨噬细胞、单核细胞与CD86CXCL10呈正相关,活化的肥大细胞与FOS呈正相关,初始B细胞与CD86CXCL10呈负相关(均P<0.01)。活化的肥大细胞与VGF呈正相关、与CD86TCF12呈负相关,而记忆B细胞与CD86呈正相关(均P<0.05)。 结论 结合通路关联深度神经网络和多机器学习分类器策略能从高维、复杂异质的转录组数据中挖掘具有高价值的关键基因,可作为AKI潜在的治疗干预靶点。

     

    Abstract:  Objective To screen for key genes and important pathways common for different etiologies of acute kidney injury (AKI) by pathway-associated deep neural network and multiple machine learning algorithms. Methods AKI microarray datasets GSE30718, GSE37838, GSE53769, GSE108113, GSE125779, GSE99325, and GSE174020 downloaded from the Gene Expression Omnibus (GEO) database were merged, including 60 kidney samples from AKI patients and 79 kidney samples from healthy controls. They were divided (8∶2) into training sets and test sets, and were used to train and evaluate pathway-associated deep neural network and 4 machine learning algorithms, including least absolute shrinkage and selection operator (LASSO), random forest (RF), support vector machine-recursive feature elimination (SVM-RFE), and extreme gradient boosting (XgBoost), to screen for common key genes and pathways of different etiologies of AKI. The downloaded datasets GSE99340 and GSE1563 were merged, including 43 kidney samples from AKI patients and 36 kidney samples from healthy controls, which were used as external validation sets for LASSO model and nomogram performance test based on the final screened genes. The pathway-associated deep neural network and machine learning algorithms were evaluated using receiver operating characteristic curves, precision, recall, accuracy, and F1-score. The immune cell infiltration characteristics were explored in AKI via cell-type identification by estimating relative subsets of RNA transcripts (CIBERSORT), and Pearson correlation coefficients were used to evaluate the correlation between the final screened common key genes and immune cell infiltration levels. Results The pathway-associated deep neural network trained by 5-fold cross validation produced an area under curve (AUC) of 0.914 5±0.007 0, a precision of 0.750 0±0.044 0, a recall of 0.923 1±0.048 0, an accuracy of 0.838 7±0.016 0, and an F1-score of 0.827 6±0.020 0 in the test set, yielding a robust and highly accurate classification performance for AKI, and identified key pathways and a subset of candidate genes. The 4 machine learning algorithms all achieved high discriminative performance for AKI in the test set with AUC≥0.860, precision≥0.750, recall≥0.800, and F1-score≥0.774, and screened 7 common key genes for AKI with different etiologies, including CD86, C-X-C motif chemokine ligand 10 (CXCL10), dynamin 2 (DNM2), proto-oncogene FOS, transcription factor 12 (TCF12), VGF nerve growth factor inducible (VGF), and A kinase anchoring protein 5 (AKAP5). Based on the final screened common key genes, the LASSO model had an AUC of 0.940 4 for the test set and an AUC of 0.944 4 for the external validation, and the model showed a very high discriminatory ability for the AKI, which demonstrated the overall regulatory performance of the genes. The nomogram constructed based on the screened 7 genes demonstrated the highest classification performance with an AUC of 0.928 9, validating the outstanding contribution and overall action performance of the screened individual genes. Immune cell infiltration analysis showed that there were significant differences in B cells naïve, mast cells activated, monocytes, macrophages M1, B cells memory, and dendritic cells activated between AKI samples and healthy control samples (all P < 0.05). Macrophages M1 and monocytes were positively correlated with CD86 and CXCL10, mast cells activated were positively correlated with FOS, and B cells naïve were negatively correlated with CD86 and CXCL10 (all P < 0.01). Mast cells activated were positively correlated with VGF and negatively correlated with CD86 and TCF12, while memory B cells were positively correlated with CD86 (all P < 0.05). Conclusion Strategy combining pathway-associated deep neural network and multiple machine learning classifiers can mine high-value key genes from high-dimensional, complex and heterogeneous transcriptomic data as potential targets for therapeutic interventions in AKI.

     

  • 急性肾损伤(acute kidney injury,AKI)是一种病因多样、异质性高的临床常见危重症,血容量不足、尿路感染、服用药物等均可导致AKI[1]。AKI的发病机制复杂,涉及缺血再灌注损伤、炎症反应、自噬、细胞凋亡等多个方面[2-3],有着极高的发病率和死亡率[1],目前临床工作中尚缺乏针对性强、个体化的有效治疗方法[4]。研究者们一直致力于探寻不同AKI病因背后的共同关键调控机制,从而给予针对性干预,以促进AKI的精准管理和有效治疗。

    深度学习能从规模庞大、异质复杂、非线性的数据中挖掘高价值信息,近年来已被应用于多种疾病(如自身免疫性肝病[5]、新型冠状病毒感染[6]、牙周炎[7]、乳腺癌[8]等)复杂病因的探索与研究中,并开发了新的治疗靶点。本研究利用深度神经网络并结合多种机器学习算法对人类肾脏转录组数据进行分析,筛选异质性、不同病因AKI共有的关键基因,并探讨其潜在功能,期望鉴定出的基因能够成为AKI治疗的新靶点,推进AKI的个体化治疗。

    在基因表达汇编(Gene Expression Omnibus,GEO)数据库(http://www.ncbi.nlm.nih.gov/geo/)中搜索关键词“acute kidney injury”,按照以下标准下载数据:(1)样本组织为肾脏;(2)研究物种为人类。最终获得9个数据集,分别为GSE30718、GSE37838、GSE53769、GSE108113、GSE125779、GSE99325、GSE174020、GSE99340、GSE1563。表 1为本研究数据集AKI组和健康对照组信息。根据数据集的探针标注文件,用R脚本将全部数据集探针转换成相应的基因符号。

    表  1  数据集中AKI组和HC组信息
    Table  1  Information of AKI and HC groups in datasets
    Dataset Year Description of AKI etiology Description of HC AKI, n HC, n
    GSE30718 2011 Renal ischemia reperfusion injury Nephrectomy biopsy 28 8
    GSE37838 2012 Renal ischemia reperfusion injury Nephrectomy biopsy 17 8
    GSE53769 2014 Acute tubular necrosis Well-functioning transplant biopsy 8 28
    GSE108113 2017 Healthy normal donor kidneys 0 11
    GSE125779 2019 Healthy normal donor kidneys 0 8
    GSE99325 2017 Healthy normal donor kidneys 0 4
    GSE174020 2021 Kidney transplant with acute rejection Well-functioning transplant biopsy 7 12
    GSE99340 2017 Immune-mediated kidney injury Healthy normal donor kidneys 43 17
    GSE1563 2004 Well-functioning transplant biopsy 0 19
    AKI: Acute kidney injury; HC: Healthy control.

    合并数据集GSE30718、GSE37838、GSE53769、GSE108113、GSE125779、GSE99325、GSE174020作为本研究的训练集和测试集(包含60个AKI患者肾脏样本和79个健康对照肾脏样本),筛选、识别对区分AKI组和健康对照组有重要影响的关键通路和候选基因。合并数据集GSE99340、GSE1563作为外部验证队列(包含43个AKI患者肾脏样本和36个健康对照肾脏样本),验证筛选出的共同关键基因表达及其整体调控。合并的数据集均通过R软件sva包的ComBat方法进行批次效应消除,并作标准化处理[9]

    本研究采用修改的通路关联稀疏深度神经网络(pathway-associated sparse deep neural network,PasNet)[10]来学习合并的异质、高维AKI转录组数据表征。为此,需重新定义生物学通路,将其整合到神经网络,下载通路来源于分子特征数据库(Molecular Signatures Database,MSigDB)的生物学通路数据库[11]。下载、整理Reactome(生物分子通路知识数据库)中的生物学通路,滤除基因数<30的通路以避免冗余。

    为了将通路整合到神经网络并表示基因与通路的关系,创建基因层和通路层之间的掩码矩阵M进行基因的通路注释,并排除不属于任一生物通路的输入特征基因。最终基于合并的AKI数据集,修改的PasNet输入特征矩阵纳入719条通路和4 943个相关基因。

    本研究通过修改PasNet架构以适应AKI转录组数据的学习和表征。修改的PasNet架构仍设计为一个稀疏前馈神经网络,共构建4层,包括输入层(基因层)、通路层、隐藏层和输出层(图 1)。输入层由基因组成,重新调整输入层结构,改变输入层神经元数量,以适应AKI转录组数据的基因特征。

    图  1  修改PasNet建立的通路关联深度神经网络体系架构
    Fig.  1  Pathway-associated deep neural network architecture built by modified PasNet
    ω(1): The input and pathway layer connection weights, which were initialized by the binary mask matrix M and therefore here they behave as sparse connections; ω(2): Weight matrix between pathway layer and hidden layer; ω(3): Weight matrix between hidden layer and output layer; x11, x12, …, x1n: The expression value of the genes input into the PasNet (n=4 943); PasNet: Pathway-associated sparse deep neural network; AKI: Acute kidney injury.
    下载: 全尺寸图片

    修改的PasNet通过神经网络层次结构模拟基因与通路间的层次效应、构建数据的抽象表示,其通路层是生物通路的表示形式,根据AKI表达谱数据下载、整理Reactome通路。通过调整的通路及每个通路所包含的特定AKI基因成分在基因层与通路层之间建立对应层次和关联。使用制作的生物通路数据库Reactome的二值掩码矩阵M初始化基因层和通路层的连接。矩阵M定义为$\boldsymbol{M} \in P^{n \times m}, $,其中nm分别表示通路个数和输入特征基因个数。如果基因j属于通路i,则M的一个元素mij设为1,否则为0。因此,修改的PasNet是通过矩阵M调整的基于基因与通路关系建立的稀疏编码模型。

    隐藏层表示生物通路间的相互作用和非线性关联,每个节点表示多条生物通路的关联效应,为了增强网络学习,通过增大隐藏层神经元数目加强捕获通路之间非线性作用。最后,通过输出层计算输出结果的后验概率。输出层将使用2个节点综合隐藏层的累积效应,对临床结果进行分类,如AKI和非AKI。最终,通过修改PasNet建立通路关联深度神经网络,通过分析生物过程的层次非线性关系及基因与通路之间的关联来区分AKI和非AKI,并识别重要通路和关键基因子集。

    通路关联深度神经网络训练采用5折交叉验证,以保证性能的可重复性并避免过拟合。交叉验证重复5次,在每次实验完成时保存神经网络参数,然后在测试集中进行评估。最后选择性能最优模型,在识别不同病因AKI中发挥重要作用的前10个生物学通路,统计前10条通路中基因出现的频数,筛选出关键的候选基因子集作为后续多机器学习分类器特征筛选的依据。

    为进一步鉴定不同病因AKI共同、最显著、最关键和最具价值的基因,同时避免单一算法偏差,综合了多个机器学习分类器进行特征选择,包括最小绝对收缩和选择算子(least absolute shrinkage and selection operator,LASSO)回归[12]、支持向量机(support vector machine,SVM)-递归特征消除(recursive feature elimination,RFE)[13]、极限梯度提升树(extreme gradient boosting,XgBoost)[14]和随机森林(random forest,RF)[15]

    利用R软件randomForest包建立RF模型,输入训练队列数据进行训练,对特征进行重要性评分和排序,选取基尼系数前20位的基因作为候选基因。

    LASSO模型是评估高维数据的降维方法,采用R软件glmnet包进行LASSO回归分析,并进行10倍交叉验证,选择使得交叉验证结果AUC最高的λ值所对应的LASSO模型进行基因筛选。

    XgBoost分类器则通过R软件xgboost包实现,基于5折交叉验证后确立模型最佳参数,根据特征在模型中的权重计算重要性排序,选取前20个基因作为候选基因。

    SVM算法是性能出色、最为流行的机器学习算法之一,基于R软件e1071包执行一种迭代算法SVM-RFE[16]。输入训练队列初始基因特征开始迭代计算,在每轮中使用基因的表达值拟合一个简单的线性SVM,根据SVM解决方案中的权重对特征进行排序,削减一半权重较低的特征,最终获得全部特征排序结果,选取前20个基因作为候选基因。

    最后确定基于上述4种机器学习算法的、重叠的候选基因,作为不同病因AKI共有的关键基因,并通过维恩图可视化。

    根据最终筛选的关键基因,使用训练队列和R软件glmnet包开发LASSO模型。将训练队列关键基因表达数据用于LASSO模型的10倍交叉验证拟合。为了评估LASSO模型的稳健性和判别性能,使用R软件pROC包绘制其在AKI外部验证集中的ROC曲线并计算AUC[17]

    CIBERSORT是一种反卷积算法,广泛用于标记微环境中不同类型免疫细胞的基因组[18]。经CIBERSORT分析,得到每个样本中22种免疫细胞类型的比例和每个样本反卷积结果的P值,P<0.05的样本被认为是可信的。采用Wilcoxon秩和检验分析免疫细胞比例差异,P<0.01被认为是显著的。使用Pearson相关分析确定最终筛选的关键基因与免疫浸润细胞的相关性,最后通过R软件pheatmap包和ggpubr包可视化结果。

    为了整合各个关键基因和直观展示每个基因对AKI的影响程度,基于训练队列数据,使用R软件rms包构建基于基因的列线图模型区分AKI和非AKI,并通过ROC曲线和校准曲线验证模型的辨别力和可靠性。在列线图中每个基因的表达都有1个对应的分数点,总分反映了上述所有要素的总和。利用外部验证数据集绘制分类列线图模型的ROC曲线。

    为了评估通路关联神经深度网络模型和机器学习算法的整体性能,计算以下性能指标:ROC曲线的AUC、召回率、精确度、准确度、F1分数。本研究中AKI被指定为阳性样本,定义TP、TN、FP和FN分别为真阳性、真阴性、假阳性和假阴性样本的数量。性能指标计算公式:召回率=TP/(TP+FN),精确度=TP/(TP+FP),准确度=(TP+TN)/(TP+TN+FP+FN),F1分数=2×召回率×精确度/(召回率+精确度)。

    本研究的训练和测试样本共139个,其中AKI样本60个、健康对照样本79个,按8∶2比例分配为训练集和测试集。使用训练数据进行5折交叉验证的通路关联深度神经网络训练。最终通路关联深度神经网络的AUC为0.914 5±0.007 0,精确度为0.750 0±0.044 0,召回率为0.923 1±0.048 0,准确度为0.838 7±0.016 0,F1分数为0.827 6±0.020 0。选择其中最具稳健性和准确度的通路关联深度神经网络模型判别AKI和非AKI样本,同时产生高价值信息的候选基因子集以及用于推断AKI合理生物学机制的生物通路。

    依据最佳性能的通路关联深度神经网络模型,基于其网络节点和权值排序生成热图。图 2显示了排名前10位的生物学通路与隐藏节点间权值,这些通路包括IL-10信号转导通路、趋化因子受体结合趋化因子通路等。这10条重要的生物学通路队列可能在AKI的生物过程中发挥重要作用。进一步统计这10条通路中各个基因出现的频率并排序。给定的基因集中共有4 943个基因,排序靠前的基因有雄激素受体(androgen receptor,AR)、活性调节细胞骨架相关蛋白(activity-regulated cytoskeleton-associated protein,ARC)、凝血因子Ⅱ(prothrombin factor 2,F2)、葡萄糖激酶(glucokinase,GK)、补体2(complement 2,C2)、早期生长反应蛋白1(early growth response protein 1,EGR1)等。多种通路共享着相同的基因,其中AR的出现频率最高,出现在6条通路中。与之相比,补体5(complement 5,C5)、长链脂酰辅酶A合成酶3(long-chain acyl-coenzyme A synthetase 3,ACSL3)等基因只出现在1条通路中,据此识别具有较高价值的候选基因子集。

    图  2  由修改的PasNet筛选的10条生物学通路与隐藏层间的绝对值权重热图
    Fig.  2  A heatmap of the absolute weights between 10 pathway nodes selected by modified PasNet and hidden layer
    PasNet: Pathway-associated sparse deep neural network; GPI: Glycosylphosphatidylinositol; PI3K: Phosphatidylinositol 3-kinase; NGF: Nerve growth factor.
    下载: 全尺寸图片

    图 3所示,最终LASSO模型测试集AUC为0.947,精确度为0.929,召回率为0.867,准确度为0.912,F1分数为0.897,通过LASSO共识别出19个基因。RF模型测试集AUC为0.911,精确度为0.800,召回率为0.800,准确度为0.824,F1分数为0.800,共筛选出20个基因。通过SVM-RFE算法循环递归式对基因特征进行排序,根据排序结果选择前20个基因,基于选定基因的SVM模型测试集AUC为0.916,精确度为0.857,召回率为0.800,准确度为0.853,F1分数为0.828。XgBoost模型测试集AUC为0.860,精确度为0.750,召回率为0.800,准确度为0.794,F1分数为0.774,共鉴定出20个基因。取上述4种算法输出交集,得到7个基因作为最终筛选的不同病因AKI共同的关键基因,分别为CD86、C-X-C基序趋化因子配体(C-X-C motif chemokine ligand,CXCL10、发动蛋白2(dynamin 2,DNM2)、原癌基因FOS、转录因子12(transcription factor 12,TCF12)、VGF神经生长因子诱导蛋白(VGF nerve growth factor inducible,VGF)、A激酶锚定蛋白5(A kinase anchoring protein 5,AKAP5)。

    图  3  机器学习算法筛选不同病因AKI共同的关键基因
    Fig.  3  Machine learning algorithms for screening common key genes of AKI with different etiologies
    A: The LASSO coefficient curve of the candidate gene set (trajectory of each coefficient of LASSO as lgλ changes); B: The curve of AUC for cross-validation as lgλ changes (in LASSO regression, 10-fold cross-validation is used to adjust parameter selection based on the evaluation metric AUC); C: The trend of out-of-bag (OOB) error as the number of decision trees varies in RF; D: The top 20 genes were obtained using the method of decreasing average Gini coefficient in RF; E: The importance of genes obtained according to XgBoost algorithm; F: Venn diagram of the intersection of 4 machine learning outputs; G: AUC of 4 machine learning algorithms based on the expression samples of candidate gene set. AKI: Acute kidney injury; LASSO: Least absolute shrinkage and selection operator; AUC: Area under curve; RF: Random forest; XgBoost: Extreme gradient boosting; SVM: Support vector machine; TPR: True positive rate; FPR: False positive rate.
    下载: 全尺寸图片

    基于最终筛选的基因构建LASSO模型,根据训练的模型得到以下公式Result=1.42×CD86+0.15×CXCL10+0.51×DNM2-0.45×FOS-2.90×TCF12-2.36×VGF-2.03×AKAP5+24.63。LASSO模型内部测试AUC为0.940 4,在外部验证队列中AUC为0.944 4,基于基因的LASSO模型在外部验证队列表现出极高的判别能力,CD86CXCL10DNM2对AKI发展进程可能起促进作用,FOSTCF12VGFAKAP5的表达则可能抑制AKI的进展,模型的出色性能证明了基因的调控机制和整体贡献。

    初始B细胞、活化的肥大细胞、单核细胞、M1型巨噬细胞、记忆B细胞、活化的树突状细胞在AKI样本和健康对照样本间差异有统计学意义(均P<0.05)。Pearson相关分析结果显示,M1型巨噬细胞、单核细胞与CD86CXCL10呈正相关,活化的肥大细胞与FOS呈正相关,初始B细胞与CD86CXCL10呈负相关(均P<0.01)。另外,活化的肥大细胞与VGF呈正相关,与CD86TCF12呈负相关,而记忆B细胞与CD86呈正相关(均P<0.05)。见图 4

    图  4  AKI表达谱的免疫细胞浸润分析
    Fig.  4  Immune cell infiltration analysis of AKI expression profile
    A: The distinct immune cell proportion of non-AKI (healthy) and AKI samples. *P < 0.05, **P < 0.01. n=79 in non-AKI group, n=60 in AKI group. B: Fraction of 21 infiltrated immune cell subpopulations determined by CIBERSORT in non-AKI and AKI samples. C: The correlation between levels of immune cell infiltration and 7 screened genes. AKI: Acute kidney injury; CIBERSORT: Cell-type identification by estimating relative subsets of RNA transcripts; NK: Natural killer; AKAP5: A kinase anchoring protein 5; CXCL10: C-X-C motif chemokine ligand 10; DNM2: Dynamin 2; TCF12: Transcription factor 12; VGF: VGF nerve growth factor inducible.
    下载: 全尺寸图片

    图 5A5B所示,FOSCD86在整个基于基因的logistic回归模型中贡献度最为显著,在AKI发展进程中可能起着关键作用。图 5C中校准曲线与理想曲线存在一定偏差,但总体逼近理想曲线,表明列线图性能十分优秀,基因组整体贡献、调控作用十分突出。建立ROC曲线评估每个基因和列线图的分类特异度和灵敏度,每个项目的AUC和95%CI如下:CD86(AUC 0.828 3,95%CI 0.761 5~0.895 0)、CXCL10(AUC 0.798 1,95%CI 0.725 6~0.870 6)、DNM2(AUC 0.705 5,95%CI 0.618 1~0.792 8)、FOS(AUC 0.689 3,95%CI 0.600 7~0.778 0)、TCF12(AUC 0.724 2,95%CI 0.641 2~0.807 1)、VGF(AUC 0.565 6,95%CI 0.466 0~0.665 2)、AKAP5(AUC 0.755 7,95%CI 0.672 8~0.838 6)、列线图(AUC 0.928 9,95%CI 0.882 9~0.974 9)。所有基因对AKI总体均展现出较高的区分性能,但构建的列线图分类价值最高,说明筛选得到的基因通过整体调控对AKI进展起到最为关键的作用,可能作为AKI治疗的潜在干预靶点。

    图  5  基于最终筛选的关键基因的列线图构建与评估
    Fig.  5  Construction and evaluation of nomogram based on final screened key genes
    A: Construction and evaluation of nomogram based on 7 screened genes; B: Nomogram classification instance and the corresponding probability based on screened genes (*P < 0.05, **P < 0.01); C: The calibration curve of the constructed nomogram. CXCL10: C-X-C motif chemokine ligand 10; DNM2: Dynamin 2; TCF12: Transcription factor 12; VGF: VGF nerve growth factor inducible; AKAP5: A kinase anchoring protein 5; AKI: Acute kidney injury; Pr: Predicted probability.
    下载: 全尺寸图片

    AKI是一组异质性疾病,其特征为肾小球滤过率突然降低,表现为血肌酐水平升高或少尿[19]。AKI常见病因主要有缺血再灌注(如手术、肾移植)、感染(如脓毒血症、败血症)、自身免疫病(如急进性肾小球肾炎、系统性红斑狼疮)、服用药物(如非甾体抗炎药、造影剂)等[20]。研究发现,一些基因及其分子表型在AKI的损伤修复过程中扮演着关键角色,如原癌基因JUN、原癌基因FOS、血红素加氧酶1(heme oxygenase 1,HMOX1)、CXCL1等基因均参与了缺血再灌注引起的AKI,透明质酸合酶2(hyaluronan synthase 2,HAS2)、Myoferlin(MYOF)、磷脂磷酸酶相关蛋白1(phospholipid phosphatase related protein 1,PLPPR1)、醌型二氢蝶啶还原酶(quinonoid dihydropteridine reductase,QDPR)、Sideroflexin 1(SFXN1)等基因与早期肾移植后发生AKI有关,c-Myc、鼠双微体基因2(murine double minute 2,MDM2)等基因在顺铂引起的AKI中发挥重要作用[21-23]。随着微阵列技术和高通量技术的发展,利用多种生物信息学和系统生物学工具探索AKI的病理生理、基因调控机制及潜在治疗靶点成为当前的研究热点[24-25]。尽管传统生物信息学方法成熟度高、模型算法多样,但多数集中于单因素分析,难以捕获基因间及基因与生物学通路间的非线性关联。

    PasNet是一个基于通路的深度神经网络,它通过多层网络层捕捉基因和通路的等级效应及建模基因与通路间、通路与通路间的非线性关联识别与临床结果最相关的生物通路和基因集合,适用于大规模、高维异质的转录组数据分析,在多形性胶质母细胞瘤高通量数据的预后预测中已取得令人鼓舞的初步成果[10]。本研究提出了基于PasNet修改的通路关联深度神经网络,对AKI转录组数据集GSE30718、GSE37838、GSE53769、GSE108113、GSE125779、GSE99325、GSE174020进行合并用于通路关联深度神经网络的5折交叉验证,识别出不同病因AKI中发挥重要作用的前10个生物通路,利用SVM、RF、LASSO和XgBoost算法筛选和选取上述通路的重要基因作为候选基因,将上述4种算法输出的交集所得的7个基因(CD86CXCL10DNM2FOSTCF12VGFAKAP5)作为不同病因AKI共同的关键基因。上述部分基因已被证实参与AKI的发生与发展,如CD86CXCL10FOSVGFCD86是一种中心共刺激基因,主要在抗原递呈细胞表达,如M1型巨噬细胞、树突状细胞、T淋巴细胞等[26-27]。研究表明肾脏缺血再灌注损伤能够诱导CD86免疫表达,在缺血性损伤过程中,肾小管上皮细胞产生多种细胞因子,诱导巨噬细胞迁移并表达促炎表型如CD80、CD86[28]。在免疫介导的肾小球疾病如急进性肾小球肾炎中,炎症性CD4+ T淋巴细胞与近端肾小管上皮细胞密切接触,表达促炎表型CD86并驱动肾脏炎症和组织损伤[29]

    CXCL10属于C-X-C趋化因子亚家族成员,主要在T淋巴细胞、中性粒细胞、单核细胞、成纤维细胞等细胞中表达,主要功能为抑制内皮细胞增殖、诱导内皮细胞凋亡和抑制血管生成[30]。在肾移植排斥反应中,移植肾间质细胞可分泌CXCL10,募集T淋巴细胞至移植肾并放大免疫应答,造成移植肾急性损伤[31]。一项早期研究显示,CXCL10在抗基底膜肾小球肾炎中明显上调[32]。Ho等[33]研究发现,心肺转流术(cardiopulmonary bypass,CPB)后AKI患者的CXCL10上调,其可能机制是CPB期间释放的趋化因子和细胞因子通过募集炎症细胞造成缺血性肾损伤。

    FOS是即早期基因家族成员,主要参与各种细胞活动如增殖、分化、存活、代谢、缺氧、血管生成、类固醇生成和前列腺素产生等[34]。FOS可直接干扰炎症细胞因子如TNF-α、IL-6和IL-1β的转录并诱导其高表达,通过结合炎症细胞因子的启动子来促进AKI的发生[35]

    VGF是一种神经营养素诱导的基因,主要表达于大脑特别是下丘脑和海马,周围组织如肾上腺和胰腺,以及胃肠道的肌间神经丛和内分泌细胞[36]。Kim等[37]研究发现,VGF作为一种应激反应基因,在缺血性、肾毒性和横纹肌溶解相关的肾损伤过程中上调。

    DNM2TCF12AKAP5在AKI的基因调控机制中研究较少。DNM2高表达与蛋白尿性肾病有关,其可能机制为DNM2作用于肾小球足细胞足突导致结构变化从而引起蛋白尿[38];TCF12、AKAP5多被报道为与急性淋巴细胞白血病、前列腺癌、胃肠道间质瘤及卵巢癌等多种恶性肿瘤的发生和预后相关[39-40],与肾脏病的关系目前尚未见报道。这些基因的具体分子机制需在实验研究中进一步探究。

    尽管筛选的每一个基因均为高AUC的关键基因,并在AKI发展与调控中发挥重要作用,可作为AKI潜在的治疗靶点,但AKI转录组数据中个体基因的表达水平仍可能存在差异且具有较大的随机波动性,这使得基于个体基因的贡献和作用不那么精确,我们更倾向于开发一个全面的AKI分类模型以验证筛选的共同基因间的互相调控和整体作用,弥补单个基因的缺陷。本研究基于7个筛选的关键基因建立了LASSO模型,在测试集和外部验证集的AUC分别为0.940 4和0.944 4,对AKI有良好的分类性能。

    本研究存在以下优势:(1)基于转录组学数据构建AKI表达数据大样本,通过结合深度神经网络和综合机器学习算法的框架提高了筛选不同病因AKI共同关键基因的准确性,基于基因建立了准确的、可靠的LASSO模型,阐释了筛选出的每个基因的贡献和整体性能。(2)通过综合人工智能算法识别的基因可能是不同病因AKI共同的关键基因,作为潜在的治疗靶点,有望改善AKI治疗结果和指导个体化治疗策略。(3)通过免疫细胞浸润分析研究了AKI免疫微环境的变化以及识别的基因与免疫细胞的相关性,并通过列线图直观展示了基因对AKI的影响程度。下一步,我们将通过动物实验验证筛选的关键基因在AKI进展中的作用。

    本研究仍存在一定的局限,如在GEO中选用数据时未纳入测序数据,有关AKI的测序数据相对较少,导致所选数据未涵盖所有病因的AKI等。

    综上所述,人工智能算法能从海量生物信息的数据中精准识别与疾病进展、预后相关的关键信息,分析出内在基因调控通路,为研究疾病的发病机制和潜在的治疗靶点提供了一种新策略。尽管本研究的人工智能算法有待改进,但为未来AKI转录组学研究提供了参考方向,促进了人工智能技术在AKI疾病探索领域的发展。

  • 图  1   修改PasNet建立的通路关联深度神经网络体系架构

    Fig.  1   Pathway-associated deep neural network architecture built by modified PasNet

    ω(1): The input and pathway layer connection weights, which were initialized by the binary mask matrix M and therefore here they behave as sparse connections; ω(2): Weight matrix between pathway layer and hidden layer; ω(3): Weight matrix between hidden layer and output layer; x11, x12, …, x1n: The expression value of the genes input into the PasNet (n=4 943); PasNet: Pathway-associated sparse deep neural network; AKI: Acute kidney injury.

    下载: 全尺寸图片

    图  2   由修改的PasNet筛选的10条生物学通路与隐藏层间的绝对值权重热图

    Fig.  2   A heatmap of the absolute weights between 10 pathway nodes selected by modified PasNet and hidden layer

    PasNet: Pathway-associated sparse deep neural network; GPI: Glycosylphosphatidylinositol; PI3K: Phosphatidylinositol 3-kinase; NGF: Nerve growth factor.

    下载: 全尺寸图片

    图  3   机器学习算法筛选不同病因AKI共同的关键基因

    Fig.  3   Machine learning algorithms for screening common key genes of AKI with different etiologies

    A: The LASSO coefficient curve of the candidate gene set (trajectory of each coefficient of LASSO as lgλ changes); B: The curve of AUC for cross-validation as lgλ changes (in LASSO regression, 10-fold cross-validation is used to adjust parameter selection based on the evaluation metric AUC); C: The trend of out-of-bag (OOB) error as the number of decision trees varies in RF; D: The top 20 genes were obtained using the method of decreasing average Gini coefficient in RF; E: The importance of genes obtained according to XgBoost algorithm; F: Venn diagram of the intersection of 4 machine learning outputs; G: AUC of 4 machine learning algorithms based on the expression samples of candidate gene set. AKI: Acute kidney injury; LASSO: Least absolute shrinkage and selection operator; AUC: Area under curve; RF: Random forest; XgBoost: Extreme gradient boosting; SVM: Support vector machine; TPR: True positive rate; FPR: False positive rate.

    下载: 全尺寸图片

    图  4   AKI表达谱的免疫细胞浸润分析

    Fig.  4   Immune cell infiltration analysis of AKI expression profile

    A: The distinct immune cell proportion of non-AKI (healthy) and AKI samples. *P < 0.05, **P < 0.01. n=79 in non-AKI group, n=60 in AKI group. B: Fraction of 21 infiltrated immune cell subpopulations determined by CIBERSORT in non-AKI and AKI samples. C: The correlation between levels of immune cell infiltration and 7 screened genes. AKI: Acute kidney injury; CIBERSORT: Cell-type identification by estimating relative subsets of RNA transcripts; NK: Natural killer; AKAP5: A kinase anchoring protein 5; CXCL10: C-X-C motif chemokine ligand 10; DNM2: Dynamin 2; TCF12: Transcription factor 12; VGF: VGF nerve growth factor inducible.

    下载: 全尺寸图片

    图  5   基于最终筛选的关键基因的列线图构建与评估

    Fig.  5   Construction and evaluation of nomogram based on final screened key genes

    A: Construction and evaluation of nomogram based on 7 screened genes; B: Nomogram classification instance and the corresponding probability based on screened genes (*P < 0.05, **P < 0.01); C: The calibration curve of the constructed nomogram. CXCL10: C-X-C motif chemokine ligand 10; DNM2: Dynamin 2; TCF12: Transcription factor 12; VGF: VGF nerve growth factor inducible; AKAP5: A kinase anchoring protein 5; AKI: Acute kidney injury; Pr: Predicted probability.

    下载: 全尺寸图片

    表  1   数据集中AKI组和HC组信息

    Table  1   Information of AKI and HC groups in datasets

    Dataset Year Description of AKI etiology Description of HC AKI, n HC, n
    GSE30718 2011 Renal ischemia reperfusion injury Nephrectomy biopsy 28 8
    GSE37838 2012 Renal ischemia reperfusion injury Nephrectomy biopsy 17 8
    GSE53769 2014 Acute tubular necrosis Well-functioning transplant biopsy 8 28
    GSE108113 2017 Healthy normal donor kidneys 0 11
    GSE125779 2019 Healthy normal donor kidneys 0 8
    GSE99325 2017 Healthy normal donor kidneys 0 4
    GSE174020 2021 Kidney transplant with acute rejection Well-functioning transplant biopsy 7 12
    GSE99340 2017 Immune-mediated kidney injury Healthy normal donor kidneys 43 17
    GSE1563 2004 Well-functioning transplant biopsy 0 19
    AKI: Acute kidney injury; HC: Healthy control.
  • [1] POUKKANEN M, VAARA S T, PETTILÄ V, et al. Acute kidney injury in patients with severe sepsis in Finnish intensive care units[J]. Acta Anaesthesiol Scand, 2013, 57(7): 863-872. DOI: 10.1111/aas.12133.
    [2] 申嫒文, 汤晓静, 孙博, 等. 不同病因急性肾损伤的临床特点及预后分析[J]. 第二军医大学学报, 2017, 38(3): 306-311. DOI: 10.16781/j.0258-879x.2017.03.0306.

    SHEN A W, TANG X J, SUN B, et al. Clinical characteristics of acute kidney injury by different causes and patients' prognosis[J]. Acad J Sec Mil Med Univ, 2017, 38(3): 306-311. DOI: 10.16781/j.0258-879x.2017.03.0306.
    [3] GONG L, PAN Q, YANG N. Autophagy and inflammation regulation in acute kidney injury[J]. Front Physiol, 2020, 11: 576463. DOI: 10.3389/fphys.2020.576463.
    [4] PORSCHEN C, STRAUSS C, MEERSCH M, et al. Personalized acute kidney injury treatment[J]. Curr Opin Crit Care, 2023, 29(6): 551-558. DOI: 10.1097/MCC.0000000000001089.
    [5] GERUSSI A, SCARAVAGLIO M, CRISTOFERI L, et al. Artificial intelligence for precision medicine in autoimmune liver disease[J]. Front Immunol, 2022, 13: 966329. DOI: 10.3389/fimmu.2022.966329.
    [6] REN J, GUO W, FENG K, et al. Identifying microRNA markers that predict COVID-19 severity using machine learning methods[J]. Life (Basel), 2022, 12(12): 1964. DOI: 10.3390/life12121964.
    [7] NING W, ACHARYA A, SUN Z, et al. Deep learning reveals key immunosuppression genes and distinct immunotypes in periodontitis[J]. Front Genet, 2021, 12: 648329. DOI: 10.3389/fgene.2021.648329.
    [8] JIA D, CHEN C, CHEN C, et al. Breast cancer case identification based on deep learning and bioinformatics analysis[J]. Front Genet, 2021, 12: 628136. DOI: 10.3389/fgene.2021.628136.
    [9] LEEK J T, JOHNSON W E, PARKER H S, et al. The sva package for removing batch effects and other unwanted variation in high-throughput experiments[J]. Bioinformatics, 2012, 28(6): 882-883. DOI: 10.1093/bioinformatics/bts034.
    [10] HAO J, KIM Y, KIM T K, et al. PASNet: pathway-associated sparse deep neural network for prognosis prediction from high-throughput data[J]. BMC Bioinformatics, 2018, 19(1): 510. DOI: 10.1186/s12859-018-2500-z.
    [11] LIBERZON A, BIRGER C, THORVALDSDÓTTIR H, et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection[J]. Cell Syst, 2015, 1(6): 417-425. DOI: 10.1016/j.cels.2015.12.004.
    [12] CHEN D L, CAI J H, WANG C C N. Identification of key prognostic genes of triple negative breast cancer by LASSO-based machine learning and bioinformatics analysis[J]. Genes, 2022, 13(5): 902. DOI: 10.3390/genes13050902.
    [13] SANZ H, VALIM C, VEGAS E, et al. SVM-RFE: selection and visualization of the most relevant features through non-linear kernels[J]. BMC Bioinformatics, 2018, 19(1): 432. DOI: 10.1186/s12859-018-2451-4.
    [14] LI W, YIN Y, QUAN X, et al. Gene expression value prediction based on XGBoost algorithm[J]. Front Genet, 2019, 10: 1077. DOI: 10.3389/fgene.2019.01077.
    [15] CHEN X, ISHWARAN H. Random forests for genomic data analysis[J]. Genomics, 2012, 99(6): 323-329. DOI: 10.1016/j.ygeno.2012.04.003.
    [16] DUAN K B, RAJAPAKSE J C, WANG H, et al. Multiple SVM-RFE for gene selection in cancer classification with expression data[J]. IEEE Trans Nanobioscience, 2005, 4(3): 228-234. DOI: 10.1109/tnb.2005.853657.
    [17] ROBIN X, TURCK N, HAINARD A, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves[J]. BMC Bioinformatics, 2011, 12: 77. DOI: 10.1186/1471-2105-12-77.
    [18] NEWMAN A M, LIU C L, GREEN M R, et al. Robust enumeration of cell subsets from tissue expression profiles[J]. Nat Methods, 2015, 12(5): 453-457. DOI: 10.1038/nmeth.3337.
    [19] KELLUM J A, ROMAGNANI P, ASHUNTANTANG G, et al. Acute kidney injury[J]. Nat Rev Dis Primers, 2021, 7: 52. DOI: 10.1038/s41572-021-00284-z.
    [20] BELLOMO R, KELLUM J A, RONCO C. Acute kidney injury[J]. Lancet, 2012, 380(9843): 756-766. DOI: 10.1016/S0140-6736(11)61454-2.
    [21] YOU R, HEYANG Z, MA Y, et al. Identification of biomarkers, immune infiltration landscape, and treatment targets of ischemia-reperfusion acute kidney injury at an early stage by bioinformatics methods[J]. Hereditas, 2022, 159(1): 24. DOI: 10.1186/s41065-022-00236-x.
    [22] ZHAI X, LOU H, HU J. Five-gene signature predicts acute kidney injury in early kidney transplant patients[J]. Aging, 2022, 14(6): 2628-2644. DOI: 10.18632/aging.203962.
    [23] WANG S Y, GAO J, SONG Y H, et al. Identification of potential gene and microRNA biomarkers of acute kidney injury[J]. Biomed Res Int, 2021, 2021: 8834578. DOI: 10.1155/2021/8834578.
    [24] DENG W, WEI X, DONG Z, et al. Identification of fibroblast activation-related genes in two acute kidney injury models[J]. PeerJ, 2021, 9: e10926. DOI: 10.7717/peerj.10926.
    [25] FENG W, TANG R, YE X, et al. Identification of genes and pathways associated with kidney ischemia-reperfusion injury by bioinformatics analyses[J]. Kidney Blood Press Res, 2016, 41(1): 48-54. DOI: 10.1159/000368546.
    [26] PAINE A, KIRCHNER H, IMMENSCHUH S, et al. IL-2 upregulates CD86 expression on human CD4+ and CD8+ T cells[J]. J Immunol, 2012, 188(4): 1620-1629. DOI: 10.4049/jimmunol.1100181.
    [27] MENG X M, TANG P M, LI J, et al. Macrophage phenotype in kidney injury and repair[J]. Kidney Dis, 2015, 1(2): 138-146. DOI: 10.1159/000431214.
    [28] EL GAZZAR W B, ALLAM M M, SHALTOUT S A, et al. Pioglitazone modulates immune activation and ameliorates inflammation induced by injured renal tubular epithelial cells via PPARγ/miRNA-124/STAT3 signaling[J]. Biomed Rep, 2022, 18(1): 2. DOI: 10.3892/br.2022.1584.
    [29] BREDA P C, WIECH T, MEYER-SCHWESINGER C, et al. Renal proximal tubular epithelial cells exert immunomodulatory function by driving inflammatory CD4+ T cell responses[J]. Am J Physiol Renal Physiol, 2019, 317(1): F77-F89. DOI: 10.1152/ajprenal.00427.2018.
    [30] ANTONELLI A, FERRARI S M, GIUGGIOLI D, et al. Chemokine (C-X-C motif) ligand (CXCL)10 in autoimmune diseases[J]. Autoimmun Rev, 2014, 13(3): 272-280. DOI: 10.1016/j.autrev.2013.10.010.
    [31] LO D J, KAPLAN B, KIRK A D. Biomarkers for kidney transplant rejection[J]. Nat Rev Nephrol, 2014, 10(4): 215-225. DOI: 10.1038/nrneph.2013.281.
    [32] TANG W W, YIN S, WITTWER A J, et al. Chemokine gene expression in anti-glomerular basement membrane antibody glomerulonephritis[J]. Am J Physiol, 1995, 269(3 Pt 2): F323-F330. DOI: 10.1152/ajprenal.1995.269.3.F323.
    [33] HO J, LUCY M, KROKHIN O, et al. Mass spectrometry-based proteomic analysis of urine in acute kidney injury following cardiopulmonary bypass: a nested case-control study[J]. Am J Kidney Dis, 2009, 53(4): 584-595. DOI: 10.1053/j.ajkd.2008.10.037.
    [34] RODRÍGUEZ-BERDINI L, CAPUTTO B L. Lipid metabolism in neurons: a brief story of a novel c-Fos-dependent mechanism for the regulation of their synthesis[J]. Front Cell Neurosci, 2019, 13: 198. DOI: 10.3389/fncel.2019.00198.
    [35] ZHANG C, MA P, ZHAO Z, et al. miRNA-mRNA regulatory network analysis of mesenchymal stem cell treatment in cisplatin-induced acute kidney injury identifies roles for miR-210/Serpine1 and miR-378/Fos in regulating inflammation[J]. Mol Med Rep, 2019, 20(2): 1509-1522. DOI: 10.3892/mmr.2019.10383.
    [36] LEWIS J E, BRAMELD J M, JETHWA P H. Neuroendocrine role for VGF[J]. Front Endocrinol (Lausanne), 2015, 6: 3. DOI: 10.3389/fendo.2015.00003.
    [37] KIM J Y, BAI Y, JAYNE L A, et al. SOX9 promotes stress-responsive transcription of VGF nerve growth factor inducible gene in renal tubular epithelial cells[J]. J Biol Chem, 2020, 295(48): 16328-16341. DOI: 10.1074/jbc.RA120.015110.
    [38] KHALIL R, KOOP K, KREUTZ R, et al. Increased dynamin expression precedes proteinuria in glomerular disease[J]. J Pathol, 2019, 247(2): 177-185. DOI: 10.1002/path.5181.
    [39] WANG L, TANG Y, WU H, et al. TCF12 activates MAGT1 expression to regulate the malignant progression of pancreatic carcinoma cells[J]. Oncol Lett, 2022, 23(2): 62. DOI: 10.3892/ol.2021.13180.
    [40] ZHONG Z, YE Z, HE G, et al. Low expression of A-kinase anchor protein 5 predicts poor prognosis in non-mucin producing stomach adenocarcinoma based on TCGA data[J]. Ann Transl Med, 2020, 8(4): 115. DOI: 10.21037/atm.2019.12.98.
WeChat 点击查看大图
图(5)  /  表(1)
出版历程
  • 收稿日期:  2024-03-11
  • 接受日期:  2024-04-15

目录

    /

    返回文章
    返回