土壤重要动物白符䖴(<i>Folsomia candida</i>)基因组的重新组装与注释

http://dx.doi.org/10.7685/jnau.201912054

文章信息

靳建锋, 张峰

JIN Jianfeng, ZHANG Feng

土壤重要动物白符䖴(Folsomia candida)基因组的重新组装与注释

An improved genome of the soil important animal Folsomia candida

南京农业大学学报, 2020, 43(6): 1042-1048

Journal of Nanjing Agricultural University, 2020, 43(6): 1042-1048.

http://dx.doi.org/10.7685/jnau.201912054

文章历史

收稿日期: 2019-12-27

引用本文

靳建锋, 张峰. 土壤重要动物白符䖴(Folsomia candida)基因组的重新组装与注释[J]. 南京农业大学学报, 2020, 43(6): 1042-1048.

JIN Jianfeng, ZHANG Feng. An improved genome of the soil important animal Folsomia candida[J]. Journal of Nanjing Agricultural University, 2020, 43(6): 1042-1048. DOI: 10.7685/jnau.201912054

土壤重要动物白符䖴(Folsomia candida)基因组的重新组装与注释

靳建锋 , 张峰

南京农业大学植物保护学院, 江苏南京 210095

收稿日期：2019-12-27

基金项目：国家自然科学基金项目（31970434，31772491）

作者简介：靳建锋, 博士研究生

通信作者：张峰, 教授, 主要从事昆虫比较基因组学研究, E-mail: fengzhang@njau.edu.cn.

摘要：[目的]本文利用公开可用的PacBio测序数据进行白符䖴（Folsomia candida）基因组的重新组装和注释，以提升该物种基因组组装的连续性和注释基因的完整性。[方法]使用Flye和Falcon进行组装，使用quickmerge合并组装结果。基因注释由MAKER完成，整合了从头（de novo）预测、转录本和蛋白质同源证据。其中转录组基于StringTie组装结果。[结果]新组装的基因组大小为221.09 Mb，共113条序列（scaffolds），其中最长scaffold为30.07 Mb，N50长度为13.5 Mb。新组装的基因组组装结果基于通用单拷贝直系同源基因（BUSCO）的完整性评估为96.5%。MAKER注释流程预测了20 080个蛋白质编码基因，其中80.56%和96%的基因获得转录组和UniProt蛋白质数据支持，蛋白质编码基因BUSCO完整性评估为92.4%。此外，结构注释鉴定了253 665条（22.3%）重复序列和661个非编码RNA。基因家族进化分析鉴定了8 876个基因家族，其中48个家族发生了快速进化（扩张或收缩）事件。[结论]白符䖴基因组的重新组装与注释的版本与原版本相比有显著提高，scaffold N50由6.5 Mb提高到13.5 Mb，蛋白质编码基因的完整性由84%提高到92.4%。基因家族进化分析为六足动物的进化及土壤生态毒理学提供重要的基础资料和新视角。

关键词：白符䖴 PacBio 基因组组装基因注释基因家族进化

An improved genome of the soil important animal Folsomia candida

JIN Jianfeng, ZHANG Feng

College of Plant Protection, Nanjing Agricultural University, Nanjing 210095, China

Abstract: [Objectives] To obtain a higher quality of genome of Folsomia candida, this study reassembled and re-annotated the genome of Folsomia candida using publicly available PacBio sequencing data. [Methods] Flye and Falcon assemblers were used to assemble the genome, and two resulting assemblies were merged by using quickmerge. Gene annotation was analyzed with the MAKER pipeline, by integrating ab initio, transcriptome-based evidence, and protein homology-based evidence, of which the transcriptome-based evidence was obtained by StringTie assemble results. [Results] The size of the newly assembled genome was 221.09 Mb and the number of scaffolds was 113, among which the longest scaffold was 30.07 Mb and the N50 length was 13.5 Mb. The new version of assembly genome captured 96.5% complete arthropod Benchmarking Universal Single-Copy Orthologs(BUSCO, n=1 066). We predicted 20 080 protein-coding genes, of which 80.56% were supported by transcriptome-based evidence and 96% were supported by UniProt; the protein-coding genes BUSCO completeness evaluation was 92.4%. We also identified 253 665 repeats and 661 noncoding RNA. We further identified 8 876 gene families, of which 48 experienced significant expansions or contractions. [Conclusions] The new version of genome assembly and annotation indicates a significant improvement in continuity compared to the published version, in which scaffold N50 was increased from 6.5 Mb to 13.5 Mb and the complete of protein coding gene was increased from 84% to 92.4%. Gene family analysis will provide fundamental information and new insights for hexapod evolution and soil ecotoxicology.

Keywords: Folsomia candida PacBio genome assembly gene annotation gene family evolution

土壤动物是指终生或某一发育阶段在土壤中度过且对土壤有一定影响的动物^[1]。大多数无脊椎动物门类在土壤中都有代表类群, 主要包括原生动物、扁形动物、线形动物、轮形动物、环节动物、缓步动物、软体动物和节肢动物等动物门^[1]。其中, 跳虫作为最古老的六足动物^[2], 广泛分布于各种陆地生态系统中^[3], 它们种类丰富, 体形微小, 颜色多样, 与线虫、螨虫共称为土壤动物三大类群, 在土壤生态系统中起到重要作用^[4]。

白符䖴(Folsomia candida)具有分布广泛、生长周期短、孤雌生殖、繁殖率高、易于饲养等优点, 近半个世纪以来一直作为标准化实验土壤动物, 用以评价杀虫剂和环境污染物等对土壤动物的影响^[5]。国际上普遍使用白符䖴评价污染物进入陆地生态系统后可能产生的危害。通过毒理相关试验, 调查有毒化学物质在白符䖴体内的积累及对其生活史和行为的影响^[6-9]。毒理试验不仅评估污染物的毒性, 而且表明白符䖴的种群动态可以用来评价对污染土壤采取的治理措施是否成功^[10]。

近年来, 跳虫基因组研究取得了一些进展, 到目前为止, 共发表了4种跳虫的基因组:Orchesella cincta^[11]、Folsomia candida^[12]、Holacanthella duospinosa^[13]和Sinella curviseta^[14]。Faddeeva-Vakhrusheva等^[12]发表的白符䖴基因利用Falcon软件进行组装, 组装的基因组大小为221.7 Mb, 共162条scaffolds, scaffold N50的长度为6.5 Mb, 最长scaffold为28.5 Mb, GC含量为37.5%, 组装结果经BUSCO(Benchmarking Universal Single-Copy Orthologs)基于节肢动物数据集(n=1 066)评估其完整性为97.0%^[15]。

随着测序技术的发展, 新的组装工具层出不穷, 使3代测序数据组装基因组仍有较大的提升空间。Falcon^[16]和Flye^[17]为3代Pacbio数据常用的组装软件, quickmerge^[18]可以合并多个软件的组装结果, 以增加组装连续性。

已发表版本的白符䖴基因组注释预测了28 734个蛋白质编码基因, 预测基因数量明显超过其他3种跳虫:Orchesella cincta(20 249)、Holacanthella duospinosa(9 911)和Sinella curviseta(23 943), 经BUSCO评估其完整性为84.0%。蛋白质编码基因可能存在一些冗余或不完整序列, 通过对注释的过程进行优化, 以提高注释的质量。

本研究利用已公开可用的PacBio(Pacific Bioscience)测序数据进行白符䖴基因组重新组装和注释, 以期提升组装的连续性和基因注释的完整性, 并进一步分析基因家族进化, 为六足动物的进化及土壤生态毒理学提供重要的基础资料和新视角。

1 材料与方法 1.1 测序数据

白符䖴原始测序数据下载自NCBI数据库, 包括:PacBio(SRR2952806)、Illumina(SRR2743547)和RNA-seq(SRR935329, SRR921597)。本研究所有的数据已上传至figshare数据库(https://figshare.com/projects/Re-assemble_and_re-annotate_of_Folsomia_candida_using_public_sequencing_data/71717)。

1.2 基因组评估

Illumina原始数据由BBtools v38.22^[19]套件工具进行质控, 即去除重复序列(clumpify)和低质量区域(BBduk)。基因组大小、杂合度和重复序列比例等评估由GenomeScope v1.0.0^[20]估算, 所需的k-mer频数分布使用khist.sh(BBtools组件之一)产生, k-mer值设置为17, k-mer最大深度范围设置为1 000。

1.3 基因组和转录组组装

Pacbio数据初步组装由Flye v2.5完成。为了提升组装连续性指标, 使用quickmerge将Flye和已发表Falcon组装版本进行合并, 进一步由Pilon v1.22^[21]对Illumina数据进行2轮纠错。污染序列检测利用BLAST v2.9.0^[22]和HS-BLASTN v0.0.4^[23]将组装结果和UniVec、NCBI核酸nt数据库进行比对, 过滤污染序列, 最终保留长度大于1 kb的序列, 获得新版本的基因组组装结果(下文简称“新版本基因组”)。基因组完整性使用BUSCO基于节肢动物数据集(n=1 066)进行评估。

转录组组装采用基于参考基因组的方法, 即使用HISAT2 v2.1.0^[24]将转录组数据与新版本基因组进行比对, 并由StringTie v1.3.4^[25]进行组装, 组装的结果用于下游的基因组注释。

1.4 基因组注释

基因组结构注释首先是识别重复序列, 即由RepeatModeler v1.0.11^[26]和RepBase(http://www.girinst.org/repbase)数据库创建重复序列数据库, 并依据该数据库通过RepeatMasker v4.0.9^[27]识别重复序列。基因预测由MAKER v2.31^[28]完成, 整合从头(de novo)预测、转录本和蛋白质同源等证据。从头预测利用Augustus v3.3^[29]和GeneMark-ET v4.38^[30]分析, 相应的基因预测训练参数由BRAKER v2.1.0^[31]整合转录组数据获取。转录组证据基于StringTie的组装结果, 蛋白质同源证据基于Ensembl数据库下载的蛋白质序列(Daphnia pulex、Acyrthosiphon pisum、Drosophila melanogaster)和2种跳虫(Sinella curviseta、Orchesella cincta)的蛋白质编码基因序列。

基因的功能注释使用Diamond v0.9.18^[32]比对UniProtKB(SwissProt+TrEMBL)数据库获得同源基因, 并由InterProScan v5.34-73.0^[33]对Pfam^[34]、PANTHER^[35]、Gene3D^[36]、Superfamily^[37]和CDD^[38]等数据库进行蛋白质域(protein domains)、基因本体(gene ontology)、通路注释(pathway annotation)的搜索。非编码核糖核酸(ncRNAs)注释由Infernal v1.1.2^[39]和tRNAscan-SE v2.0^[40]完成。

1.5 基因家族进化

本研究对11种节肢动物进行基因家族的进化关系推断, 包括8种六足动物(Folsomia candida、Orchesella cincta、Sinella curviseta、Holacanthella duospinosa、Acyrthosiphon pisum、Drosophila melanogaster、Tribolium castaneum、Zootermopsis nevadensis)和3种非六足节肢动物(Ixodes scapularis、Strigamia maritima和Daphnia pulex), 其中I.scapularis为外群。直系同源基因由OrthoFinder v2.2.7^[41]聚类, 获得单拷贝同源基因。单拷贝基因利用MAFFT v7.3^[42]比对, 冗余的序列由trimAl v1.4^[43]剪切。使用IQ-TREE v2^[44]基于最大似然法构建系统发育树, 物种的分歧时间由R8s^[45]计算, 利用CAFE v4.2^[46]分析基因家族进化(扩张或收缩)历程。

2 结果与分析 2.1 基因组评估

白符䖴基因组评估结果为:基因组大小255.72~258.35 Mb, 杂合度0.028%~0.134%, 重复序列长度89.30~90.22 Mb。

2.2 基因组组装

Flye组装的白符䖴基因组大小为222.80 Mb, 包含144条scaffolds, 其中最长scaffold为22.06 Mb, scaffold N50为7.03 Mb。经quickmerge提升连续性、Pilon纠错以及污染去除后, 最终版本的白符䖴基因组大小为221.09 Mb, 包含113条scaffolds, 最长scaffold为30.07 Mb, scaffold N50为13.5 Mb(表 1)。

表 1 白符䖴(Folsomia candida)组装版本的比较 Table 1 Comparison of assembly versions of Folsomia candida

组装版本 Assembly version	基因组总长度/Mb Total length	Scaffold数量 No. of scaffolds	Scaffold N50长度/kb Scaffold N50 length	最长scaffold/kb Longest scaffold	GC/%	BUSCO/%(n=1 066)
组装版本 Assembly version	基因组总长度/Mb Total length	Scaffold数量 No. of scaffolds	Scaffold N50长度/kb Scaffold N50 length	最长scaffold/kb Longest scaffold	GC/%	C	D	F	M
新版本 New version
Flye	228.64	144	7 034	22 056	37.50	96.9	1.7	0.7	2.4
Quickmerge	222.15	124	13 530	30 110	37.48	96.5	1.4	0.7	2.8
Pilon	221.85	121	13 500	30 073	37.47	96.5	1.4	0.7	2.8
最终结果 Final genome assembly	221.09	113	13 500	30 073	37.48	96.5	1.4	0.7	2.8
转录组组装 Transcript assembly	33.60	25 971	1.76	14.03	40.38	87.0	4.5	6.3	6.7
Faddeeva-Vakhrusheva版本^[12] Faddeeva-Vakhrusheva version^[12]
Falcon	221.70	162	6 519	28.53	37.52	97.0	1.9	0.7	2.3
转录组组装 Transcript assembly	29.54	38 016	1.16	6.01	39.94	81.2	5.5	11.4	7.4
注:C:完整的单拷贝BUSCO基因Complete and single-copy BUSCO; D:完整的多拷贝BUSCO基因Complete and duplicated BUSCO; F:不完整的BUSCO基因Fragmented BUSCO; M:缺失的BUSCO基因Missing BUSCO.

表选项

2.3 基因组注释 2.3.1 重复序列注释

RepeatMasker识别出253 665条(49.33 Mb)重复序列, 占整个基因组大小的22.3%。其中重复序列数量最多的3种类型依次为:简单重复序列(simple repeat)50 375条(0.95%)、Helitron转座子9 633条(1.1%)、低复杂度序列(low complexity)5 884条(0.12%)(表 2)。

表 2 白符䖴的注释版本比较 Table 2 Comparison of two annotation versions

指标 Elements	新版本 New version	Faddeeva-Vakhrusheva版本^[12] Faddeeva-Vakhrusheva version^[12]
重复序列 Repeats
重复片段 Repeat segment	49 334 367/22.30	51 606 299/23.28
脱氧核糖核酸 DNA	9 691 017/4.35	9 462 341/4.27
长散在重复序列 Long interspersed elements	2 252 705/1.02	1 939 306/0.87
长末端重复序列 Long terminal repeats	2 421 736/1.09	2 557 166/1.15
未分类 Unclassified	30 040 181/13.59	30 170 966/13.60
简单重复序列 Simple repeats	2 103 613/0.95	5 080 872/2.29
短分散重复序列 Short interspersed elements	38 531/0.02	22 158/0.01
其他 Others	2 786 584/1.28	2 373 490/1.07
蛋白质编码基因 Protein-coding gene
基因数No. of genes	20 080(69.13 Mb)	28 734(132.62 Mb)
基因平均长度/bp Gene mean length	3 442.64	4 615.44
外显子数 No. of exons	143 376(38.19 Mb)	197 859(70.64 Mb)
外显子平均长度/bp Exon mean length	266.36	357.02
内含子数 No. of introns	401 165(30.34 Mb)	524 921(61.98 Mb)
内含子平均长度/bp Intron mean length	75.63	118.07
注:符号“/”前数字代表长度(bp); 符号后数字代表比例(%); 括号内数字为长度(Mb)。 Note:The number before the symbol “/”represents the sequence in bp and the number after the symbol “/”represents the proportion in %. The number in parentheses are the length.

表选项

2.3.2 蛋白编码基因注释

MAKER预测了20 080个蛋白质编码基因, 平均每个基因包含7.14个外显子和21.03个内含子, 外显子和内含子的平均长度分别为266.36和75.63 bp。其中80.56%和96%的基因分别获得转录组和UniProt蛋白质支持, BUSCO基于节肢动物数据集(n=1 066)评估预测基因集的完整性为92.4%。InterproScan注释了16 125(80.30%)基因, 其中10 455个基因具GO(gene ontology)注释, 匹配的KEGG通路、MetaCyc和Reactome数据库分别为972、767和3 797个。

2.3.3 非编码RNA注释

对于非编码核糖核酸(ncRNA), 本研究共识别出327个转运核糖核酸(tRNA)、68个核糖体核糖核酸(rRNA)、55个小核核糖核酸(snRNA)、26个小分子核糖核酸(miRNA)、5个核糖酶(ribozymess)、2个长链非编码核糖核酸(lncRNA)、119个其他非编码核糖核酸(Others)。

2.4 基因家族进化

OrthoFinder将67.90%(148 089)基因分成13 979个基因家族, 平均每个基因家族包含10.6个基因, 其中345个家族发生了快速进化(扩张或收缩)事件。2 670个基因家族存于所有11个物种中, 其中699个是单拷贝基因家族。对于白符䖴, 75.7%(15 027)基因划分为8 876个基因家族, 其中48个家族发生了快速进化(扩张或收缩)事件(图 1)。较大的5个扩张基因家族包括锌指(zinc finger, 235)、羧酸酯酶(carboxylesterase, 87)、SEC14-like proteins(75)、DDE总科核酸内切酶(DDE superfamily endonuclease, 73)和胰蛋白酶(trypsin, 64)。锌指蛋白是1种转录因子, 具有广泛的生物功能, 通过结合DNA、RNA、蛋白质或小分子发挥作用^[47]。羧酸酯酶属于α/β水解酶超家族, 是动物体内重要的1相药物代谢酶, 在不同种属动物体内普遍存在^[48]。SEC14-like proteins是最早在酿酒酵母(Saccharomyces cerevisiae)中发现的一类转运蛋白^[49], 可以调节脂质代谢, 为高尔基体的分泌提供一个宽松的膜环境^[50-53]。DDE总科核酸内切酶包含3个羧酸残基, 负责协调催化所需的金属离子^[54]。胰蛋白酶具有较高的蛋白水解活性和裂解特异性, 它会将蛋白质水解为肽, 进而分解为氨基酸^[55]。3种弹尾虫(Sinella curviseta、Orchesella cincta、Folsomia candida)扩张较大的基因家族具有较高的相似性, 参与解毒和药物代谢、核酸代谢、信号传导等。基因家族的进化对于跳虫适应复杂的土壤环境起重要作用, 解释了这3个种在北半球或局部区域广泛分布的原因。

	图 1 系统发育树及基因家族进化 Fig. 1 Phylogenetic tree and gene family evolution 节点的3个数字分别代表扩增、收缩、快速进化的基因家族数量。每个物种的基因家族数量如图所示。 Gain and loss are indicated with symbol + and -. Numbers of gene families for each species are shown following species name.
图选项

3 讨论

本研究对白符䖴基因组进行了重新组装, 与Faddeeva-Vakhrusheva等^[12]组装的版本相比, 新版本的组装连续性更好, 原因如下:原版本使用Falcon进行基因组组装, 而新版本则首先利用Flye组装, 然后经quickmerge对Flye和Falcon的组装结果进行两轮合并, 集成了两者的优点, 将原版本scaffold N50由6.52 Mb提高到13.53 Mb, 从而显著提高了组装结果的连续性^[19]。由于基因组初始Flye和Falcon组装中不可避免地存在一些错误, 会一直留存至最终的组装版本中, quickmerge无法对其进行识别和矫正, 可能会对结果造成一定的影响。

基因组注释部分结果更加合理, 新版本使用MAKER共预测了20 080个蛋白质编码基因, 其中80.56%和96%的基因分别获得转录组和UniProt蛋白质等数据支持, BUSCO基于节肢动物数据集(n=1 066)评估预测蛋白编码基因完整性为92.4%, 高于原版本的84%。2种注释版本结果的差异, 主要有2个原因:1)转录组组装的结果更好。原版本利用从头组装方法进行组装, 新版本使用基于参考序列的方法进行组装。基于参考序列的组装结果更好, 低表达量的转录组可以顺利组装出来, 并且能够预测出转录本上的gap序列以及UTR区域^[56]。基于从头组装的转录组结果中, 存在大量的冗余, 从而导致后续基因组注释的蛋白质编码基因数量偏多(28 734), 即有可能同一编码区域被注释为多个亚型。2)增加了2种近源物种的蛋白质参考序列。新版本预测蛋白质编码基因时, 增加了已发表的2种跳虫(Sinella curviseta和Orchesella cincta)的蛋白质信息, 相对于仅使用亲缘关系较远的3种模式生物(Daphnia pulex、Drosophila melanogaster、Acyrthosiphon pisum)来说, 预测的基因更为准确。新版本的20 080基因与已经发表的其他3种弹尾虫基因数目相近。

参考文献(References)

[1]	尹文英. 土壤动物学研究的回顾与展望[J]. 生物学通报, 2001, 36(8): 1-3. Yin W Y. A brief review and prospect on soil zoology[J]. Bulletin of Biology, 2001, 36(8): 1-3 (in Chinese with English abstract).

[2]	Hirst S, Maulik S. On some arthropod remains from the Rhynie Chert(Old Red Sandstone)[J]. Geological Magazine, 1926, 63: 69-71. DOI:10.1017/S0016756800083692

[3]	Christiansen K A. Springtails[J]. The Kansas School Naturalist, 1992, 39: 1-16.

[4]	陈建秀, 麻智春, 严海娟, 等. 跳虫在土壤生态系统中的作用[J]. 生物多样性, 2007, 15(2): 154-161. Chen J X, Ma Z C, Yan H J, et al. Roles of springtails in soil ecosystem[J]. Biodiversity Science, 2007, 15(2): 154-161 (in Chinese with English abstract).

[5]	Fountain M T, Hopkin S P. Folsomia candida(Collembola):a "standard" soil arthropod[J]. Annual Review of Entomology, 2005, 50: 201-222. DOI:10.1146/annurev.ento.50.071803.130331

[6]	Fava F, Bertin L. Use of exogenous specialised bacteria in the biological detoxification of a dump site-polychlorobiphenyl-contaminated soil in slurry phase conditions[J]. Biotechnology and Bioengineering, 1999, 64: 240-249. DOI:10.1002/(SICI)1097-0290(19990720)64:2<240::AID-BIT13>3.0.CO;2-F

[7]	Fava F, di Gioia D, Marchetti L. Role of the reactor configuration in the biological detoxification of a dump site polychlorobiphenyl-contaminated soil in lab-scale slurry phase conditions[J]. Applied Microbiology and Biotechnology, 2000, 53: 243-248. DOI:10.1007/s002530050015

[8]	Crouau Y, Gisclard C, Perotti P. The use of Folsomia candida(Collembola:Isotomidae)in bioassays of waste[J]. Applied Soil Ecology, 2002, 19: 65-70. DOI:10.1016/S0929-1393(01)00175-5

[9]	Fava F, Piccolo A. Effects of humic substances on the bioavailability and aerobic biodegradation of polychlorinated biphenyls in a model soil[J]. Biotechnology and Bioengineering, 2002, 77: 204-211. DOI:10.1002/bit.10140

[10]	Lock K, Janssen C R. Effect of new soil metal immobilizing agents on metal toxicity to terrestrial invertebrates[J]. Environmental Pollution, 2003, 121(1): 123-127. DOI:10.1016/S0269-7491(02)00202-6

[11]	Faddeeva-Vakhrusheva A, Derks M F L, Anvar S Y, et al. Gene family evolution reflects adaptation to soil environmental stressors in the genome of the collembolan Orchesella cincta[J]. Genome Biology and Evolution, 2016, 8(7): 2106-2117. DOI:10.1093/gbe/evw134

[12]	Faddeeva-Vakhrusheva A, Kraaijeveld K, Martijn F L, et al. Coping with living in the soil:the genome of the parthenogenetic springtail Folsomia candida[J]. BMC Genomics, 2017, 18: 493. DOI:10.1186/s12864-017-3852-x

[13]	Wu C, Jordan M D, Newcomb R D, et al. Analysis of the genome of the New Zealand giant collembolan(Holacanthella duospinosa)sheds light on hexapod evolution[J]. BMC Genomics, 2017, 18: 795. DOI:10.1186/s12864-017-4197-1

[14]	Zhang F, Ding Y H, Zhou Q S, et al. A high-quality draft genome assembly of Sinella curviseta:a soil model organism(Collembola)[J]. Genome Biology and Evolution, 2019, 11(2): 521-530. DOI:10.1093/gbe/evz013

[15]	Waterhouse R M, Seppey M, Simão F A, et al. BUSCO applications from quality assessments to gene prediction and phylogenomics[J]. Molecular Biology and Evolution, 2018, 35(3): 543-548. DOI:10.1093/molbev/msx319

[16]	Falcon Pacific Biosciences[EB/OL].[2020-04-26]. https: //github.com/PacificBiosciences/FALCON.

[17]	Kolmogorov M, Yuan J, Lin Y, et al. Assembly of long, error-prone reads using repeat graphs[J]. Nature Biotechnology, 2019, 37(5): 540-546. DOI:10.1038/s41587-019-0072-8

[18]	Chakraborty M, Baldwin-Brown J G, Long A D, et al. Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage[J]. Nucleic Acids Research, 2016, 44(19): e147.

[19]	Bushnell B, Rood J, Singer E. BBMerge-Accurate paired shotgun read merging via overlap[J]. PLoS One, 2017, 12(10): e0185056. DOI:10.1371/journal.pone.0185056

[20]	Vurture G W, Sedlazeck F J, Nattestad M, et al. GenomeScope:fast reference-free genome profiling from short reads[J]. Bioinformatics, 2017, 33(14): 2202-2204. DOI:10.1093/bioinformatics/btx153

[21]	Walker B J, Abeel T, Shea T, et al. Pilon:an integrated tool for comprehensive microbial variant detection and genome assembly improvement[J]. PLoS One, 2014, 9(11): e112963. DOI:10.1371/journal.pone.0112963

[22]	Altschul S F, Madden T L, Schäffer A A, et al. Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J]. Nucleic Acids Research, 1997, 25: 3389-3402. DOI:10.1093/nar/25.17.3389

[23]	Chen Y, Ye W C, Zhang Y D, et al. High speed BLASTN:an accelerated MegaBLAST search tool[J]. Nucleic Acids Research, 2015, 43(16): 7762-7768. DOI:10.1093/nar/gkv784

[24]	Kim D, Langmead B, Salzberg S L. HISAT:a fast spliced aligner with low memory requirements[J]. Nature Methods, 2015, 12(4): 357-360. DOI:10.1038/nmeth.3317

[25]	Pertea M, Pertea G M, Antonescu C M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads[J]. Nature Biotechnology, 2015, 33(3): 290-295. DOI:10.1038/nbt.3122

[26]	Smit A F A, Hubley R. 2008-2015. RepeatModeler Open-1.0[EB/OL].[2020-01-10]. http://www.repeatmasker.org.

[27]	Smit A F A, Hubley R, Green P. 2013-2015. RepeatMasker Open-4.0[EB/OL].[2019-10-31]. http://www.repeatmasker.org.

[28]	Holt C, Yandell M. MAKER2:an annotation pipeline and genome database management tool for second-generation genome projects[J]. BMC Bioinformatics, 2011, 12: 491. DOI:10.1186/1471-2105-12-491

[29]	Stanke M, Steinkamp R, Waack S, et al. AUGUSTUS:a web server for gene finding in eukaryotes[J]. Nucleic Acids Research, 2004, 32: W309-W312. DOI:10.1093/nar/gkh379

[30]	Lomsadze A, Ter-Hovhannisyan V, Chernoff Y O, et al. Gene identification in novel eukaryotic genomes by self-training algorithm[J]. Nucleic Acids Research, 2005, 33(20): 6494-6506. DOI:10.1093/nar/gki937

[31]	Hoff K J, Lange S, Lomsadze A, et al. BRAKER1:unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS[J]. Bioinformatics, 2016, 32(5): 767-769. DOI:10.1093/bioinformatics/btv661

[32]	Buchfink B, Xie C, Huson D H. Fast and sensitive protein alignment using DIAMOND[J]. Nature Methods, 2015, 12(1): 59-60. DOI:10.1038/nmeth.3176

[33]	Finn R D, Attwood T K, Babbitt P C, et al. InterPro in 2017-beyond protein family and domain annotations[J]. Nucleic Acids Research, 2017, 45(D1): D190-D199. DOI:10.1093/nar/gkw1107

[34]	Finn R D, Bateman A, Clements J, et al. Pfam:the protein families database[J]. Nucleic Acids Res, 2014, 42(D1): D222-D230. DOI:10.1093/nar/gkt1223

[35]	Mi H Y, Huang X S, Muruganujan A, et al. PANTHER version 11:expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements[J]. Nucleic Acids Research, 2017, 45(D1): D183-D189. DOI:10.1093/nar/gkw1138

[36]	Lewis T E, Sillitoe I, Dawson N, et al. Gene3D:extensive prediction of globular domains in proteins[J]. Nucleic Acids Research, 2018, 46(D1): D435-D439. DOI:10.1093/nar/gkx1069

[37]	Wilson D, Pethica R, Zhou Y D, et al. SUPERFAMILY:sophisticated comparative genomics, data mining, visualization and phylogeny[J]. Nucleic Acids Research, 2009, 37(Suppl 1): D380-D386.

[38]	Marchler-Bauer A, Bo Y, Han L Y, et al. CDD/SPARCLE:functional classification of proteins via subfamily domain architectures[J]. Nucleic Acids Research, 2017, 45(D1): D200-D203. DOI:10.1093/nar/gkw1129

[39]	Nawrocki E P, Eddy S R. Infernal 1.1:100-fold faster RNA homology searches[J]. Bioinformatics, 2013, 29(22): 2933-2935. DOI:10.1093/bioinformatics/btt509

[40]	Lowe T M, Eddy S R. tRNAscan-SE:a program for improved detection of transfer RNA genes in genomic sequence[J]. Nucleic Acids Research, 1997, 25(5): 955-964. DOI:10.1093/nar/25.5.955

[41]	Emms D M, Kelly S. OrthoFinder:solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy[J]. Genome Biology, 2015, 16: 157. DOI:10.1186/s13059-015-0721-2

[42]	Katoh K, Standley D M. MAFFT multiple sequence alignment software version 7:improvements in performance and usability[J]. Molecular Biology and Evolution, 2013, 30(4): 772-780. DOI:10.1093/molbev/mst010

[43]	Capella-Gutiérrez S, Silla-Martínez J M, Gabaldón T. trimAl:a tool for automated alignment trimming in large-scale phylogenetic analyses[J]. Bioinformatics, 2009, 25(15): 1972-1973. DOI:10.1093/bioinformatics/btp348

[44]	Chernomor O, von Haeseler A, Minh B Q. Terrace aware data structure for phylogenomic inference from supermatrices[J]. Systematic Biology, 2016, 65(6): 997-1008. DOI:10.1093/sysbio/syw037

[45]	Sanderson M J. r8s:inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock[J]. Bioinformatics, 2003, 19(2): 301-302. DOI:10.1093/bioinformatics/19.2.301

[46]	Han M V, Thomas G W C, Lugo-Martinez J, et al. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3[J]. Molecular Biology and Evolution, 2013, 30(8): 1987-1997. DOI:10.1093/molbev/mst100

[47]	王楠, 张春玉. 基因编辑技术的研究进展[J]. 国际遗传学杂志, 2016, 39(4): 208-212. Wang N, Zhang C Y. The research progress of gene editing[J]. International Journal of Genetics, 2016, 39: 208-212 (in Chinese with English abstract).

[48]	齐琪, 王彦, 莫雨佳, 等. 三七总皂苷对羧酸酯酶体外活性的影响[J]. 中国现代中药, 2019, 21(6): 777-781. Qi Q, Wang Y, Mo Y J, et al. Effect of Panax notoginseng saponins on activity of carboxylesterases in vitro[J]. Modern Chinese Medicine, 2019, 21(6): 777-781 (in Chinese with English abstract).

[49]	Bankaitis V A, Malehorn D E, Emr S D, et al. The Saccharomyces cerevisiae SEC14 gene encodes a cytosolic factor that is required for transport of secretory proteins from the yeast Golgi complex[J]. Journal of Cell Biology, 1989, 108: 1271-1281. DOI:10.1083/jcb.108.4.1271

[50]	McGee T P, Skinner H B, Whitters E A, et al. A phosphatidylinositol transfer protein controls the phosphatidylcholine content of yeast Golgi membranes[J]. Cell Biology, 1994, 124: 273-287. DOI:10.1083/jcb.124.3.273

[51]	Huijbregts R P H, Topalof L, Bankaitis V A. Lipid metabolism and regulation of membrane trafficking[J]. Traffic, 2000, 1(3): 195-202. DOI:10.1034/j.1600-0854.2000.010301.x

[52]	Li X M, Xie Z G, Bankaitis V A. Phosphatidylinositol/phosphatidylcholine transfer proteins in yeast[J]. Biochim Biophys Acta, 2000, 1486(1): 55-71. DOI:10.1016/S1388-1981(00)00048-2

[53]	Xie Z, Fang M, Bankaitis V A. Evidence for an intrinsic toxicity of phosphatidylcholine to Secl4p-dependent protein transport from the yeast Golgi complex[J]. Molecular Biology of the Cell, 2001, 12(4): 1117-1129. DOI:10.1091/mbc.12.4.1117

[54]	Dou T H, Gu S H, Zhou Z X, et al. Note:isolation and characterization of a Jerky and JRK/JH8 like gene, tigger transposable element derived 7, TIGD7[J]. Biochemical Genetics, 2004, 42(7/8): 279-285. DOI:10.1023/B:BIGI.0000034428.95802.35

[55]	Saveliev S. Trypsin/Lys-C mix, a new member of trypsin product line, for enhanced protein mass spectrometry analysis and whole cell yeast and human proteome reference extracts for mass spectrometry method development and instrument performance monitoring[J]. EuPA Open Proteomics, 2014, 2: 63.

[56]	卢戌.基于第二代测序的转录组组装软件比较研究[D].兰州: 兰州大学, 2013. Lu X. A comparison of transcriptome assembly software for next-generation sequencing technologies[D]. Lanzhou: Lanzhou University, 2013(in Chinese with English abstract).