ddRAD sequencing of 1076 Camellia accessions reveals the genetic diversity and population introgression of the tea plant in China
Xiao-Yan Tong (童小燕)a,1, Jian-Bing Hu (胡健兵)a,1, Hui-Lin Wen (温辉林)a,1, Chun-Lin Chen (陈春林)b,1, Ling-Ling Tao (陶玲玲)a, Jun Wu (吴军)a, Yi-Ping Tian (田易萍)b, Hui-Ping Jiang (蒋会兵)b, Lin-Bo Chen (陈林波)b, Da-He Qiao (乔大河)c, Ming-Tao Shu (舒明涛)a, En-Hua Xia (夏恩华)a, Kun Dong (董坤)a, Yue Fei (费月)a, Sheng-Rui Liu (刘升锐)a, Chao-Ling Wei (韦朝领)a,**, Jun-Yan Zhu (朱俊彦)a,*     
a. State Key Laboratory of Tea Plant Germplasm Innovation and Resource Utilization, Anhui Agricultural University, West 130 Changjiang Road, Hefei 230036, Anhui, China;
b. Yunnan Provincial Key Laboratory of Tea Science, Tea Research Institute, Yunnan Academy of Agricultural Sciences, Kunming 650201, China;
c. Guizhou Tea Research Institute, 1 Jin’nong Road, Guiyang 550006, Guizhou, China
Abstract: The tea plant (Camellia sinensis) is a widely cultivated and economically significant crop that originated from southwestern China. Owing to its high heterozygosity and complex genetic diversity, the evolutionary history of the tea plant and its domestication remain unclear. In this study, we used 1076 Camellia accessions to elucidate the phylogenetic relationships and population structure of Camellia accessions across 16 major tea-producing provinces in China. Six species-specific populations were classified based on these SNPs through phylogenetic trees and population structure. We found that C. taliensis populations have undergone divergent evolution compared to other wild relatives, such as C. kwangsiensis and C. crassicoluma. Population structure and introgression analyses further revealed frequent interspecific genetic exchanges between C. sinensis and its wild relatives. Thus, a transitional C. sinensis population that shared a genetic background with C. taliensis was characterized. Distinct genetic divergence was observed within the two main varieties of C. sinensis. Contrary to the relatively exclusive genetic background within the C. sinensis var. assamica population in southwestern China, extensive regional introgression and gene flow were observed in the C. sinensis var. sinensis population between southwestern and eastern China. Finally, we developed a core collection comprising 160 accessions based on our genetic diversity analyses. Our findings not only provide insights into the evolutionary history and genetic diversity of the tea plant but may facilitate the development of the conservation and utilization of tea plant resources.
Keywords: Tea plant    Population structure    ddRAD-seq    Genetic diversity    Core collection    Introgression    
1. Introduction

The tea plant, Camellia sinensis, is an economically vital crop in many developing countries across Asia, Africa, and Latin America (Xia et al., 2017; Wang et al., 2020). Camellia sect. Thea comprises 12 species and six varieties. C. sinensis, the main cultivated species, contains four varieties: C. sinensis var. sinensis, C. sinensis var. assamica, C. sinensis var. pubilimba, and C. sinensis var. dehungensis. C. taliensis represents the closest wild relative to the cultivated species (Ming, 1992; Zhao et al., 2014). The remarkable phenotypic variation of species within C. sect. Thea has led to considerable controversy regarding its taxonomic circumscription and species delimitation (Zhuang et al., 1981; Zhang et al., 1990; Ming, 1999; Chen et al., 2000; Yang, 2021). Furthermore, tea plant taxonomy is complicated by self-incompatibility and extensive interspecific hybridization, which have produced large genome sizes (~3 Gb) with high heterozygosity (Wei et al., 2018; Xia et al., 2020a; Jiang et al., 2024a; Shen et al., 2025). Although research has provided insights into the phylogeny and domestication of tea plants (Li et al., 2023), elucidating tea plant population introgression and dispersal patterns remains challenging.

In the tea plant, many well-known cultivars, such as ‘Biyun’ and ‘Zhenong 113’, were bred through hybridization between Camellia sinensis var. sinensis and C. sinensis var. assamica (Yang and Liang, 2014). A recent genome re-sequencing study of 190 tea accessions revealed extensive intra- and interspecific introgressions between these two major varieties (Zhang et al., 2021), whereas analysis of 1325 Camellia accessions further indicated a higher gene flow probability between C. taliensis and C. sinensis var. sinensis than with C. sinensis var. assamica. These interspecific introgressed genes may contribute to the biosynthesis of terpenoids and flavonoids, which may enhance tea flavor variation (Kong et al., 2025).

One efficient approach to elucidating both genetic diversity and phylogenetic relationships between tea plants is the use of SNPs (Wei et al., 2018; Xia et al., 2020b). Large-scale sequencing and phylogenetic analyses of Camellia accessions from China, Laos, Russia, Azerbaijan, and Iran have previously revealed three major types of tea plants: C. sinensis var. sinensis, C. sinensis var. assamica, and their wild relatives. Resequencing of additional global tea accessions confirmed these phylogenetic relationships (Wang et al., 2020; Xia et al., 2017, 2020a). Additionally, resequencing of an even greater number of cultivated tea accessions from the Guizhou Plateau revealed that the genetic diversity of cultivated tea germplasm in the Pearl River Basin is significantly higher than that in the Yangtze River Basin (He et al., 2023). However, the degree of genetic diversity in tea plant accessions across the entirety of China remains unknown.

Core collections represent a small subset of the entire germplasm collection, maximizing genetic diversity while minimizing redundancy. These collections enable efficient utilization of tea plant genetic resources and enhance breeding efficiency (Frankel, 1984; Brown, 1989). Core germplasm sets have been developed using SSR and SNP markers for tea plants in various regions, including India, Japan, Russia, and different provinces in China, such as Guizhou, Anhui, Yunnan, and Hunan (Chen et al., 2017; Niu et al., 2020; Samarina et al., 2021; Huang et al., 2022; Tao et al., 2023; Kottawa-Arachchi et al., 2024; Sun et al., 2024). However, a comprehensive core germplasm collection has not yet been established due to a limited understanding of the genetic diversity of tea plants in China.

In this study, we explored the phylogenetic relationships and population structure of 1076 Camellia accessions across China. Our findings uncovered complex patterns of intraspecific and interspecific introgression among the accessions. Furthermore, we developed a China-specific core collection comprising 160 representative accessions, which significantly represents the genetic diversity of tea plants in China. This study provides a vital foundation for elucidating the evolutionary dynamics and domestication history of the tea plant.

2. Materials and methods 2.1. Plant materials

The study dataset was obtained from 1076 Camellia accessions, including 1025 ddRAD-seq accessions from our collection and 51 genome re-sequencing accessions from published data (Table S1) (Zhang et al., 2021). The 1025 Camellia accessions were collected from 16 major tea-producing provinces in China, spanning tropical to temperate climate zones. These accessions were classified based on Ming's taxonomic system of C. sect. Thea, i.e., 746 C. sinensis, 178 C. taliensis, 53 C. gymnogyna, 41 C. tachangensis, 36 C. crassicolumna, 19 C. kwangsiensis, and three C. grandibracteata (Ming, 2000). The C. sinensis accessions were further separated into four varieties: 503 C. sinensis var. sinensis, 187 C. sinensis var. assamica, 41 C. sinensis var. pubilimba, and 15 C. sinensis var. dehungensis. Publicly available C. oleifera sequences were incorporated as an outgroup for phylogenetic analyses (Zhang et al., 2021).

Leaves were collected from tea plants and dried in silica gel, then stored at −20 ℃ (Fig. S1). Genomic DNA was extracted from the leaves using a modified cetyltrimethylammonium bromide method (Aboul-Maaty and Oraby, 2019).

2.2. ddRAD-seq library construction, sequencing, and variant calling

The ddRAD library was constructed following the manufacturer’s protocol. Briefly, genomic DNA (1 μg) was digested with MseI and EcoRI, then ligated to Solexa P1 adapters. Samples with different P1 adapters were pooled and physically sheared to 300–500 bp fragments. DNA fragments of 300–500 bp were recovered after 1% agarose gel electrophoresis. Fragments were end-repaired, A-tailed, ligated with Solexa P2 adapters, and purified. A 5 μL aliquot was used for PCR amplification. The final library of 350–550 bp DNA fragments was gel-purified, recovered, and sequenced on a HiSeq X Ten platform.

Variant calling was conducted as follows: Raw sequencing data, encompassing ddRAD-seq and genome re-sequencing data, underwent quality control with SOAPnuke (v.2.1.0) to remove adapter sequences and low-quality reads (Chen et al., 2018). The clean reads were mapped to the reference genome of the ‘Tieguanyin’ cultivar (C. sinensis var. sinensis) (Zhang et al., 2021) with the BWA-MEM algorithm (v.0.7.17) (Li and Durbin, 2009, 2010). SAMtools (v.1.6) converted the resulting SAM files to sorted BAM files, which were processed and analyzed using SAMtools and BamDeal (Li et al., 2009).

VCF files from ddRAD-seq and genome re-sequencing datasets were produced using GATK HaplotypeCaller (v.4.1.2) (McKenna et al., 2010; He et al., 2013). A high-quality SNP set was produced using BCFtools (v.1.17) with the isec parameter to identify common SNPs between these two datasets (Danecek et al., 2021). This high-quality SNP set underwent quality filtering using GATK parameters: QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < −12.5, and ReadPosRankSum < −8.0. Finally, VCFtools (v.0.1.16) was applied to retain SNPs with the following parameters: –max-missing 0.8 –maf 0.05 –min-alleles 2 –max-alleles 2 (Danecek et al., 2011).

2.3. Phylogenetic analysis and principal component analysis

Phylogenetic analyses were conducted using the SNPs (Leaché and Oaks, 2017). A Python script was used to convert the SNPs to an aligned FASTA format. Maximum likelihood (ML) tree inference was performed using IQ-TREE (v.1.6.12) (Nguyen et al., 2015). A best-fit substitution model was employed with the -m TESTONLY parameter. Next, an ML tree was constructed using the parameters -bb 1000 -m GTR + F + I + G4 + ASC. The resulting ML tree with bootstrap support values was visualized using the interactive Tree of Life web-based tool (https://itol.embl.de/). Principal component analysis (PCA) was then performed for each genotype using PLINK with default parameters to evaluate the population structure (Chang et al., 2015).

2.4. Population structure and introgression analysis

SNPs were further filtered in PLINK (v.1.90b4.6) using the parameters ‘–indep-pairwise 50 10 0.2 –maf 0.05’ (Chang et al., 2015; Liu et al., 2020). The loci that met these criteria were subsequently analyzed. Ancestral population structure was inferred using ADMIXTURE (v.1.3.0); a range of K values (1–20) was tested (Alexander et al., 2009). The results were visualized using the mapmixture R package to provide a clear representation of the population structure and admixture patterns (Jenkins, 2024).

Introgression analysis was performed using D statistics based on the ABBA-BABA test to infer phylogenetic relationships and detect gene flow among the tested triads (P1, P2, and P3), with Camellia oleifera as the outgroup. ABBA-BABA analysis was performed using Python scripts (https://github.com/simonhmartin/genomics_general/blob/master/ABBABABAwindows). Given the taxa with the relationship between P1, P2, and P3, a D statistic was calculated using a sliding window approach (1000 kb windows; 100 kb steps) and averaged across all genome-wide windows. Under strict bifurcating evolution with no gene flow, D statistics equal zero; positive D statistic values indicate gene flow between P3 and P2, with higher values signifying stronger introgression (Durand et al., 2011).

2.5. Core collection construction and evaluation

A core collection was constructed using the Core Hunter R package (De Beukelaer et al., 2018) based on a genetic distance matrix generated from 1076 Camellia accessions using IQ-TREE (v.1.6.12) (Nguyen et al., 2015). Genetic diversity parameters were calculated for both the core and whole collections using the snpReady R package (Granato et al., 2018) to evaluate the representativeness of the core collection. These parameters included minor allele frequency (MAF), polymorphic information content (PIC), expected heterozygosity (He), and observed heterozygosity (Ho). Genetic diversity between the core and whole collections was then compared using an independent samples t-test in SPSS (v.26.0).

3. Results 3.1. Evaluation and landscape of high-quality SNPs from 1076 Camellia accessions in China

The ddRAD-seq sequencing of all 1025 newly collected accessions yielded 2330.34 Gb of high-quality data. The sequencing quality metrics showed average Q20 and Q30 values of 96.8% and 92.6%, respectively, with an average data size of 2.27 Gb per accession (Fig. 1 and Table S1). Additionally, we incorporated 51 publicly available re-sequencing datasets from the NCBI database; this comprised 1402.2 Gb of sequence data (Zhang et al., 2021). In total, 24,023,180 raw SNPs were identified (Zhang et al., 2021). A total of 384,418 high-quality SNPs was retained for subsequent analyses after quality control. The distribution and density of these high-quality SNPs were then visualized across all 15 chromosomes. The distribution of SNPs varied significantly across chromosomes, with the highest density on chromosome 1 and the lowest on chromosome 15. The Ho values of SNPs ranged from 0.21 to 0.24, whereas no significant difference was observed in PIC values (Table S2). In addition, a low-density region was observed on chromosome 7 (Fig. S2).

Fig. 1 Taxonomic classification and geographical distribution of 1076 Camellia accessions. Circle size indicates population number. The inner circle shows germplasm types; the outer circle represents proportions of cultivated versus wild species.
3.2. Phylogenetic and principal component analyses of 1076 Camellia accessions

To investigate the phylogenetic relationships among 1076 Camellia accessions in China, we constructed an ML tree based on 384,418 high-quality SNPs, with C. oleifera as the outgroup. We classified these accessions into six main clades based on their phylogenetic relationships and characterized the dominant germplasms in each clade: wild relative 1 (WR1, 11.5%), wild relative 2 (WR2, 13.4%), transitional C. sinensis (TCS, 6.9%), C. sinensis var. assamica (CSA, 15.9%), C. sinensis var. pubilimba (CSP, 4.3%), and C. sinensis var. sinensis (CSS, 48%) clades (percentages represent proportions of total germplasms; Fig. 2a).

Fig. 2 Phylogenetic relationships of 1076 Camellia accessions in China. (a) Maximum likelihood (ML) tree showing phylogenetic relationships among 1076 Camellia accessions. Branch colors indicate different species, and letters represent abbreviations of dominant species in each clade. (b) Principal component analysis (PCA) of the 1076 Camellia accessions in China.

The WR1 clade formed the basal clade in the phylogenetic tree, containing accessions from Camellia gymnogyna (52), C. tachangensis (40), C. kwangsiensis (19), C. crassicolumna (8), and C. taliensis (5). The WR2 clade was distinct, predominantly consisting of 79.2% C. taliensis from Yunnan Province (Table S3). These findings suggest a divergent phylogenetic relationship between C. taliensis and other wild species.The TCS clade was identified as being between wild relatives and varieties of C. sinensis, with the closest relationship with the CSA clade. Additionally, the 74 accessions of the TCS clade were mainly from Yunnan province and included five species: 48 C. taliensis, 21 C. sinensis var. assamica, three C. grandibracteata, one each of C. crassicolumna and C. gymnogyna (Table S3).

Within different clades of varieties of C. sinensis, 83.6% of the C. sinensis var. assamica accessions from Yunnan province were classified into the CSA clade (Fig. 2a). Similarly, C. sinensis var. pubilimba accessions from Yunnan and Guizhou provinces formed the CSP clade, while those from Hunan province clustered in the CSS clade (Fig. 2a and Table S3). Conversely, the CSS species exhibited a broader geographical distribution and characteristics. According to phylogenetic relationships, the CSS clade formed three subclades. For instance, the accessions from Jiangxi Province were observed to be present in all three subclades of the CSS clade, while 97.1% of those from Fujian province were included in the CSS-Ⅲ subclade (Fig. S3).

The PCA results of 1076 Camellia accessions were consistent with their ML phylogenetic tree. The first two principal components accounted for 38.74% and 26.97% of the total genetic variation (Fig. 2b). PC1 effectively distinguished wild relatives and C. sinensis, with the TCS group clustered between wild relatives and C. sinensis. PC2 further indicated genetic divergence between the CSA and CSS species, whereas a closer phylogenetic relationship was observed between the CSP and CSS species.

3.3. Population structure and introgression analyses revealed the genetic interflow of Camellia accessions in China

To further investigate the population structure of the 1076 Camellia accessions, ADMIXTURE analysis was performed to calculate the K value based on a dataset comprising 384,418 high-quality SNPs. At K = 3, 1076 Camellia accessions were classified into three major groups: wild relatives (WR), CSA, and CSS groups (Fig. S4). K = 10 was considered the optimal parameter for the population structure analysis of 1076 Camellia accessions. The resulting analysis indicated that the WR group could be divided into G1 and G2 groups, with accessions primarily distributed in southwestern China. Additionally, C. sinensis var. assamica accessions were predominantly found in the G3 group. C. sinensis var. pubilimba accessions were classified into the G4 and G5 groups, which were distributed across the provinces of Yunnan and Guizhou. In contrast, five different groups (G6–G10) were collectively identified within the CSS accessions in various regions from southwestern to eastern China. Interspecific introgression was observed among the G6–G10 groups, with distinct distribution preferences. For instance, 74.2% of G6 accessions were identified from Hunan, whereas the G7 group consisted of 97.6% accessions from Guangdong and Jiangxi. In addition, the G8 and G9 groups exhibited a distinct distribution pattern, with the accessions derived from southwestern and central (Sichuan, Guizhou, Jiangxi, Hunan, and Hubei), as well as eastern China (Anhui and Fujian), respectively (Fig. 3a and Table S3).

Fig. 3 Genetic introgression analysis of 1076 Camellia accessions in China. (a) Population structure analysis of 1076 accessions at K = 6–10. Letters indicate abbreviations of the dominant species group. Pie charts indicate the geographical distribution and percentage of each group. (b) Population structure of 419 accessions from Yunnan Province at K = 10. Pie charts illustrate the relative proportions of genetic clusters at each sampling location. Bar plots display individual-level admixture structure across seven regions, with different colors representing G1–G10 genetic ancestral components (color codes are the same as Fig. 3a). (c) D statistics of triplet combinations among six populations using ABBA-BABA test. (d) Common and unique introgression areas of the CSS, CSA, and CSP populations. The horizontal axis indicates the common and unique introgression regions across three populations, and the vertical axis represents the region size (Mb). (e) Distribution of CSS, CSA, and CSP population introgression regions on chromosomes.

As genetic diversity analysis showed that Yunnan accessions were widely distributed across multiple groups (G1–G5, G8 and G9) (Fig. 3a and Table S3), we further examined the genetic composition of accessions from seven regions in Yunnan Province using ADMIXTURE analysis (K = 10) (each region with > 20 samples). The population structure of seven regional groups indicated that the main contributors to the genetic diversity of Camellia accessions in Wenshan and Honghe were the wild relatives and C. sinensis var. pubilimba (Fig. 3b). In contrast, in Pu'er and Xishuangbanna genetic introgression was observed between C. taliensis and C. sinensis var. assamica (Fig. 3b and Table S3). The accessions from Lincang exhibited diverse genetic backgrounds, of which 62.3% showed genetic backgrounds from C. taliensis, C. sinensis var. assamica, and C. sinensis var. sinensis (Fig. 3b and Table S3). Moreover, low genetic diversity was observed in accessions from Baoshan, with the exclusive genetic background of C. taliensis (Fig. 3b).

We detected introgression signals among different tea plant populations by applying the ABBA-BABA test to 1076 Camellia accessions to calculate their D statistics. TCS populations not only exhibited introgression with the WR2 population (D statistics 0.0975) but was also significantly introgressed with CSA, CSS, and CSP populations (D statistics from 0.4001 to 0.4820), with the highest introgression observed between TCS and CSA populations (Fig. 3c). Additionally, significant genetic exchange was detected among the CSA, CSS, and CSP populations (D statistics 0.4431–0.5107); this was particularly pronounced between CSA and CSP populations (Fig. 3c). We also evaluated the shared and unique introgressed regions among different C. sinensis. The TCS population shared 1638.6 Mb of introgressed regions with the CSA, CSS, and CSP populations, whereas the CSS population had the largest unique introgressed region (59.8 Mb). This suggests that the CSS population was divergent from CSA and CSP populations (Fig. 3d). Moreover, these shared introgressed regions between the TCS population and C. sinensis varieties were distributed across all chromosomes (Fig. 3e).

3.4. Genetic variation and population differentiation of CSS accessions in China

CSS accessions are widely distributed and cultivated in China (Ming, 2000; Wei et al., 2018), Thus, to assess the genetic diversity of CSS accessions across China, we analyzed 465 CSS accessions from nine provinces (each province with > 20 samples) (Table S3). CSS accessions from Guangdong (MAF = 0.2617, PIC = 0.2892) and Sichuan (MAF = 0.2613, PIC = 0.2897) provinces had significantly higher allelic polymorphism levels than those from other regions (Table 1). In contrast, those of the Hunan (MAF = 0.2453, PIC = 0.2798) and Fujian (MAF = 0.2446, PIC = 0.2799) provinces had the lowest allelic polymorphism. FST and GD values ranged from 0.018 to 0.172 and 0.014 to 0.072, respectively (Fig. S6). Additionally, CSS accessions from Guangdong exhibited considerable genetic differentiation compared to those from other regions, with FST and GD values from 0.059 to 0.172 and 0.03 to 0.072, respectively (Fig. S6).

Table 1 Genetic parameters of the CSS population across nine provinces of China.
Region Number MAF PIC He Ho
Guangdong 33 0.2617a 0.2892a 0.3594a 0.309e
Sichuan 47 0.2613a 0.2897a 0.3601a 0.3298b
Anhui 90 0.2572b 0.2868b 0.356b 0.3453a
Hubei 28 0.256b 0.2858c 0.3553b 0.3269c
Guangxi 20 0.2522c 0.283d 0.3518c 0.2874g
Guizhou 38 0.2492d 0.2824de 0.3484d 0.3202d
Jiangxi 71 0.248e 0.2821e 0.348d 0.2961f
Hunan 70 0.2453f 0.2798f 0.3447e 0.2878g
Fujian 68 0.2446f 0.2799f 0.3447e 0.3084e
Abbreviations: MAF, minor allele frequency; PIC, polymorphism information content; He, expected heterozygosity; Ho, observed heterozygosity. Different lowercase letters (a, b, c, etc.) indicate significant differences among provinces (P < 0.05), as determined through ANOVA with Duncan's multiple range test. Data are expressed as the mean values of each parameter within provinces.
3.5. Construction and evaluation of core collection for Camellia accessions in China

We further constructed a core collection using the Core Hunter R package to efficiently evaluate the genetic diversity of tea plant germplasms in China. Upon completion of the whole collection of 1076 Camellia accessions, we established a core collection comprising 160 Camellia accessions (14.87% of the whole collection) based on their GD comparisons (Table 2). A total of 27.5% of the core collection consists of germplasm resources derived from Yunnan Province (Table S4), confirming the high genetic diversity of tea plants in this province. The core collection contains diverse tea plant germplasms, including C. sinensis var. sinensis, C. sinensis var. assamica, and C. taliensis, which account for 70%, 15%, and 7.5%, respectively (Table S4). Statistical analysis of classical genetic parameters showed no significant differences between the whole and core collections (P > 0.05). Phylogenetic analysis revealed a similar topological structure to the whole collection, in which 160 tea accessions were classified into six taxa, consistent with the phylogenetic tree of the whole collection (Fig. 4a). Furthermore, PCA results demonstrated the high coverage of core collection accessions with those of the whole collection (Fig. 4b). These results confirm that the core collection established a set of representative and effective germplasms with considerable genetic diversity.

Table 2 Genetic parameters of the whole and core collection.
Collection Number Percentage MAF PIC He Ho
Whole 1076 100% 0.224a 0.266a 0.324a 0.323a
Core 160 14.87% 0.219a 0.263a 0.318a 0.328a
Note: No significant differences in genetic parameters were observed between the two collections, as determined based on an independent samples t-test using SPSS software (P > 0.05). Abbreviations: Same as in Table 1.

Fig. 4 Representative evaluation of the tea plant core collection in China. (a) Phylogenetic tree of 160 Camellia accessions from the core collection. (b) Comparison of PCA results of the core and whole collections. Lighter and darker dots indicate the whole and core collections, respectively.
4. Discussion

The tea plant (Camellia sinensis), among the earliest documented tree crops in China, has a domestication history exceeding 3000 years. Although China is globally recognized as the primary center of origin for the tea plant (Yamanishi, 1995; Meegahakumbura et al., 2017), previous studies have largely focused on specific regions or cultivars, lacking a systematic investigation of local tea germplasm resources across the country (Xia et al., 2020a; Zhang et al., 2021; Tao et al., 2023; Jiang et al., 2024b; Sun et al., 2024). Here, we conducted a systematic analysis of local germplasm resources from major tea-producing provinces in China, encompassing 1076 Camellia accessions. We obtained 384,418 high-quality SNPs for in-depth analysis of phylogenetic relationships, population structure, population introgression, and genetic differentiation. Additionally, we established a China-specific core collection comprising 160 representative accessions that capture the major genetic variations in our collected accessions. This study provides novel insights and crucial data for understanding the origin and domestication of the tea plant in China.

4.1. Regional population introgression contributed to the increased genetic diversity of the tea plant

Our study revealed complex interspecific hybridization and gene flow among species of Camellia sect. Thea. This finding is consistent with previous studies that showed that in Yunnan approximately 10% of C. sinensis var. assamica have hybridized with C. taliensis (Li et al., 2024). Similarly, studies have identified introgression between C. taliensis and C. sinensis var. assamica in seven Baiyingshan cultivated accessions (Mao et al., 2021). One explanation for extensive hybridization between tea plant accessions is that the self-incompatibility system in tea plants promotes gene flow among different populations (Zhang et al., 2021). In addition, hybridization between the tea plant and its wild relatives has been found to be associated with agronomic trait formation and environmental adaptation (Chen et al., 2023; Duan et al., 2024; Kong et al., 2025).

We observed hybridization between Camellia taliensis and other wild relatives (C. kwangsiensis and C. crassicolumna) in southeastern Yunnan (Honghe and Wenshan). In the southwestern Yunnan (Lincang), a transitional C. sinensis population was detected (Fig. 3b). Gene introgression analysis revealed genetic exchange between this population and the WR2, CSA, and CSS populations, highlighting the role of genetic intermediates in tea plant domestication (Fig. 3c). Recently, a cultivated ancient C. sinensis representative was determined to have a close phylogenetic relationship to and similar population structure as CSA and its wild relative (Kong et al., 2025). This phylogenetic pattern is similar to the introgressive genetic background of TCS accessions found in our study (Fig. 3a), suggesting that this population was derived from natural hybridization between C. sinensis var. assamica and its wild relatives, such as C. taliensis.

4.2. Germplasm exchange and breeding activities govern modern tea cultivar development

Human activities have profoundly shaped the evolutionary trajectory of tea plants. Population analysis has previously shown that the ‘Fudingdabaicha’ cultivar served as a critical parent that hybridized with numerous accessions to generate new tea cultivars, such as ‘Zhenong 117’ (Zhang et al., 2020). In our study, we also found remarkable artificial hybrid-mediated genetic introgression patterns among different accessions and populations. For example, ‘Fuyun 6’ (FJP2) was developed through artificial hybridization between CSS and CSA species; this hybrid exhibited a genetic background primarily comprising G3, G9, and G10 groups (Table S3). Moreover, phylogenetic analysis revealed that CSS accessions from Jiangxi Province were distributed across three CSS subclades (Fig. S3). Three distinct groups (G7, G8, and G9) were identified in the population structure analysis from Jiangxi Province (Fig. 3a). However, the population of Jiangxi accessions exhibited relatively low genetic diversity (MAF = 0.248; PIC = 0.2821) (Table 1). This inconsistency between phylogenetic relationship and genetic diversity may be attributed to the historical germplasm introductions and cultivation of Jiangxi in adjacent provinces (Jiang et al., 2015).

4.3. Construction and application of the core collection of the tea plant

The establishment of core collections is a crucial strategy in germplasm management, as it captures the genetic diversity of an entire germplasm repository through the selection of representative samples (Frankel and Brown, 1984; Brown, 1989). This approach not only facilitates the evaluation of genetic potential but also promotes efficient conservation and utilization of germplasm resources (Gu et al., 2023). In recent years, high-throughput sequencing technologies, such as ddRAD-seq and genome re-sequencing, have emerged as efficient and cost-effective methods for establishing core collections of Camellia accessions in China. These strategies have surpassed traditional approaches that are based on morphological and metabolite analyses (Baird et al., 2008; Elshire et al., 2011; Xia et al., 2020b).

In this study, we constructed a core collection using ddRAD-seq technology. Our core collection encompassed 160 Camellia accessions that represent 14.87% of the whole collection. Our retention rate was lower than that of other existing core collections in China. Furthermore, 27.5% of the accessions were collected from Yunnan in the core collection. The high proportion of core collection in Yunnan may be attributed to the numerous clone accessions collected in this study, as well as the extensive range of wild species and cultivated taxa within C. sect. Thea from Yunnan. The establishment of this core collection supports molecular marker-assisted breeding, genome-wide association studies (GWAS), and the conservation and utilization of germplasm resources (Hyun et al., 2020; Niu et al., 2020). For instance, GWAS based on the core germplasm resources identified nine SNPs associated with leaf size traits of the tea plant (Niu et al., 2020). However, due to variations in sample sources and sequencing approaches, phylogenetic divergence was observed in some germplasms in this study, a finding consistent with previous studies (Jiang et al., 2024a; Tong et al., 2024; Kong et al., 2025). For instance, genome re-sequencing of 1325 accessions indicated that 25 CSP germplasms were commonly nested in the CSS and CSA clades, while 27 CSP accessions from Yunnan and Guizhou provinces formed a specific clade that differentiated from the CSA and CSS clades (Fig. 2a and Table S3). Therefore, a comprehensive genetic analysis strategy constructed through expanded sample sources and integrating multiple methodologies would be beneficial in elucidating the complex phylogenetic relationships between C. sinensis and its wild relatives.

In summary, we analyzed 1076 representative Camellia accessions from China to elucidate their phylogenetic relationships and patterns of genetic diversity. Population structure and introgression analyses revealed frequent interspecific and intraspecific introgressions among Camellia accessions in China, contributing to their abundant genetic diversity. Our findings provide additional evidence for gene exchange in tea and enrich the genetic landscape of Camellia accessions. Additionally, we established a China-specific core collection that exhibits extensive genetic diversity, providing valuable resources for future tea breeding programs and conservation efforts.

Acknowledgments

We thank Saijun Li, Zheng Liu, and Feiyi Huang (Tea Research Institute, Hunan Academy of Agricultural Sciences); Puxiang Yang, Zhihui Wang, and Hua Peng (Jiangxi Sericulture and Tea Research Institute); Xiaofang Jin and Linlong Ma (Fruit and Tea Research Institute, Hubei Academy of Agricultural Sciences); Xiuju Qin and Huiqun Deng (Guangxi Institute of Tea Science and Research); Junming Hu (Agricultural Resource and Environment Research Institute, Guangxi Academy of Agricultural Sciences); Qian Tang (College of Horticulture, Sichuan Agricultural University); and Ming Deng (Tea Research Institute, Chongqing Academy of Agricultural Sciences) for their contribution to the material collection. This study was funded by the National Natural Science Foundation of China (grant no. U20A2045) and the Science and Technology Project of Yunnan Province (grant no. 202102AE090038).

CRediT authorship contribution statement

Xiaoyan Tong: Writing – original draft, Data curation, Formal analysis. Jianbing Hu: Data curation, Methodology. Huilin Wen: Data curation, Investigation. Chunlin Chen: Investigation, Conceptualization. Lingling Tao: Investigation, Data curation. Jun Wu: Formal analysis. Yiping Tian: Formal analysis. Huibing Jiang: Investigation, Data curation. Linbo Chen: Investigation, Data curation. Dahe Qiao: Investigation, Data curation. Mingtao Shu: Investigation, Data curation. Enhua Xia: Conceptualization. Kun Dong: Conceptualization. Yue Fei: Conceptualization. Shengrui Liu: Conceptualization. Chaoling Wei: Writing – review & editing, Conceptualization, Supervision. Junyan Zhu: Writing – review & editing, Conceptualization, Supervision.

Data availability statement

Raw reads from the sequencing data are available at the NCBI Sequence Read Archive (Project numbers: PRJNA1157005, PRJNA1200250, PRJNA665594, and PRJNA1198293).

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.pld.2025.09.002.

References
Aboul-Maaty, N.A.F., Oraby, H.A.S., 2019. Extraction of high-quality genomic DNA from different plant orders applying a modified CTAB-based method. Bull. Natl. Res. Cent., 43: 25. DOI:10.1186/s42269-019-0066-1
Alexander, D.H., Novembre, J., Lange, K., 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res., 19: 1655-1664. DOI:10.1101/gr.094052.109
Baird, N.A., Etter, P.D., Atwood, T.S., et al., 2008. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One, 3: e3376. DOI:10.1371/journal.pone.0003376
Brown, A.H.D., 1989. Core collections: a practical approach to genetic resources management. Genome, 31: 818-824. DOI:10.1139/g89-144
Chang, C.C., Chow, C.C., Tellier, L.C., et al., 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4: 7.
Chen, L., Yu, F.L., Tong, Q.Q., 2000. Discussions on phylogenetic classification and evolution of sect. Thea. J. Tea Sci., 20: 89-94.
Chen, R., Hara, T., Ohsawa, R., et al., 2017. Analysis of genetic diversity of rapeseed genetic resources in Japan and core collection construction. Breed. Sci., 67: 239-247. DOI:10.1270/jsbbs.16192
Chen, S., Wang, P.J., Kong, W.L., et al., 2023. Gene mining and genomics-assisted breeding empowered by the pangenome of tea plant Camellia sinensis. Nat. Plants, 9: 1986-1999. DOI:10.1038/s41477-023-01565-z
Chen, Y.X., Chen, Y.S., Shi, C.M., et al., 2018. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience, 7: gix120. DOI:10.1093/gigascience/giy120
Danecek, P., Auton, A., Abecasis, G., et al., 2011. The variant call format and VCFtools. Bioinformatics, 27: 2156-2158. DOI:10.1093/bioinformatics/btr330
Danecek, P., Bonfield, J.K., Liddle, J., et al., 2021. Twelve years of SAMtools and BCFtools. GigaScience, 10: giab008. DOI:10.1093/gigascience/giab008
De Beukelaer, H., Davenport, G.F., Fack, V., 2018. Core Hunter 3, flexible core subset selection. BMC Bioinformatics, 19: 203. DOI:10.1186/s12859-018-2209-z
Duan, S., Yan, L., Shen, Z., et al., 2024. Genomic analyses of agronomic traits in tea plants and related Camellia species. Front. Plant Sci., 15: 1449006. DOI:10.3389/fpls.2024.1449006
Durand, E.Y., Patterson, N., Reich, D., et al., 2011. Testing for ancient admixture between closely related populations. Mol. Biol. Evol., 28: 2239-2252. DOI:10.1093/molbev/msr048
Elshire, R.J., Glaubitz, J.C., Sun, Q., et al., 2011. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One, 6: e19379. DOI:10.1371/journal.pone.0019379
Frankel, O.H., 1984. Genetic Perspectives of Plant Germplasm Conservation. Genetic Manipulation: Impact on Man and Society. Cambridge: Cambridge University Press: pp. 161-170.
Frankel, O.H., Brown, A.H.D., 1984. Plant genetic resources today, a critical appraisal. International Conference of Genetics: pp. 241-256.
Granato, I.S., Galli, G., de Oliveira Couto, E.G., et al., 2018. snpReady: a tool to assist breeders in genomic analysis. Mol. Breed., 38: 102. DOI:10.1007/s11032-018-0844-8
Gu, R., Fan, S.H., Wei, S.P., et al., 2023. Developments on core collections of plant genetic resources: do we know enough?. Forests, 14: 926. DOI:10.3390/f14050926
He, L.M., Luo, J., Niu, S.Z., et al., 2023. Population structure analysis to explore genetic diversity and geographical distribution characteristics of wild tea plant in Guizhou Plateau. BMC Plant Biol., 23: 255. DOI:10.1186/s12870-023-04239-2
He, W., Zhao, S., Liu, X., et al., 2013. ReSeqTools: an integrated toolkit for large-scale next-generation sequencing based resequencing analysis. Genet. Mol. Res., 12: 6275-6283. DOI:10.4238/2013.December.4.15
Huang, F.Y., Duan, J.H., Lei, Y., et al., 2022. Genetic diversity, population structure and core collection analysis of Hunan tea plant germplasm through genotyping-by-sequencing. Beverage Plant Res., 2: 1-7. DOI:10.48130/bpr-2022-0005
Hyun, D.Y., Gi, G.Y., Raveendar, S., et al., 2020. Utilization of phytochemical and molecular diversity to develop a target-oriented core collection in tea germplasm. Agronomy, 10: 1667. DOI:10.3390/agronomy10111667
Jenkins, T.L., 2024. mapmixture, an R package and web app for spatial visualisation of admixture and population structure. Mol. Ecol. Resour., 24: e13943. DOI:10.1111/1755-0998.13943
Jiang, X.F., Li, W.J., Yang, P.X., et al., 2015. Preliminary report on introduction experiment of new tea cultivars in Jiangxi province. J. South. Agric., 46: 1856-1861.
Jiang, Y., Yang, J., Folk, R.A., et al., 2024a. Species delimitation of tea plants (Camellia sect. Thea) based on super-barcodes. BMC Plant Biol., 24: 181. DOI:10.1007/978-3-031-56241-9_12
Jiang, L.L., Xie, S.Y., Zhou, C.Z., et al., 2024b. Analysis of the genetic diversity in tea plant germplasm in Fujian Province based on restriction site-associated DNA sequencing. Plants, 13: 100. DOI:10.62517/jbm.202409215
Kong, W.L., Kong, X.R., Xia, Z.Q., et al., 2025. Genomic analysis of 1,325 Camellia accessions sheds light on agronomic and metabolic traits for tea plant improvement. Nat. Genet., 57: 997-1007. DOI:10.1038/s41588-025-02135-z
Kottawa-Arachchi, J.D., Ranatunga, M.A.B., Sharma, R.K., et al., 2024. Morpho-molecular genetic diversity and population structure analysis to enrich core collections in tea [Camellia sinensis (L.) O. Kuntze] germplasm of Sri Lanka and India. Genet. Resour. Crop Evol., 71: 2597-2616. DOI:10.1007/s10722-023-01792-5
Leaché, A.D., Oaks, J.R., 2017. The utility of single nucleotide polymorphism (SNP) data in phylogenetics. Annu. Rev. Ecol. Evol. Syst., 48: 69-84. DOI:10.1146/annurev-ecolsys-110316-022645
Li, H., Durbin, R., 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25: 1754-1760. DOI:10.1093/bioinformatics/btp324
Li, H., Handsaker, B.H., Wysoker, A., et al., 2009. The sequence alignment/map format and SAMtools. Bioinformatics, 25: 2078-2079. DOI:10.1093/bioinformatics/btp352
Li, H., Durbin, R., 2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26: 589-595. DOI:10.1093/bioinformatics/btp698
Li, H.Z., Song, K.K., Zhang, X.H., et al., 2023. Application of multi-perspectives in tea breeding and the main directions. Int. J. Mol. Sci., 24: 12643. DOI:10.3390/ijms241612643
Li, M.M., Meegahakumbura, M.K., Wambulwa, M.C., et al., 2024. Genetic analyses of ancient tea trees provide insights into the breeding history and dissemination of Chinese Assam tea (Camellia sinensis var. assamica). Plant Divers., 46: 229-237. DOI:10.1016/j.pld.2023.06.002
Liu, C.C., Shringarpure, S., Lange, K., et al., 2020. Exploring population structure with admixture models and principal component analysis. In: Dutheil, Julien Y. (Ed.), Statistical Population Genomics. Methods Mol. Biol., 2090, pp. 67–86.
Mao, J., Jiang, H., Bing, Y.R., et al., 2021. Genetic diversity and population structure of wild and cultivated Camellia taliensis populations. J. Tea Sci., 41: 454-462.
McKenna, A., Hanna, M., Banks, E., et al., 2010. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res., 20: 1297-1303. DOI:10.1101/gr.107524.110
Meegahakumbura, M.K., Wambulwa, M.C., Li, M.M., et al., 2017. Domestication origin and breeding history of the tea plant (Camellia sinensis) in China and India based on nuclear microsatellites and cpDNA sequence data. Front. Plant Sci., 8: 2270.
Ming, T.L., 1992. A revision of Camellia sect. Thea. Acta Bot. Yunnan., 14: 115-132.
Ming, T.L., 1999. A systematic synopsis of the genus Camellia. Acta Bot. Yunnan., 21: 149-159.
Ming, T.L., 2000. Monograph of the Genus Camellia. Kunming: Yunnan Science and Technology Press.
Nguyen, L.T., Schmidt, H.A., von Haeseler, A., et al., 2015. IQ-TREE, a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol., 32: 268-274. DOI:10.1093/molbev/msu300
Niu, S.Z., Koiwa, H., Song, Q.F., et al., 2020. Development of core-collections for Guizhou tea genetic resources and GWAS of leaf size using SNP developed by genotyping-by-sequencing. PeerJ, 8: e8572. DOI:10.7717/peerj.8572
Samarina, L.S., Matskiv, A.O., Shkhalakhova, R.M., et al., 2021. Genetic diversity in the Russian genebank collection of tea plant [Camellia sinensis (L.) O. Kuntze]. Front. Plant Sci., 12: 801141.
Shen, Z., Feng, Y., Möller, M., et al., 2025. Genomic DNA barcodes provide novel insights into species delimitation in the complex Camellia sect. Thea (Theaceae). BMC Plant Biol., 25: 570. DOI:10.1186/s12870-025-06612-9
Sun, W.H., Chen, C.L., Xu, L.L., et al., 2024. Genetic diversity analysis and core collection construction of tea plant from the Yunnan Province of China using ddRAD sequencing. BMC Plant Biol., 24: 1163. DOI:10.1186/s12870-024-05821-y
Tao, L.L., Ting, Y.J., Chen, H.R., et al., 2023. Core collection construction of tea plant germplasm in Anhui Province based on genetic diversity analysis using simple sequence repeat markers. J. Integr. Agric., 22: 2576-2592.
Tong, W., Wang, Y.L., Li, F.D., et al., 2024. Genomic variation of 363 diverse tea accessions unveils the genetic diversity, domestication, and structural variations associated with tea adaptation. J. Integr. Plant Biol., 66: 2175-2190. DOI:10.1111/jipb.13737
Wang, X.C., Feng, H., Chang, Y.X., et al., 2020. Population sequencing enhances understanding of tea plant evolution. Nat. Commun., 11: 4447. DOI:10.1038/s41467-020-18228-8
Wei, L.C., Yang, H., Wang, S.B., et al., 2018. Draft genome sequence of Camellia sinensis var. sinensis provides insights into the evolution of the tea genome and tea quality. Proc. Natl. Acad. Sci. U.S.A., 115: E4151-E4158.
Xia, E.H., Tong, W., Hou, Y., et al., 2020a. The reference genome of tea plant and resequencing of 81 diverse accessions provide insights into its genome evolution and adaptation. Mol. Plant, 13: 1013-1026. DOI:10.1016/j.molp.2020.04.010
Xia, E.H., Tong, W., Wu, Q., et al., 2020b. Tea plant genomics: achievements, challenges and perspectives. Hortic. Res., 7: 7.
Xia, E.H., Zhang, H.B., Sheng, J., et al., 2017. The tea tree genome provides insights into tea flavor and independent evolution of caffeine biosynthesis. Mol. Plant, 10: 866-877. DOI:10.1016/j.molp.2017.04.002
Yamanishi, T., 1995. Special issue tea. Food Rev. Int., 11: 477-505. DOI:10.1080/87559129509541056
Yang, S.X., 2021. Thinking on the taxonomy of Camellia sect. Thea. J. Tea Sci., 41: 439-453.
Yang, Y.J., Liang, Y.R., 2014. Flora of Clonal Tea Plants in China. Shanghai: Shanghai Sci. Tech. Publ..
Zhang, F.C., Ding, W.R., Huang, Y., et al., 1990. Three new species of the genus Camellia from Yunnan. Acta Bot. Yunnan., 12: 31-34.
Zhang, W.Y., Zhang, Y.J., Qiu, H.J., et al., 2020. Genome assembly of wild tea tree DASZ reveals pedigree and selection history of tea varieties. Nat. Commun., 11: 3719.
Zhang, X.T., Chen, S., Shi, L.Q., et al., 2021. Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia sinensis. Nat. Genet., 53: 1250-1259. DOI:10.1038/s41588-021-00895-y
Zhao, D.W., Yang, J.B., Yang, S.X., et al., 2014. Genetic diversity and domestication origin of tea plant Camellia taliensis (Theaceae) as revealed by microsatellite markers. BMC Plant Biol., 14: 14. DOI:10.1186/1471-2229-14-14
Zhuang, W.F., Liu, Z.S., Chen, W.H., 1981. A version of the classification of the tea varieties. J. Zhejiang Agric. Univ., 7: 41-48.