Skmer approach improves species discrimination in taxonomically problematic genus Schima (Theaceae)
Han-Ning Duana,b, Yin-Zi Jiangc, Jun-Bo Yangd, Jie Caid, Jian-Li Zhaoe, Lu Lib,**, Xiang-Qin Yua,*     
a. CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, Yunnan, PR China;
b. College of Forestry, Southwest Forestry University, Kunming 650224, Yunnan, PR China;
c. Genomics and Genetic Engineering Laboratory of Ornamental Plants, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, Zhejiang, PR China;
d. Germplasm Bank of Wild Species, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, Yunnan, PR China;
e. Yunnan Key Laboratory of Plant Reproductive Adaptation and Evolutionary Ecology, Laboratory of Ecology and Evolutionary Biology, School of Ecology and Environmental Sciences, Yunnan University, Kunming 650500, Yunnan, PR China
Abstract: Genome skimming has dramatically extended DNA barcoding from short DNA fragments to next generation barcodes in plants. However, conserved DNA barcoding markers, including complete plastid genome and nuclear ribosomal DNA (nrDNA) sequences, are inadequate for accurate species identification. Skmer, a recently proposed approach that estimates genetic distances among species based on unassembled genome skims, has been proposed to effectively improve species discrimination rate. In this study, we used Skmer to identify species based on genomic skims of 47 individuals representing 10 out of 13 species of Schima (Theaceae) from China. The unassembled reads identified six species, with a species identification rate of 60%, twice as high as previous efforts that used plastid genomes (27.27%). In addition, Skmer was able to identify Schima species with only 0.5× sequencing depth, as six species were well-supported with unassembled data sizes as small as 0.5 Gb. These findings demonstrate the potential for Skmer approach in species identification, where nuclear genomic data plays a crucial role. For taxonomically difficult taxa such as Schima, which have diverged recently and have low levels of genetic variation, Skmer is a promising alternative to next generation barcodes.
Keywords: Schima    Genome skimming    Species discrimination    Skmer    
1. Introduction

Species identification in some angiosperm taxa is challenging. Several approaches have been developed to identify species in these groups, including the standard DNA barcodes (Li et al., 2011; Bieniek et al., 2014; Jiang et al., 2020), the next generation barcodes (Kane et al., 2012; Li et al., 2015; Hollingsworth et al., 2016), capturing single-copy nuclear genes (Alvarez et al., 2008; Straub et al., 2012; Liu et al., 2021), and the recently developed methods based on all reads generated from genome skims (Yi and Jin, 2013; Ondov et al., 2016; Balaban et al., 2020). Genome skimming, defined as low-depth genome sequencing, is typically applied in organelle genome recovery and species identification (Straub et al., 2012). It is a popular approach in phylogenomics, as it can extract total DNA from a small amount of fresh or silica-dried samples, or even from long-preserved herbarium specimens for low-depth genome sequencing (Zeng et al., 2018; Nevill et al., 2020). However, most nuclear DNA data produced by genome skimming are discarded in current species identification practices, reducing its discriminatory power (Rubinoff and Holland, 2005; Straub et al., 2012; Dodsworth, 2015; Yu et al., 2018; Bohmann et al., 2020; Antil et al., 2023). Genome skimming could be optimized by using assembly-free and alignment-free methods. Although methods for assembly-free genetic distance calculation exist, such as Co-phylog (Yi and Jin, 2013), Mash (Ondov et al., 2016) and APPLES (Balaban et al., 2020), these approaches require deep sequencing depth, at least 5× (Ondov et al., 2016), which increases costs and the need for computing power.

An assembly-free and alignment-free approach called 'Skmer' is proposed to calculate genomic distances using genome skims. Data with 1× sequencing depth might be sufficient for accurate species identification in most cases when using Skmer (Sarmashghi et al., 2019). It was confirmed that using Skmer for species identification with marine mollusks was feasible and efficient (Xu et al., 2022). Furthermore, Skmer has been shown to increase species identification rate in Cymbidium Sw. (Orchidaceae) (Zhang et al., 2023) and Torreya Arn. (Taxaceae) (Mo et al., 2023). It seemed that Skmer could be proposed as a potentially credible approach for species discrimination.

Species delimitation has long been controversial in Schima Reinw. ex Blume (Gordonieae, Theaceae) (Melchior, 1925; Bloembergen, 1952; Willis and Airy Shaw, 1985; Keng, 1994; Chang and Ren, 1998). Schima is distributed in subtropical and tropical areas of Asia. It is characterized by white flowers, globose to oblate fruits containing small reniform seeds with marginal membranous wings (Fig. 1; Ming and Bartholomew, 2007). It is ecologically important as one of the dominant elements of the subtropical evergreen broadleaved forests in East Asia (Yu et al., 2017a, 2017b; Li et al., 2020; Tang et al., 2020). Schima species also used in traditional Chinese medicine and have recently been shown to have anti-cancer properties (College 1985; Liang et al., 2019; Wu et al., 2019). Furthermore, these plants are used as horticultural resources, fire-resistant trees, and furniture wood (Yang et al., 2022; Zheng et al., 2022). The current classification system of Schima relies on morphological characters, which overlap in some species, making species identification difficult. Recent research has used DNA barcodes (matK, rbcL, trnH-psbA and nrITS) and next generation barcodes (plastome, nrDNA arrays) to identify species in Schima (Yu et al., 2022). However, next generation barcodes identified only three out of 11 Schima species, whereas standard DNA barcodes identified no species. Thus, the efficacy of current DNA barcodes for species identification in Schima is limited.

Fig. 1 Morphological diversity of 10 species of Schima. (A) S. argentea; (B) S. brevipedicellata; (C) S. crenata; (D) S. khasiana; (E) S. noronhae; (F) S. remotiserrata; (G) S. sericans var. sericans; (H) S. sinensis; (I) S. superba; (J) S. wallichii. p = pedicel.

Here we used genome skimming data from 47 individuals of 10 species of Schima to determine the efficacy of Skmer in identifying species in a taxonomically difficult genus. We also explored the sequencing depths required for accurate species identification.

2. Material and methods 2.1. Sample collection

In total, 47 individuals representing 10 out of 13 Chinese species of Schima were represented. Leaves of eight individuals representing Chinese species of Schima were newly collected. We also used genome skimming data from 39 individuals from our previous study, representing five additional Schima species (Yu et al., 2022). These data represent at least 2× sequencing depth according to the genome size of Schima argentea as 900 Mb (unpublished data). The numbers of individuals used for each species varied from two to nine, depending on the ranges of their geographic distribution. We also collected two species of Camellia (C. mingii and C. connata) as outgroups were collected. Fresh leaves were dried by silica gel and stored at low temperature. Voucher specimens from newly collected individual were deposited in the Herbarium of Kunming Institute of Botany, Chinese Academy of Sciences (KUN) (Table S1).

2.2. DNA extraction and data collection

Total DNA was extracted from about 100 to 200 mg of leaf material and purified using the Plant Genomic DNA Extraction and Purification Kit. The quality of DNA was examined by agarose gel electrophoresis and NanoDrop® ND-1000 Spectrophotometer. High-quality DNA (> 1.5 μg) was used for subsequent library construction and sequencing. For library construction, DNA was first randomly sheared by a Covaris crusher, before the desired length of DNA fragments (300–350 bp) was recovered by electrophoresis. The NEBNext Ultra II DNA Library Prep Kit for Illumina was used to construct a library. Briefly, DNA fragments were end repaired, polyA tails and sequencing junctions added. DNA fragments were then purified prior to PCR amplification. Finally, the target fragments were subjected to paired-end sequencing using the Illumina NovaSeq 6000 platform with a unidirectional read length of 150 bp. Sequencing data for each individual was about 2 Gb.

2.3. Sequence dataset preparation

Fastp v.0.23.2 (Chen et al., 2018) was used for quality control to remove contamination and filter out low-quality reads, with parameters set to default. To explore the minimum sequencing depth require for species discrimination using Skmer in Schima, we used BBtools (Bushnell, 2014) to randomly extract sub-datasets from our dataset derived from genome skimming in 47 individuals. Two different sizes of sub-datasets were generated: 0.5 Gb (0.5×) and 1 Gb (1×).

2.4. Distance calculation

To calculate mismatch rate, each dataset was analyzed separately according to the Skmer protocol. Sequencing error and coverage were estimated by calculating k-mer frequency curves in JellyFish (Marcais and Kingsford, 2011). Shallow sequencing of each genome was collected and used to create separate k-mer sets. The similarity between any two sets (k-mer sets) was computed as the J (Jaccard index J) using the hashing technique by Mash (Ondov et al., 2016). The distance matrix is constructed with the Skmer default parameters, generating a master distance matrix and 100 subsample distance matrices. Finally, the estimated distance was corrected using the Skmer command to obtain values extremely close to the results.

2.5. Phylogenetic analysis

For building a phylogenetic tree, FastME 2.0 (Lefort et al., 2015) was used to infer the backbone tree, generating one main matrix tree file and 100 sub-repeat matrix tree files. Then the Maximum-Likelihood (ML) trees were built using RAxML-1.2.0 (Stamatakis, 2014), with 1000 rapid Bootstrap replicates to compute Bootstrap support (BS) and to search for the optimal ML tree. Species identification was evaluated as being successful if all conspecific samples clustered in a clade in the phylogenetic tree with ML bootstrap support above 70% (David and James, 1993; Pamela and Douglas, 2003).

3. Results 3.1. Species discrimination of Schima estimated by Skmer analysis

Skmer identified six of ten Schima species (i.e., Schima wallichii, S. khasiana, S. sericans, S. noronhae, S. crenata, and S. remotiserrata) with high support based on the 2× (ca. 2 Gb) genome skimming data (Table 1). Specifically, Skmer divided 47 Schima samples into two main clades (BS = 100%) (Fig. 2). Clade A was formed by 27 individuals from seven species and one subspecies collected in southwestern China, Myanmar and Nepal. Eight individuals of S. wallichii clustered into a monophyletic subclade with strong support (BS = 100%), which were distributed in southwestern China (four individuals), Myanmar (two individuals), and Nepal (two individuals). Similarly, three individuals of S. khasiana (collected from Yunnan, China) formed a subclade with 100% bootstrap support. Another monophyletic subclade consisted of three individuals of S. sericans var. sericans and two individuals of S. sericans var. paracrenata. Two individuals of S. noronhae collected from Southwestern China formed a monophyletic subclade but was nested within S. sinensis, where five individuals of S. sinensis were sub-divided into two groups respectively corresponding to Yunnan (two individuals) and Sichuan (three individuals). Three individuals of S. argentea and one individual of S. brevipedicellata clustered in a subclade together.

Table 1 Species discrimination rates of traditional barcodes (rbcL, matK, trnH-psbA and nrITS), next generation super barcodes (complete plastomes + nrDNA), and the unassembled reads obtained from genome skimming based on Skmer.
Species rbcL + matK + trnH-psbA Next generation barcoding (Complete plastomes + nrDNA) Skmer based on genome skims
Schima argentea No No No
S. brevipedicellata No No No
S. crenata No YES YES
S. khasiana No No YES
S. multibracteata No No /
S. noronhae No No YES
S. remotiserrata No No YES
S. sericans No No YES
S. sericans var. paracrenata No YES No
S. sinensis No No No
S. superba No No No
S. wallichii No No YES
S. parviflora No YES /
Successfully identified species 0 3 6
Identification rate (%) 0 27.27 60

Fig. 2 Skmer tree obtained from the original genome skimming data (ca. 2 Gb) of Schima. Taxa names in six different colors indicate those were successfully identified. Geographical distributions are indicated behind species names according to color coding.

Clade B was composed of 20 individuals of five species mainly from southeastern China. Three individuals of Schima remotiserrata and four individuals of S. crenata (two from China and another two from Cambodia) formed two distinct monophyletic subclades, respectively, with strong support (BS = 100%). However, nine individuals of S. superba were divided into two groups with 100% bootstrap support. One group consisted of four individuals of S. superba, while the other group of five individuals of S. superba was accompanied by one individual of S. argentea. Interestingly, the other two individuals of S. brevipedicellata, which were collected from southwestern China (Yunnan) and southeastern China (Guangdong), clustered into a subclade, with the remaining one individual of S. argentea from south China (Guangxi).

3.2. Sequencing depths for Schima

To test suitable sequencing depths of Skmer in Schima, two fixed datasets (0.5 Gb and 1 Gb) were extracted randomly from the original dataset (about 2 Gb, ca. 2×). Two phylogenetic trees constructed by 0.5 Gb (0.5×) or 1 Gb (1×) dataset exhibited similar topologies to that constructed by the original dataset (Figs. 2-4). Skmer successfully discriminated six species of Schima (Schima wallichii, S. khasiana, S. sericans, S. noronhae, S. crenata, and S. remotiserrata) using each dataset (0.5×, 1× and 2× depth). For datasets based on genome skimming at the original (2×) and 0.5× sequencing depths, all 47 individuals of 10 Schima species were divided into two main clades (Figs. 2 and 4), which corresponded well with their geographic distribution. In contract, fixed 1× sequencing depth yielded three main clades, with S. wallichii clade not clustering with other species found in clade A (Fig. 3). As expected, bootstrap support values for some clades for most clades or subclades were higher in the original 2× dataset than for the 0.5× and 1× datasets. However, all three phylogenetic trees formed six monophyletic subclades. Surprisingly, these six monophyletic subclades obtained complete 100% bootstrap support, even in the 0.5× dataset (Fig. 4).

Fig. 3 Skmer tree with 1 Gb of genome skimming data. Taxa names in six different colors indicate those were successfully identified. Geographical distributions are indicated behind sample names according to color coding.

Fig. 4 Skmer tree with 0.5 Gb of genome skimming data. Taxa names in six different colors indicate those were successfully identified. Geographical distributions are indicated behind sample names according to color coding.
4. Discussion 4.1. Skmer approach was efficient in species identification of Schima

Next generation DNA barcodes have been used well in species identification of some taxa, such as Taxus (Taxaceae) (Fu et al., 2019), Panax (Araliaceae) (Ji et al., 2019), and Calligonum (Polygonaceae) (Song et al., 2020). However, these barcodes have not been effective for species identification in taxonomically challenging taxa, such as Rhododendron (Ericaceae) (Fu et al., 2022) and Fargesia (Poaceae) (Lv et al., 2023), or our previous study in Schima, where only 3 out 11 Schima species were identified (Yu et al., 2022). Here, Skmer approach, based on the unassembled data, improved species identification rate in Schima, identifying six out of ten species (60%) (Table 1). Our results indicate that Skmer is more efficient than the next generation barcodes at species discrimination in Schima. This finding is similar to a previous study that showed that the Skmer approach improved upon the species identification rate of next generation barcodes (72% vs. 68%) in Cymbidium (Zhang et al., 2023). Our finding is also consistent with previous studies that showed that the discrimination rate of Skmer approach (77.7%) is the same as that of plastid genomes, but relatively higher than that of nrDNA cistrons (62.5%) (Mo et al., 2023).

The improved identification rate of Schima species might be attributed to an extended exploration of the unassembled nuclear reads from genome skims. Nuclear genes have been used to infer of angiosperm phylogeny (Hughest et al., 2006; Zimmer and Wen, 2012; Dong et al., 2021). However, nuclear genome sequencing is expensive and requires significant computational power. For example, the minimum sequencing depth required that falls in the range of at least 50× for a relatively straightforward diploid organism (Sohn and Nam, 2018). As a result, a fully assembled nuclear genome may not be the most suitable option for widespread species identification purposes. Genome skimming generates a large amount of nuclear genomic sequence data that are not used for species discrimination and plant phylogenetic studies, but instead discarded. In fact, it was ever proposed that ca. 1× sequencing depth or 2 Gb genome skimming data might be sufficient for species-level identification in most cases when Skmer was developed as a new approach (Sarmashghi et al., 2019; Bohmann et al., 2020). The Skmer approach allows for species identification based on the genome skimming data ranging from 0.1 Gb (ca. 0.50×–1.68×) to 4 Gb (ca. 0.93×–2.78×) with an increasing depth corresponding to a strong support (Xu et al., 2022). In our study, the phylogenetic trees based on data ranging from 0.5 Gb (ca. 0.5×), 1 Gb (ca. 1×), and 2 Gb (ca. 2×) were similar in topology, with each dataset identifying six Schima species with strong support (BS = 100%). Therefore, Skmer might be an effective, credible, and economic approach for exploring the potential of nuclear genomic data.

Interestingly, our findings based on Skmer analyses conflict with those of our previous study in which we used next generation barcodes. For example, next generation barcodes grouped all individuals of Schima wallichii into two subclades (Yu et al., 2022), however, Skmer approach grouped these same individuals into one highly supported monophyletic subclade (BS = 100%) (Fig. 2). Similarly, our plastomic tree failed to group S. khasiana into a subclade (Yu et al., 2022), whereas Skmer tree placed all three individuals within a monophyletic subclade (BS = 100%) (Fig. 2). The same discrepancies occurred when we used these approaches to identify S. noronhae, S. sericans, and S. remotiserrata. These results indicate that previous failure of plastome for species discrimination of Schima might be caused by hybridization or introgression among species, and thus the maternally inherited plastome could not trace the underlying species boundary (Small et al., 2004; Greiner et al., 2015). However, in this study, the genome skims based on nuclear reads in Skmer analyses could provide additional information for species identification and improve the species discrimination rate, avoiding limitations coming from chloroplast capture (Percy et al., 2014; Yi et al., 2015; Ogishima et al., 2019; Liu et al., 2020). Our previous study also found ancient introgression between the common ancestor of Gordonia and that of Schima (Zhang et al., 2022). It remains unclear whether potential hybridization has played a role in Schima speciation, although we suspect this might be an important factor that limited the efficacy of plastome data in species identification in the genus (Twyford and Ennos, 2012; Richard and Erica, 2014). Even though Skmer analysis could be an alternative of next generation barcodes, using more single/low copy genes as next generation nuclear barcodes would provide more robust framework of species discrimination, as numerous studies have suggested (Ruhsam et al., 2015; Ji et al., 2019; Fu et al., 2022; Yu et al., 2022; Zhang et al., 2023).

4.2. Skmer approach is reliable in species discrimination of Schima

Species discrimination of Schima by Skmer corresponded well with morphological and geographic evidence. The six Schima species identified by Skmer included two species endemic to China (i.e., S. sericans and S. remotiserrata) and four species widely distributed in Yunnan, China and Southeast Asia (Figs. 2-4). Previous studies have divided S. sericans, which is distributed in Xizang and Yunnan, China, into two varieties only based on morphological data, including the presence or absence of trichome on current year branchlets, petioles, pedicels and outside of sepals (Ming and Bartholomew, 2007). Furthermore, S. sericans var. sericans and S. sericans var. paracrenata are distributed in overlapping ranges, with the latter narrowly distribution in the Gongshan mountain of Yunnan Province in China, suggesting that relying solely on morphology may be insufficient for species delimitation. Skmer analysis showed that two individuals of S. sericans var. paracrenata were nested with three individuals of S. sericans var. sericans and strongly formed a subclade. Thus, our findings indicate that the two varieties of S. sericans should be merged as a single species. We also found that three individuals of S. remotiserrata also form a monophyletic subclade (Figs. 2-4). These individuals, narrowly distributed in the southeastern China, share morphological traits (e.g., leaf adaxially glabrous, margin apically sparsely obtusely serrate).

The identification of Schima crenata was strongly supported, as two individuals from China and two from Cambodia clustered into a subclade (Figs. 2-4). This finding is consistent with our previous analysis based on plastomic next generation barcodes (Yu et al., 2022). In China, S. crenata is only found on Hainan Island, although it is common throughout the Mainland Southeast Asia, Malaysia and Indonesia. It was distinguished morphologically from other Schima species by its long pedicel (4–6 cm), leaf margin apically undulate crenate, and ovary tomentose (Ming and Bartholomew, 2007).

Schima wallichii is widely distributed in southwestern China, Bhutan, N India, Laos, Myanmar, Nepal, Thailand, and Vietnam (Ming and Bartholomew, 2007). It was previously treated as the only one species in Schima, but regarded as a 'complex-polymorphic' species, which could be divided into nine geographically separated subspecies and three varieties (Bloembergen, 1952). The chromosome number (2n = 36) of S. wallichii was recorded (Bezbaruah, 1971), which was congruent with that of the whole genus of Schima (Yang et al., 2004). Previous studies used next generation barcode to divide ten individuals of S. wallichii from southwestern China, Myanmar, and Nepal into two subclades (Yu et al., 2022). Here, Skmer analysis grouped eight of these individuals into one subclade with strong support (Figs. 2-4). Our results support prior taxonomic treatment that regard S. wallichii as a good species, considering its distinct features, including the current year branchlets pubescent, leaf blade with entire margin, capsule subglobose with white lenticellate, etc. (Ming and Bartholomew, 2007).

Schima khasiana is mainly distributed in Yunnan of China, extending to Myanmar, India, as well as to Vietnam. It was distinguished by the largest flower (diameter > 6 cm) and the largest fruit (diameter ca. 2.5–4 cm) within Schima. Skmer analysis grouped three individuals into a monophyletic subclade (Figs. 2-4). In addition, two individuals of S. noronhae, which mainly occurred in Southeastern Asia, clustered as a subclade by Skmer (Figs. 2-Fig. 4). This identification is consistent with morphological trait, i.e., pale pink flowers with a long pedicle (> 5 cm) (Ming and Bartholomew, 2007).

Skmer did not successfully identify 22 individuals sampled from other four Schima species (i.e., S. superba, S. argentea, S. brevipedicellata and S. sinensis; Figs. 2-4). The delimitation of these four species has been controversial due to their wide range distribution and morphological overlap (Ming and Bartholomew, 2007). For example, S. superba is distributed widely from the southern and southeastern China and Taiwan Island in China to Japan. It could not be easily identified by morphology because its leaves were characterized by margin undulately obtusely crenate from basal 1/2 apically but varied heavily. Furthermore, S. superba shares features with S. remotiserrata and S. crenata, e.g., variously toothed leaf margins and glabrous abaxial surface. Skmer grouped and mixed four individuals of S. argentea and three S. brevipedicellata in two distant subclades, with one individual of S. argentea grouping within the S. superba subclade (Figs. 2-4). The geographical distribution of S. argentea and S. brevipedicellata overlap in Jiangxi and Yunnan of China as well as in Vietnam. These two species also share common features, such as leaf margin entire and abaxial surface glaucous, but are distinguished by the evident length of pedicle. The pedicle of S. argentea (1–1.5 cm) is longer than that of S. brevipedicellata (< 1 cm). S. sinensis is endemic to the southwestern China, and is characterized by solitary flowers, a trait shared with S. khasiana (Ming and Bartholomew, 2007). Skmer also failed to successfully identify individuals of S. sinensis, which clustered into a subclade but with two individuals of S. noronhae nested within (Figs. 2-4). S. sinensis is morphologically similar to S. noronhae, with both having leave that are glabrous on the abaxial surface and long pedicel (> 3 cm).

The inability of Skmer to discriminate individual of four Schima species might be attributed to their low genetic variation which is reflected in their overlapping morphological characters. In addition, Schima species may have undergone interspecific hybridization or rapid radiation. Despite our progress, taxonomy and species identification of Schima remains problematic. Future taxonomic studies should integrate morphological and molecular data as well as sample populations that cover the entire geographic distribution of the species (Dayrat, 2005; Cleusa et al., 2018; He et al., 2022).

5. Conclusions

In the study, Skmer identified six out of ten Schima species. Furthermore, species identification was consistent with the unassembled genome skims ranging from 0.5× (ca. 0.5 Gb), 1× (ca. 1 Gb), to 2× (ca. 2 Gb). The species identified by Skmer were also consistent with morphological data and corresponded to their distribution patterns. Our results effectively improve the species identification rate of Schima to 60% compared with the previous next generation barcoding research (27.27%). Additionally, our comparative results showed that 0.5× sequencing depth data successfully captured sufficient nuclear genomic data for downstream species identification analysis in the case study of Schima. Therefore, Skmer can make full use of the scattered nuclear genomic data from shallow sequencing depth and thus improve species identification rates. However, comprehensive taxonomic revisions for the objective taxa would serve as significant benchmark for Skmer performance.

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 32070369), the Youth Innovation Promotion Association CAS of China (No. 2021393), the Yunnan Revitalization Talent Support Program "Young Talent" Project, the Applied Fundamental Research Foundation of Yunnan Province (202301AT070308) and the Fund of Yunnan Key Laboratory of Plant Reproductive Adaptation and Evolutionary Ecology (YNPRAEC-2023006). The authors are grateful to Prof. Liang Fang (Jiujiang University) and Prof. Xue-Jun Ge (South China Botanical Garden, CAS), to Prof. Hua Peng and colleagues En-De Liu, Ting Zhang, Ji-Dong Ya, Yong-Jie Guo, Cheng Liu and Li Chen (Kunming Institute of Botany, CAS) for assistance with sample collection. We thank the staff at KUN for providing access to study specimens of Schima, and the iFlora High Performance Computing Center of the Germplasm Bank of Wild Species (iFlora HPC Center of GBOWS, KIB, CAS) for computing.

Data availability statement

The data set supporting the conclusions of this article is available in the FigShare (https://doi.org/10.6084/m9.figshare.24848427).

CRediT authorship contribution statement

Han-Ning Duan: Writing – original draft, Formal analysis. Yin-Zi Jiang: Writing – review & editing. Jun-Bo Yang: Writing – review & editing. Jie Cai: Writing – review & editing. Jian-Li Zhao: Writing – review & editing. Lu Li: Writing – review & editing. Xiang-Qin Yu: Writing – review & editing, Conceptualization.

Declaration of competing interest

The authors have no conflict of interest.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.pld.2024.06.003.

References
Alvarez, I., Costa, A., Feliner, G.N., 2008. Selecting single-copy nuclear genes for plant phylogenetics: a preliminary analysis for the Senecioneae (Asteraceae). J. Mol. Evol., 66: 276-291. DOI:10.1007/s00239-008-9083-7
Antil, S., Abraham, J.S., Sripoorna, S., et al., 2023. DNA barcoding, an effective tool for species identification: a review. Mol. Biol. Rep., 50: 761-775. DOI:10.1007/s11033-022-08015-7
Balaban, M., Sarmashghi, S., Mirarab, S., et al., 2020. APPLES: scalable distance-based phylogenetic placement with or without alignments. Syst. Biol., 69: 566-578. DOI:10.1093/sysbio/syz063
Bezbaruah, H.P., 1971. Cytological investigations in the family Theaceae-I. chromosome numbers in some Camellia species and allied genera. Caryologia, 24: 421-426. DOI:10.1080/00087114.1971.10796449
Bieniek, W., Mizianty, M., Szklarczyk, M., 2014. Sequence variation at the three chloroplast loci (matK, rbcL, trnH-psbA) in the Triticeae tribe (Poaceae): comments on the relationships and utility in DNA barcoding of selected species. Plant Syst. Evol., 301: 1275-1286. DOI:10.1007/s00606-014-1138-1
Bloembergen, S., 1952. A critical study in the complex-polymorphous genus Schima (Theaceae). Reinwardtia, 2: 113-183. DOI:10.55981/reinwardtia.1952.1019
Bohmann, K., Mirarab, S., Bafna, V., et al., 2020. Beyond DNA barcoding: the unrealized potential of genome skim data in sample identification. Mol. Ecol., 29: 2521-2534. DOI:10.1111/mec.15507
Bushnell, B., 2014. BBTools software package. Available online: https://sourceforge.net/projects/bbmap/. (Accessed 1 September 2023).
Chang, H.D., Ren, S.X., 1998. Theaceae. In: Wu, C.Y. (Ed.), Flora Reipublicae Popularis Sinicae. Science Press.
Chen, S., Zhou, Y., Chen, Y., et al., 2018. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34: i884-i890. DOI:10.1093/bioinformatics/bty560
Cleusa, V., Bianca, O., João, R.I., et al., 2018. Integrative taxonomy improves delimitation in Hypericum subspecies. Perspect. Plant Ecol. Evol. Syst., 34: 68-76. DOI:10.1016/j.ppees.2018.08.005
College, J.N.M., 1985. Traditional Chinese Medicine (1st Volume). Shanghai Scientific & Technical Publishers, Shanghai, China.
Dayrat, B., 2005. Towards integrative taxonomy. Biol. J. Linn. Soc., 85: 407-415. DOI:10.1111/j.1095-8312.2005.00503.x
David, M.H., James, J.B., 1993. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst. Biol., 42: 182-192. DOI:10.1093/sysbio/42.2.182
Dodsworth, S., 2015. Genome skimming for next-generation biodiversity analysis. Trends Plant Sci., 20: 525-527. DOI:10.1016/j.tplants.2015.06.012
Dong, P.B., Wang, R.N., Afzal, N., et al., 2021. Phylogenetic relationships and molecular evolution of woody forest tree family Aceraceae based on plastid phylogenomics and nuclear gene variations. Genomics, 113: 2365-2376. DOI:10.1016/j.ygeno.2021.03.037
Fu, C.N., Mo, Z.Q., Yang, J.B., et al., 2022. Testing genome skimming for species discrimination in the large and taxonomically difficult genus Rhododendron. Mol. Ecol. Resour., 22: 404-414. DOI:10.1111/1755-0998.13479
Fu, C.N., Wu, C.S., Ye, L.J., et al., 2019. Prevalence of isomeric plastomes and effectiveness of plastome super-barcodes in yews (Taxus) worldwide. Sci. Rep., 9: 2773. DOI:10.1038/s41598-019-39161-x
Greiner, S., Sobanski, J., Bock, R., 2015. Why are most organelle genomes transmitted maternally?. Bioessays, 37: 80-94. DOI:10.1002/bies.201400110
He, X., Cao, J.J., Zhang, W., et al., 2022. Integrative taxonomy of herbaceous plants with narrow fragmented distributions: a case study on Primula merrilliana species complex. J. Syst. Evol., 60: 859-875. DOI:10.1111/jse.12726
Hollingsworth, P.M., Li, D.Z., van der Bank, M., et al., 2016. Telling plant species apart with DNA: from barcodes to genomes. Philos. Trans. R. Soc. B Biol. Sci., 371: 20150338. DOI:10.1098/rstb.2015.0338
Hughest, C.E., Eastwood, R.J., Bailey, C.D., 2006. From famine to feast? Selecting nuclear DNA sequence loci for plant species-level phylogeny reconstruction. Philos. Trans. R. Soc. Lond. B Biol. Sci., 361: 211-225. DOI:10.1098/rstb.2005.1735
Ji, Y., Liu, C., Yang, Z., et al., 2019. Testing and using complete plastomes and ribosomal DNA sequences as the next generation DNA barcodes in Panax (Araliaceae). Mol. Ecol. Resour., 19: 1333-1345. DOI:10.1111/1755-0998.13050
Jiang, K.W., Zhang, R., Zhang, Z.F., et al., 2020. DNA barcoding and molecular phylogeny of Dumasia (Fabaceae: Phaseoleae) reveals a cryptic lineage. Plant Divers., 42: 376-385. DOI:10.1016/j.pld.2020.07.007
Kane, N., Sveinsson, S., Dempewolf, H., et al., 2012. Ultra-barcoding in cacao (Theobroma spp.; Malvaceae) using whole chloroplast genomes and nuclear ribosomal DNA. Am. J. Bot., 99: 320-329. DOI:10.3732/ajb.1100570
Keng, H., 1994. Flora Malesianae Precursores - LVIII, Part Four. The Genus Schima (Theaceae) in Malesia. The Gardens' Bulletin, Singapore.
Lefort, V., Desper, R., Gascuel, O., 2015. FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol. Biol. Evol., 32: 2798-2800. DOI:10.1093/molbev/msv150
Li, D.Z., Gao, L.M., Li, H.T., et al., 2011. Comparative analysis of a large dataset indicates that internal transcribed spacer (ITS) should be incorporated into the core barcode for seed plants. Proc. Natl. Acad. Sci. U.S.A., 108: 19641-19646. DOI:10.1073/pnas.1104551108
Li, X., Wu, T., Cheng, Y., et al., 2020. Ecophysiological adaptability of four tree species in the southern subtropical evergreen broad-leaved forest to warming. Chin. J. Plant Ecol., 44: 1203-1214. DOI:10.17521/cjpe.2020.0318
Li, X., Yang, Y., Henry, R.J., et al., 2015. Plant DNA barcoding: from gene to genome. Biol. Rev., 90: 157-166. DOI:10.1111/brv.12104
Liang, Q.P., Xu, T.Q., Liu, B.L., et al., 2019. Sasanquasaponin IotaIotaIota from Schima crenata Korth induces autophagy through Akt/mTOR/p70S6K pathway and promotes apoptosis in human melanoma A375 cells. Phytomedicine, 58: 152769. DOI:10.1016/j.phymed.2018.11.029
Liu, B.B., Christopher, S.C., Hong, D.Y., et al., 2020. Phylogenetic relationships and chloroplast capture in the Amelanchier-Malacomeles-Peraphyllum clade (Maleae, Rosaceae): evidence from chloroplast genome and nuclear ribosomal DNA data using genome skimming. Mol. Phylogenet. Evol., 147: 106784. DOI:10.1016/j.ympev.2020.106784
Liu, B.B., Ma, Z.Y., Ren, C., et al., 2021. Capturing single - copy nuclear genes, organellar genomes, and nuclear ribosomal DNA from deep genome skimming data for plant phylogenetics: a case study in Vitaceae. J. Syst. Evol., 59: 1124-1138. DOI:10.1111/jse.12806
Lv, S.Y., Ye, X.Y., Li, Z.H., et al., 2023. Testing complete plastomes and nuclear ribosomal DNA sequences for species identification in a taxonomically difficult bamboo genus Fargesia. Plant Divers., 45: 147-155. DOI:10.1016/j.pld.2022.04.002
Marcais, G., Kingsford, C., 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27: 764-770. DOI:10.1093/bioinformatics/btr011
Melchior, H., 1925. Die Natürlichen Pflanzenfamilien (2nd ed.). Verlag von Wilhelm Engelmann, 28: 109-154. DOI:10.1093/bioinformatics/btr011
Ming, T.L., Bartholomew, B., 2007. Theaceae. In: Wu, C.Y., Raven, P.H. (Eds.), Flora of China.
Mo, Z.Q., Wang, J., Moller, M., et al., 2023. Phylogenetic relationships and next-generation barcodes in the Genus Torreya reveal a high proportion of misidentified cultivated plants. Int. J. Mol. Sci., 24: 13216. DOI:10.3390/ijms241713216
Nevill, P.G., Zhong, X., Tonti-Filippini, J., et al., 2020. Large scale genome skimming from herbarium material for accurate plant identification and phylogenomics. Plant Methods, 16: 1. DOI:10.1186/s13007-019-0534-5
Ogishima, M., Horie, S., Kimura, T., et al., 2019. Frequent chloroplast capture among Isodon (Lamiaceae) species in Japan revealed by phylogenies based on variation in chloroplast and nuclear DNA. Plant Species Biol., 34: 127-137. DOI:10.1111/1442-1984.12239
Ondov, B.D., Treangen, T.J., Melsted, P., et al., 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol., 17: 132. DOI:10.1186/s13059-016-0997-x
Pamela, S.S., Douglas, E.S., 2003. Applying the bootstrap in phylogeny reconstruction. Statist. Sci., 18: 256-267. DOI:10.1214/ss/1063994980
Percy, D.M., Argus, G.W., Cronk, Q.C., et al., 2014. Understanding the spectacular failure of DNA barcoding in willows (Salix): does this result from a trans-specific selective sweep?. Mol. Ecol., 23: 4737-4756. DOI:10.1111/mec.12837
Richard, G.H., Erica, L.L., 2014. Hybridization, introgression, and the nature of species boundaries. J. Hered., 105: 795-809. DOI:10.1093/jhered/esu033
Rubinoff, D., Holland, B.S., 2005. Between two extremes: mitochondrial DNA is neither the panacea nor the nemesis of phylogenetic and taxonomic inference. Syst. Biol., 54: 952-961. DOI:10.1080/10635150500234674
Ruhsam, M., Rai, H.S., Mathews, S., et al., 2015. Does complete plastid genome sequencing improve species discrimination and phylogenetic resolution in Araucaria?. Mol. Ecol. Resour., 15: 1067-1078. DOI:10.1111/1755-0998.12375
Sarmashghi, S., Bohmann, K., Gilbert, M.T.P., et al., 2019. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol., 20: 34. DOI:10.1186/s13059-019-1632-4
Small, R., Cronn, R., Wendel, J., 2004. Use of nuclear genes for phylogeny reconstruction in plants. Aust. Syst. Bot., 17: 145-170. DOI:10.1071/SB03015
Sohn, J.I., Nam, J.W., 2018. The present and future of de novo whole-genome assembly. Brief Bioinform., 19: 23-40. DOI:10.1093/bib/bbw096
Song, F., Li, T., Burgess, K.S., et al., 2020. Complete plastome sequencing resolves taxonomic relationships among species of Calligonum L. (Polygonaceae) in China. BMC Plant Biol., 20: 261. DOI:10.1186/s12870-020-02466-5
Stamatakis, A., 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30: 1312-1313. DOI:10.1093/bioinformatics/btu033
Straub, S.C., Parks, M., Weitemier, K., et al., 2012. Navigating the tip of the genomic iceberg: next-generation sequencing for plant systematics. Am. J. Bot., 99: 349-364. DOI:10.3732/ajb.1100335
Tang, C.Q., Han, P.B., Li, S.F., et al., 2020. Species richness, forest types and regeneration of Schima in the subtropical forest ecosystem of Yunnan, southwestern China. For. Ecosyst., 7: 35. DOI:10.1186/s40663-020-00244-1
Twyford, A.D., Ennos, R.A., 2012. Next-generation hybridization and introgression. Heredity, 108: 179-189. DOI:10.1038/hdy.2011.68
Willis, J.C., Airy Shaw, H.K., 1985. A Dictionary of the Flowering Plants and Ferns, eighth ed. Cambridge University Press.
Wu, C., Wu, H.T., Wang, Q., et al., 2019. Anticandidal potential of stem bark extract from Schima superba and the identification of its major anticandidal compound. Molecules, 24: 1587. DOI:10.3390/molecules24081587
Xu, T., Kong, L., Li, Q., 2022. Testing efficacy of assembly-free and alignment-free methods for species identification using genome skims, with patellogastropoda as a test case. Genes, 13: 1192. DOI:10.3390/genes13071192
Yang, S.X., Yang, J.B., Lei, L.G., et al., 2004. Reassessing the relationships between Gordonia and Polyspora (Theaceae) based on the combined analyses of molecular data from the nuclear, plastid and mitochondrial genomes. Plant Syst. Evol., 248: 45-55. DOI:10.1007/s00606-004-0178-3
Yang, Z., Zhang, R., Zhou, Z., 2022. The XTH Gene Family in Schima superba: genome-wide identification, expression profiles, and functional interaction network analysis. Front. Plant Sci., 13: 911761. DOI:10.3389/fpls.2022.911761
Yi, H., Jin, L., 2013. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res., 41: e75. DOI:10.1093/nar/gkt003
Yi, T.S., Jin, G.H., Wen, J., 2015. Chloroplast capture and intra- and inter-continental biogeographic diversification in the Asian - new World disjunct plant genus Osmorhiza (Apiaceae). Mol. Phylogenet. Evol., 85: 10-21. DOI:10.1016/j.ympev.2014.09.028
Yu, X., Yang, D., Guo, C., et al., 2018. Plant phylogenomics based on genome-partitioning strategies: progress and prospects. Plant Divers., 40: 158-164. DOI:10.1016/j.pld.2018.06.005
Yu, X.Q., Drew, B.T., Yang, J.B., et al., 2017a. Comparative chloroplast genomes of eleven Schima (Theaceae) species: insights into DNA barcoding and phylogeny. PLoS One, 12: e0178026. DOI:10.1371/journal.pone.0178026
Yu, X.Q., Gao, L.M., Soltis, D.E., et al., 2017b. Insights into the historical assembly of East Asian subtropical evergreen broadleaved forests revealed by the temporal history of the tea family. New Phytol., 215: 1235-1248. DOI:10.1111/nph.14683
Yu, X.Q., Jiang, Y.Z., Folk, R.A., et al., 2022. Species discrimination in Schima (Theaceae): next-generation super-barcodes meet evolutionary complexity. Mol. Ecol. Resour., 22: 3161-3175. DOI:10.1111/1755-0998.13683
Zeng, C.X., Hollingsworth, P.M., Yang, J., et al., 2018. Genome skimming herbarium specimens for DNA barcoding and phylogenomics. Plant Methods, 14: 43. DOI:10.1186/s13007-018-0300-0
Zhang, L., Huang, Y.W., Huang, J.L., et al., 2023. DNA barcoding of Cymbidium by genome skimming: call for next-generation nuclear barcodes. Mol. Ecol. Resour., 23: 424-439. DOI:10.1111/1755-0998.13719
Zhang, Q., Zhao, L., Folk, R.A., et al., 2022. Phylotranscriptomics of Theaceae: generic-level relationships, reticulation and whole-genome duplication. Ann. Bot., 129: 457-471. DOI:10.1093/aob/mcac007
Zheng, W., Ma, Y., Tigabu, M., et al., 2022. Capture of fire smoke particles by leaves of Cunninghamia lanceolata and Schima superba, and importance of leaf characteristics. Sci. Total Environ., 841: 156772. DOI:10.1016/j.scitotenv.2022.156772
Zimmer, E.A., Wen, J., 2012. Using nuclear gene data for plant phylogenetics: progress and prospects. Mol. Phylogenet. Evol., 65: 774-785. DOI:10.1111/jse.12174