b. State Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518020, China;
c. School of Medical, Molecular and Forensic Sciences, Murdoch University, Murdoch 6149, WA, Australia;
d. National Engineering Laboratory for Resource Development of Endangered Crude Drugs in Northwest China, College of Life Sciences, Shaanxi Normal University, Xi’an 710119, Shaanxi, China;
e. Department of Agricultural Biology, Colorado State University, Fort Collins, CO 80523, USA;
f. Shenzhen Key Laboratory of Southern Subtropical Plant Diversity, Fairy Lake Botanical Garden, Shenzhen & Chinese Academy of Sciences, Shenzhen 518004, Guangdong, China;
g. Marine College, Shandong University, Weihai 264209, Shandong, China;
h. State Key Laboratory of Plant Diversity and Specialty Crops and Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
The extraordinary diversification of angiosperms represents one of the most remarkable evolutionary radiations in Earth's history, encompassing over 350,000 extant species (Qian et al., 2022). This rapid diversification, which Charles Darwin famously termed an “abominable mystery,” continues to challenge our understanding of plant evolution (Guo et al., 2023; Zhang and Ma, 2024). Despite advances in high-throughput sequencing technologies and numerous large-scale phylogenomic studies, relationships among major angiosperm lineages remain contentious. The current framework, established by the Angiosperm Phylogeny Group (APG Ⅳ, 2016), recognizes the early-diverging ANA grade (comprising Amborellaceae, Nymphaeales, and Austrobaileyales) and the more diverse core angiosperms (containing Magnoliids, Chloranthales, Ceratophyllales, monocots, and eudicots). While the classification of 64 orders has been relatively stable, the phylogenetic relationships among the five major lineages within core angiosperms remain unresolved (Fig. 1).
|
| Fig. 1 Alternative topologies for deep angiosperm phylogeny. The figure illustrates eight competing hypotheses (a–h). Topologies are derived from the APG Ⅳ system (a), nuclear data including single-copy genes and orthologous genes (OGs; b–d), plastid genes (e and f), and mitochondrial genes (g and h). Each panel is annotated with its data source and key references. Branch colors correspond to major angiosperm clades. |
Recent phylogenomic studies have employed increasingly comprehensive genomic datasets and broader taxon sampling, yet have yielded conflicting topologies (Leebens-Mack et al., 2019; Zuntini et al., 2024). This persistent uncertainty despite advanced genomic approaches suggests that additional data types and analytical strategies are required. Two major controversies dominate these phylogenetic conflicts. First, the first-branching lineage within core angiosperms varies across datasets: some nuclear gene analyses support monocots as sister to the remaining core angiosperms (Fig. 1b and d) (Leebens-Mack et al., 2019; Yang et al., 2020a; Hu et al., 2023), while plastid genome data favor either Magnoliids or Chloranthales in this position (Fig. 1e and f) (Gitzendanner et al., 2018; Li et al., 2019; Hu et al., 2023). Previous mitochondrial analyses present additional conflicts, with some suggesting that monocots and Magnoliids share a most recent common ancestor (Fig. 1g) (Xue et al., 2022; Lin et al., 2025). Second, the relationship between Ceratophyllales and Chloranthales varies by genomic compartment: mitochondrial data support their sister-group relationship (Fig. 1g), whereas nuclear and plastid datasets typically place Ceratophyllales closer to eudicots (Fig. 1a, b, and d–f) and Chloranthales closer to Magnoliids (Fig. 1b, e).
The challenges in resolving these relationships stem from multiple biological factors, including the rapid radiation of early angiosperms, incomplete lineage sorting, hybridization and introgression, polyploidization, and other confounding influences (Yang et al., 2020a; Ma et al., 2021; Guo et al., 2023; Hu et al., 2023). Methodological limitations in sequencing, assembly and phylogenetic inference further complicate analysis. For example, Wu et al. (2015) identified differences between two Oryza australiensis plastid genomes that were attributable to the sequencing and assembly methods rather than true biological variation. While valuable, many previous studies using mitochondrial sequences for angiosperm phylogeny have relied on reference-based extraction or initial assembly at the contig level rather than complete mitochondrial genome assemblies (Xue et al., 2022; Hu et al., 2023; Lin et al., 2025). This approach introduces systematic errors, including an inability to distinguish true mitochondrial DNA from nuclear mitochondrial DNA (NUMTs) or to differentiate between duplicated genes with partial sequence variation (Wu et al., 2020; Postel et al., 2023). Additionally, complete mitochondrial genomes remain unavailable for two key angiosperm groups, Chloranthales and Ceratophyllales, further hampering efforts to resolve relationships among the five major clades.
Mitochondrial datasets have provided valuable insights into resolving phylogenetic relationships, particularly in rapidly evolving lineages or large-scale phylogenetic reconstructions (Liu et al., 2014a; Lin et al., 2022; Tyszka et al., 2023), despite their frequent exclusion in phylogenetic studies due to the relatively low nucleotide substitution in mitochondrial genomes (Palmer and Herbon, 1988; Christensen, 2021; Kan et al., 2022; Wang et al., 2024b). Plant mitochondrial genomes exhibit extensive structural variation—ranging from dynamic repeat content to unequally abundant chromosomes in some lineages (Wang et al., 2025)—with several features offering highly informative phylogenetic information. The loss and retention patterns of ribosomal protein genes show lineage-specific conservation across plant lineages (Mower, 2020). RNA editing sites track major taxonomic divisions, with substantially more editing sites in basal angiosperms than in model plants like Arabidopsis (Hein et al., 2016; Yu et al., 2025), providing potential synapomorphies for major clades. The distribution of repeat sequences correlates with genome stability and evolutionary history—mosses, with few repeats, exhibit 350 million years of mitochondrial genome stasis, whereas vascular plants, with abundant repeats, show greater structural dynamism (Liu et al., 2014b; Wynn and Christensen, 2019). Additionally, phylogenetically informative structural variants complement sequence-based analyses, as shown in Gossypium species (Kong et al., 2025). Despite these promising features, a systematic exploration of mitochondrial genome utility for resolving deep angiosperm relationships remains lacking, particularly regarding the integration of structural characteristics with sequence-based phylogenetic frameworks.
This study aims to (1) fill critical mitogenome gaps for Chloranthales and Ceratophyllales, thereby establishing a robust case study for dissecting cyto-nuclear conflict; (2) develop and apply a novel diagnostic framework to systematically identify the drivers of mitogenomic signal conflict; (3) empirically validate this framework by demonstrating its ability to resolve incongruence and uncover a more coherent phylogenetic signal; and (4) integrate these sequence-based findings with a multi-faceted analysis of mitogenomic architectural features to provide a comprehensive evolutionary narrative. By integrating these approaches, we provide a comprehensive framework for utilizing mitochondrial data to navigate the complexities of deep angiosperm phylogeny.
2. Materials and methods 2.1. Taxon sampling and data collectionWe sampled 57 angiosperm species representing major lineages: 34 eudicots, 12 monocots, five magnoliids, one Chloranthales, one Ceratophyllales, and four basal angiosperms, covering more than half of core angiosperms orders. One gymnosperm (Cycas taitungensis) served as an outgroup. We downloaded 55 selected published mitochondrial and plastid genome sequences from the NCBI database (Table S1). For the Chloranthales and Ceratophyllales, which lack published mitochondrial genomes, raw Illumina whole-genome sequencing data, Nanopore whole-genome sequencing data, and Illumina RNA-seq data were downloaded from the NCBI SRA under project IDs PRJNA759285 and PRJEB33250 (details provided in Table S1) (Yang et al., 2020b; Ma et al., 2021).
2.2. Organelle genome assembly and annotationFor the two lineages lacking published mitochondrial genomes, we employed a hybrid assembly approach combining Illumina short reads and Nanopore long reads to generate high-quality, complete circular genomes (Camacho et al., 2009; Bankevich et al., 2012; Bolger et al., 2014; Wick et al., 2015; Li, 2018; Kolmogorov et al., 2020). To ensure consistency across all 58 species, both our newly assembled and all downloaded mitogenomes and plastid genomes were systematically re-annotated using a standardized pipeline (IPMGA for mitochondria, CPGAVAS2 for plastids) (Shi et al., 2019). This process included predicting C-to-U RNA editing sites with a high-confidence deep learning model (Edera et al., 2021) and identifying repetitive elements (Zhang et al., 2024), thereby minimizing artifacts from heterogeneous data sources. A detailed description of the assembly and annotation workflow is provided in the Supplementary Methods (Note S1).
2.3. Primary gene identification and dataset constructionTo ensure the selection of robust orthologous genes, we implemented a stringent filtering strategy based on sequence conservation against the Magnolia biondii mitogenome (Camacho et al., 2009), classifying genes to retain only high-confidence primary orthologs. This resulted in a final set of 29 primary protein-coding genes for analysis. To systematically evaluate the impact of different data types, we constructed five distinct mitochondrial datasets from these genes: complete coding sequences (CDS), first and second codon positions (codon12), amino acid sequences (protein), and two datasets computationally adjusted for predicted RNA editing. Parallel datasets were constructed for 73 plastid genes. All individual gene datasets were aligned using MAFFT and MACSE, trimmed with Gblocks, and concatenated into supermatrices in PhyloSuite v.1.2.3 (Talavera and Castresana, 2007; Katoh and Standley, 2013; Ranwez et al., 2018; Zhang et al., 2020). The detailed criteria for gene classification and sequence processing are available in the Supplementary Methods (Note S1).
2.4. Phylogenetic reconstructionFollowing dataset preparation, we implemented a comprehensive analytical framework to assess the phylogenetic utility of mitochondrial data, applying four distinct strategies to each of the five mitochondrial datasets: unpartitioned Maximum Likelihood (ML), partitioned ML, unpartitioned Bayesian Inference (BI), and partitioned BI. The best-fit evolutionary models for each partition were determined using ModelFinder (Kalyaanamoorthy et al., 2017), implemented in IQ-TREE v.2.1.2 (Minh et al., 2020a,b) with appropriate settings for each data type. For BI model selection, the same core options were used with the addition of the -mset mrbayes parameter.
ML trees were inferred using IQ-TREE with ultrafast bootstrap (10,000 replicates) and nearest-neighbor interchange optimization. BI analysis was conducted using MrBayes v.3.2.7 (Ronquist et al., 2012) with two independent runs of four chains each, for two million generations. The same phylogenetic reconstruction strategies were applied to the three plastid datasets, enabling direct comparison between mitochondrial and plastid phylogenies.
To quantify topological differences between trees derived from different datasets and analytical approaches, we calculated both standard Robinson-Foulds (RF) distances and support-weighted RF distance using the phangorn R package (Robinson and Foulds, 1981; Schliep, 2011). While standard RF distances equally count all differing bipartitions between trees, we implemented a modified approach that incorporates node support values (bootstrap percentages or posterior probabilities) as edge weights. This method normalizes support values to a 0–1 scale and assigns them as weights for internal branches, effectively down-weighting conflicts arising from poorly supported nodes. This support-weighted RF metric provides a more biologically relevant assessment of tree similarity by prioritizing well-supported phylogenetic relationships. Both distance matrices were used in Principal Coordinate Analysis (PCoA) to visualize tree topology distributions in multidimensional space.
2.5. PhyloForensics framework for conflict dissectionTo systematically dissect the sources of phylogenetic conflict, we developed and applied a four-stage analytical framework. This “forensic” approach was designed to move from broad conflict quantification to the precise attribution of instability drivers, culminating in a validated set of optimal phylogenetic markers.
In Stage 1 (Conflict Scene Investigation), we quantified gene- and site-level conflict using concordance factors (gCF/sCF) and implemented a cross-validation strategy against multiple competing reference topologies to ensure the robustness of the conflict signal (Minh et al., 2020a,b; Mo et al., 2023; Lanfear and Hahn, 2024).
In Stage 2 (Systematic Conflict Diagnostics), we employed Internode Certainty (IC) metrics to diagnose the nature of these conflicts at key nodes, distinguishing between strong, systematic biases and weaker, stochastic noise (Salichos and Rokas, 2013; Salichos et al., 2014).
In Stage 3 (Gene Stability Profiling), we developed a novel metric, logL_SD, based on the standard deviation of partition log-likelihoods calculated for each gene across the competing reference topologies. This allowed us to rank all genes by their intrinsic signal stability, independent of any single preferred topology, and use the Elbow Method to objectively partition them into a high-stability “Optimal Set” and a lower-stability “Noisy Set.”
Finally, in Stage 4 (Attribution Analysis), we identified the root causes of this instability. We conducted a comprehensive association analysis to screen for gene attributes significantly correlated with logL_SD. After visualizing these patterns with Principal Component Analysis (PCA), we performed a rigorous multicollinearity diagnostic using the Variance Inflation Factor (VIF) to filter for a set of statistically non-redundant predictors. This allowed us to conduct a robust relative importance analysis using the LMG method (Groemping, 2006) to accurately quantify the independent contribution of each key driver to phylogenetic instability.
To empirically validate the efficacy of this framework, we constructed a new supermatrix using only the genes identified as the “Optimal Set”. We then repeated the full suite of phylogenetic analyses (ML-P, ML-NP, BI-P, BI-NP) on this filtered dataset. The resulting tree topologies and node support values were systematically compared against those derived from the full 29-gene dataset to assess whether our marker selection strategy successfully reduced topological inconsistency and enhanced phylogenetic resolution. A detailed description of the methods implemented in each stage is provided in the Supplementary Methods (Note S1).
2.6. Mitogenomic feature analysisWhile sequence-based phylogenetics formed the core of our analysis, we also investigated whether mitochondrial genomic features could provide complementary phylogenetic signals. We analyzed both continuous features (genome size, GC content, number of RNA editing sites, characteristics of repetitive elements, and root-to-tip branch length) and discrete features (intron retention/loss patterns). Genome sizes and GC content were calculated from the verified mitogenome sequences. RNA editing sites were tallied from the IPMGA annotation results. Repetitive sequences longer than 30 bp were identified using BLAST v.2.7.1 with a word size of 7 and an e-value of 1e-6 (Camacho et al., 2009) and classified into three categories based on length: short (30–100 bp), medium (100–1000 bp), and long (> 1000 bp) repeats. We visualized these features across the phylogeny using the R packages ggtree and ggplot2 (Wickham, 2016; Yu et al., 2016). The root-to-tip branch length for each species was calculated from the phylogenetic tree constructed using mitochondrial protein sequences with a Maximum Likelihood non-partitioned strategy.
For statistical analysis of continuous features, we first examined the distribution of each dataset and applied log transformation to approximate normality. We excluded gymnosperm, Chloranthales, and Ceratophyllales samples from statistical testing due to insufficient sample sizes. Normality and homogeneity of variances were verified using the shapiro.test(.) and leveneTest(.) functions in R. We identified and removed outliers based on quartile analysis to ensure robust statistical inference. One-way ANOVA followed by pairwise t-tests with FDR correction (implemented through the rstatix package) were used to assess differences between major angiosperm lineages (Kassambara, 2023b). Results were visualized using the ggpubr package (Kassambara, 2023a).
2.7. Phylogenetic correlation analysis of mitogenomic featuresTo investigate the evolutionary dynamics of genomic features while accounting for shared ancestry, we conducted phylogenetic comparative analyses. We quantified the phylogenetic signal for each feature using Blomberg’s K statistic (Blomberg and Garland Jr, 2002; Blomberg et al., 2003). We then investigated evolutionary correlations between pairs of mitogenomic features using both standard linear models and phylogenetic generalized least squares (PGLS) regression (Felsenstein, 1985). This dual approach allowed us to identify significant trait correlations and to test whether these relationships held true after controlling for phylogenetic non-independence. Specific R packages and model parameters used for these analyses are detailed in the Supplementary Methods (Note S1).
3. Results 3.1. Filling critical phylogenetic gaps: the first mitogenome assemblies for Chloranthales and CeratophyllalesWe assembled the first comprehensive mitogenomes for representatives of Chloranthales (Chloranthus sessilifolius) and Ceratophyllales (Ceratophyllum demersum). While partial mitochondrial genes from these lineages have been used in previous studies, the availability of complete mitogenomes provides a more robust foundation for phylogenomic analyses. The C. demersum mitogenome comprised three circular subgenomes totaling 594,967 bp (Fig. 2a), with sizes of 102,122 bp, 207,723 bp, and 285,122 bp, with consistent GC content across subgenomes (49.56–49.69%). This mitogenome contained 73 unique genes (40 protein-coding genes, 30 tRNAs, and 3 rRNAs), with 25 introns identified in protein-coding genes, including five trans-spliced introns (nad1i394, nad1i669, nad2i542, nad5i1455, nad5i1477).
|
| Fig. 2 Schematic feature maps of the newly assembled mitogenomes for two key angiosperm lineages. (a) Three circular subgenomes of the Ceratophyllum demersum (Ceratophyllales) mitogenome, totaling 594,967 bp. (b) Four circular subgenomes of the Chloranthus sessilifolius (Chloranthales) mitogenome, totaling 862,408 bp. For both species, the features are displayed from the outermost to the innermost layers as follows: genes located on the positive strand (outer ring) and negative strand (inner ring), the scale coordinate axis, the distribution of GC content, and the distribution of microsatellite repeats, tandem repeats, and dispersed repeat sequences. The innermost circle illustrates the connections between dispersed repeat sequences. |
The Chloranthus sessilifolius mitogenome was assembled into four circular subgenomes totaling 862,408 bp (Fig. 2b), with sizes of 34,845 bp, 47,870 bp, 220,922 bp, and 558,771 bp and GC contents ranging from 45.39 to 46.44%. We identified 60 unique genes, including 38 protein-coding genes (six present in multiple copies: atp9, nad4L, rps13, rps14, rps19, sdh3), 19 tRNAs, and 3 rRNAs. One additional trans-spliced intron (nad1i728) was found compared to Ceratophyllum demersum. The larger size of the C. sessilifolius mitogenome is partly attributable to three large segmental duplications in subgenome 1.
Repeat sequences in both mitogenomes were annotated and visualized using PMGmap (Fig. 2, details in Table S2–S5). These sequences constitute an important component of the overall genome structure and contribute significantly to genome size variation, a pattern explored further in subsequent analyses comparing mitogenomic features across angiosperms.
3.2. Dynamic gene content evolution: widespread ribosomal gene loss in angiosperm mitogenomesTo ensure consistent gene identification across mitogenomes from diverse sources, we re-annotated all genomes using a standardized workflow. This approach was critical for distinguishing genes of true mitochondrial origin from potential artifacts or nuclear mitochondrial DNA insertions. Based on sequence similarity and alignment length, we categorized all annotated genes into three types: primary genes (high-confidence mitochondrial genes covering >85% of the reference gene length), secondary genes (potential gene variants due to premature or delayed stop codons), and redundant genes (likely annotation artifacts or pseudogenes).
Our analysis revealed distinct patterns of gene retention and loss across angiosperm lineages. Notably, ribosomal protein genes (rpl, rps) exhibit widespread loss across angiosperm lineages, particularly in monocots and eudicots (Fig. S1). This pattern of ribosomal gene loss contrasts sharply with the relative conservation of plastid genomes, suggesting distinct evolutionary trajectories of mitochondrial genomes during angiosperm diversification. Conversely, genes related to energy metabolism, such as those encoding ATP synthase, cytochrome c oxidase, and the NADH dehydrogenase complex, were highly conserved with fewer instances of gene loss.
Interestingly, we observed multiple copies of energy metabolism-related genes in several species from diverse lineages, including Amborella trichopoda (an early-branching angiosperm), Magnolia officinalis (a magnoliid), and several eudicots (Malania oleifera, Psychotria serpens, Daucus carota, and Brassica napus). These duplications may represent either genuine evolutionary events or, in some cases such as B. napus with its fragmented genome assembly, potential artifacts that require further verification. Overall, the patterns of gene content variation provide valuable insights into the evolutionary dynamics of angiosperm mitogenomes and identified a core set of 29 conserved genes suitable for phylogenetic analysis.
3.3. Phylogenetic signal variations and topological inconsistencies across organellar datasetsOur phylogenetic analyses of the 29 mitochondrial and 73 plastid genes revealed pervasive topological conflict, both between the two organelles and within the mitochondrial dataset itself (Fig. 3). The most fundamental inter-organellar conflict concerned the placement of Ceratophyllum and Chloranthus. Across all 20 of our analyses, the mitochondrial data consistently recovered a monophyletic Ceratophyllum + Chloranthus clade, often with strong support (e.g., protein tree, BI PP = 1.00; Fig. 3a and b). In stark contrast, the predominant signal in the plastid data rejected this pairing, instead placing Ceratophyllum as sister to the eudicots. However, this internal consistency in the mitochondrial C + C signal did not extend to the deeper backbone of the tree. The relationship of this stable C + C clade to the other major lineages (monocots, eudicots, and magnoliids) was highly unstable, yielding at least four different backbone topologies, with no single topology representing a clear majority.
|
| Fig. 3 Phylogenomic analyses of angiosperms based on mitochondrial and plastid data. (a) Phylogenetic relationships among 57 representative angiosperms and one outgroup inferred from mitochondrial (mt, left) and plastid (pt, right) protein datasets. (b) Alternative topologies recovered from different datasets and inference methods, with frequency indicated for mitochondrial (e.g., 7 topologies across 20 trees) and plastid (e.g., 6 topologies across 11 trees) analyses. Labels indicate dataset source and analytical approach (BI-NP: Bayesian non-partitioned; BI-P: Bayesian partitioned; ML-NP: Maximum Likelihood non-partitioned; ML-P: Maximum Likelihood partitioned). Support values are shown near nodes in order of BI-NP/BI-P/ML-NP/ML-P, where “*” indicates a support value of 100% (ML-BS) or 1.00 (BI-PP), and “-” indicates tree construction failure (occurring only in plastid protein dataset with ML-NP due to model overfitting). Detailed trees can be found in Figs. S2–S9. (c) Principal Coordinates Analysis (PCoA) based on standard Robinson-Foulds (RF) distances between all trees. (d) Distribution of standard RF distances within different datasets. (e) PCoA based on support-weighted RF distances. (f) Distribution of support-weighted RF distances across datasets. |
To quantify these topological disparities, we conducted Principal Coordinate Analysis (PCoA) based on Robinson-Foulds (RF) distances. The analysis confirmed a clear separation between the mitochondrial and plastid topological spaces and visualized the greater internal inconsistency within the mitochondrial dataset, which occupied a more scattered area of the PCoA plot compared to the more tightly clustered plastid dataset (Fig. 3c and e). Crucially, this analysis also revealed a key methodological insight: the mitochondrial datasets computationally adjusted for RNA editing produced significantly more congruent topologies (i.e., had smaller internal RF distances) than the unadjusted CDS or protein datasets (Fig. 3d and f). This finding underscores the critical importance of accounting for RNA editing to mitigate analytical noise in plant mitochondrial phylogenomics and sets the stage for a deeper investigation into the sources of signal conflict.
3.4. PhyloForensics dissection reveals robust, systematic, and pervasive signal conflictTo systematically dissect the sources of the pervasive topological inconsistencies observed in our initial analyses (Fig. 3), we applied our four-stage PhyloForensics framework, beginning with a quantitative assessment of conflict across the phylogeny (Fig. 4). The primary CDS-ML reference tree revealed a core analytical challenge: many nodes with high bootstrap support exhibited strikingly low gene (gCF) and site (sCF) concordance factors (Fig. 4a). A key instance is the node uniting Ceratophyllum and Chloranthus, which has 92% bootstrap support. Its gCF of 21.4% indicates that fewer than a quarter of informative genes support this topology, and its sCF of 34.7% approaches the 33.3% baseline expected under random phylogenetic noise. This disparity, where statistical resampling confidence is high but underlying data agreement is low, points toward strong, systematic conflict rather than stochastic error.
|
| Fig. 4 PhyloForensics framework: dissecting phylogenetic conflict and identifying drivers of gene signal instability. (a) The CDS-ML reference tree. Nodes are annotated with support values in the order of Ultrafast Bootstrap (%), Gene Concordance Factor (gCF), and Site Concordance Factor (sCF). (b) Robustness of conflict signal. Scatterplot of gCF values for common nodes between CDS-ML and Protein-ML reference trees, showing a near-perfect Spearman's rank correlation (table inset; ρ = 0.997, p-value < 0.001), indicating the conflict is independent of reference data type. (c) Determination of the “Optimal” and “Noisy” gene sets using the Elbow Method. The plot of logL_SD stability scores for all 22 gene partitions identifies an optimal set size of 19 genes. (d) Boxplots of five gene attributes with significant differences (FDR < 0.05) between the “Optimal” and “Noisy” sets. (e) Correlation network of gene attributes. Edges represent significant correlations (p-value < 0.05), with width proportional to strength (blue = positive, red = negative). Node colors denote functional categories: Compositional Bias (blue), Sequence Information (green), Evolutionary Dynamics (red), and Signal Quality (orange). (f) PCA biplot of gene attributes separating the “Optimal Set” (blue dots) from the “Noisy Set” (red dots). Vectors indicate attribute contributions to the principal components, which together explain 61.9% of the total variance. (g) Relative importance of instability drivers. LMG analysis from a multivariate linear model (R2 = 0.583) identifies topological entropy variance (t_Entropyvar) and proportion of informative sites (info_sites_ratio) as the dominant drivers of gene signal instability. |
This deep conflict is not an artifact of a single analytical choice. The pattern remains robust across our highest-support alternative reference topologies (Figs. S10 and S11). In the Protein-ML reference, the Ceratophyllum + Chloranthus node is highly supported (BS = 95%) but retains a low gCF of 21.4% (sCF = 39.7%). In the CDS-BI reference, the same node achieved maximum posterior probability (BI PP = 1.00) but possessed an identical gCF of 21.4% (sCF = 40.9%). To formally confirm this independence from any single reference, we conducted a cross-validation of gCF values for all common nodes across the three reference topologies. The gCFs remained remarkably consistent, showing near-perfect Spearman correlations (ρ > 0.99, p-value < 0.001; Figs. 4b and S12). This result definitively proves that the observed gene-tree conflict is a robust and intrinsic feature of the dataset, providing a solid foundation for a deeper investigation into its underlying causes.
To diagnose the nature of this conflict, we used Internode Certainty (IC) metrics to analyze key nodes across the three reference topologies (Tables 1 and S10). The analysis revealed that that conflict manifests at different phylogenetic levels. The monophyly of core clades, such as the monocots, was overwhelmingly supported (“Undetermined” conflict pattern) across all references. At deeper levels, some relationships, like the sister relationship between eudicots and monocots, consistently exhibited symmetric conflict (EFp > 0.05) in both CDS-ML and PRO-ML analyses. This pattern indicates the presence of conflict, but one that is diffuse and stochastic, lacking a single, dominant alternative signal.
| Hypothesis | Tree_Type | NodeID | gCF | IC | EFp | Conflict_Pattern |
| Eudicots_Monocots_Magnoliids_split | CDS-ML | 66 | 13.79 | 0.5026 | 4.55E-02 | Asymmetric |
| Eudicots_Monocots_Ceratophyllales_Chloranthales_split | PRO-ML | 64 | 13.79 | 0.5026 | 4.55E-02 | Asymmetric |
| Monocots_Ceratophyllales_Chloranthales_Magnoliids_split | CDS-BI | 96 | 3.45 | 0.6456 | 4.55E-02 | Asymmetric |
| Eudicots_Monocots_split | CDS-ML | 71 | 17.24 | 0.6184 | 3.17E-01 | Symmetric |
| Ceratophyllales_Chloranthales_split | CDS-ML | 65 | 21.43 | 0.528 | 1.57E-01 | Symmetric |
| Eudicots_Monocots_split | PRO-ML | 65 | 17.24 | 0.4048 | 3.17E-01 | Symmetric |
| Mesangiospermae_split | PRO-ML | 63 | 55.56 | 0.5917 | 3.17E-01 | Symmetric |
| Magnoliids | CDS-BI | 110 | 31.03 | 0.5637 | 3.17E-01 | Symmetric |
| Ceratophyllales_Chloranthales_split | CDS-BI | 98 | 21.43 | 0.2287 | 7.06E-01 | Symmetric |
| Monocots_Ceratophyllales_Chloranthales_split | CDS-BI | 97 | 3.45 | 0.7885 | 3.17E-01 | Symmetric |
| Monocots | CDS-ML | 72 | 79.31 | 0.8327 | NA | Undetermined |
| Magnoliids | CDS-ML | 67 | 31.03 | 0.6695 | NA | Undetermined |
| Mesangiospermae_split | CDS-ML | 64 | 55.56 | 0.7028 | NA | Undetermined |
| Monocots | PRO-ML | 99 | 79.31 | 0.8327 | NA | Undetermined |
| Magnoliids | PRO-ML | 111 | 31.03 | 0.6695 | NA | Undetermined |
| Ceratophyllales_Chloranthales_split | PRO-ML | 110 | 21.43 | 0.6995 | NA | Undetermined |
| Monocots | CDS-BI | 99 | 79.31 | 0.8327 | NA | Undetermined |
| Mesangiospermae_split | CDS-BI | 62 | 55.56 | 0.7028 | NA | Undetermined |
| Notes: This table summarizes conflict diagnostics for key nodes in the angiosperm phylogeny, evaluated against three reference trees (CDS-ML, PRO-ML, CDS-BI). gCF: gene concordance factor (%); IC: internode certainty; EFp: p-value from a binomial test for equal frequencies of the two most common conflicting signals. A significant EFp (* p-value <0.05) indicates Asymmetric conflict, suggesting a strong, systematic alternative signal. Non-significant EFp indicates symmetric conflict, attributable to stochastic noise. Only nodes exhibiting conflict are shown. Nodes with high gCF and no significant conflict are not displayed. | ||||||
Crucially, the analysis also detected significant, asymmetric conflict (EFp < 0.05) indicative of strong systematic bias. However, the specific node exhibiting this bias was highly dependent on the reference topology. With the CDS-ML reference, systematic conflict was localized to the node representing the divergence of eudicots, monocots, and magnoliids. The PRO-ML reference identified it at the branching of those three clades and the Ceratophyllales + Chloranthales clade, while the CDS-BI reference placed it at the base of a clade including monocots, Ceratophyllales, Chloranthales, and magnoliids. This demonstrates that while the dataset contains genuine systematic bias, its exact interpretation—which specific clade is being pulled away by this bias—is contingent on the analytical framework. This context-dependent manifestation of systematic conflict illustrates the limitations of topology-dependent metrics, necessitating a shift to assessing gene signal stability across a representative set of topologies rather than against any single reference.
3.5. Driver analysis of gene stability reveals information content and signal heterogeneity as key factorsTo move beyond topology-dependent conflict metrics, we quantified the intrinsic stability of each gene partition by calculating the standard deviation of its site-wise log-likelihood scores (logL_SD) across the set of competing topologies. This metric, which is independent of any single reference, allowed us to rank all gene partitions from most stable (low logL_SD) to most unstable. Using the Elbow Method on this ranked distribution, we objectively classified the 16 most stable partitions (19 genes) as the “Optimal Set” and the remaining 6 partitions (10 genes) as the “Noisy Set” (Fig. 4c; Tables 2 and S11). To identify the drivers of this stability gap, we first screened 14 intrinsic gene attributes for significant differences between the two sets (Table S12). Five attributes emerged as prime suspects, showing significant divergence (FDR < 0.05) in Wilcoxon or T-tests (Fig. 4d and Table S13). These included metrics of information content (proportion of informative sites, info_sites_ratio), evolutionary rate (Evolution_rate), signal heterogeneity (t_Entropyvar), compositional bias (GC3), and sequence structure (Gap_Proportion), suggesting a multifaceted cause for signal instability.
| Stability_Category | genes |
| Optimal set | rpl16, nad3, atp9, nad7, ccmB, atp6, rpl5, cox2, mttB, nad1, nad4L, nad2, atp1, cob, nad9, cox3, nad6, nad4, nad5 |
| Noisy set | cox1, ccmFN, ccmFC, matR, rps12, rps3, atp4, atp8, ccmC, rps4 |
| Notes: Classification was determined using the logL_SD stability metric (the standard deviation of site-wise log-likelihood scores) calculated for each gene partition across three competing reference topologies. The Elbow Method (see Fig. 4c) was used to objectively partition genes into the high-stability “Optimal Set” (19 genes in 16 partitions) and the low-stability “Noisy Set” (10 genes in 6 partitions). See Methods for a full description of the analytical framework. | |
To untangle the relationships among these attributes, we visualized their interdependencies using a correlation network and a corresponding heatmap (Figs. 4e and S13; Tables S14 and S15). This revealed that the stability score logL_SD was most strongly correlated with signal heterogeneity (t_Entropyvar, ρ = 0.58, p-value < 0.01) and compositional bias (GC3, ρ = 0.51, p-value < 0.05). t_Entropyvar emerged as a central hub in a complex web of multicollinearity, showing significant correlations with nine other attributes. The high degree of intercorrelation among variables confirmed that they do not act independently, necessitating a multivariate approach to isolate the most influential drivers. A Principal Component Analysis (PCA) biplot further synthesized this complexity, showing that the separation between the “Optimal” and “Noisy” gene sets is not driven by a single attribute, but rather by the combined influence of multiple, often correlated, vectors (Fig. 4f). This high degree of intercorrelation confirmed that the attributes do not act independently, necessitating a multivariate approach to isolate the most influential drivers.
To definitively isolate the primary drivers from this web of multicollinearity, we employed a multivariate linear regression model incorporating iterative Variance Inflation Factor (VIF) elimination. This statistical interrogation systematically removed redundant variables with high collinearity, such as GC_Content (VIF >10ˆ9) and Evolution_rate (VIF = 37.0), resulting in a robust, minimized set of predictors (Table S16). A subsequent relative importance analysis (LMG) on this filtered model delivered the final verdict (Fig. 4g). Signal heterogeneity (t_Entropyvar) and the proportion of informative sites (info_sites_ratio) were identified as the two dominant factors, collectively explaining 48.1% of the variance in gene stability within the model (R2 = 0.583). The analysis further pinpointed relative compositional variability (RCV) as the third most important factor, contributing an additional 10.0%. This result pinpoints the quantity and, more critically, the heterogeneity of phylogenetic signal as the most decisive determinants of a gene's reliability.
3.6. Validation of the PhyloForensics framework: the optimal gene set recovers a more congruent phylogenyTo empirically validate our PhyloForensics framework, we constructed a new supermatrix using only genes from the “Optimal Set” and repeated the suite of phylogenetic analyses. The results provide compelling evidence for the framework’s efficacy in reducing signal conflict (Fig. 5). The successful analyses recovered a single, highly congruent topology that was absent from any of the 20 trees derived from the full dataset. This novel topology places Monocots as sister to Magnoliids, with this clade in turn sister to the Ceratophyllales + Chloranthales clade; this entire group is then recovered as sister to the Eudicots. The emergence of this consistent topology after removing the most unstable genes demonstrates the decisive impact of signal heterogeneity on resolving deep phylogenetic relationships. Furthermore, this new topology is more robustly supported at key nodes, as the primary bifurcation in the unpartitioned ML tree received maximum bootstrap support (ML BS = 100%)—a dramatic improvement over the weakly supported backbone in the full-dataset ML tree (BS 41–47%). The successful resolution of deep-level incongruence confirms that our framework effectively identifies and removes phylogenetically misleading data, thereby uncovering a strong, coherent, albeit novel, phylogenetic signal.
|
| Fig. 5 Contrasting phylogenetic signals between the “Optimal” and “Noisy” gene sets. (a–c) Phylogenetic trees inferred from the “Optimal Set” of 19 mitochondrial genes using (a) Unpartitioned Bayesian Inference (BI-NP), (b) Partitioned Bayesian Inference (BI–P), and (c) Unpartitioned Maximum Likelihood (ML-NP). All three methods converged on a congruent topology supporting a Ceratophyllum + Chloranthus clade. Support values are shown near nodes (Ultrafast Bootstrap % for ML, posterior probabilities for BI). An asterisk (*) indicates maximum support. (d and e) In contrast, representative Maximum Likelihood trees for two of the most unstable genes from the “Noisy Set”: (d) rps4 and (e) atp4. The Partitioned Maximum Likelihood (ML-P) analysis for the optimal set is not shown as the model selection process failed. |
In stark contrast to the remarkable congruence recovered from the “Optimal Set,” an examination of individual gene trees from the “Noisy Set” directly visualizes the source of the underlying conflict. We present the ML phylogenies for rps4 and atp4, two genes identified by our logL_SD metric as among the most unstable in the entire mitogenome (Fig. 5d and e). Both trees fail to recover the Ceratophyllum + Chloranthus sister relationship that is highly supported by the combined optimal dataset. The atp4 tree, for instance, places Ceratophyllum as sister to the eudicots, a topology reminiscent of some nuclear and plastid gene analyses. Critically, in both gene trees, Ceratophyllum exhibits a conspicuously long branch, a classic indicator of potential susceptibility to analytical artifacts such as Long-Branch Attraction. These individual gene trees provide a compelling visual validation of our PhyloForensics framework, demonstrating its ability to pinpoint and partition genes that carry strong but conflicting and potentially misleading phylogenetic signals.
3.7. Lineage-specific patterns in mitochondrial genome architectureBeyond sequence-based phylogenetics, we examined whether mitochondrial genomic features could provide complementary phylogenetic signals. We analyzed multiple continuous features (genome size, GC content, RNA editing sites, repeat characteristics) and discrete features (gene structure variation) across lineages (Fig. 6; Tables S6 and S7). Genome size showed remarkable variation across angiosperms, spanning a nearly 20-fold range. Basal angiosperms exhibited the largest average mitogenome size (1777.4 ± 1440.93 kb), with Amborella trichopoda possessing the largest mitogenome in our dataset (3866.04 kb). In contrast, the smallest mitogenome belonged to the seagrass Zostera marina (191.48 kb). Eudicots and monocots displayed similar average genome sizes (543.89 ± 290.72 kb and 534.39 ± 247.83 kb, respectively), suggesting potential convergence in optimal genome size. GC content also varied considerably, ranging from 43.4% in Spinacia oleracea to 49.6% in Ceratophyllum demersum, while RNA editing site counts ranged from 327 sites in Populus alba to 1182 sites in Magnolia officinalis. Repetitive sequences constituted a variable proportion of mitochondrial genomes, with non-redundant repeats accounting for 2.37%–80.0% (in Brassica napus) of total genome size (Fig. 6a, red portions). Among the three size categories of repeats, long repeats (> 1000 bp) contributed the most to total repeat length (43% on average), followed by short repeats (30–100 bp; 37%) and medium-length repeats (100–1000 bp; 21%) (Fig. 6b and c). This distribution pattern highlights the significant but variable role of repeats in mitogenome size evolution.
|
| Fig. 6 Mitochondrial genome characteristics and repeat sequences across angiosperms. (a) Distribution of genome size (kb) with red portions indicating non-redundant repeat sequences, GC content (%), RNA editing sites, and structural variation in cox2 and rpl2 genes. Vertical lines and bottom values indicate the mean values across all angiosperm species. (b) Repeat sequences categorized by length: 30–100 bp (blue), 100–1000 bp (orange), and >1k bp (green). (c) Proportional distribution of repeat sequence lengths across species. |
We identified distinct lineage-specific patterns in gene structure, particularly evident in cox2 and rpl2 genes. The cox2 gene exhibited complex structural variation across major angiosperm lineages, with two introns (cox2i373 and cox2i691) showing different evolutionary trajectories. The first intron (cox2i373) displayed a relatively stable pattern in basal angiosperms (Amborella, Nymphaeales) but showed sporadic losses in derived lineages, particularly in recently diverged eudicots, where the gene was reduced to a single long exon. In contrast, cox2i691 demonstrated more dramatic variation, including complete losses in some lineages and substantial size expansions in others. Notably, we observed significant expansions of cox2i691 in Lilium (approximately 40 kb) and Nymphaea (approximately 10 kb), suggesting independent evolution of intron size across different angiosperm lineages. The monocots and early-diverging eudicots generally maintained the ancestral multi-exon structure, though with varying intron sizes. This complex pattern of intron retention, loss, and size variation in cox2 provides additional phylogenetic information beyond sequence data alone. The ribosomal protein gene rpl2 showed even more dramatic evolutionary patterns, having been widely lost in both eudicots and monocots. Interestingly, we also detected its loss in Saururus (Magnoliids), suggesting independent loss events across major angiosperm lineages. When present, rpl2 typically maintained a conserved structure with two introns, though some lineage-specific variations in intron size were observed.
To statistically evaluate differences between major angiosperm lineages, we conducted variance analysis on log-transformed feature data after removing outliers (Fig. 7 and Table S8). Among 63 pairwise comparisons between lineages, 21 showed significant differences. Notably, monocots and eudicots showed no significant differences across all comparisons, suggesting potential convergence in mitogenomic features. Branch length comparisons revealed no significant difference between magnoliids and basal angiosperms, though both had significantly lower values than monocots and eudicots (adjusted p-value = 0.000569 and 0.000729, respectively). Magnoliids exhibited distinctive patterns in several features, including significantly higher GC content and more RNA editing sites compared to monocots and eudicots. They also showed significant differences in repeat characteristics, with higher abundance of short to medium-length repeats (Fig. 7e, i, j, l). However, total mitogenome size, total repeat length, and long repeat characteristics showed no significant variation among lineages, suggesting these features may evolve more randomly or be subject to different selective pressures.
|
| Fig. 7 One-way ANOVA of mitogenomic features across angiosperm lineages. Data were natural log-transformed (ln), and sample numbers vary slightly across tests due to outlier removal in certain lineages. Abbreviations: TN, Total number; TL, Total length. Significance levels: ns, p-value >0.05; *, p-value ≤ 0.05; **, p-value ≤ 0.01; ***, p-value ≤ 0.001; ****, p-value ≤ 0.0001. |
To determine whether mitogenomic features contain phylogenetic signal and evolve in correlated patterns, we conducted phylogenetic comparative analyses based on the tree derived from the complete set of protein-coding genes. This approach ensures that the analysis of whole-genome features is contextualized by the average evolutionary signal of the entire mitochondrial proteome, providing a robust and conservative test of macro-evolutionary correlations. We calculated Blomberg’s K statistic for each feature, which quantifies the strength of phylogenetic signal relative to what would be expected under a Brownian motion model of trait evolution (Table 3). Branch length exhibited the strongest phylogenetic signal with a K-value of 5.525 (p-value = 0.001), indicating that closely related species have much more similar branch lengths than expected under random evolution. GC content and RNA editing site number showed K-value close to 1 (0.948 and 1.092, respectively; both p-value = 0.001), suggesting these traits evolve following a Brownian motion model, with closely related species showing similar values due to shared ancestry. Most repeat-related features also exhibited significant phylogenetic signals, with short and medium-length repeat characteristics showing particularly strong signals (K-value between 1.8 and 2.5; all p-value = 0.001). Interestingly, mitogenome size showed a K-value of 0.510 (p-value = 0.001), indicating a weaker phylogenetic signal than expected under Brownian motion. This suggests that genome size evolves more rapidly or erratically than other features, with less predictable patterns across the phylogeny. Even more striking, the length and number of large repeats (> 1 kb) showed no significant phylogenetic signal (K = 0.209 and 0.231; p-value = 0.872 and 0.645, respectively), indicating these features likely evolve independently of phylogenetic relationships.
| log-transformed data | K | Observed Variance | Randomized Mean | p-value | Z-score |
| branch length (N = 58) | 5.525 | 0.029 | 0.444 | 0.001 | −1.006 |
| genome size (N = 58) | 0.510 | 0.929 | 4.544 | 0.001 | −0.863 |
| GC% (N = 58) | 0.948 | 0.002 | 0.013 | 0.001 | −0.991 |
| RNA editing number (N = 58) | 1.092 | 0.228 | 1.337 | 0.001 | −1.040 |
| Total number of repeats (N = 58) | 2.440 | 7.783 | 71.357 | 0.001 | −0.925 |
| Total length of repeats (N = 58) | 1.128 | 11.615 | 56.045 | 0.002 | −0.886 |
| Number of 30–100 bp repeats (N = 58) | 2.424 | 8.101 | 70.702 | 0.001 | −0.904 |
| Length of 30–100 bp repeats (N = 58) | 2.479 | 7.865 | 77.394 | 0.001 | −0.887 |
| Number of 100–1k bp repeats (N = 58) | 1.800 | 9.748 | 64.843 | 0.001 | −0.790 |
| Length of 100–1k bp repeats (N = 58) | 2.267 | 7.153 | 58.088 | 0.001 | −0.804 |
| Number of > 1k bp repeats (N = 47) | 0.231 | 6.754 | 6.718 | 0.645 | 0.011 |
| Length of > 1k bp repeats (N = 47) | 0.209 | 12.519 | 9.141 | 0.872 | 1.037 |
| Notes: Each column represents the following: Blomberg’s K statistic, Phylogenetically Independent Contrasts (PIC) Observed Variance, PIC Variance Randomized Mean, PIC Variance p-value, and PIC Variance Z-score. A K-value of 1 indicates that the trait follows a Brownian motion model of evolution along the phylogeny. A K-value greater than 1 suggests that the trait exhibits a stronger phylogenetic signal than expected under a Brownian motion model, while a K-value less than 1 indicates a weaker phylogenetic signal than expected under such a model. | |||||
To explore evolutionary correlations between mitogenomic features, we performed regression analyses using both standard linear models and phylogenetic generalized least squares (PGLS) to account for shared evolutionary history. After examining all pairwise comparisons, we selected 12 feature pairs showing potential correlations for detailed analysis (Fig. 8; Tables 4 and S9). We found strong negative correlations between branch length and both RNA editing site number and GC content. Species with longer branch lengths consistently showed fewer RNA editing sites, with editing site number explaining 65.7% of variation in branch length in our linear model (dotted line, Fig. 8c; t = −10.085, p-value = 6.26e-14, adjusted R2 = 0.651). This relationship remained highly significant even after accounting for phylogenetic non-independence in our PGLS model (solid black line, t = −3.737, p-value = 0.0005). Similar patterns emerged between branch length and GC content (Fig. 8b; linear model: t = −7.013, p-value = 4.29e-09, adjusted R2 = 0.472; PGLS: t = −2.722, p-value = 0.0088). These relationships were also evident within major lineages. Both eudicots and monocots showed significant negative correlations between branch length and RNA editing site number (eudicots: R = −0.74, p-value = 4.5e-7; monocots: R = −0.8, p-value = 0.0017), though magnoliids did not show a significant correlation (R = −0.9, p-value = 0.1), possibly due to limited sample size. As expected, RNA editing site number and GC content showed a significant positive correlation (Fig. 8d; linear model: p-value = 2.93e-05, adjusted R2 = 0.274; PGLS: t = 2.838, p-value = 0.0065), consistent with the biochemical mechanism of cytosine-to-uracil RNA editing.
|
| Fig. 8 Regression analysis of mitogenomic features using a standard linear model and a phylogenetic generalized least squares (PGLS) model. For each sub-plot, the dashed black line represents the overall trend from the standard linear model (LM), while the solid black line represents the overall trend from the PGLS model, which accounts for phylogenetic relationships. Colored solid lines depict the PGLS trends within major angiosperm lineages. Statistical annotations on the plots summarize the results for the overall trend lines only. For detailed PGLS results within each major lineage, please refer to Table 4. |
| Dependent ~ Independent | Group | slope | t value | p-value | Residual Std. Error |
| branch length ~ genome size | Total (N = 56) | −0.003 | −0.137 | 0.892 | 0.134 |
| Eudicots (N = 34) | 0.791 | 0.435 | 0.124 | ||
| Magnolia (N = 5) | −1.311 | 0.281 | 0.063 | ||
| Monocots (N = 12) | −0.321 | 0.755 | 0.100 | ||
| branch length ~ GC% | Total (N = 55) | −1.439 | −2.722 | 0.009 | 0.132 |
| Eudicots (N = 34) | −2.877 | 0.007 | 0.112 | ||
| Magnolia (N = 5) | −0.659 | 0.557 | 0.074 | ||
| Monocots (N = 11) | −1.901 | 0.090 | 0.089 | ||
| branch length ~ RNA editing number | Total (N = 55) | −0.164 | −3.737 | 0.001 | 0.120 |
| Eudicots (N = 34) | −2.511 | 0.017 | 0.115 | ||
| Magnolia (N = 4) | −2.526 | 0.127 | 0.038 | ||
| Monocots (N = 12) | −3.182 | 0.010 | 0.071 | ||
| RNA editing number ~ GC% | Total (N = 54) | 3.656 | 2.838 | 0.007 | 0.352 |
| Eudicots (N = 34) | 3.209 | 0.003 | 0.353 | ||
| Magnolia (N = 4) | 0.051 | 0.964 | 0.140 | ||
| Monocots (N = 12) | 1.483 | 0.169 | 0.159 | ||
| genome size ~ GC% | Total (N = 55) | −0.195 | −0.067 | 0.947 | 0.800 |
| Eudicots (N = 34) | −1.514 | 0.140 | 0.725 | ||
| Magnolia (N = 5) | −0.515 | 0.642 | 0.529 | ||
| Monocots (N = 12) | 1.070 | 0.310 | 0.555 | ||
| genome size ~ RNA editing number | Total (N = 55) | 0.218 | 0.769 | 0.446 | 0.774 |
| Eudicots (N = 34) | 0.176 | 0.861 | 0.750 | ||
| Magnolia (N = 5) | 0.470 | 0.685 | 0.297 | ||
| Monocots (N = 12) | 1.296 | 0.224 | 0.542 | ||
| genome size ~ total length of repeats | Total (N = 55) | 0.156 | 4.878 | <0.0001 | 0.658 |
| Eudicots (N = 34) | 5.159 | <0.0001 | 0.555 | ||
| Magnolia (N = 5) | 2.735 | 0.072 | 0.295 | ||
| Monocots (N = 12) | 0.012 | 0.990 | 0.586 | ||
| genome size ~ total number of repeats | Total (N = 54) | 0.226 | 5.299 | <0.0001 | 0.626 |
| Eudicots (N = 34) | 5.000 | <0.0001 | 0.562 | ||
| Magnolia (N = 5) | 1.688 | 0.190 | 0.395 | ||
| Monocots (N = 11) | 2.190 | 0.056 | 0.404 | ||
| genome size ~ length of 30–100bp repeats | Total (N = 54) | 0.220 | 5.049 | <0.0001 | 0.636 |
| Eudicots (N = 34) | 4.673 | 0.000 | 0.579 | ||
| Magnolia (N = 5) | 1.573 | 0.214 | 0.409 | ||
| Monocots (N = 11) | 2.214 | 0.054 | 0.403 | ||
| genome size ~ length of 100–1k bp repeats | Total (N = 54) | 0.224 | 5.250 | <0.0001 | 0.953 |
| Eudicots (N = 34) | 6.050 | <0.0001 | 0.513 | ||
| Magnolia (N = 5) | 3.063 | 0.055 | 0.272 | ||
| Monocots (N = 11) | 0.649 | 0.533 | 0.489 | ||
| genome size ~ number of 30–100bp repeats | Total (N = 54) | 0.207 | 4.700 | <0.0001 | 0.988 |
| Eudicots (N = 34) | 4.343 | 0.000 | 0.595 | ||
| Magnolia (N = 5) | 1.590 | 0.210 | 0.407 | ||
| Monocots (N = 11) | 2.420 | 0.039 | 0.390 | ||
| genome size ~ number of 100–1k bp repeats | Total (N = 54) | 0.200 | 5.778 | <0.0001 | 0.920 |
| Eudicots (N = 34) | 6.396 | <0.0001 | 0.497 | ||
| Magnolia (N = 5) | 2.560 | 0.083 | 0.309 | ||
| Monocots (N = 11) | 0.870 | 0.407 | 0.481 |
Genome size showed moderate positive correlations with various repeat features, with different repeat metrics explaining 16.6%–25.5% of genome size variation (Fig. 8e–i and Table S9). These correlations were significant in both standard and phylogenetic models, suggesting that the accumulation of repetitive elements—particularly short-to medium-length repeats—plays an important role in mitogenome size evolution across angiosperms.
4. Discussion 4.1. Beyond a single tree: navigating layers of conflict in the angiosperm mitogenomeOur analyses, anchored by the first complete mitogenomes for Ceratophyllum and Chloranthus, consistently recover a strongly supported sister relationship between these two lineages. This finding is not novel, but rather a robust confirmation of a persistent phylogenetic signal previously detected in other mitochondrial studies (Qiu et al., 2010; Xue et al., 2022; Lin et al., 2025). More importantly, this result places our work at the epicenter of a classic, unresolved botanical enigma: the deep incongruence between phylogenies derived from the three genomic compartments (Wickett et al., 2014; Leebens-Mack et al., 2019; Li et al., 2019; Hu et al., 2023). The stark discordance between high bootstrap support and low gene- and site-concordance factors (gCF/sCF) at key nodes is a classic signature of strong, systematic conflict, not just random error (Fig. 4a). Notably, the sCF values for nodes defining the relationships among the five major clades are near or below the 33.3% baseline expected under random noise, highlighting profound site-level heterogeneity.
This complex conflict landscape is likely the product of multiple, superimposed evolutionary and analytical forces. First, a substantial portion of the conflict appears to stem from the unique, often tumultuous, evolutionary histories of individual genes. The “Noisy Set” we identified is not a random assortment, but is significantly enriched with genes whose biological roles predispose them to conflicting evolutionary pressures. These include genes with auxiliary functions like cytochrome c maturation (ccm genes), or functionally distinct maturase (matR), which may be subject to different selective pressures than core respiratory genes. This instability is exacerbated by events that disrupt vertical inheritance. For instance, our “Noisy Set” is replete with genes frequently implicated in horizontal gene transfer (HGT), such as atp4 and cox1 (Park et al., 2015; Yu et al., 2023), and ribosomal protein (rps) genes with well-documented histories of repeated loss and functional transfer to the nucleus (Adams and Palmer, 2003; Mower, 2020). Beyond these gene-specific histories, the symmetric conflict patterns detected by our IC analysis (Table 1) point towards non-directional processes like incomplete lineage sorting (ILS), which remains a plausible factor in the context of a rapid, ancient radiation (Minh et al., 2020a,b; Lanfear and Hahn, 2024). Furthermore, complex inheritance patterns, such as paternal leakage or hybridization-driven organellar capture, provide another mechanism for mitochondrial introgression that can create topological conflict (Camus et al., 2022; Wang et al., 2024a, 2024c). Finally, systematic analytical artifacts cannot be discounted, as evidenced by the conspicuously long branch of Ceratophyllum in the trees of unstable genes like atp4 (Fig. 5e), a classic indicator of potential susceptibility to Long-Branch Attraction.
Given this multi-causal conflict landscape, our study’s central innovation was to develop the PhyloForensics framework, a diagnostic approach designed to move beyond simple conflict metrics and identify the intrinsic properties that render a gene’s signal unreliable. Our driver analysis (Section 3.5) delivered a clear verdict: a gene’s stability is not primarily dictated by its evolutionary rate, but by the quality and heterogeneity of its signal. We identified topological entropy variance (t_Entropyvar) and the proportion of informative sites (info_sites_ratio) as the two dominant factors explaining signal instability (Fig. 4g). The critical role of t_Entropyvar resonates with a growing body of research demonstrating that high signal heterogeneity and substitution saturation are primary obstacles to deep-time phylogenetic inference (Duchene et al., 2022). This reinforces the idea that the most reliable genes are those that have evolved under consistent, rather than necessarily slow, selective pressures (Salichos and Rokas, 2013; Salichos et al., 2014). The seemingly counterintuitive role of a high info_sites_ratio in the ‘Noisy Set’ is key: it confirms these genes are not merely uninformative, but are “loudly conflicted,” saturated with strong but contradictory signals. The subsequent empirical validation, where the “Optimal Set” of genes—those with low internal signal conflict—recovered a single, highly congruent topology (Section 3.6; Fig. 5), provides compelling evidence that our framework successfully identifies and removes these noisy markers. It succeeds not by simply filtering for data quantity, but by diagnosing data quality, thereby uncovering the coherent phylogenetic signal that was previously obscured by internal conflict. While promising, the generalizability of this framework warrants further validation. Future applications could test its stability across broader taxonomic scales or use a robust nuclear species tree as a reference, which may allow for a more direct identification of genes involved in cyto-nuclear conflict.
4.2. Beyond sequence: mitogenomic architecture as a source of deep evolutionary signalBeyond the conflicts inherent in sequence data, the architecture of the mitogenome itself provides a complementary and often more stable source of phylogenetic signal. Our analysis of genomic features reveals that their evolution is not random, but is instead deeply intertwined with the phylogenetic history of angiosperms, opening a valuable avenue for exploring evolutionary relationships that might be obscured by sequence-level noise.
Many mitogenomic features exhibited strong phylogenetic signals, indicating their evolution is significantly correlated with shared evolutionary history. As measured by Blomberg’s K statistic (Table 3), branch length showed the strongest signal, followed by most repeat-related features, and then GC content and RNA editing site (Table 3). Most striking was the discovery of unexpected correlations between branch length and both GC content and RNA editing site abundance (Fig. 8b and c; Table 4). Species with longer branch lengths in the phylogenetic tree consistently exhibited lower GC content and fewer RNA editing sites, relationships that remained significant even after accounting for phylogenetic non-independence in PGLS models. The mechanisms underlying these correlations likely involve fundamental aspects of mitochondrial genome evolution. High GC content may result from either adaptive pressures (selection for translational efficiency) or non-adaptive processes like biased gene conversion, and is often linked to elevated levels of C-to-U RNA editing, explaining the positive correlation observed between these two features (Smith, 2012) (Fig. 8d and Table 4). The negative correlation between branch length and both GC content and RNA editing sites suggests a coordinated evolution between sequence substitution rate and editing requirements. Indeed, previous work has shown that accelerated mutation rates can correlate with reduced RNA editing, possibly as lineages with faster sequence evolution are selected for reduced dependency on the complex editing machinery (Liu et al., 2020). This pattern is consistent with the broader observation that early-diverging angiosperms generally retain significantly more ancestral RNA editing sites than more derived clades like monocots and eudicots (Hein et al., 2016; Yu et al., 2025).
In contrast to the strong phylogenetic signal in GC content and RNA editing, genome size evolution showed more complex patterns. Exhibiting a significant but weaker phylogenetic signal (K = 0.510, p-value = 0.001; Table 3) than for other features, indicating that genome size may be more influenced by lineage-specific factors such as recombination rates, repeat dynamics, and foreign DNA integration than by shared evolutionary history alone. Genome size correlated significantly with various repeat metrics (explaining 16.6–25.5% of variation; Fig. 8e–i and Table 4), particularly for short and medium-length repeats which themselves carried strong phylogenetic signal (Table 3). However, large repeats (> 1 kb) showed no significant phylogenetic signal, suggesting they evolve more independently, potentially through large, sporadic insertion or deletion events. This complexity in mitogenome size evolution reflects the dynamic nature of plant mitochondrial genomes, which undergo frequent rearrangements and content fluctuations while maintaining core functional elements (Smith and Keeling, 2015; Wang et al., 2024b).
Beyond these quantitative features, gene structural variations also revealed lineage-specific patterns that corroborate phylogenetic relationships. The cox2 intron, for instance, exhibits remarkable variation patterns across angiosperm lineages, as clearly demonstrated in our comparative analysis (Fig. 6a). This variation results from a complex evolutionary history involving multiple mechanisms. Within our sampled taxa, cox2 contains up to two introns (cox2i373 and cox2i691) with striking lineage-specific presence/absence and size patterns—the first intron (cox2i373) is consistently present in basal angiosperms like Amborella and Nymphaeales but shows sporadic losses in monocots and eudicots, while the second intron (cox2i691) displays more dramatic variation, including significant size expansions in Lilium and Nymphaea (Fig. 6a). These patterns align with previous findings that cox2i691 undergoes more frequent losses primarily through retroprocessing (Edera et al., 2023). Alternative mechanisms like horizontal gene transfer might also affect cox2 intron evolution (Hepburn et al., 2012), potentially contributing to the inconsistent intron distribution observed. While complex, the observed lineage-specific patterns in cox2 intron structure provide valuable qualitative phylogenetic information that might complements sequence-based analyses, particularly in highlighting distinctions between major angiosperm groups.
Together, these findings demonstrate that diverse mitogenomic features, ranging from GC content and RNA editing to repeat dynamics and gene structure, carry significant phylogenetic signal and offer valuable insights into understanding angiosperm evolutionary history. They provide valuable independent lines of evidence for interrogating contentious relationships and reveal the complex interplay between genome structure, RNA editing, molecular evolution rates, and lineage-specific diversification patterns.
AcknowledgementsWe thank Dr. Mark Pitkin Simmons (Colorado State University) for the valuable comments and kind help in manuscript preparation. This work was funded by the Shenzhen Science and Technology Program (Grant No. JCYJ20241202130723030); the National Natural Science Foundation of China (Grant No. 32170238); the Guangdong Pearl River Talent Program (Grant No. 2021QN02N792); the Shenzhen Fundamental Research Program (Grant No. JCYJ20220818103212025); the Chinese Academy of Agricultural Sciences Elite Youth Program (110243160001007) to Z.W.; and this work was also supported by the Shenzhen Key Laboratory of Southern Subtropical Plant Diversity.
CRediT authorship contribution statement
Li-Yun Nie: Formal analysis, Software, Visualization, Investigation, Data curation, Writing – original draft, Writing – review & editing. Jie Wang: Formal analysis, Software, Visualization, Investigation, Data curation. Lei Huang: Methodology, Data curation, Writing – review & editing. Jia-Li Kong: Methodology, Data curation. Bao Nie: Writing – review & editing. Luke R. Tembrock: Writing – review & editing. Shan-Shan Dong: Funding acquisition, Writing – review & editing. Ravi Tiwari: Writing – review & editing. Hui Wang: Funding acquisition, Writing – review & editing. Sheng-Long Kan: Conceptualization, Supervision, Writing – review & editing. Xin-Hui Zou: Supervision, Methodology, Writing–review & editing. Zhi-Qiang Wu: Conceptualization, Supervision, Project administration, Funding acquisition, Writing – review & editing.
Data availability statement
The newly assembled mitochondrial genomes generated in this study have been deposited in GenBank under accession numbers PV557754-PV557756 for the three subgenomes of Ceratophyllum demersum and PV557757-PV557760 for the four subgenomes of Chloranthus sessilifolius. All primary sequence data, alignment matrices, and phylogenetic trees necessary to reproduce the analyses reported in this study are available in the data repository of Plant Diversity at Science Data Bank (https://doi.org/10.57760/sciencedb.30119). Analysis scripts and workflows are available at GitHub (https://github.com/Liyun-Nie/mtAPG.git, https://github.com/Liyun-Nie/PhyloForensics).
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Appendix A. Supplementary data
Supplementary data to this article can be found online at https://doi.org/10.1016/j.pld.2025.10.003.
Adams, K.L., Palmer, J.D., 2003. Evolution of mitochondrial gene content: gene loss and transfer to the nucleus. Mol. Phylogenet. Evol., 29: 380-395. DOI:10.1016/S1055-7903(03)00194-5 |
APG Ⅳ, 2016. An update of the angiosperm phylogeny group classification for the orders and families of flowering plants: APG Ⅳ. Bot. J. Linn. Soc., 181: 1-20. |
Bankevich, A., Nurk, S., Antipov, D., et al., 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19: 455-477. DOI:10.1089/cmb.2012.0021 |
Blomberg, S.P., Garland Jr, T., 2002. Tempo and mode in evolution: phylogenetic inertia, adaptation and comparative methods. J. Evol. Biol., 15: 899-910. DOI:10.1046/j.1420-9101.2002.00472.x |
Blomberg, S.P., Garland Jr., T., Ives, A.R., 2003. Testing for phylogenetic signal in comparative data: behavioral traits are more labile. Evolution, 57: 717-745. DOI:10.1111/j.0014-3820.2003.tb00285.x |
Bolger, A.M., Lohse, M., Usadel, B., 2014. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics, 30: 2114-2120. DOI:10.1093/bioinformatics/btu170 |
Camacho, C., Coulouris, G., Avagyan, V., et al., 2009. BLAST+: architecture and applications. BMC Bioinformatics, 10: 421. DOI:10.1186/1471-2105-10-421 |
Camus, M.F., Alexander-Lawrie, B., Sharbrough, J., et al., 2022. Inheritance through the cytoplasm. Heredity (Edinb), 129: 31-43. DOI:10.1038/s41437-022-00540-2 |
Christensen, A.C., 2021. Plant mitochondria are a riddle wrapped in a mystery inside an enigma. J. Mol. Evol., 89: 151-156. DOI:10.1007/s00239-020-09980-y |
Duchene, D.A., Mather, N., Van Der Wal, C., et al., 2022. Excluding loci with substitution saturation improves inferences from phylogenomic data. Syst. Biol., 71: 676-689. DOI:10.1093/sysbio/syab075 |
Edera, A.A., Howell, K.A., Nevill, P.G., et al., 2023. Evolution of cox2 introns in angiosperm mitochondria and efficient splicing of an elongated cox2i691 intron. Gene, 869: 147393. DOI:10.1016/j.gene.2023.147393 |
Edera, A.A., Small, I., Milone, D.H., et al., 2021. Deepred-Mt: deep representation learning for predicting C-to-U RNA editing in plant mitochondria. Comput. Biol. Med., 136: 104682. DOI:10.1016/j.compbiomed.2021.104682 |
Felsenstein, J., 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution, 39: 783-791. DOI:10.1111/j.1558-5646.1985.tb00420.x |
Gitzendanner, M.A., Soltis, P.S., Wong, G.K.S., et al., 2018. Plastid phylogenomic analysis of green plants: a billion years of evolutionary history. Am. J. Bot., 105: 291-301. DOI:10.1002/ajb2.1048 |
Groemping, U., 2006. Relative importance for linear regression in R: the package relaimpo. J. Stat. Softw., 17: 1-27. |
Guo, C., Luo, Y., Gao, L.M., et al., 2023. Phylogenomics and the flowering plant tree of life. J. Integr. Plant Biol., 65: 299-323. DOI:10.1111/jipb.13415 |
Hein, A., Polsakiewicz, M., Knoop, V., 2016. Frequent chloroplast RNA editing in early-branching flowering plants: pilot studies on angiosperm-wide coexistence of editing sites and their nuclear specificity factors. BMC Evol. Biol., 16: 23. DOI:10.1186/s12862-016-0589-0 |
Hepburn, N.J., Schmidt, D.W., Mower, J.P., 2012. Loss of two introns from the Magnolia tripetala mitochondrial cox2 gene implicates horizontal gene transfer and gene conversion as a novel mechanism of intron loss. Mol. Biol. Evol., 29: 3111-3120. DOI:10.1093/molbev/mss130 |
Hu, H.Y., Sun, P.C., Yang, Y.Z., et al., 2023. Genome-scale angiosperm phylogenies based on nuclear, plastome, and mitochondrial datasets. J. Integr. Plant Biol., 65: 1479-1489. DOI:10.1111/jipb.13455 |
Kalyaanamoorthy, S., Minh, B.Q., Wong, T.K., et al., 2017. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods, 14: 587-589. DOI:10.1038/nmeth.4285 |
Kan, S.L., Liao, X.Z., Wu, Z.Q., 2022. The roles of mutation and selection acting on mitochondrial genomes inferred from intraspecific variation in seed plants. Genes (Basel), 13: 1036. DOI:10.3390/genes13061036 |
Kassambara, A., 2023a. ggpubr: 'ggplot2' based publication ready plots. R package version, 0.6.0. |
Kassambara, A., 2023b. rstatix: Pipe-friendly framework for basic statistical tests. R package version, 0.7.2. |
Katoh, K., Standley, D.M., 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol., 30: 772-780. DOI:10.1093/molbev/mst010 |
Kolmogorov, M., Bickhart, D.M., Behsaz, B., et al., 2020. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods, 17: 1103-1110. DOI:10.1038/s41592-020-00971-x |
Kong, J.L., Wang, J., Nie, L.Y., et al., 2025. Evolutionary dynamics of mitochondrial genomes and intracellular transfers among diploid and allopolyploid cotton species. BMC Biology, 23: 9. DOI:10.1186/s12915-025-02115-z |
Lanfear, R., Hahn, M.W., 2024. The meaning and measure of concordance factors in phylogenomics. Mol. Biol. Evol., 41: msae214. DOI:10.1093/molbev/msae214 |
Leebens-Mack, J.H., Barker, M.S., Carpenter, E.J., et al., 2019. One thousand plant transcriptomes and the phylogenomics of green plants. Nature, 574: 679-685. DOI:10.1038/s41586-019-1693-2 |
Li, H.T., Yi, T.S., Gao, L.M., et al., 2019. Origin of angiosperms and the puzzle of the Jurassic gap. Nat. Plants, 5: 461-470. DOI:10.1038/s41477-019-0421-0 |
Li, H., 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34: 3094-3100. DOI:10.1093/bioinformatics/bty191 |
Lin, D.L., Shao, B.Y., Gao, Z.Y., et al., 2025. Phylogenomics of angiosperms based on mitochondrial genes: insights into deep node relationships. BMC Biology, 23: 45. DOI:10.1186/s12915-025-02135-9 |
Lin, Q.S., Braukmann, T.W.A., Gomez, M.S., et al., 2022. Mitochondrial genomic data are effective at placing mycoheterotrophic lineages in plant phylogeny. New Phytol., 236: 1908-1921. DOI:10.1111/nph.18335 |
Liu, F., Fan, W.S., Yang, J.B., et al., 2020. Episodic and guanine-cytosine-biased bursts of intragenomic and interspecific synonymous divergence in Ajugoideae (Lamiaceae) mitogenomes. New Phytol., 228: 1107-1114. DOI:10.1111/nph.16753 |
Liu, Y., Cox, C.J., Wang, W., et al., 2014a. Mitochondrial phylogenomics of early land plants: mitigating the effects of saturation, compositional heterogeneity, and codon-usage bias. Syst. Biol., 63: 862-878. DOI:10.1093/sysbio/syu049 |
Liu, Y., Medina, R., Goffinet, B., 2014b. 350 my of mitochondrial genome stasis in mosses, an early land plant lineage. Mol. Biol. Evol., 31: 2586-2591. DOI:10.1093/molbev/msu199 |
Ma, J.X., Sun, P.C., Wang, D.D., et al., 2021. The Chloranthus sessilifolius genome provides insight into early diversification of angiosperms. Nat. Commun., 12: 6929. DOI:10.1038/s41467-021-26931-3 |
Minh, B.Q., Schmidt, H.A., Chernomor, O., et al., 2020a. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol., 37: 1530-1534. DOI:10.1093/molbev/msaa015 |
Minh, B.Q., Hahn, M.W., Lanfear, R., 2020b. New methods to calculate concordance factors for phylogenomic datasets. Mol. Biol. Evol., 37: 2727-2733. DOI:10.1093/molbev/msaa106 |
Mo, Y.K., Lanfear, R., Hahn, M.W., et al., 2023. Updated site concordance factors minimize effects of homoplasy and taxon sampling. Bioinformatics, 39: btac741. DOI:10.1093/bioinformatics/btac741 |
Mower, J.P., 2020. Variation in protein gene and intron content among land plant mitogenomes. Mitochondrion, 53: 203-213. DOI:10.1016/j.mito.2020.06.002 |
Palmer, J.D., Herbon, L.A., 1988. Plant mitochondrial DNA evolves rapidly in structure, but slowly in sequence. J. Mol. Evol., 28: 87-97. DOI:10.1007/BF02143500 |
Park, S., Grewe, F., Zhu, A.D., et al., 2015. Dynamic evolution of Geranium mitochondrial genomes through multiple horizontal and intracellular gene transfers. New Phytol., 208: 570-583. DOI:10.1111/nph.13467 |
Postel, Z., Sloan, D.B., Gallina, S., et al., 2023. The decoupled evolution of the organellar genomes of Silene nutans leads to distinct roles in the speciation process. New Phytol., 239: 766-777. DOI:10.1111/nph.18966 |
Qian, H., Zhang, J., Zhao, J.C., 2022. How many known vascular plant species are there in the world? An integration of multiple global plant databases. Biodivers. Sci., 30: 22254. DOI:10.17520/biods.2022254 |
Qiu, Y.L., Li, L.B., Wang, B., et al., 2010. Angiosperm phylogeny inferred from sequences of four mitochondrial genes. J. Syst. Evol., 48: 391-425. DOI:10.1111/j.1759-6831.2010.00097.x |
Ranwez, V., Douzery, E.J.P., Cambon, C., et al., 2018. MACSE v2: toolkit for the alignment of coding sequences accounting for frameshifts and stop codons. Mol. Biol. Evol., 35: 2582-2584. DOI:10.1093/molbev/msy159 |
Robinson, D.F., Foulds, L.R., 1981. Comparison of phylogenetic trees. Math. Biosci., 53: 131-147. DOI:10.1016/0025-5564(81)90043-2 |
Ronquist, F., Teslenko, M., Van Der Mark, P., et al., 2012. MrBayes 3.2: efficient bayesian phylogenetic inference and model choice across a large model space. Syst. Biol., 61: 539-542. DOI:10.1093/sysbio/sys029 |
Salichos, L., Rokas, A., 2013. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature, 497: 327-331. DOI:10.1038/nature12130 |
Salichos, L., Stamatakis, A., Rokas, A., 2014. Novel information theory-based measures for quantifying incongruence among phylogenetic trees. Mol. Biol. Evol., 31: 1261-1271. DOI:10.1093/molbev/msu061 |
Schliep, K.P., 2011. phangorn: phylogenetic analysis in R. Bioinformatics, 27: 592-593. DOI:10.1093/bioinformatics/btq706 |
Shi, L.C., Chen, H.M., Jiang, M., et al., 2019. CPGAVAS2, an integrated plastome sequence annotator and analyzer. Nucleic Acids Res., 47: W65-W73. DOI:10.1093/nar/gkz345 |
Smith, D.R., 2012. Updating our wiew of organelle genome nucleotide landscape. Front. Genet., 3: 175. DOI:10.5040/9781350275911.ch-012 |
Smith, D.R., Keeling, P.J., 2015. Mitochondrial and plastid genome architecture: reoccurring themes, but significant differences at the extremes. Proc. Natl. Acad. Sci. U.S.A., 112: 10177-10184. DOI:10.1073/pnas.1422049112 |
Talavera, G., Castresana, J., 2007. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol., 56: 564-577. DOI:10.1080/10635150701472164 |
Tyszka, A.S., Bretz, E.C., Robertson, H.M., et al., 2023. Characterizing conflict and congruence of molecular evolution across organellar genome sequences for phylogenetics in land plants. Front. Plant Sci., 14: 1125107. DOI:10.3389/fpls.2023.1125107 |
Wang, H.C., Wang, D.Y., Shao, B.Y., et al., 2025. Unequally abundant chromosomes and unusual collections of transferred sequences characterize mitochondrial genomes of Gastrodia (Orchidaceae), one of the largest Mycoheterotrophic plant genera. Mol. Biol. Evol., 42: msaf082. DOI:10.1093/molbev/msaf082 |
Wang, H.Y., Zhang, W., Yu, Y.N., et al., 2024a. Biased gene introgression and adaptation in the face of chloroplast capture in Aquilegia amurensis. Syst. Biol., 73: 886-900. DOI:10.1093/sysbio/syae039 |
Wang, J., Kan, S.L., Liao, X.Z., et al., 2024b. Plant organellar genomes: much done, much more to do. Trends Plant Sci., 29: 754-769. DOI:10.1016/j.tplants.2023.12.014 |
Wang, J., Zou, Y., Mower, J.P., et al., 2024c. Rethinking the mutation hypotheses of plant organellar DNA. Genom. Commun. Now., 1: e003. |
Wick, R.R., Schultz, M.B., Zobel, J., et al., 2015. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics, 31: 3350-3352. DOI:10.1093/bioinformatics/btv383 |
Wickett, N.J., Mirarab, S., Nam, N., et al., 2014. Phylotranscriptomic analysis of the origin and early diversification of land plants. Proc. Natl. Acad. Sci. U.S.A., 111: E4859-E4868. |
Wickham, H., 2016. Data Analysis, ggplot2: Elegant Graphics for Data Analysis. Springer, pp. 189–201.
|
Wu, Z.Q., Tembrock, L.R., Ge, S., 2015. Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes. PLoS One, 10: e0118019. DOI:10.1371/journal.pone.0118019 |
Wu, Z.Q., Waneka, G., Broz, A.K., et al., 2020. MSH1 is required for maintenance of the low mutation rates in plant mitochondrial and plastid genomes. Proc. Natl. Acad. Sci. U.S.A., 117: 16448-16455. DOI:10.1073/pnas.2001998117 |
Wynn, E.L., Christensen, A.C., 2019. Repeats of unusual size in plant mitochondrial genomes: identification, incidence and evolution. G3: Genes Genomes Genet., 9: 549-559. DOI:10.1534/g3.118.200948 |
Xue, J.Y., Dong, S.S., Wang, M.Q., et al., 2022. Mitochondrial genes from 18 angiosperms fill sampling gaps for phylogenomic inferences of the early diversification of flowering plants. J. Syst. Evol., 60: 773-788. DOI:10.1111/jse.12708 |
Yang, L.X., Su, D.Y., Chang, X., et al., 2020a. Phylogenomic insights into deep phylogeny of angiosperms based on broad nuclear gene sampling. Plant Commun., 1: 100027. DOI:10.1016/j.xplc.2020.100027 |
Yang, Y.Z., Sun, P.C., Lv, L., et al., 2020b. Prickly waterlily and rigid hornwort genomes shed light on early angiosperm evolution. Nat. Plants, 6: 215-222. DOI:10.1038/s41477-020-0594-6 |
Yu, G.C., Smith, D.K., Zhu, H.C., et al., 2016. GGTREE: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evol., 8: 28-36. |
Yu, R.X., Chen, X.D., Long, L.J., et al., 2023. De novo assembly and comparative analyses of mitochondrial genomes in Piperales. Genome Biol. Evol., 15: evad041. DOI:10.1093/gbe/evad041 |
Yu, R.X., Liu, L.M., Jost, M., et al., 2025. Evolution of mitochondrial RNA editing sites and stop codon-lacking transcripts in angiosperms. Commun. Biol., 8: 977. DOI:10.1038/s42003-025-08418-9 |
Zhang, D., Gao, F.L., Jakovlic, I., et al., 2020. PhyloSuite: an integrated and scalable desktop platform for streamlined molecular sequence data management and evolutionary phylogenetics studies. Mol. Ecol. Resour., 20: 348-355. DOI:10.1111/1755-0998.13096 |
Zhang, G.J., Ma, H., 2024. Nuclear phylogenomics of angiosperms and insights into their relationships and evolution. J. Integr. Plant Biol., 66: 546-578. DOI:10.1111/jipb.13609 |
Zhang, X.Y., Chen, H.M., Ni, Y., et al., 2024. Plant mitochondrial genome map (PMGmap): a software tool for the comprehensive visualization of coding, noncoding and genome features of plant mitochondrial genomes. Mol. Ecol. Resour., 24: e13952. DOI:10.1111/1755-0998.13952 |
Zuntini, A.R., Carruthers, T., Maurin, O., et al., 2024. Phylogenomics and the rise of the angiosperms. Nature, 629: 843-850. DOI:10.1038/s41586-024-07324-0 |


