Genomic diversity at the Glu-D1 locus in Aegilops tauschii reveals the origin of elite high-molecular-weight glutenin genes in bread wheat (Triticum aestivum)

Citation

Wanxin Xu, Shuaifeng Geng, Shaoshuai Liu, Andrea González-Muñoz, Yifan Liu, Zhongyin Deng, Yuqing Che, Dada Cui, Xinyu Zou, Ziying Wang, Xiang Wang, Daowen Wang, Dongcheng Liu, Yun Zhou, Dengcai Liu, Maria Itria Ibba, Brande B.H. Wulff, Aili Li, Long Mao. Genomic diversity at the Glu-D1 locus in Aegilops tauschii reveals the origin of elite high-molecular-weight glutenin genes in bread wheat (Triticum aestivum)[J]. Plant Diversity, 2026, 48(3): 557-566.

Genomic diversity at the Glu-D1 locus in Aegilops tauschii reveals the origin of elite high-molecular-weight glutenin genes in bread wheat (Triticum aestivum)

Wanxin Xu (徐婉欣)^a^,1, Shuaifeng Geng (耿帅锋)^a^,1, Shaoshuai Liu (刘少帅)^a^,1, Andrea González-Muñoz^b, Yifan Liu (刘一凡)^c, Zhongyin Deng (邓中印)^a, Yuqing Che (车育青)^a, Dada Cui (崔达达)^a, Xinyu Zou (邹新宇)^a, Ziying Wang (王梓颖)^a, Xiang Wang (王翔)^c, Daowen Wang (王道文)^c, Dongcheng Liu (刘冬成)^d, Yun Zhou (周云)^e, Dengcai Liu (刘登才)^f, Maria Itria Ibba^g, Brande B.H. Wulff^b, Aili Li (李爱丽)^a^,*, Long Mao (毛龙)^a^,**

a. State Key Laboratory of Crop Gene Resources and Breeding, National Key Facility for Crop Gene Resources and Genetic Improvement, Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing 100081, China;
b. King Abdullah University of Science and Technology (KAUST), Plant Science Program, Biological and Environmental Science and Engineering Division (BESE), Thuwal, Saudi Arabia;
c. College of Agronomy, Longzi Lake Campus, Henan Agricultural University, Zhengzhou, China;
d. State Key Laboratory of North China Crop Improvement and Regulation, College of Agronomy, Hebei Agricultural University, Baoding 071001, China;
e. State Key Laboratory of Crop Stress Adaptation and Improvement, School of Life Sciences, Henan University, Kaifeng, China;
f. Triticeae Research Institute, Sichuan Agricultural University, Sichuan, 611130, China;
g. Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), DF, Mexico

Received 10 March 2025; Received in revised form 4 June 2025; Accepted 5 June 2025; Available online 11 June 2025

Corresponding author: E-mail address: liaili@caas.cn (A. Li);
E-mail address: maolong@caas.cn (L. Mao).
Peer review under responsibility of Editorial Office of Plant Diversity.

^* Corresponding author.
^** Corresponding author.
¹ These authors contributed equally to this work.

Abstract: The Glu-D1 locus of the Aegilops tauschii genome carries two high-molecular-weight glutenin genes, Dx and Dy, that are essential for viscoelastic properties of bread wheat dough, contributing to its success as a global staple crop. Here, we examined 48 Ae. tauschii high-quality genome assembles and identified a large dataset of Glu-D1, a locus that has remained recalcitrant to high-resolution analysis due to its genomic complexity. Phylogenetic analysis supported six major clades, slightly differing from the geography-based classification. Despite a short genomic distance, gene-based haplotype analysis detected possible ancestral recombination between Dx and Dy genes that were separated by distinctive repetitive sequences. Biochemically, glutenins of the same length can vary in isoelectric points, causing deviations in migration on traditional SDS-PAGE gels. Differential selection pressures were detected among clades and between Dx and Dy glutenin genes. Two clades, L2E-1 and L2W-2, with relatively lower coeliac motifs, were identified as the most probable ancestral contributors to bread wheat. Furthermore, key amino acids were identified as conceptually suitable for single-base editing to create novel elite alleles. Dissecting genomic diversity of the Glu-D1 loci deepens our understanding of the evolutionary trajectory of these long-studied seed storage proteins and offers new strategies for wheat grain-quality improvement.

Keywords: HMW glutenin Goat grass Grain quality Dx5+Dy10

1. Introduction

The hybridization of the wild goat grass Aegilops tauschii (DD, 2x = 14) with the cultivated emmer wheat (Triticum dicoccum, AABB, 4x = 28) created hexaploid bread wheat (Triticum aestivum, AABBDD, 6x = 42). This eventually became a dominant staple crop, providing essential calories and proteins for the mankind. However, the domestication bottleneck resulted in limited genetic diversity being utilized in modern wheat, despite the vast array of germplasm available for further exploration.

Due to its pivotal role in the evolution of bread wheat, Aegilops tauschii has been the subject of extensive study. Recently, large-scale re-sequencing of more than 900 Ae. tauschii accessions confirmed the existence of three major lineages, L1, L2, and L3 (Gaurav et al., 2022; Li et al., 2022; Cavalet-Giorsa et al., 2024). Among them, L2 accessions were identified as the major contributors of the D subgenome of bread wheat (Wang et al., 2013; Zhou et al., 2021; Cavalet-Giorsa et al., 2024). Geographically, L1 accessions are distributed across the native range of Ae. tauschii, while those of L2 are primarily found near the Caspian Sea in Iran and Azerbaijan (Singh et al., 2019). L2 accessions can be additionally classified into eastern (L2E) and western (L2W) sublineages that can be further differentiated into L2E-1, L2E-2, L2W-1, and L2W-2, respectively (Mizuno et al., 2010). The L3 lineage, which originated from a small area near present-day Tbilisi in Georgia, also contributed to the genetic blueprint of bread wheat (Gaurav et al., 2022).

The incorporation of high-molecular-weight (HMW) glutenin proteins from the Glu-D1 locus of Aegilops tauschii was pivotal to revolutionize the end-use quality of bread wheat flour by imparting unique elasticity and viscosity to wheat dough, ideal for leavened bread (Shewry, 2019). These unique physico-biochemical properties of wheat dough have established bread wheat as a staple food cherished by diverse cultures worldwide (Delorean et al., 2021). At the genomic level, the Glu-D1 locus comprises two HMW glutenin subunit genes, named Dx and Dy, which are separated by an intergenic region of ~55 kilobases (kb) enriched with repetitive sequences (Gu et al., 2006). The HMW glutenin proteins themselves contain repeated motifs in the middle. These complicate the cloning and sequencing of Glu-D1 genomic regions using conventional short-read sequencing technologies. Thus, to date, only a limited number of full-length glutenin sequences are available in GenBank (Bromilow et al., 2017; Shewry, 2019).

HMW glutenins are usually characterized according to their migration patterns using sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE). For bread wheat, the two most common alleles at the Glu-D1 locus are Glu-D1a and Glu-D1d (Rooke et al., 1999; Delorean et al., 2021). The SDS-PAGE allele designation of Glu-D1a is Dx2 + Dy12 (for short, 2 + 12) while that of Glu-D1d is Dx5 + Dy10 (or 5 + 10); of these, the 5 + 10 allele provides a better baking quality due to an extra cysteine in the Dx glutenin (Payne et al., 1980). To date, the method of Glu-D1 allele discrimination has mostly relied on SDS-PAGE separation of their proteins, without insight into the underlying biochemical properties, amino acid composition, or origins. In other words, despite their pivotal roles in end-product quality, a comprehensive understanding of the biochemical nature of HMW glutenins and the evolutionary diversity of the Glu-D1 locus at the population level remains incomplete.

The Aegilops tauschii pangenome assembled from long sequence reads gives megabase (Mb) long genomic sequences that cover the complex Glu-D1 locus. These provide an opportunity to collect full-length HMW glutenin genes from a representative set of accessions and to study the evolutionary relationship of these genes relative to those in modern wheat cultivars. To this end, we extracted the Glu-D1 locus genomic sequences from 48 high-quality genome assemblies of Ae. tauschii and studied their conservation and divergence as well as their relationship with those in bread wheat. Our work reveals new insights into the relationships among Ae. tauschii accessions through HMW glutenin genes, providing a basis for designing new strategies to expedite wheat grain quality improvement.

2. Material and methods 2.1. Acquisition of genomic sequencing data

Genome sequences of Aegilops tauschii accessions as described in (Cavalet-Giorsa et al., 2024) were downloaded and two additional accessions AS2386 and AS2399 were locally sequenced using the same platform. De novo genome anchored was carried out using RagTag (v.2.1.0) using the AL8/78 genome as a reference. The final anchored genomes of AS2386 and AS2399 have a total size of ~4.1 Gb, with a scaffold N50 value of 93, 808, 092 bp and 195, 811, 519 bp respectively. The completeness of the assembly was assessed using BUSCO (Benchmarking Universal Single-Copy Orthologs), which indicated a completeness score of 92.0% and 94.5%. Using the makeblastdb function of the Blast + software, we established a database for each assembly (Camacho et al., 2009). The GenBank accession ncbi-p:CAA27052.1 (Dx) and the sequence reported for Dy12 (Yang et al., 2023), named Yao-Dy12, were used respectively as query sequences for tBLASTn against the local databases, with parameters set to e-value ≤ 1e-5 and identity ≥ 80%. Additionally, Dx and Dy sequences of six representative wheat varieties (JM22, XY6, Abo, MZM, CM42, and ZM366) from the recent wheat pan-genome work (Jiao et al., 2025) were extracted using similar methods and included in the analysis (Table S2; Table S3). This allowed us to better determine the phylogenetic positions of Ae. tauschii glutenin genes during wheat domestication and evolution. Custom Perl scripts were developed to extract the genomic sequences from pangenome assemblies.

2.2. Phylogenetic analysis

Glutenin protein sequences were aligned using ClustalX (Larkin et al., 2007) and were imported into MEGA 11, where phylogenetic trees were constructed using the Neighbor-Joining (NJ) method with 1000 bootstrap replicates and 50% cut-off threshold. The branching patterns of the phylogenetic tree were used for classification of glutenin proteins.

2.3. Haplotype analysis

Haplotype analysis of Dx and Dy genes and their 5-kb upstream and downstream sequences was performed by firstly converting FASTA format to FASTQ format. BWA (v0.7.18) was employed for sequence assemby using AL8/78 as a reference genome to establish an index with BWA and SAM files (Li and Durbin, 2009). After converting the SAM files to BAM files using SAMtools (v.1.19.2), VCF files were generated using VCFtools individually and then merged to obtain a total VCF file (Li et al., 2009; Danecek et al., 2011). Bcftools was used to filter out sites with DP < 10, and plink was employed for further filtering, retaining SNPs with a genotype missing rate below 20% and a minor allele frequency (MAF) above 5%. Haplotype analysis was conducted using the GeneHapR package in R, and results of the haplotype analysis were visualized using the ggplot2 function in R (Zhang et al., 2023).

2.4. HMW gene and protein feature analysis

To investigate the characteristics of sequences within each clade, length, number, and amino acid composition of glutenin proteins were calculated for each accession. pI values were calculated using the Bio: : Tools: : pICalculator module from Bioperl. Data were visualized using the ggplot2 function in R language (Wickham, 2016). SOPMA was used for protein secondary structure prediction. A custom Perl script was employed to calculate the positions and counts of the secondary structures of the amino acid sequences of Dx and Dy proteins. Results were visualized using the gggenes package in R (Geourjon and Deléage, 1995; Zhang et al., 2023).

2.5. Selection pressure analysis

Ka/Ks values between pairs of sequences within each clade were calculated using the Simple Ka/Ks Calculator in TBtools v.2.119 (Chen et al., 2020). Fst values for each clade were computed using the default parameters of VCFtools. The Pi values were calculated using the Compute Pairwise Distances feature in the DISTANCE module of MEGA 11, applying the Jukes-Cantor model (Danecek et al., 2011).

2.6. Repeat sequence annotation

RepeatMasker v.4.0.2 (Tarailo-Graovac and Chen, 2009) was used to analyze the repeat sequences in the Aegilops tauschii Glu-D1 region including 10-kb upstream and downstream sequences from all accessions. Sequence alignment was performed using rmblastn from Rmblast 2.14.0, with the repeat sequence database designated as Triticum. A custom Perl script was employed to classify and organize the output files generated by RepeatMasker according to the categories of repeat sequences. Synteny was visualized using NGenomeSyn 1.41 (He et al., 2023).

2.7. Analysis of celiac disease motifs

Celiac disease immunogenic sequences were downloaded from the celiac disease database (http://www.allergenonline.org/celiacbrowse.shtml). Additional HMW glutenins were downloaded from GenBank using key words "HMW glutenin" and only sequences close to the lengths of Aegilops tauschii glutenin proteins were retained with redundancy removed. A total of 44 Dx and Dy proteins were obtained, among which 17 Dx and 14 Dy HMW subunits were annotated as from Triticum aestivum (Tae), with lengths close to the average length of Ae. tauschii Dx (> 800 aa) and Dy (> 620 aa) proteins. The remaining was from Ae. tauschii, four Dx subunits of the L2 lineage and, among nine Dy subunits, three belonging to L1 and six belonging to L2 accessions (Table S3). A custom Perl script was used to count the number of celiac disease motif sequences matched in the Dx and Dy sequences.

2.8. Cysteine and CRISPR/Cas9 editing site selection

Cysteines provide disulfide bonds in the gluten network, which helps enhance gluten strength. A custom Perl script was used to count cysteine sites and their occurrences in HWM glutenins. Based on the principle of single-base gene editing (Zong et al., 2018; Ni et al., 2023), we identified sites in the 20 bp upstream of the PAM site (NGG) that could be converted to cysteine through single-base substitution mutations. The results obtained were visualized using the trackViewer package in R (Ou and Zhu, 2019).

3. Results 3.1. Genomics sequences of Glu-D1 loci in the Aegilops tauschii pangenome

Previously, 46 Ae. tauschii accessions out of 920 surveyed were longread-sequenced to generate a pangenome-like dataset that represented 72.5% of the diversity in the panel (Cavalet-Giorsa et al., 2024), including 11 L1, 34 L2, and one L3 (Table S1). Sequences of two additional L2 accessions, AS2386 and AS2399, were included in the present HMW glutenin study. We extracted 48 genomic sequences corresponding to the Glu-D1 locus, anchored by single-copy genes as reported by Gu et al. (2006) and identified 48 pairs of Dx and Dy glutenins (Table S2). Interestingly, all HMW glutenin genes were single-exon genes, with average lengths of 844 ± 13 amino acids (aa) for Dx and 649 ± 13 aa for Dy glutenins (Table S3).

3.2. Refining Aegilops tauschii sublineages using HMW glutenin sequences

Previously, the L1 lineage was subdivided into L1E and L1W corresponding to their respective eastern and western origins, while the eastern and western L2E and L2W could be further divided into subclasses L2E-1 (southwestern Caspian Sea), L2E-2 (southeastern Caspian Sea), L2W-1 (Caucasus), and L2W-2 (Turkmenistan and northern Iran). Among these, the L2E population from the southern Caspian Sea was considered the most likely donor of the bread wheat D subgenome (Wang et al., 2013; Zhou et al., 2021). Such a classification was largely confirmed by whole genome resequencing data (Gaurav et al., 2022) and the Aegilops tauschii pangenome (Cavalet-Giorsa et al., 2024).

We developed phylogenetic trees using glutenin protein sequences. As shown in Fig. 1, using a NCBI-derived B subgenome sequence (ABY59654.1) as the outgroup, Dx proteins can be divided into seven sublineages corresponding to the major classification of Aegilops tauschii, i.e. two L1 subclasses, four L2 subclasses and one L3. However, each clade appeared to contain members from other sublineages defined by their geographical locations with high bootstrap support. We therefore added "clade" before their original naming for the new groups. For example, clade L2E-1 contained 10 members including six originally designated as L2E-1 accessions. It also contained two accessions originally designated as L2E-2, and two originally designated as admixed accessions. Our phylogenetic tree showed that clade L2E-2 appeared to be more complicated, comprising only three accessions originally designated as L2E-2 accessions, with the rest comprising two L2E-1, three L2W-1, and one admixed accession. The L3 accession TA2576 stood by itself as a separate clade.

Fig. 1 Neighbor-joining phylogenetic tree developed using Dx glutenin protein sequences. Bx glutenin from GenBank accession ABY59654.1 was used as an outgroup. Original geography-based sublineage classification of accessions are indicated in the gene names. Accessions are marked with a black dot when their classifications are consistent with previous ones. Bootstraps greater than 50% are shown.

Figure options

The new phylogenetic relationship of Aegilops tauschii accessions defined by Dy sequences were largely consistent with that defined by Dx glutenin phylogenetic analysis, except for five accessions that changed their groupings (Fig. S1). For the Dy phylogenetic tree, the two clades L2E-1 and L2E-2 remained the same. However, the L3 accession TA2576 appeared to be grouped with accessions of clade L2E-2 with high bootstrap support. For the two L2W sub-lineages, clade L2W-1 received a new member, TA1668, that was classified as a member of clade L1E in the Dx-based tree, whereas clade L2W-2 received three members of Dx clade L2W-1, i.e. TA1675, TA2564, and TA10946 (Table S3). The discrepancy in the phylogenetic relationship of Ae. tauschii accessions as demonstrated by the two HMW glutenins may indicate different evolutionary paths between the two glutenin genes, or genetic recombination among geographically separated sublineages despite their relatively short genomic distances.

3.3. Haplotype analysis of the Dx and Dy genic regions

To address Dx and Dy glutenin phylogenetic discrepancy, we performed gene-based haplotype analysis using genic regions of the two genes including 5-kb of each of their 5' and 3' sequences among 48 accessions (Table S4). Network relationship analysis identified a total of 11 haplotypes for Dx genic regions and 12 for Dy genic regions among 48 pangenome accessions (Fig. S2; Fig. S3). A trace of haploid type resources found different numbers of haplotypes for corresponding clades of Dx and Dy. For instance, nine accessions from clade L2E-2 gave only one haplotype for Dx genic regions (Fig. S3). In contrast, Dy genic regions of 10 accessions provided three haplotypes. For closely related accessions, 5' and 3' non-coding regions often provided characteristic SNPs. Dx of L1E accessions gave four haplotypes while the most divergent Dy regions appeared in clades L2E-2 and L2W-1, both of which gave three haplotypes. Thus, haplotype analysis demonstrated differential divergence at Dx and Dy genic regions.

We then compared the haplotypes of each pair of Dx and Dy glutenin genes in a given accession and found that five Dx-Dy pairs may be derived from ancestral recombination between different sublineages. We found that the Dx-Dy pair in TA1668 (an L2W-1 accession) was derived from the two far related lineages, L1 and L2, with Dx from the former and Dy from the latter (Fig. 2a), consistent with that reported by Delorean et al. (2021) using whole genome SNP data. Three new Dx-Dy recombination pairs were found in TA1675, TA2564, and TA10946 within closely related clades L2W-1 and L2W-2 (Fig. 2b and c). The single L3 accession TA2576 had one Dx haplotype L3-x2 and one Dy haplotype L3-y4 (Fig. S3). Network relationship showed that L3-y4 was most closely related to L2E-2 haplotypes, whereas L3-x2 was nearly equal distance to those of L1E, L2w-2, and L2E-1. Such a pattern suggests that L2E-2-y2 could be the L3-y4 ancestral donor, while the origin of the L3-x2 remained unclear (Fig. 2d). The results from haplotype analysis are consistent with those from phylogenetic analysis using protein sequences.

Fig. 2 Haplotype analyses of recombinant Glu-D1 regions. Five accessions were found to be derived from ancestral recombination events between Dx and Dy genes, supporting their phylogenetic categorization. Rectangles indicate the source of Dx or Dy haplotypes that form the recombinant ones (middle) as indicated. SNPs in the coding region plus 5 kb up- and down-stream were used.

Figure options

3.4. SDS-PAGE patterns of HMW glutenins

For the Glu-D1 locus, HMW glutenin genes were previously found to encode two main subunit combinations, i.e. Dx2+Dy12 and Dx5+Dy10, which are identified by their SDS-PAGE patterns to be closely related to the final grain quality in bread wheat. The genotype corresponding to Dx2+Dy12 represents the major HMW glutenin allele combination, whereas the one for Dx5+Dy10 represents a rare elite allele providing better baking quality for bread wheat due to the presence of an extra cysteine (Delorean et al., 2021).

As mentioned earlier, the average lengths of the Dx and Dy glutenins were 844 ± 7 and 649 ± 13 aa respectively in Aegilops tauschii (Table S3). Traditionally, HMW glutenins are classified by their migration rates on SDS-PAGE gels, which are affected by both the molecular weight and electronic charges of their proteins. However, our full-length glutenin protein sequences derived from the high-quality genome assemblies showed that their length and annotation did not largely correspond to the SDS annotation. Table 1 shows previously identified SDS migration patterns of Ae. tauschii glutenins and their real protein sequences. We observed discrepancies for several glutenins. For example, among five Dx subunits annotated as pattern "2" or Dx2, four were 845-aa long and one was 12-aa shorter (833 aa). For Dy subunits, among eight annotated as patterns 10, 10.1, or 10.2 on SDS-PAGE gels, two of them, from TA1675 and TA10124 of clades L2W-1 and L1W, respectively, were both 658 aa long, whereas the remaining five varied in length, ranging from 648 aa to 681 aa.

Table 1 Protein lengths and SDS-PAGE gel annotation of Aegilops tauschii HMW glutenins.

Accession (synonym)	Subunit	Dx-length (aa)	Dx-PI	Dy-length (aa)	Dy-PI
TA2576	1.5 + 10.2	851	6.91	681	7.84
TA10105 (TOWWC202)	1^t + 12	845	5.04	660	7.49
TA1577 (TOWWC242)	2 + 10	845	5.66	654	7.36
TA1675	2 + 10.1	833	6.11	658	7.59
TA1668 (TOWWC112)	2 + 10.2	845	5.66	685	7.59
TA2451 (TOWWC142)	2 + 12	845	6.51	624	7.83
TA1592 (TOWWC243)	4 + 10	845	6.01	660	7.59
TA2529 (TOWWC178)	4 + 10	854	6.91	648	7.36
TA10124 (TOWWC050)	4 + 10.2	827	5.66	658	7.59
TA1662 (TOWWC107)	4 + 12.1∗	845	6.91	648	7.59

Table options

Except for accessions with ancestral recombination between Dx and Dy, the majority of glutenins of the new clades had similar protein sequence lengths (Fig. 3 and Table S3). Accessions in clade L2W-1 appeared to be more diverse. This clade also contained most of the recombined accessions and, even for those originally annotated as members of L2W-1, there were three types of lengths for Dx and Dy glutenins, respectively, indicating that accessions in this clade were genetically diversified.

Fig. 3 Physico-biochemical features of Aegilops tauschii glutenins. (a) Protein sequence lengths in number of amino acids (aa). (b) Isoelectric points (pI) of Dx and Dy proteins as indicated. Circles represent Dy, triangles represent Dx, and hollow shapes represent subunits from accessions with ancestral recombination between Dx and Dy genes.

Figure options

For isoelectric point (pI) values, it appears that similar protein lengths of clade L2E-2 accessions gave quite similar pI values. Clade L2E-1 accessions also had identical lengths for Dx and Dy proteins, but Dx glutenins had two types of pI values (Fig. 3, bottom panel). In comparison, clade L1 accessions seemed to be more diversified in pI values for both Dx and Dy glutenins. The pI values of all Dy proteins were above 7, indicating their potentially basic nature, while all Dx proteins, but one from TA10171, exhibited pI values below 7, indicating their potentially acidic nature. For TA10171, there was an eight-amino-acid insertion immediately after the 454th amino acid, resulting in a pI value of 6.8. The two acidic amino acids in the insertion may be responsible for the unique alteration in electronic charge of the Dy glutenin in TA10171, demonstrating ongoing evolution of HMW glutenins. The differences in protein sequence lengths and electronic charges among HMW glutenins provide useful information for refining wheat grain quality in the future.

3.5. Additional biochemical features of Aegilops tauschii HMW glutenin proteins

Among the four major classes of amino acids, Dx proteins in all clades carried significantly more hydrophobic and hydrophilic amino acids than acidic and basic amino acids on average (Fig. 4a). In contrast, basic amino acids occupied a higher proportion among total amino acids in Dy subunits (Fig. 4b), consistent with lower and acidic pIs (pH < 7) of Dx glutenins and higher and basic pIs (pH > 7) of Dy glutenins as shown in Fig. 3.

Fig. 4 Amino acid composition and coeliac motif content of Aegilops tauschii glutenins. (a, b) Four major amino acid classes in HMW Dx (a) and Dy (b) glutenins in six sublineages. (c, d) coeliac motifs in Dx (c) and Dy (d) glutenins from Ae. tauschii and bread wheat. Wheat glutenins are designated as group "Tae". The error bars represent the standard deviation. Different letters at top of each bar indicate a significant difference at P < 0.05 determined by Tukey's multiple comparisons test.

Figure options

As expected, glycine (G), proline (P), and glutamine (Q) were three major amino acids (aa) for both Dx and Dy, representing on average 67.6% and 59.8% of total amino acids respectively for the two types of glutenins (Table S3). In wheat flour, cysteines provide sulfhydryl (-SH) groups as branching sites for gluten subunit networking and dough quality. Although Dy glutenins were much shorter than Dx glutenins, all Dy glutenins contained seven cysteines (C), while all Dx glutenins contained four cysteines (Table S3). Therefore, Dy glutenins provide more amino acids for cross linking between gluten subunits than Dx glutenins for the formation of the gluten network during bread making. No extra –SH groups were found in bread wheat, except for Dx5 glutenins (Delorean et al., 2021).

We then examined the secondary structures of Aegilops tauschii HMW glutenins. As shown in Fig. S4, both Dx and Dy contained α-helices at the two ends of the proteins. Beta-sheets appeared at the C-terminal of Dx glutenins (Fig. S4a), whereas some were observed in the middle of Dy glutenins, displaying clade-specific patterns (Fig. S4b). Most of the remaining regions contained random coils.

HMW glutenins occupy more than 10% of wheat grain proteins and are sources of immunogenic epitopes and toxic motifs causing coeliac disease in a sub-population of humans (Shewry, 2019). The epitopes and motifs are located especially at the middle repetitive regions of the protein. Using more than 1000 motifs reported in the Coeliac Database (Bromilow et al., 2017), we identified on average 86 immunogenic epitopes or toxic motifs on Dx glutenins and 50 on Dy glutenins in Aegilops tauschii (Fig. 4c and d; Table S3). Among them, Dx glutenins in clade L2E-1 carried the smallest number of coeliac disease-associated motifs (CD motifs), whereas for Dy glutenins clade L2E-2 contained the least motifs, coinciding with these two sublineages being the most probable D genome donors for bread wheat. Searching of wheat Dx and Dy glutenins showed that they also contained low CD motifs, comparable to those in L2-lineage Ae. tauschii accessions, consistent with their close phylogenetic relationship.

3.6. Genomic features of the Glu-D1 loci in Aegilops tauschii

We then examined the transposable element (TE) distribution in the Glu-D1 region. As shown in Fig. 5, the genomic regions were variable in length with up to 80-kb between Dx and Dy genes for L2E-1 accessions and as low as 50-kb in L2E-2. We annotated the sequences using RepeatMasker and identified that TEs occupied 42%–55% in the genomic regions between the Dx and Dy genes, well below the total repeat content of the genome. Among TEs, 64%–76% were Copia-like retrotransposons, whereas Gypsy-like retrotransposons occupied 22%–32% (Table S5). Both Copia- and Gypsy-TEs were clustered between Dx and Dy, but close to the Dy side. Each clade showed its own characteristic distribution patterns, especially for Copia-like retrotransposons. Some clades such as L2E-1 had multiple TE patterns across accessions, suggesting further diversification within sublineages, while others such as L2E-2 were quite consistent. Such distribution of TEs largely supports our new classification of Ae. tauschii accessions using HMW glutenins.

Fig. 5 Repetitive sequence distribution at the Glu-D1 region. Sequences are grouped according to their phylogenetic clades. Accessions with Dx–Dy recombination are shown in blue. Copia-like transposable elements (TEs) are indicated in yellow, whereas blue indicates Gypsy-like TEs. DNA TEs are highlighted in gray. The blue connecting line indicates the location of the Dx gene, while the location of Dy is indicated by the green connecting line. Low-copy anchor genes in the region: G1, protein kinase; G2, transcriptase zinc-binding domain-containing protein; G3, globin; G4, serine/threonine protein kinase, as reported by Gu et al. (2006).

Figure options

Calculation of fixation index (Fst) showed a range of 0.17–1 for Dx genes, whereas for Dy genes the range was 0.62–1 (Fig. S5), indicating that Dy genes are more diverged than those of Dx, an observation correlating with results from other diversity measures. Despite previous studies reporting an overall greater genetic diversity amongst accessions belonging to L2 compared to L1 (Zhou et al., 2021; Gaurav et al., 2022), we found that the Glu1-1D genetic diversity (π) of L1 clades was much higher than that of L2, whereas significant differences were observed for the two sub-lineages of L2 clades (Fig. S5). This observation is consistent with selection pressure inside each clade. We then calculated pairwise Ka/Ks values for glutenin sequences in each subclade. As shown in Fig. S6, asymmetrical selection on Aegilops tauschii Dx and Dy glutenin genes can be observed for a number of clades, with L2E-1 and L2W-1 having wider Ka/Ks distribution and L2W-2 and L1E having an opposite pattern, showing to some extent positive selection, while the remaining appeared to be negatively selected. For bread wheat, both Dx and Dy genes appeared to be positively selected with wider Ka/Ks distributions, particularly those of Dx. These observations indicate that selection for Ae. tauschii glutenins genes may be adaptive to ecological pressure, whereas for bread wheat stronger human interference may play a larger role in selecting grain quality for various end-products.

3.7. The origin of bread wheat HMW glutenin genes

As mentioned earlier, bread wheat HMW glutenins can be classified into Dx2+Dy12 and Dx5+Dy10 alleles, with the latter being considered as elite alleles due to an extra cysteine in the Dx glutenin that gives better baking quality (Payne et al., 1980). Phylogenetic analysis of bread wheat and Aegilops tauschii HMW glutenins showed that wheat Dx5 glutenins grouped with clade L2E-1, with the Dx from the L3 accession TA2576 at the base, suggesting possible origin of this allele from the L3 lineage (Fig. 6). In contrast, Dx2 glutenins were clustered with those from L2W-2 (Fig. 6). There was only one bread wheat glutenin that grouped with clade L2W-1. For Dy glutenins, the close relationship between L3 and L2E-2 was supported with high bootstraps (75%) (Fig. S7), suggesting non-coordinated evolution between Dx and Dy glutenin genes.

Fig. 6 Origin of elite HMW Dx glutenin gene alleles. (a) Phylogenetic tree developed using Dx protein sequences of both Ae. tauschii and bread wheat (indicated by green dots). (b) Multiple sequence alignment of Dx5 glutenins and Dx2 glutenins. The cysteine codon at 352–354 bp (C¹¹⁸) in Dx5 sequences is marked in green. A codon at 298–300 bp was identified in the CS Dx2 allele to be suitable for single-base gene editing to produce a C to T transition with nearby Protospacer Adjacent Motif (PAM) sequence required for Cas9 proteins. Accordingly, an arginine (R¹⁰⁰) will be converted into a cysteine (C¹⁰⁰), resulting in the conversion of a Dx2 allele (top, four cysteines) into a Dx5-like elite allele with an extra C (bottom, five cysteines).

Figure options

The so-called elite HMW glutenin combination Dx5+Dy10 features an extra cysteine (C) at around 118 aa in Dx5 proteins at the position of a serine (S) in Dx2 proteins (Fig. 6b), which provides additional –SH for chain elongation in gluten networks hence improving dough quality. We checked the DNA sequences between non-Dx5 and Dx5 genes and found that a C to G transversion at 353 bp was responsible for the replacement of a serine (S¹¹⁸) with a cysteine (C¹¹⁸) (Fig. 6b). This raises the possibility of using CRISPR/Cas9-based single-base editing technology currently available for wheat codon modification (Zong et al., 2018; Lin et al., 2020) for targeted improvement of glutenin. As shown in Fig. 6b, we identified a codon at 298 bp of the Chinese Spring Dx2 gene that encoded an arginine (R¹⁰⁰), which could be edited using nicked Cas9-based single-base editing by virtue of a nearby PAM (Protospacer Adjacent Motif) required for Cas9 functions. Thus, our HMW glutenin evolutionary analysis provided gene editing targets for improving bread wheat grain quality directly in elite, high-yielding varieties.

4. Discussion 4.1. D genome glutenins are essential to make bread wheat a staple crop

The D genome of Aegilops tauschii offers several advantages for hexaploid bread wheat, one of which is the ability to produce versatile end products that adapt to the dietary preferences of diverse ethnic groups around the world. The addition of Glu-D1 HMW glutenins allows more flexible HMW glutenin combinations in wheat, such as the consistent silencing of the A genome glutenin subunit (Ay) in bread wheat (Shewry, 2019, 2022; Li et al., 2022). For Ae. tauschii, both Dx and Dy HMW glutenins are intact and expressed in nearly all accessions studied so far, indicating indispensable roles of D-genome HMW glutenins for diploid wheat relatives.

It is interesting that Dy proteins were largely basic based on their pI values, whereas Dx proteins were acidic. Such a combination of basic Dx and acidic Dy glutenin subunits should give a close to neutral pH for proteins in the dough, making it more suitable for a variety of end-use purposes. On the other hand, by examining the number of epitopes associated with coeliac disease, we found that the founding accessions, particularly those from L2 sublineages, contained fewer such epitopes. Further comparison with bread wheat HMW glutenins revealed similar low CD motifs, consistent with the idea that Aegilops tauschii accessions from clades L2 are the most probable donors for wheat D-subgenomes. Our data suggest that the polyploid nature in hexaploid bread may provide relaxed conditions for wider genetic changes and allow for higher human selection pressure for low CD motifs and high grain quality. It is possible that, with emerging biotechnology, such as gene editing techniques, we may further remove or reduce these epitopes without significantly changing the biochemical features of HMW glutenins so as to alleviate the problems for those affected by coeliac disease.

Our analysis of protein sequence length and pI values showed that conventional SDS-PAGE is a relatively coarse method for estimating the biochemical features of HMW glutenins, especially for Dx glutenins, which exhibit greater variability in pI values and therefore more complex migration patterns on SDS-PAGE gels. The deeper insight provided by pangenome-derived full-length sequences underscore the importance of continuous, high-quality genomic data. Using this information, we identified full-length Dx and Dy glutenin proteins and their biochemical features, offering a more comprehensive understanding of these long-studied storage proteins that are essential for human life.

4.2. Re-examination of the Glu-D1 locus in wheat

Phylogenetic analysis using full-length HMW glutenin protein sequences revealed different structures to previous ones for sublineages of L1 and L2, while the phylogenetic status of the single L3 accession was not congruent from Dx and Dy analyses. More sequences may be needed to gain a conclusion comparable to that reported by Delorean et al. (2021). Similarly, Ka/Ks evaluation showed that Dx and Dy were under different selection restrictions, the significance of which may request further investigation. Our global analysis demonstrated that all Dx and Dy glutenin genes are single-exon genes, which are considered to evolve more rapidly than multi-exon genes and are often involved in environmental responses (Hu et al., 2023). This does not seem to be the case for HMW glutenin genes in Aegilops tauschii, which are relatively conserved in both sequence and function. We hypothesize that single-exon genes may be more efficient by bypassing the posttranscriptional splicing step, which could be advantageous for seed storage protein genes that are heavily transcribed during grain filling.

Moreover, the full-length sequence-based phylogenetic relationships were largely consistent with and were further confirmed by gene-based haplotype analysis, advancing the results of previous reports using short-read sequencing (Delorean et al., 2021). Our genomic structure analysis supported the phylogenetic trees based on various TE composition and distribution patterns, except for clade L2E-1 in which some accessions showed higher TEcontent than others. Such changes in TE composition and content may affect the chromatin structure, causing differential epigenetic modifications, especially during polyploidization to form bread wheat. It is notable that TEs were mostly distributed on the 3' side of the Dx and Dy HMW glutenin genes, which may have less effect on the expression of the genes through potential epigenetic modification.

4.3. Improving wheat grain quality by modifying HMW glutenin genes

Our data cannot precisely locate the potential origin of the so-called Dx5+Dy10 elite allele pairs in bread wheat since Dx5 and Dx10 glutenins gave different phylogenetic topologies when compared with those of Aegilops tauschii. Despite this, comparing Dx proteins annotated as Dx5 with those of Ae. tauschii and those non-Dx5 glutenins from bread wheat in GenBank identified the codon that led to the extra cysteine in Dx5. By analyzing codon composition, we identified targets for modification using CRISPR/Cas9-assisted single-base editing techniques, demonstrating an additional value of the pangenome resource. With technological advances, the precise codon modification corresponding to the original Dx5 glutenins can be created using the prime editing technique derived from the conventional CRISPR/Cas9 technology (Ni et al., 2023).

Since their discovery nearly 300 years ago (Beccari, 1745; Bailey, 1941), glutenins have been extensively studied for their unique roles in the production of various end products, including leavened bread, pasta, and noodles. They have also been widely studied due to their ability to cause adverse reactions in some individuals. Modern wheat breeding has pursued higher grain yield at the cost of grain protein content, a trend that needs attention in future breeding practices. Our pangenome-based analysis of glutenin genes in wild wheat relatives as well as those from bread wheat revealed the complex genomic composition and history of the Glu-D1 locus in Aegilops tauschii, offering new opportunities to utilize wild germplasm for wheat improvement. With the continued advancement of biotechnology, wheat varieties that combine higher yield with improved quality and low immune disease epitopes should become an increasingly viable possibility.

Acknowledgements

This work was supported by National Key Research and Development Program of China (2021YFF1000204), the National Science Foundation (W2411025), and CAAS Agricultural Science and Technology Innovation Program (CAAS-ZDRW202002).

CRediT authorship contribution statement

Wanxin Xu: Writing – original draft, Visualization, Software, Formal analysis, Data curation. Shuaifeng Geng: Validation, Conceptualization. Shaoshuai Liu: Validation, Supervision. Andrea González-Muñoz: Supervision. Yifan Liu: Validation. Zhongyin Deng: Validation. Yuqing Che: Validation, Software, Formal analysis. Dada Cui: Validation. Xinyu Zou: Validation. Wang Ziying: Validation, Software. Xiang Wang: Writing – review & editing, Supervision. Daowen Wang: Supervision. Dongcheng Liu: Validation. Yun Zhou: Supervision. Dengcai Liu: Supervision, Resources. Maria Itria Ibba: Validation, Resources. Brande B.H. Wulff: Writing – review & editing, Supervision. Aili Li: Writing – review & editing, Supervision. Long Mao: Writing – review & editing, Writing – original draft, Supervision, Conceptualization.

Data availability

Genomic sequences of the 48 Glu-D1 loci reported in this paper have been deposited in GenBase (Bu et al., 2024) hosted by the National Genomics Data Center (CNCB-NGDC Members and Partners, 2024), Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation (Bai et al., 2024), under the accession numbers C_AA104581.1 to C_AA104628.1 that are publicly accessible at https://ngdc.cncb.ac.cn/genbase.

Code availability

Data analysis and visualization scripts are available at GitHub (https://github.com/Xuwanxings/A-pangenome-view-of-the-high-molecular-weight-glutenin-genes-locus-in-Aegilops-tauschii.git). R codes used to generate figures are available at https://github.com/Xuwanxings/Glu-D1_Aegilops-tauschii-R/tree/main.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.pld.2025.06.002.

References

Bai, X., Bao, Y., Bei, S., et al., 2024. Database resources of the National Genomics Data Center, China National Center for Bioinformation in 2024. Nucleic Acids Res., 52: D18-D32. DOI:10.1093/nar/gkad1078

Bailey, C.H., 1941. A translation of Beccari's "Concerning grain" 1728. Cereal Chem., 18: 555-561.

Beccari, 1745. De frumento. De Bononiensi Scientiarum et Artium Instituto Atque Academia Commentarii. 6.

Bromilow, S., Gethings, L.A., Buckley, M., et al., 2017. A curated gluten protein sequence database to support development of proteomics methods for determination of gluten in gluten-free foods. J. Proteomics, 163: 67-75. DOI:10.1016/j.jprot.2017.03.026

Bu, C., Zheng, X., Zhao, X., et al., 2024. GenBase: a nucleotide sequence database. Genom. Proteom. Bioinform., 22. DOI:10.1093/gpbjnl/qzae047

Camacho, C., Coulouris, G., Avagyan, V., et al., 2009. BLAST+: architecture and applications. BMC Bioinformatics, 10. DOI:10.1186/1471-2105-10-421

Cavalet-Giorsa, E., González-Muñoz, A., Athiyannan, N., et al., 2024. Origin and evolution of the bread wheat D genome. Nature, 633: 848-855. DOI:10.1038/s41586-024-07808-z

Chen, C., Chen, H., Zhang, Y., et al., 2020. TBtools: an integrative toolkit developed for interactive analyses of big biological data. Mol. Plant, 13: 1194-1202. DOI:10.1016/j.molp.2020.06.009

Danecek, P., Auton, A., Abecasis, G., et al., 2011. The variant call format and VCFtools. Bioinformatics, 27: 2156-2158. DOI:10.1093/bioinformatics/btr330

Delorean, E., Gao, L., Lopez, J.F.C., et al., 2021. High molecular weight glutenin gene diversity in Aegilops tauschii demonstrates unique origin of superior wheat quality. Commun. Biol., 4. DOI:10.1038/s42003-021-02563-7

Gaurav, K., Arora, S., Silva, P., et al., 2022. Population genomic analysis of Aegilops tauschii identifies targets for bread wheat improvement. Nat. Biotechnol., 40: 422-431. DOI:10.1038/s41587-021-01058-4

Geourjon, C., Deléage, G., 1995. SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments. Bioinformatics, 11: 681-684. DOI:10.1093/bioinformatics/11.6.681

Gu, Y.Q., Salse, J.r. m., Coleman-Derr, D., et al., 2006. Types and rates of sequence evolution at the high-molecular-weight glutenin locus in hexaploid wheat and its ancestral genomes. Genetics, 174: 1493-1504. DOI:10.1534/genetics.106.060756

He, W., Yang, J., Jing, Y., et al., 2023. NGenomeSyn: an easy-to-use and flexible tool for publication-ready visualization of syntenic relationships across multiple genomes. Bioinformatics, 39. DOI:10.1093/bioinformatics/btad121

Hu, H., Dong, B., Fan, X., et al., 2023. Mutational bias and natural selection driving the synonymous codon usage of single-exon genes in rice (Oryza sativa L.). Rice, 16. DOI:10.1186/s12284-023-00627-2

Jiao, C., Xie, X., Hao, C., et al., 2025. Pan-genome bridges wheat structural variations with habitat and breeding. Nature, 637: 384-393. DOI:10.1038/s41586-024-08277-0

Larkin, M.A., Blackshields, G., Brown, N.P., et al., 2007. Clustal W and clustal X version 2.0. Bioinformatics, 23: 2947-2948. DOI:10.1093/bioinformatics/btm404

Li, H., Durbin, R., 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25: 1754-1760. DOI:10.1093/bioinformatics/btp324

Li, H., Handsaker, B., Wysoker, A., et al., 2009. The sequence alignment/map format and SAMtools. Bioinformatics, 25: 2078-2079. DOI:10.1093/bioinformatics/btp352

Li, H., Nie, F., Zhu, L., et al., 2022. New insights into the dispersion history and adaptive evolution of taxon Aegilops tauschii in China. J. Genet. Genomics, 49: 185-194. DOI:10.1016/j.jgg.2021.11.004

Lin, Q., Zong, Y., Xue, C., et al., 2020. Prime genome editing in rice and wheat. Nat. Biotechnol., 38: 582-585. DOI:10.1038/s41587-020-0455-x

Mizuno, N., Yamasaki, M., Matsuoka, Y., et al., 2010. Population structure of wild wheat D-genome progenitor Aegilops tauschii Coss.: implications for intraspecific lineage diversification and evolution of common wheat. Mol. Ecol., 19: 999-1013. DOI:10.1111/j.1365-294X.2010.04537.x

Ni, P., Zhao, Y., Zhou, X., et al., 2023. Efficient and versatile multiplex prime editing in hexaploid wheat. Genome Biol., 24. DOI:10.1186/s13059-023-02990-1

Ou, J., Zhu, L.J., 2019. trackViewer: a Bioconductor package for interactive and integrative visualization of multi-omics data. Nat. Methods, 16: 453-454. DOI:10.1038/s41592-019-0430-y

Payne, P.I., Law, C.N., Mudd, E.E., 1980. Control by homoeologous group 1 chromosomes of the high-molecular-weight subunits of glutenin, a major protein of wheat endosperm. Theor. Appl. Genet., 58: 113-120. DOI:10.1007/bf00263101

Rooke, L., Békés, F., Fido, R., et al., 1999. Overexpression of a gluten protein in transgenic wheat results in greatly increased dough strength. J. Cereal. Sci., 30: 115-120. DOI:10.1006/jcrs.1999.0265

Shewry, P., 2019. What is gluten—why is it special?. Front. Nutr., 6. DOI:10.3389/fnut.2019.00101

Shewry, P., 2022. Wheat grain proteins: past, present, and future. Cereal Chem., 100: 9-22. DOI:10.1002/cche.10585

Singh, N., Wu, S., Tiwari, V., et al., 2019. Genomic analysis confirms population structure and identifies inter-lineage hybrids in Aegilops tauschii. Front. Plant Sci., 10. DOI:10.3389/fpls.2019.00009

Tarailo-Graovac, M., Chen, N., 2009. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinf., 25. DOI:10.1002/0471250953.bi0410s25

Wang, J., Luo, M.C., Chen, Z., et al., 2013. Aegilops tauschii single nucleotide polymorphisms shed light on the origins of wheat D-genome genetic diversity and pinpoint the geographic origin of hexaploid wheat. New Phytol., 198: 925-937. DOI:10.1111/nph.12164

Wickham, H., 2016. ggplot2: Elegant Graphics for Data Analysis New York: Springer-Verlag.

Yang, C., Chen, Q., Xin, M., et al., 2023. A highly conserved amino acid in high molecular weight glutenin subunit 1Dy12 contributes to gluten functionality and processing quality in wheat. J. Genet. Genomics, 50: 909-912. DOI:10.1016/j.jgg.2022.11.002

Zhang, R., Jia, G., Diao, X., 2023. geneHapR: an R package for gene haplotypic statistics and visualization. BMC Bioinformatics, 24. DOI:10.1186/s12859-023-05318-9

Zhou, Y., Bai, S., Li, H., et al., 2021. Introgressing the Aegilops tauschii genome into wheat as a basis for cereal improvement. Nat. Plants, 7: 774-786. DOI:10.1038/s41477-021-00934-w

Zong, Y., Song, Q., Li, C., et al., 2018. Efficient C-to-T base editing in plants using a fusion of nCas9 and human APOBEC3A. Nat. Biotechnol., 36: 950-953. DOI:10.1038/nbt.4261

Article Information

Article history

Workspace