Applying image clustering to phylogenetic analysis: A trial
Li-Dan Taoa,b, Wei-Bang Suna,c,*     
a. Yunnan Key Laboratory for Integrative Conservation of Plant Species with Extremely Small Populations, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, Yunnan 650201, China;
b. University of Chinese Academy of Sciences, 100049 Beijing, China;
c. Kunming Botanical Garden, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, China

Phylogenetic studies have increased in recent years, largely due to rapid developments in sequencing techniques (Tucker et al., 2017). However, molecular phylogenetic studies rely on collecting biomaterials, which limits their applicability to many, especially small, rare plants, or inaccessible plants. Recently, image classification using supervised or semi-supervised approaches (Dyrmann et al., 2016; Wäldchen et al., 2018; Draper et al., 2020; Pushpanathan et al., 2021) has been widely used in botanical studies to identify plants. Furthermore, several pretrained open-source image classification models that have been published are highly accurate. This raises the possibility that if image clustering analyses is combined with pre-trained machine learning models, databases of plant photographs can be used in phylogenetic studies. However, the use of image clustering in phylogenetic studies is rare, especially when image clustering simply represents an update of traditional morphological clustering or a supplement to molecular phylogenetic approaches. Here, we introduce an image clustering approach to reconstruct the phylogeny of Corybas Salisb (Orchidaceae). Corybas is a genus of approximately 170 terrestrial orchid species with tiny flowers (Lyon, 2016; Wang et al., 2020). Although Corybas is distributed broadly from the southernmost Macquarie Island near Antarctica to Xizang, China, most Corybas species are endemic to small areas in New Guinea, Australia, New Zealand, southeast Asia, or the Pacific islands (Clements et al., 2007). The distribution of Corybas, along with the size of these orchids, makes it difficult to collect biomaterials, thus, limiting our ability to analyze their phylogeny.

In this study, we evaluated the ability of image cluster analysis to reconstruct the phylogeny of Corybas based on two pretrained deep convolutional neural network (CNN) architectures, ResNet50 (He et al., 2016) and InceptionResNetV2 (Szegedy et al., 2017). Both are widely used in image processing, including plant species recognition, and have proven to be accurate (Heredia, 2017; Krause et al., 2019). Our image data set consisted of freely available public images downloaded from GBIF (https://www.gbif.org/, 2021-11-8), PPBC (http://ppbc.iplant.cn/, 2021-11-8), and published literature (Jin et al., 2012; Lyon, 2014; Go et al., 2015; Faria, 2016; Hsu et al., 2016; Lehnebach et al., 2016; Tandang et al., 2020a, 2020b; Inuthai et al., 2021). To evaluate the accuracy of our approach, we compared phylogenetic trees generated by image clustering to high-quality phylogenetic trees of over 80 taxa of Corybas that were previously reconstructed using four chloroplast genes and three nuclear genes (Lyon, 2014).

Images of specimens and images not focusing on Corybas plants were removed. A total of 4736 images of 78 Corybas species were used (Table S1 in Supplementary file 1). Three sampling strategies were compared: 1) all species with at least 1 image, 2) species with at least 5 images, and 3) species with at least 10 images. Image clustering was conducted using Keras (https://keras.io/) in Python 3.8.5 following official guidelines from Keras (Supplementary file 2; Fig. 1). Basic models, excluding fully connected layers, were fine-tuned on a data set containing images of Corybas species with more than 100 images (training set: validation set = 3 : 1). The last 5 layers of ResNet50 and the last 281 layers of InceptionResNetV2 were set trainable.

Fig. 1 Flow chart of this research.

Image features of all 78 Corybas species were extracted using fine-tuned models (Figs. S2–7 in Supplementary file 1). To simplify the process, image segmentation, specific feature extraction from images (texture, width, shapes, etc.), or other complicated processes were not conducted. All images were treated equally without any weighting or grouping.

To obtain binary trees, an agglomerative clustering method was conducted. For each clustering step of each node, Euclidean distance between each species pair was calculated using one randomly selected image from each species, this step was repeated five times to get an average distance between each species pair. The nearest species pairs were grouped. The next iteration went on until all species were grouped together. In the end, a binary tree was formed. The whole process was repeated 1000 times to obtain 1000 trees, which were then combined to form a 50% majority consensus tree using PAUP* v.4.0a (Swofford, 2003) (Fig. 1).

Simplified Silhouette Index (SSI) was used to evaluate the clustering performance at each node in the 50% majority tree (Rousseew, 1987). The average distances between every two clusters were calculated separately using features of all images (Supplementary file 3). Topological differences between trees were evaluated by symmetric differences calculated using PAUP* (Penny and Hendy, 1985); consensus trees were visualized using iTOL (https://itol.embl.de/) (Letunic and Bork, 2021).

Flowers of Corybas from Australia are spherical, cystic, or crescent shaped, with lateral sepals and near absent petals; flowers of Corybas from New Zealand are usually spherical, ovate, or cystic with a beaked helmet and the lateral sepals and petals are much longer than the dorsal sepals. The morphologies of Corybas species from Asia and the Pacific islands are intermediate between these two morphologies. Image clustering analysis grouped the morphological characters of Corybas in accordance with their distribution and dispersal history. These morphologically based clusters were also consistent with phylogenetic trees based on molecular data (Lyon, 2014).

Images of some species tended to cluster together, such as species from China and species from New Zealand, whereas species from Australia and Pacific Islands were separated and had variable placement in clusters, consistent with phylogenetic trees based on morphology and chloroplast genes in Lyon (2014) (Fig. 2 and Figs. S8–24 in Supplementary file 1). Our results also showed that images from some species do not cluster correctly, such as Corybas hispidus, C. rotundifolius, and C. acuminatus. Moreover, image clustering left some clades unresolved (Fig. 2).

Fig. 2 Image clustering of 50 Corybas species, each with at least five images, based on an ensemble of fine-tuned ResNet50 and InceptionResNetV2 models (Images from GBIF, See Table S4 in Supplementary file 1 for information of image authors).

Fine-tuned InceptionRestNetV2 had the lowest symmetric difference within its trees, the highest symmetric difference compared with random trees, and the highest silhouette score, which indicates that InceptionRestNetV2 performed better than ResNet50. This finding is consistent with other studies (Mahdianpari et al., 2018). In addition, fine-tuning significantly improved performance of InceptionRestNetV2 (Tables 2 and 3; Fig. S1 in Supplementary file 1; Supplementary file 4).

Even though our method was not fully optimized, and our data set was not sufficient to train a new neural network model, our models performed well because they were transformed from pretrained models. The high quality of pretrained models and the increasing availability of plant images from online databases make image clustering with pretrained machine learning models cheap, efficient, and of great research value. Like morphological analysis, image clustering may be unable to fully reflect the evolutionary history of taxa. However, image clustering approaches may be useful as supplements, not substitutes, for molecular phylogenetic analysis. Notably, image clustering analysis may be able to resolve relationships in clades in which molecular data is inadequate. Recently, integrative methods for species identification using both morphological and DNA sequence data have achieved much attention (Yang et al., 2022). A similar integrative method can be used to resolve phylogenetic relationships, with image clustering filling in gaps in unresolved branches and challenging clades with inconsistent image and molecular clusters.

Below are suggestions for optimizing image clustering analyses:

1) The performance of image clustering may be dramatically affected by basic models, clustering methods and hyperparameters. Other basic models should be assessed, including AlexNet (Krizhevsky et al., 2017), Inception v.4 (Szegedy et al., 2016), and Xception (Chollet, 2017), and other clustering methods, such as self-organized map, maximum parsimony tree, and Bayesian methods.

2) Images from public databases contain substantial noise; thus, developing automatic image filtering methods is necessary.

3) It is also necessary to add more images or more data enhancement treatments, such as segmentation, sliding window lattice and rotation. When species with fewer than five images were included in clustering, unresolved clades increased dramatically, and were inconsistent with morphological features. Thus, we strongly recommend using at least five images per taxon (Supplementary file 1). However, too many missing taxa can also lead to biased clustering. In this study, there were fewer images of Corybas species from the Pacific Islands than from other regions. Therefore, image clustering would benefit from more attention to public plant image databases and deeper cooperation between volunteers, plant lovers, and botanists.

Acknowledgement

This work was equally supported by the Second Tibetan Plateau Scientific Expedition and Research Program (2019QZKK0502 to H.S.), National Natural Science Foundation of China (Grant No. 32001230 to L.T.), and the Science and Technology Basic Resources Investigation Program of China (Grant No. 2017FY100100 to W.S.). We thank Prof. Gao Chen and Dr. Yongsheng Chen from the Kunming Institute of Botany, Chinese Academy of Sciences, for their valuable suggestions.

Author contributions

L.T. performed research and wrote the manuscript; W.S. revised the manuscript.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.pld.2022.11.001.

References
Chollet, F., 2017. Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Pp. 1800-1807. https://doi.org/10.1109/CVPR.2017.195.
Clements, M.A., Mackenzie, A.M., Copson, G.R., et al., 2007. Biology and molecular phylogenetics of Nematoceras sulcatum, a second endemic orchid species from subantarctic Macquarie Island. Polar Biol., 30: 859-869. DOI:10.1007/s00300-006-0246-y
Dyrmann, M., Karstoft, H., Midtiby, H.S., 2016. Plant species classification using deep convolutional neural network. Biosyst. Eng., 151: 72-80. DOI:10.1016/j.biosystemseng.2016.08.024
Draper, F.C., Baker, T.R., Baraloto, C., et al., 2020. Quantifying tropical plant diversity requires and integrated technological approach. Trends Ecol. Evol., 35: 1100-1109. DOI:10.1016/j.tree.2020.08.003
Faria, E., 2016. Diversity in the genus Corybas Salisb. (Orchidaceae, Diurideae) in New Caledonia. Adansonia, 38: 175-198. DOI:10.5252/a2016n2a4
Go, R., Ching, T.M., Nuruddin, A.A., et al., 2015. Extinction risks and conservation status of Corybas (Orchidaceae; Orchidoideae; Diurideae) in peninsular Malaysia. Phytotaxa, 233: 273-280. DOI:10.11646/phytotaxa.233.3.4
He, K., Zhang, X., Ren, S., et al., 2016. Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 7: 171-180.
Heredia, I., 2017. Large-scale plant classification with deep neural networks. 14th ACM International Conference on Computing Frontiers 259-262. https://doi.org/10.1145/3075564.3075590.
Hsu, T.C., Yang, T.Y.A., Pitisopa, F., et al., 2016. New records and name changes for the Orchids in the Solomon Islands. Taiwania, 61: 21-26.
Inuthai, J., Chantanaorrapint, S., Poopath, M., et al., 2021. Corybas papillatus (Orchidaceae), a new orchid species from peninsular Thailand. PhytoKeys, 187: 1-7. DOI:10.3897/phytokeys.183.71167
Jin, N.Y., Go, R., Nulit, R., 2012. Orchids of cloud forest in Genting Highlands, Pahang, Malaysia. Sains Malays., 41: 505-526.
Krause, J., Baek, K., Lim, L., 2019. A guided multi-scale categorization of plant species in natural images. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Pp. 2639-2647. https://doi.org/10.1109/CVPRW.2019.00320.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM, 60: 84-90. DOI:10.1109/5.726791
Lehnebach, C.A., Zeller, A.J., Frericks, J., et al., 2016. Five new species of Corybas (Diurideae, Orchidaceae) endemic to New Zealand and phylogeny of the Nematoceras clade. Phytotaxa, 270: 1-24. DOI:10.11646/phytotaxa.270.1.1
Letunic, I., Bork, P., 2021. Interactive Tree of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res., 49: W293-W296. DOI:10.1093/nar/gkab301
Lyon, S.P., 2014. Molecular Systematics, Biogeography, and Mycorrhizal Associations in the Acianthinae (Orchidaceae), with a Focus on the Genus Corybas. University of Wisconsin, Madison.
Lyon, S.P., 2016. Six new species of new Guinea Corybas. Malesian Orchid J., 18: 85-103.
Mahdianpari, M., Salehi, B., Rezaee, M., et al., 2018. Very deep convolutional neural networks for complex land cover mapping using multispectral remote sensing imagery. Rem. Sens., 10: 1119. DOI:10.3390/rs10071119
Penny, D., Hendy, M.D., 1985. The use of tree comparison metrics. Syst. Zool., 34: 75-82. DOI:10.2307/2413347
Pushpanathan, K., Hanafi, M., Mashohor, S., et al., 2021. Machine learning in medicinal plants recognition: a review. Artif. Intell. Rev., 54: 305-327. DOI:10.1007/s10462-020-09847-0
Rousseew, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20: 53-65. DOI:10.1016/0377-0427(87)90125-7
Swofford, D.L., 2003. PAUP*. Phylogenetic Analysis Using Parsimony. Sinauer Associates, Sunderland, Massachusetts. Version 4.
Szegedy, C., Ioffe, S., Vanhoucke, V., et al., 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. Thirty-first AAAI Conference on Artificial Intelligence.
Tandang, D.N., Reyes, T., Bustamante, R.A.A., et al., 2020a. Corybas boholensis (Orchidaceae): a new jewel orchid species from bohol island, central visayas, Philippines. Phytotaxa, 477: 261-268. DOI:10.11646/phytotaxa.477.2.10
Tandang, D.N., Bustamante, R.A.A., Ferreras, U., et al., 2020b. Corybas circinatus (Orchidaceae), a new species from Palawan, the Philippines. Phytotaxa, 446: 135-140. DOI:10.11646/phytotaxa.446.2.7
Tucker, C.M., Cadotte, M.W., Carvalho, S.B., et al., 2017. A guide to phylogenetic metrics for conservation, community ecology and macroecology. Biol. Rev., 92: 698-715. DOI:10.1111/brv.12252
Wäldchen, J., Rzanny, M., Seeland, M., et al., 2018. Automated plant species identification—trends and future directions. PLoS Comput. Biol., 14: e1005993. DOI:10.1371/journal.pcbi.1005993
Wang, H.C., Yang, J., Sun, W.B., 2020. Complete chloroplast genome of the endangered Corybas taliensis (Orchidaceae), a plant species with extremely small populations endemic to China. Mitochondrial DNA B, 5: 1884-1885. DOI:10.1080/23802359.2020.1753591
Yang, B., Zhang, Z.X., Yang, C.Q., et al., 2022. Identification of species by combining molecular and morphological data using convolutional Neural Networks. Syst. Biol., 71: 690-705. DOI:10.1093/sysbio/syab076