Dual networks with hierarchical attention for fine-grained image classification

引用本文

Yang T，Wang G H. Dual networks with hierarchical attention for fine-grained image classification[J]. Journal of University of Chinese Academy of Sciences, 2025, 42(6): 806-813.

杨涛, 王改华. 用于细粒度图像分类的层级注意力双重网络[J]. 中国科学院大学学报, 2025, 42(6): 806-813.

Dual networks with hierarchical attention for fine-grained image classification

YANG Tao, WANG Gaihua

College of Artificial Intelligence, Tianjin University of Science & Technology, Tianjin 300457, China

Received 14 December 2023; Revised 25 March 2024

Foundation items: Supported by the National Natural Science Foundation of China (61601176)

Corresponding author: WANG Gaihua, E-mail: wanggh@tust.edu.cn

Abstract: In this paper, we propose hierarchical attention dual network (DNet) for fine-grained image classification. The DNet can randomly select pairs of inputs from the dataset and compare the differences between them through hierarchical attention feature learning, which are used simultaneously to remove noise and retain salient features. In the loss function, it considers the losses of difference in paired images according to the intra-variance and inter-variance. In addition, we also collect the disaster scene dataset from remote sensing images and apply the proposed method to disaster scene classification, which contains complex scenes and multiple types of disasters. Compared to other methods, experimental results show that the DNet with hierarchical attention is robust to different datasets and performs better.

Keywords: dual network (DNet) fine-grained image classification hierarchical attention features

用于细粒度图像分类的层级注意力双重网络

杨涛, 王改华

天津科技大学人工智能学院, 天津 300457

摘要: 提出一种用于细粒度图像分类的层级注意力双重网络。双重网络可以从数据集中随机选择成对的输入, 通过层次注意力特征学习比较它们之间的差异, 这有利于消除噪声的同时保留显著特征。在损失函数设计中, 根据类内差异和类间差异, 考虑了成对图像之间差异损失的计算。此外, 通过遥感图像搜集了灾害场景数据集, 这些数据集包含各种复杂场景和多种灾害类型。在该灾害场景分类中的验证实验表明, 与其他方法相比, 层级注意力双重网络在不同的数据集上具有较好的鲁棒性, 取得了更好的性能指标。

关键词: 双重网络细粒度图像分类层级注意力特征

Computer vision techniques for fine-grained image classification have attracted a lot of attention, e.g., bird species^[1], flower types^[2], and dog species^[3]. This task is challenging because some fine-grained categories (e.g., "eared plover" and "horned plover") can only be recognized by domain experts. Unlike general classification tasks, fine-grained image classification requires localizing and representing very small visual differences in subca-tegories.

To address the above challenges, we propose a novel dual network (DNet) with hierarchical attention for fine-grained classification without bounding box/partial annotations. Its loss function combines DNet with hierarchical attention model to obtain more accurate representation of fine-grained categorical features. Our contribution can be summarized as follows:

1) The DNet is designed based on paired images. The differences in the paired images are compared and added to the loss function based on intra-variance and inter-variance. Its loss function consists of two parts: the original loss of the network and the comparison loss of the hierarchical attention features.

2) Hierarchical attention model is introduced for selecting salient features. From top layer to down layer, the different layer attention describes its own characteristics. Through the max K response block, the attention model can carry multi-scale information and learn more discriminative features.

3) We performed comprehensive experiments and analyses on four datasets (CUB-200-2011, Stanford dogs, Oxford-flower-17, and Disaster-scene) and demonstrated superior performance on all of them.

1 Related work

The research on fine-grained image classifica-tion focuses on two main areas: discriminative part localization and weakly supervised methods.

1.1 Discriminative part localization

Early approaches to fine-grained image classification required robust annotations, often utilizing bounding boxes and partial annotations to locate salient regions in fine-grained classification. Lam et al.^[4] proposed a set of bounding boxes in the image, whose "informativeness" was evaluated by the heuristic function. To get bounding boxes, some detectors^[5-6]are designed automatically to localize the object or discriminative regions and encode discriminative features. The hierarchical classification method^[7] obtains salient features by localizing the attention region for accurate reco-gnition. These methods provide a way of encoding the salient features for distinguishing sub-categories. But such annotations are expensive and unrealistic in many practical applications.

1.2 Weakly supervised methods

Weakly supervised methods have been extensively studied for fine-grained image classification. To better model the subtle differences presented in fine-grained categories, bilinear structure^[8] has been proposed to capture the local differences by two independent CNNs, which achieved superior results on CUB-200-2011 dataset. The compact bilinear representation^[9] is a kernelized method of bilinear pooling and can provide the discriminative power of bilinear pooling with only a few dimensions. Kong and Fowlkes^[10] proposed classifier that factorizes the collection of bilinear classifiers into common factor and compact terms. Nie et al.^[11] proposed a novel module, named forcing module, to force the network to extract more diverse features. Oezdemir^[12] proposed a custom pooling method to calculate the weighted average of the dominant features. Li et al.^[13] proposed a semantic bilinear pooling method to learn the semantic information from the hierarchical levels. Ji et al.^[14] proposed the concepts from self-supervised learning and seamese network and take a step towards solving the problems of overfitting. A region enhancement and suppression approach^[15] proposes a plug-and-play significant region diffusion (SRD) module for explicitly enhancing the significant features. The attention module for two-branch networks (DAL-Net)^[16] acquires features in a weakly supervised manner. Weakly supervised spatial group attention Network (WSSGA-Net)^[17] highlights the correct semantic feature regions for more accurate classification by establishing semantic enhancement mechanism.

2 Approach

In this section, we introduce DNet with hierar-chical attention for fine-grained image classification that can be trained end-to-end.

2.1 Overall structure

Figure 1 provides the overall structure of our approach. It consists of four parts: CONV layers, attention block, interaction model, and loss function.

	Download: JPG larger image
Fig. 1 Structure of network

It takes two images as input and generates features through convolutional layers. On one side, the hierarchical attention mechanism of the network selects salient features from the images. Comparing the differences between the paired images, loss0 and loss1 are obtained. On the other side, the network undergoes the prediction of the interaction model to obtain loss2 and loss3. The total loss obtained is used to iteratively adjust the network parameters so that the network can select salient features.

Instructions: CONV blocks are the last three conv layers in VGG16. For Resnet18, CONV blocks are con3_x, con4_x, and con5_x. In the testing stage, we can only use the forward network of input1 or input2 to get the result of test. In interaction model, multiple models including bilinear CNN (BCNN)^[8], compact bilinear pooling (CBP)^[9], hierarchical bilinear pooling (HBP)^[18], pyramid hybrid pooling quantization (PHPQ)^[19], and guided cluster aggregation (GCA)^[20] are imple-mented.

2.2 Region selection

To obtain the salient features, the framework adopts the hierarchical attention model (HA model). Given feature maps F from last three CONV Blocks. F_c(x, y) is defined as the spatial location (x, y) of channel C. The whole framework of attention block is divided into two parts: attention maps and attention features, which are shown in Fig. 2.

	Download: JPG larger image
Fig. 2 Selection of salient features

We compute the salient map of attention mechanism mask_c(x, y) as

$ \operatorname{mask}_{c}(x, y)=\operatorname{sigmoid}\left(F_{c}(x, y)\right) . $

(1)

We get attention features in each CONV Block for

$ \boldsymbol{M}=\boldsymbol{F}\left(1+\operatorname{mask}_{c}(x, y)\right) . $

(2)

Attention maps are considered as a set of spatial maps essentially encoded on the spatial areas of the inputs. Based on the generated attention map, the region of the object can be localized. The hierarchical attention model focuses on the distin-guishing differences of parts among subcategories, which makes the saliency feature values stronger and suppresses noise pixels or useless information.

After getting the salient features from attention maps, some irrelevant regions should be filtered out. As shown in Fig. 2(b), the output of attention map can be more salient in objects.

Then the outputs of attention map are transformed into vectors. We make these vectors sort ascendingly. In this process, the max K response block is continuously added into framework to find the main objects or relevant region bounded by red box. Finally, the max K salient values are selected. The main steps of methods are listed in following Algorithm1.

Here, the salient object features are extracted automatically without any object or part annotations. The salient feature representations (it is 512 × 3 in our experiments) totally came from the hierarchical attention block.

Algorithm1: the main process of the attention feature extraction
Input: the feature outputs F of CONV block Output: Attention feature S 1: All feature maps are processed by Eqs.(1) and (2). 2: Resized the outputs M of attention map as column vectors. 3: Sort all pixels of attention map M according to ascending order. 4: Select max K response value in each attention block (in experiment, we choose max 512 values). 5: To get S by concatenating K salient feature values of each layer.

Algorithm1: the main process of the attention feature extraction

Input: the feature outputs F of CONV block
Output: Attention feature S
1: All feature maps are processed by Eqs.(1) and (2).
2: Resized the outputs M of attention map as column vectors.
3: Sort all pixels of attention map M according to ascending order.
4: Select max K response value in each attention block (in experiment, we choose max 512 values).
5: To get S by concatenating K salient feature values of each layer.

2.3 Optimization

After the hierarchical attention features are generated from different blocks, they are utilized to compare the difference of paired inputs via calculating the Euclidean distance. Figure 3 illustrates the flowchart of the loss. The distance0, distance1 and distance2 are the distances between the paired images from two attention blocks. Loss0 is the difference of paired images from the same categories. Loss1 is the difference of paired images from different categories.

	Download: JPG larger image
Fig. 3 Loss of hierarchical attention block

Based on the intra-class variance and inter-class variance, we add the compared loss functions. The total loss function is expressed by

$ \operatorname{loss}=\operatorname{loss} 0-\operatorname{loss} 1+\operatorname{loss} 2+\operatorname{loss} 3 . $

(3)

Where loss2 and loss3 refer the losses of the paired images from the main network. They use classification cross-entropy loss. Theoretically, the variances should be smaller when the paired images are the same categories. And the variances should be larger when the paired images are the different categories.

The proposed method uses paired images as inputs for training. But for test and inference, we only use a branch to classify images.

3 Experiments

All the experiments are performed by using PyTorch on NVIDIA GeForce RTX 2060 GPUs. The proposed approach is compared with multiple methods to verify its effectiveness. In all of the experiments, we only use image labels without any part or bounding box annotations. Accuracy is adopted as the evaluation metric to evaluate the classification performances. In original paper of CBP^[9], the features of CBP^[9] uses both random Maclaurin (RM) and tensor sketch (TS) projections. In this experiment, CBP^[9] model only utilizes TS projections. To reduce the complexity, DNet on CBP only uses features of final attention block to compute distance between paired images (it is the distance2 in Fig. 3).

3.1 Datasets and baselines

The experiments are conducted on three public challenging fine-grained image classification datasets and a self-made disaster scenario dataset, including CUB-200-2011, Stanford dogs, OxfordFlower-17, and Disaster-scene. The detailed statistics with category numbers and data splits are summarized in Table 1.

Table 1 Statistics of fine-grained datasets used in this paper

We use two basic CNNs: VGG-16 and ResNet-18, which achieved state-of-the art performance on ImageNet. VGG-16 has 16 layers including 13 convolutions with ReLU layers and 3 fully connected layers. And ResNet-18 has four basic conv blocks. Each conv block includes two residual units. A weight decay of 0.000 1 with a momentum of 0.9 is used in training, and the initial learning rate is set to 0.01. With all above settings, two CNN networks are initialized with the pretrained parameters from ImageNet. During the training, the learning rate is divided by 10 every 20 iterations. For input format, the training stage uses pairs of images as inputs to train the network. In the test stage, only single image is input to test the accuracy.

3.2 Experiments on CUB-200-2011

Table 2 shows the comparison results on CUB-200-2011 dataset. The training images are randomly cropped to 224×224 patches. The test images are processed with center-cropped (each of size 224×224). In HACBP, the baseline network is VGG-16. The three attention models come from the last three convolution layers of VGG-16.

Table 2 Comparison results on CUB-200-2011 dataset

From Table 2, we can see that HA model has a boost performance than the original model. The accuracy of CBP^[9] is 72.75%. The accuracy of HACBP is 73.50%. The accuracy of HBP^[18] is 69.69% when the baseline is ResNet-18. The accuracy of HAHBP is 71.30%. We used the DNet to test the performance.When DNet is added on CBP, the accuracy is 74.56% and it offers the best accuracy. We also test the accuracy of some recent methods, such as PHPQ^[19], GCA^[20]. PHPQ^[19] captures and retains fine-grained semantic information in multi-level features. GCA^[20] use cluster to divides the data into small clusters, and then aggregates them. In Table 2, the proposed approach on CBP achieved the higher classification accuracy without object and parts annotations.

We also show the whole process from original images to the output of attention model (Fig. 4). The first image is the original image. The second image is the output of data augmentation. The third image is the output of attention map. The last one is the output box of extracted feature. In comparison, the DNet achieved significant improvement.

	Download: JPG larger image
Fig. 4 Process from input to attention model on CUB-200-2011

3.3 Experiments on Stanford dogs

The classification accuracy on Stanford dogs is summarized in Table 3. The training images are randomly cropped to 224×224 patches. The test images are processed with 11 center-cropped (each of size 224×224). According to the same methods, we added HA model to different networks and obtained an impressive progress. The accuracy of HACBP(78.74%) is higher than CBP^[9]. The accuracy of HAHBP (78.26%) is also higher than HBP^[18]. The accuracy of DNet on CBP is highest at 80.02%. It is 0.19% higher than PHPQ^[19].

Table 3 Comparison results on Stanford dogs dataset

Figure 5 is the transformation process of sample image, which is from original image to the output of data after augmentation. Then it acquires the main feature points following attention map. The proposed attention model can extract salient feature from discriminating region.

	Download: JPG larger image
Fig. 5 Process from input to attention model on Stanford dogs

3.4 Experiments on Oxford-flower-17

Table 4 shows the comparison results on Oxford-flower-17 dataset. The training images are also randomly cropped to 224×224 patches. The test images are processed with center-cropped (each of size 224×224). The accuracy of HACBP is 89.41%. The original CBP^[9] has 89.12% accuracy. The accuracy of HAHBP is 90.15%. And the original HBP^[18] has 88.09% accuracy. The accuracy of DNet on CBP is highest at 95.15%.

Table 4 Comparison results on Oxford-flower-17 dataset

We also displayed the process of a sample image from original input to attention model (Fig. 6). Firstly, it gets the attention map from the designed attention model. Then some salient features are extracted from the red box.

	Download: JPG larger image
Fig. 6 Process from input to attention model on Oxford-flower-17

3.5 Experiments on Disaster-scene

The proposed methods are also applied to estimate the disaster-scene, which include earthquakes, tsunamis, and tornadoes. They are extremely complex at a finer scale. This database contains approximately 1 757 color images, including three different disaster types: 760 images are earthquake images; 556 images are tsunami images and the rest 441 images are tornado images. Specifically, the earthquake data were from the Haitian earthquake in 2010 and the earthquake in Christchurch, New Zealand in 2011; the tsunami data came from the 2004 Tohoku tsunami in Japan and the 2004 Indonesian tsunami; and the tornado data was taken after the Moore tornado in Oklahoma, USA in 2013.

These annotations and labeling of datasets were done by our team at the University of Missouri, Kansas City. Additionally, we divided the dataset into different damage-level: DL1, DL2, DL3, which stands for no damage, mild damage, severe damage.

The different methods are compared by using baseline VGG-16. The training images are randomly cropped to 224×224 patches. The test images are processed with center-cropped (each of size 224×224). The data augmentation is also used to increase the number of images. The results are shown by Table 5 and Table 6. Table 5 is the prediction accuracy of different disaster-scenes. We divided the disaster-scenes into three types: tornado, earthquake, and tsunami with train data and test data.

Table 5 Prediction accuracy of different disaster-scenes (base model: VGG-16)

Table 6 Prediction accuracy of damage level in different disasters

For Table 5, the original CBP^[8] has 86.58% accuracy. The HACBP is 86.89%. We also test the pairs network with attention model. DNet is used in CBP. The accuracy of DNet on CBP is highest at 88.47%.

Table 6 is the prediction accuracy of damage level. The first column is the prediction of tornado level. The second column is the prediction of earthquake level. And the third column is the prediction of tsunami level. Each disaster type is divided into three levels: no damage, mild damage, and severe damage.

For Table 6, the prediction of Tornado level is best. The original CBP^[9] has 84.10% accuracy. The HACBP is 87.12%. DNet on CBP has 87.33% accuracy. For the prediction of earthquake and tsunami, these results are not very good. But they are all over 50% accuracy and may need further research.

4 Conclusions

In this paper, we propose a hierarchical attention DNet for weakly supervised fine-grained image classification, which jointly integrates hierarchical attention and DNet. The experimental results show that the hierarchical attention improves the multi-scale feature learning ability and has a tendency to improve gradually. And DNet can learn features significantly by comparing paired images. In conclusion, the proposed method performs well in fine-grained image classification without strong bounding box or component annotations. This work is highly beneficial for the application of various neural network models.

References

[1]	Jagiello Z, Reynolds S J, Nagy J, et al. Why do some bird species incorporate more anthropogenic materials into their nests than others?[J]. Philosophical Transactions of the Royal Society B: Biological Sciences, 2023, 378(1884). DOI:10.1098/rstb.2022.0156
[2]	Yang X D, Su J S, Qu Y X, et al. Dissecting the inheritance pattern of the anemone flower type and tubular floral traits of chrysanthemum in segregating F-1 populations[J]. Euphytica, 2023, 219(1): 16. DOI:10.1007/s10681-022-03141-6
[3]	Park C, Moon N. Dog-species classification through CycleGAN and standard data augmentation[J]. Journal of Information Processing Systems, 2023, 19(1): 67-79. DOI:10.3745/JIPS.02.0190
[4]	Lam M, Mahasseni B, Todorovic S. Fine-grained recognition as HSnet search for informative image parts[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA. IEEE, 2017: 6497-6506. DOI: 10.1109/CVPR.2017.688.
[5]	He X T, Peng Y X. Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. February 4-9, 2017, San Francisco, California, USA. ACM, 2017: 4075-4081. https://dl.acm.org/doi/10.5555/3298023.3298160.
[6]	He X T, Peng Y X, Zhao J J. Fine-grained discriminative localization via saliency-guided faster R-CNN[C]//Proceedings of the 25th ACM international conference on Multimedia. October 23-27, 2017, Mountain View, California, USA. ACM, 2017: 627-635. DOI: 10.1145/3123266.3123319.
[7]	Zhou L, Wang W Q, Lyu K. FACR: a fast and accurate car recognizer[J]. Journal of University of Chinese Academy of Sciences, 2021, 38(1): 130-136. DOI:10.7523/j.issn.2095-6134.2021.01.016(inChinese)
[8]	Lin T Y, RoyChowdhury A, Maji S. Bilinear CNNs for fine-grained visual recognition[EB/OL]. 2015: arXiv: 1504.07889. (2015-04-29)[2023-12-02]. http://arxiv.org/abs/1504.07889.
[9]	Gao Y, Beijbom O, Zhang N, et al. Compact bilinear pooling[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA. IEEE, 2016: 317-326. DOI: 10.1109/CVPR.2016.41.
[10]	Kong S, Fowlkes C. Low-rank bilinear pooling for fine-grained classification[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA. IEEE, 2017: 7025-7034. DOI: 10.1109/CVPR.2017.743.
[11]	Nie X, Chai B S, Wang L Y, et al. Learning enhanced features and inferring twice for fine-grained image classification[J]. Multimedia Tools and Applications, 2023, 82(10): 14799-14813. DOI:10.1007/s11042-022-13619-z
[12]	Özdemir C. Avg-topk: new pooling method for convolutional neural networks[J]. Expert Systems with Applications, 2023, 223: 119892. DOI:10.1016/j.eswa.2023.119892
[13]	Li X J, Yang C, Chen S L, et al. Semantic bilinear pooling for fine-grained recognition[C]//2020 25th International Conference on Pattern Recognition (ICPR). Milan, Italy. IEEE, 2021: 3660-3666. DOI: 10.1109/ICPR48806.2021.9412252.
[14]	Ji R Y, Li J Y, Zhang L B. Siamese self-supervised learning for fine-grained visual classification[J]. Computer Vision and Image Understanding, 2023, 229: 103658. DOI:10.1016/j.cviu.2023.103658
[15]	Pan W Y, Yang S Y, Qian X H, et al. Learn more: sub-significant area learning for fine-grained visual classification[C]//2023 IEEE International Conference on Image Processing (ICIP). Kuala Lumpur, Malaysia. IEEE, 2023: 485-489. DOI: 10.1109/ICIP49359.2023.10222241.
[16]	Jung Y, Syazwany N S, Kim S, et al. Fine-grained classification via hierarchical feature covariance attention module[J]. IEEE Access, 2023, 11: 35670-35679. DOI:10.1109/ACCESS.2023.3265472
[17]	Xie J J, Zhong Y J, Zhang J G, et al. A weakly supervised spatial group attention network for fine-grained visual recognition[J]. Applied Intelligence, 2023, 53(20): 23301-23315. DOI:10.1007/s10489-023-04627-z
[18]	Yu C J, Zhao X Y, Zheng Q, et al. Hierarchical bilinear pooling for fine-grained visual recognition[C]//Computer Vision-ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XVI. ACM, 2018: 595-610. DOI: 10.1007/978-3-030-01270-0_35.
[19]	Zeng Z Y, Wang J P, Chen B, et al. Pyramid hybrid pooling quantization for efficient fine-grained image retrieval[J]. Pattern Recognition Letters, 2024, 178: 106-114. DOI:10.1016/j.patrec.2023.12.022
[20]	Otholt J, Meinel C, Yang H J. Guided cluster aggregation: a hierarchical approach to generalized category discovery[C/OL]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision(WACV). 2024: 2618-2627. [2024-03-12]https://openaccess.thecvf.com/content/WACV2024/html/Otholt_Guided_Cluster_Aggregation_A_Hierarchical_Approach_to_Generalized_Category_Discovery_WACV_2024_paper.html.


中国科学院大学学报 2025, Vol. 42 Issue (6): 806-813	PDF