Computer vision techniques for fine-grained image classification have attracted a lot of attention, e.g., bird species[1], flower types[2], and dog species[3]. This task is challenging because some fine-grained categories (e.g., "eared plover" and "horned plover") can only be recognized by domain experts. Unlike general classification tasks, fine-grained image classification requires localizing and representing very small visual differences in subca-tegories.
To address the above challenges, we propose a novel dual network (DNet) with hierarchical attention for fine-grained classification without bounding box/partial annotations. Its loss function combines DNet with hierarchical attention model to obtain more accurate representation of fine-grained categorical features. Our contribution can be summarized as follows:
1) The DNet is designed based on paired images. The differences in the paired images are compared and added to the loss function based on intra-variance and inter-variance. Its loss function consists of two parts: the original loss of the network and the comparison loss of the hierarchical attention features.
2) Hierarchical attention model is introduced for selecting salient features. From top layer to down layer, the different layer attention describes its own characteristics. Through the max K response block, the attention model can carry multi-scale information and learn more discriminative features.
3) We performed comprehensive experiments and analyses on four datasets (CUB-200-2011, Stanford dogs, Oxford-flower-17, and Disaster-scene) and demonstrated superior performance on all of them.
1 Related workThe research on fine-grained image classifica-tion focuses on two main areas: discriminative part localization and weakly supervised methods.
1.1 Discriminative part localizationEarly approaches to fine-grained image classification required robust annotations, often utilizing bounding boxes and partial annotations to locate salient regions in fine-grained classification. Lam et al.[4] proposed a set of bounding boxes in the image, whose "informativeness" was evaluated by the heuristic function. To get bounding boxes, some detectors[5-6]are designed automatically to localize the object or discriminative regions and encode discriminative features. The hierarchical classification method[7] obtains salient features by localizing the attention region for accurate reco-gnition. These methods provide a way of encoding the salient features for distinguishing sub-categories. But such annotations are expensive and unrealistic in many practical applications.
1.2 Weakly supervised methodsWeakly supervised methods have been extensively studied for fine-grained image classification. To better model the subtle differences presented in fine-grained categories, bilinear structure[8] has been proposed to capture the local differences by two independent CNNs, which achieved superior results on CUB-200-2011 dataset. The compact bilinear representation[9] is a kernelized method of bilinear pooling and can provide the discriminative power of bilinear pooling with only a few dimensions. Kong and Fowlkes[10] proposed classifier that factorizes the collection of bilinear classifiers into common factor and compact terms. Nie et al.[11] proposed a novel module, named forcing module, to force the network to extract more diverse features. Oezdemir[12] proposed a custom pooling method to calculate the weighted average of the dominant features. Li et al.[13] proposed a semantic bilinear pooling method to learn the semantic information from the hierarchical levels. Ji et al.[14] proposed the concepts from self-supervised learning and seamese network and take a step towards solving the problems of overfitting. A region enhancement and suppression approach[15] proposes a plug-and-play significant region diffusion (SRD) module for explicitly enhancing the significant features. The attention module for two-branch networks (DAL-Net)[16] acquires features in a weakly supervised manner. Weakly supervised spatial group attention Network (WSSGA-Net)[17] highlights the correct semantic feature regions for more accurate classification by establishing semantic enhancement mechanism.
2 ApproachIn this section, we introduce DNet with hierar-chical attention for fine-grained image classification that can be trained end-to-end.
2.1 Overall structureFigure 1 provides the overall structure of our approach. It consists of four parts: CONV layers, attention block, interaction model, and loss function.
|
Download:
|
|
Fig. 1 Structure of network |
|
It takes two images as input and generates features through convolutional layers. On one side, the hierarchical attention mechanism of the network selects salient features from the images. Comparing the differences between the paired images, loss0 and loss1 are obtained. On the other side, the network undergoes the prediction of the interaction model to obtain loss2 and loss3. The total loss obtained is used to iteratively adjust the network parameters so that the network can select salient features.
Instructions: CONV blocks are the last three conv layers in VGG16. For Resnet18, CONV blocks are con3_x, con4_x, and con5_x. In the testing stage, we can only use the forward network of input1 or input2 to get the result of test. In interaction model, multiple models including bilinear CNN (BCNN)[8], compact bilinear pooling (CBP)[9], hierarchical bilinear pooling (HBP)[18], pyramid hybrid pooling quantization (PHPQ)[19], and guided cluster aggregation (GCA)[20] are imple-mented.
2.2 Region selectionTo obtain the salient features, the framework adopts the hierarchical attention model (HA model). Given feature maps F from last three CONV Blocks. Fc(x, y) is defined as the spatial location (x, y) of channel C. The whole framework of attention block is divided into two parts: attention maps and attention features, which are shown in Fig. 2.
|
Download:
|
|
Fig. 2 Selection of salient features |
|
We compute the salient map of attention mechanism maskc(x, y) as
| $ \operatorname{mask}_{c}(x, y)=\operatorname{sigmoid}\left(F_{c}(x, y)\right) . $ | (1) |
We get attention features in each CONV Block for
| $ \boldsymbol{M}=\boldsymbol{F}\left(1+\operatorname{mask}_{c}(x, y)\right) . $ | (2) |
Attention maps are considered as a set of spatial maps essentially encoded on the spatial areas of the inputs. Based on the generated attention map, the region of the object can be localized. The hierarchical attention model focuses on the distin-guishing differences of parts among subcategories, which makes the saliency feature values stronger and suppresses noise pixels or useless information.
After getting the salient features from attention maps, some irrelevant regions should be filtered out. As shown in Fig. 2(b), the output of attention map can be more salient in objects.
Then the outputs of attention map are transformed into vectors. We make these vectors sort ascendingly. In this process, the max K response block is continuously added into framework to find the main objects or relevant region bounded by red box. Finally, the max K salient values are selected. The main steps of methods are listed in following Algorithm1.
Here, the salient object features are extracted automatically without any object or part annotations. The salient feature representations (it is 512 × 3 in our experiments) totally came from the hierarchical attention block.
Algorithm1: the main process of the attention feature extraction
Input: the feature outputs F of CONV block
Output: Attention feature S
1: All feature maps are processed by Eqs.(1) and (2).
2: Resized the outputs M of attention map as column vectors.
3: Sort all pixels of attention map M according to ascending order.
4: Select max K response value in each attention block (in experiment, we choose max 512 values).
5: To get S by concatenating K salient feature values of each layer.
After the hierarchical attention features are generated from different blocks, they are utilized to compare the difference of paired inputs via calculating the Euclidean distance. Figure 3 illustrates the flowchart of the loss. The distance0, distance1 and distance2 are the distances between the paired images from two attention blocks. Loss0 is the difference of paired images from the same categories. Loss1 is the difference of paired images from different categories.
|
Download:
|
|
Fig. 3 Loss of hierarchical attention block |
|
Based on the intra-class variance and inter-class variance, we add the compared loss functions. The total loss function is expressed by
| $ \operatorname{loss}=\operatorname{loss} 0-\operatorname{loss} 1+\operatorname{loss} 2+\operatorname{loss} 3 . $ | (3) |
Where loss2 and loss3 refer the losses of the paired images from the main network. They use classification cross-entropy loss. Theoretically, the variances should be smaller when the paired images are the same categories. And the variances should be larger when the paired images are the different categories.
The proposed method uses paired images as inputs for training. But for test and inference, we only use a branch to classify images.
3 ExperimentsAll the experiments are performed by using PyTorch on NVIDIA GeForce RTX 2060 GPUs. The proposed approach is compared with multiple methods to verify its effectiveness. In all of the experiments, we only use image labels without any part or bounding box annotations. Accuracy is adopted as the evaluation metric to evaluate the classification performances. In original paper of CBP[9], the features of CBP[9] uses both random Maclaurin (RM) and tensor sketch (TS) projections. In this experiment, CBP[9] model only utilizes TS projections. To reduce the complexity, DNet on CBP only uses features of final attention block to compute distance between paired images (it is the distance2 in Fig. 3).
3.1 Datasets and baselinesThe experiments are conducted on three public challenging fine-grained image classification datasets and a self-made disaster scenario dataset, including CUB-200-2011, Stanford dogs, OxfordFlower-17, and Disaster-scene. The detailed statistics with category numbers and data splits are summarized in Table 1.
|
|
Table 1 Statistics of fine-grained datasets used in this paper |
We use two basic CNNs: VGG-16 and ResNet-18, which achieved state-of-the art performance on ImageNet. VGG-16 has 16 layers including 13 convolutions with ReLU layers and 3 fully connected layers. And ResNet-18 has four basic conv blocks. Each conv block includes two residual units. A weight decay of 0.000 1 with a momentum of 0.9 is used in training, and the initial learning rate is set to 0.01. With all above settings, two CNN networks are initialized with the pretrained parameters from ImageNet. During the training, the learning rate is divided by 10 every 20 iterations. For input format, the training stage uses pairs of images as inputs to train the network. In the test stage, only single image is input to test the accuracy.
3.2 Experiments on CUB-200-2011Table 2 shows the comparison results on CUB-200-2011 dataset. The training images are randomly cropped to 224×224 patches. The test images are processed with center-cropped (each of size 224×224). In HACBP, the baseline network is VGG-16. The three attention models come from the last three convolution layers of VGG-16.
|
|
Table 2 Comparison results on CUB-200-2011 dataset |
From Table 2, we can see that HA model has a boost performance than the original model. The accuracy of CBP[9] is 72.75%. The accuracy of HACBP is 73.50%. The accuracy of HBP[18] is 69.69% when the baseline is ResNet-18. The accuracy of HAHBP is 71.30%. We used the DNet to test the performance.When DNet is added on CBP, the accuracy is 74.56% and it offers the best accuracy. We also test the accuracy of some recent methods, such as PHPQ[19], GCA[20]. PHPQ[19] captures and retains fine-grained semantic information in multi-level features. GCA[20] use cluster to divides the data into small clusters, and then aggregates them. In Table 2, the proposed approach on CBP achieved the higher classification accuracy without object and parts annotations.
We also show the whole process from original images to the output of attention model (Fig. 4). The first image is the original image. The second image is the output of data augmentation. The third image is the output of attention map. The last one is the output box of extracted feature. In comparison, the DNet achieved significant improvement.
|
Download:
|
|
Fig. 4 Process from input to attention model on CUB-200-2011 |
|
The classification accuracy on Stanford dogs is summarized in Table 3. The training images are randomly cropped to 224×224 patches. The test images are processed with 11 center-cropped (each of size 224×224). According to the same methods, we added HA model to different networks and obtained an impressive progress. The accuracy of HACBP(78.74%) is higher than CBP[9]. The accuracy of HAHBP (78.26%) is also higher than HBP[18]. The accuracy of DNet on CBP is highest at 80.02%. It is 0.19% higher than PHPQ[19].
|
|
Table 3 Comparison results on Stanford dogs dataset |
Figure 5 is the transformation process of sample image, which is from original image to the output of data after augmentation. Then it acquires the main feature points following attention map. The proposed attention model can extract salient feature from discriminating region.
|
Download:
|
|
Fig. 5 Process from input to attention model on Stanford dogs |
|
Table 4 shows the comparison results on Oxford-flower-17 dataset. The training images are also randomly cropped to 224×224 patches. The test images are processed with center-cropped (each of size 224×224). The accuracy of HACBP is 89.41%. The original CBP[9] has 89.12% accuracy. The accuracy of HAHBP is 90.15%. And the original HBP[18] has 88.09% accuracy. The accuracy of DNet on CBP is highest at 95.15%.
|
|
Table 4 Comparison results on Oxford-flower-17 dataset |
We also displayed the process of a sample image from original input to attention model (Fig. 6). Firstly, it gets the attention map from the designed attention model. Then some salient features are extracted from the red box.
|
Download:
|
|
Fig. 6 Process from input to attention model on Oxford-flower-17 |
|
The proposed methods are also applied to estimate the disaster-scene, which include earthquakes, tsunamis, and tornadoes. They are extremely complex at a finer scale. This database contains approximately 1 757 color images, including three different disaster types: 760 images are earthquake images; 556 images are tsunami images and the rest 441 images are tornado images. Specifically, the earthquake data were from the Haitian earthquake in 2010 and the earthquake in Christchurch, New Zealand in 2011; the tsunami data came from the 2004 Tohoku tsunami in Japan and the 2004 Indonesian tsunami; and the tornado data was taken after the Moore tornado in Oklahoma, USA in 2013.
These annotations and labeling of datasets were done by our team at the University of Missouri, Kansas City. Additionally, we divided the dataset into different damage-level: DL1, DL2, DL3, which stands for no damage, mild damage, severe damage.
The different methods are compared by using baseline VGG-16. The training images are randomly cropped to 224×224 patches. The test images are processed with center-cropped (each of size 224×224). The data augmentation is also used to increase the number of images. The results are shown by Table 5 and Table 6. Table 5 is the prediction accuracy of different disaster-scenes. We divided the disaster-scenes into three types: tornado, earthquake, and tsunami with train data and test data.
|
|
Table 5 Prediction accuracy of different disaster-scenes (base model: VGG-16) |
|
|
Table 6 Prediction accuracy of damage level in different disasters |
For Table 5, the original CBP[8] has 86.58% accuracy. The HACBP is 86.89%. We also test the pairs network with attention model. DNet is used in CBP. The accuracy of DNet on CBP is highest at 88.47%.
Table 6 is the prediction accuracy of damage level. The first column is the prediction of tornado level. The second column is the prediction of earthquake level. And the third column is the prediction of tsunami level. Each disaster type is divided into three levels: no damage, mild damage, and severe damage.
For Table 6, the prediction of Tornado level is best. The original CBP[9] has 84.10% accuracy. The HACBP is 87.12%. DNet on CBP has 87.33% accuracy. For the prediction of earthquake and tsunami, these results are not very good. But they are all over 50% accuracy and may need further research.
4 ConclusionsIn this paper, we propose a hierarchical attention DNet for weakly supervised fine-grained image classification, which jointly integrates hierarchical attention and DNet. The experimental results show that the hierarchical attention improves the multi-scale feature learning ability and has a tendency to improve gradually. And DNet can learn features significantly by comparing paired images. In conclusion, the proposed method performs well in fine-grained image classification without strong bounding box or component annotations. This work is highly beneficial for the application of various neural network models.
| [1] |
Jagiello Z, Reynolds S J, Nagy J, et al. Why do some bird species incorporate more anthropogenic materials into their nests than others?[J]. Philosophical Transactions of the Royal Society B: Biological Sciences, 2023, 378(1884). DOI:10.1098/rstb.2022.0156 |
| [2] |
Yang X D, Su J S, Qu Y X, et al. Dissecting the inheritance pattern of the anemone flower type and tubular floral traits of chrysanthemum in segregating F-1 populations[J]. Euphytica, 2023, 219(1): 16. DOI:10.1007/s10681-022-03141-6 |
| [3] |
Park C, Moon N. Dog-species classification through CycleGAN and standard data augmentation[J]. Journal of Information Processing Systems, 2023, 19(1): 67-79. DOI:10.3745/JIPS.02.0190 |
| [4] |
Lam M, Mahasseni B, Todorovic S. Fine-grained recognition as HSnet search for informative image parts[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA. IEEE, 2017: 6497-6506. DOI: 10.1109/CVPR.2017.688.
|
| [5] |
He X T, Peng Y X. Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. February 4-9, 2017, San Francisco, California, USA. ACM, 2017: 4075-4081. https://dl.acm.org/doi/10.5555/3298023.3298160.
|
| [6] |
He X T, Peng Y X, Zhao J J. Fine-grained discriminative localization via saliency-guided faster R-CNN[C]//Proceedings of the 25th ACM international conference on Multimedia. October 23-27, 2017, Mountain View, California, USA. ACM, 2017: 627-635. DOI: 10.1145/3123266.3123319.
|
| [7] |
Zhou L, Wang W Q, Lyu K. FACR: a fast and accurate car recognizer[J]. Journal of University of Chinese Academy of Sciences, 2021, 38(1): 130-136. DOI:10.7523/j.issn.2095-6134.2021.01.016(inChinese) |
| [8] |
Lin T Y, RoyChowdhury A, Maji S. Bilinear CNNs for fine-grained visual recognition[EB/OL]. 2015: arXiv: 1504.07889. (2015-04-29)[2023-12-02]. http://arxiv.org/abs/1504.07889.
|
| [9] |
Gao Y, Beijbom O, Zhang N, et al. Compact bilinear pooling[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA. IEEE, 2016: 317-326. DOI: 10.1109/CVPR.2016.41.
|
| [10] |
Kong S, Fowlkes C. Low-rank bilinear pooling for fine-grained classification[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA. IEEE, 2017: 7025-7034. DOI: 10.1109/CVPR.2017.743.
|
| [11] |
Nie X, Chai B S, Wang L Y, et al. Learning enhanced features and inferring twice for fine-grained image classification[J]. Multimedia Tools and Applications, 2023, 82(10): 14799-14813. DOI:10.1007/s11042-022-13619-z |
| [12] |
Özdemir C. Avg-topk: new pooling method for convolutional neural networks[J]. Expert Systems with Applications, 2023, 223: 119892. DOI:10.1016/j.eswa.2023.119892 |
| [13] |
Li X J, Yang C, Chen S L, et al. Semantic bilinear pooling for fine-grained recognition[C]//2020 25th International Conference on Pattern Recognition (ICPR). Milan, Italy. IEEE, 2021: 3660-3666. DOI: 10.1109/ICPR48806.2021.9412252.
|
| [14] |
Ji R Y, Li J Y, Zhang L B. Siamese self-supervised learning for fine-grained visual classification[J]. Computer Vision and Image Understanding, 2023, 229: 103658. DOI:10.1016/j.cviu.2023.103658 |
| [15] |
Pan W Y, Yang S Y, Qian X H, et al. Learn more: sub-significant area learning for fine-grained visual classification[C]//2023 IEEE International Conference on Image Processing (ICIP). Kuala Lumpur, Malaysia. IEEE, 2023: 485-489. DOI: 10.1109/ICIP49359.2023.10222241.
|
| [16] |
Jung Y, Syazwany N S, Kim S, et al. Fine-grained classification via hierarchical feature covariance attention module[J]. IEEE Access, 2023, 11: 35670-35679. DOI:10.1109/ACCESS.2023.3265472 |
| [17] |
Xie J J, Zhong Y J, Zhang J G, et al. A weakly supervised spatial group attention network for fine-grained visual recognition[J]. Applied Intelligence, 2023, 53(20): 23301-23315. DOI:10.1007/s10489-023-04627-z |
| [18] |
Yu C J, Zhao X Y, Zheng Q, et al. Hierarchical bilinear pooling for fine-grained visual recognition[C]//Computer Vision-ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XVI. ACM, 2018: 595-610. DOI: 10.1007/978-3-030-01270-0_35.
|
| [19] |
Zeng Z Y, Wang J P, Chen B, et al. Pyramid hybrid pooling quantization for efficient fine-grained image retrieval[J]. Pattern Recognition Letters, 2024, 178: 106-114. DOI:10.1016/j.patrec.2023.12.022 |
| [20] |
Otholt J, Meinel C, Yang H J. Guided cluster aggregation: a hierarchical approach to generalized category discovery[C/OL]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision(WACV). 2024: 2618-2627. [2024-03-12]https://openaccess.thecvf.com/content/WACV2024/html/Otholt_Guided_Cluster_Aggregation_A_Hierarchical_Approach_to_Generalized_Category_Discovery_WACV_2024_paper.html.
|
2025, Vol. 42 


