UGC-YOLO: Underwater Environment Object Detection Based on YOLO with a Global Context Block

Citation

YANG Yuyi, CHEN Liang, ZHANG Jian, et al. UGC-YOLO: Underwater Environment Object Detection Based on YOLO with a Global Context Block[J]. Journal of Ocean University of China, 2023, 22(3): 665-674.

Corresponding author

CHEN Liang, E-mail: kentchen@163.com.

History

Received December 12, 2021
revised September 8, 2022
accepted October 26, 2022

Contents Abstract Full text Figures/Tables PDF

UGC-YOLO: Underwater Environment Object Detection Based on YOLO with a Global Context Block

YANG Yuyi , CHEN Liang , ZHANG Jian , LONG Lingchun , and WANG Zhenfei

School of Information and Electrical Engineering, Hunan University of Science and Technology, Xiangtan 411201, China

Received December 12, 2021; revised September 8, 2022; accepted October 26, 2022

Corresponding author: CHEN Liang, E-mail: kentchen@163.com.

Abstract: With the continuous development and utilization of marine resources, the underwater target detection has gradually become a popular research topic in the field of underwater robot operations and target detection. However, it is difficult to combine the environmental semantic information and the semantic information of targets at different scales by detection algorithms due to the complex underwater environment. In this paper, a cascade model based on the UGC-YOLO network structure with high detection accuracy is proposed. The YOLOv3 convolutional neural network is employed as the baseline structure. By fusing the global semantic information between two residual stages in the parallel structure of the feature extraction network, the perception of underwater targets is improved and the detection rate of hard-to-detect underwater objects is raised. Furthermore, the deformable convolution is applied to capture longrange semantic dependencies and PPM pooling is introduced in the highest layer network for aggregating semantic information. Finally, a multi-scale weighted fusion approach is presented for learning semantic information at different scales. Experiments are conducted on an underwater test dataset and the results have demonstrated that our proposed algorithm could detect aquatic targets in complex degraded underwater images. Compared with the baseline network algorithm, the Common Objects in Context (COCO) evaluation metric has been improved by 4.34%.

Key words: object detection underwater environment semantic information semantic features deep learning algorithm

1 Introduction

As the most extensive body of water on Earth, the ocean accounts for about 71% of the Earth's surface area. Its exploration and utilization play an important role in human survival and development. In the past, traditional methods of underwater image analysis required a large amount of a priori information and multiple sensors are usually deployed in the vehicle for information fusion, leading to high cost of underwater detection. With the development of deep learning, effective methods for detecting aquatic targets through optical vision sensors could provide assistance to underwater robots in environmental perception and improve the accuracy of target detection (Chen et al., 2022; Wang et al., 2022a; Wang et al., 2022b).

Deep learning-based detection algorithms could be distinguished by the region of interest extraction which has been performed or not. Two-stage algorithms (such as Ren et al., 2016; Wan et al., 2020) use region suggestion networks and Anchor mechanisms to enhance the effectiveness of target detection. However, the real-time requirements of underwater robots make single-stage algorithms become the mainstay of target detection for underwater robots (Yang et al., 2021a; Yu et al., 2021; Zhou et al., 2022). Target detection algorithms for underwater robots are limited by high underwater turbidity and lack of salient features, which could lead into the inability to extract effective features of the image, making underwater single-stage detection algorithms ineffective. The existing effective solution is to provide convolutional neural networks (CNNs) with an efficient computing process. The network topology is reconfigured to make the parameter gradient propagation more efficient, and the multiscale feature information is combined to improve the detection accuracy. A feature fusion structure with a cross-feature map (Liu et al., 2020) was developed by fusing high-dimensional feature information in order to obtain better feature crossover capability, but it failed to provide effective environmental semantic information in the feature extraction session, which could compromise the accuracy. Li et al. (2020) used DenseNet to reduce the feature loss in convolutional operations in order to adapt to the complex environment on the water surface, but this method also failed to maintain high accuracy enhancement in the underwater environment with diverse target morphology. The Single Shot Detection (SSD) network algorithm for shallower features (Liu et al., 2016; Jiang and Wang, 2020) has been used to enhance the detection of small targets, but the fusion of multi-scale information is more rigid. With the incompletely utilized global information of the feature map, the detection accuracy of the algorithm is not significantly improved.

A network algorithm combining underwater global features to fuse high-level semantic information across residual stages is proposed in this paper, using YOLOv3 as a baseline. Deformable convolution is applied to extract downsampled effective features and pyramidal pooling is applied to aggregate multiscale semantic information as information supplement. A soft fusion approach of multi-scale features with learnable weights is achieved, improving the detection capability of underwater targets effectively.

2 Related Works 2.1 Self-Attention Algorithm

According to the biological model of human vision, in the chaotic foreground environment, human vision focuses on a specific region that is usually information-rich. The attention mechanism involves processing excessive information and giving higher weights to the characteristics in a priori calibration areas by combining internal experience and external perception.

Therefore, self-attention network structures such as convolutional block attention modules (CBAMs) and non-local (NL) networks have been proposed successively (Yang et al., 2021b; Yin et al., 2021). The former focuses on the weights of the input feature layer at two levels through the serial channel attention algorithm and spatial attention algorithm. The latter mitigates the effects of long-distance dependency correlation by capturing local data or feature correlations. A CBAM is implemented after two global calculations. Moreover, it is often necessary to integrate all nodes of the network, and the level of calculation is complex. However, the NL network is intended to establish a specific global context dependency relationship for each data element. It has been proven by experiments that the global context is not affected by the location dependency, and traversing each data point to calculate context information is twice as effective with half the effort. Therefore, it is necessary to build a network model that can reduce unnecessary computation and has context information extraction capability like a NL network.

2.2 Semantic Feature Expression

In the deep convolution neural network, regional feature operations carried out by standard convolutional kernels can obtain more distinct semantic information with the deepening of network layers. Image information has high spatial location constraints, and the network constructed by standard convolution can, to a certain extent, adapt to the structure of the images arranged in rows and columns. However, the spatial construction of the convolution kernel still has room for improvement with respect to the feature extraction ability of objective targets with changeable shapes.

At present, the effective improvement schemes include the 1 × 3 and 3 × 1 convolution structures proposed in an attention complementary network (ACNet) (Ding et al., 2019). Recent evidences suggest that the computation amount of the output obtained by summing after convolution of the feature graph with different convolution kernels is the same as the calculation amount when the output result is obtained by adding the feature graph and the convolution kernel point by point and then performing the convolution. The feature extraction capability can be enhanced without increasing the amount of computation, and this method has a certain robustness to the flipped and rotated targets.

However, for underwater image targets, the effect of this method is not significantly improved in this scene. A possible explanation might be that the targets, such as silt and gravel, often merge with their surroundings through shielding, where such individuals and their backgrounds are often symmetrically consistent. Considering this, a convolution kernel scheme that can better express the features is adopted to extract image feature information. Feature fusion is carried out with semantic information from a highlevel feature graph to provide more significant input features for the attention module. The performances and limitations of the different algorithms in related works are listed in Table 1.

Table 1 Performances and limitations of enhanced semantic expression methods

3 The Proposed Method 3.1 The Network Structure

Redmon et al. (2016) proposed YOLO series algorithms in 2015 and then proposed the end-to-end detection by the anchor free detection method. YOLOv3 improves the number of detection frames per unit time and improves the detection accuracy by combining current mainstream network modules and a perfect anchor mechanism.

On the basis of the above works, this paper proposes an improvement in the network structure and takes the underwater environment study as the starting point to improve the detection accuracy and adaptability of the network in this scene. Underwater scenes are often distorted and degraded, and the shapes of aquatic creatures are not consistent. The prior boxes obtained by common clustering methods cannot guarantee the effectiveness of target feature information extraction. Therefore, this paper introduces the deformable convolution into the convolution kernel, adapting to different morphological features of aquatic organisms, and extracting targeted features.

Organisms in the underwater environment often use the environment to protect themselves, inducing difficulties in the detection process. The density problem caused by target clusters and the occlusion problem caused by the water quality and sediments comprise the limiting aspects of the image detection. In this paper, the detection of the exposed part of the target is considered as the key features of extraction. When extracting the target in the prior box, the association between the target and the scene should be captured by combining the surrounding environment information, and the image context information model should be established. Therefore, information fusion is carried out in the feature layer of the last three stages of the YOLOv3 structure and the results are used as the input of the global context module. The network topology diagram is shown in Fig. 1.

Fig. 1 Illustration of the algorithm network structure.

As shown in Fig. 1, the blue dashed line indicates the optimized jump-weighted fusion method used to fuse multiscale features, where the dimensionality is adjusted by 1 × 1 convolution and the robustness is enhanced by feature sparsification with a step size of 2. To enable the network to focus on underwater targets with variable shapes during down-sampling, a deformable convolution is introduced to enhance the feature extraction, which contains a convolution kernel offset branch for obtaining the effective convolution region location. The fused feature maps are used as the input of the global contextual attention module. The multi-scale feature information is context-modeled by the dimensional transformation and point-wise convolution to obtain the global feature information and generate the attention matrix for the information transfer. The low-level feature maps of the network have high resolution and contain large amount of detailed information. This structure allows the long-range dependency computation at different semantic levels, optimizes the semantic association between targets and the environment, and involves the influence of the surrounding environment when detecting targets. Finally, the pyramid pooling module is used in the high-level feature layer to mitigate the loss of fine-grained information and enhance the spatial mapping capability of the scene.

3.2 Deformable Convolution Block

At present, most of the units of the deep convolution network structure are still standard convolutions, and the standard convolutions of 1 × 1 and 3 × 3 are used as calculation units to reduce the number of equipment-based floatingpoint operations, but the ability to extract features is not greatly improved. The receptive field of convolution kernel calculation is a standard rectangular structure, which limits the ability to extract target features. Thus, the deformable convolution block is proposed (Dai et al., 2017), which can be learned by the parameter biases of convolution kernel elements. The feature extraction capability of the network is enhanced by adjusting the convolutional kernel with bias parameters. The structure of deformable convolution block is shown in Fig. 2.

Fig. 2 Architecture of the deformable convolution block.

The structure is divided into two layers: the upper structure carries out the learning process to obtain the learnable parameter biases through the input of the feature graphs, and the lower structure combines the input feature graphs with deformable convolution calculation to obtain the output. This method can adapt to the underwater targets with different sizes and shapes and does not require for much prior expert knowledge for data enhancement.

3.3 Attention Block

To capture salient features in underwater targets, the algorithm is usually combined with a self-attention mechanism. It has good performance in common target tasks, but it also has some limitations. 1) The neuron composition is complex, and the calculation parameters are difficult to be simplified. 2) The excessive network branches make its optimization difficult. 3) Currently, the common attention mechanism only establishes the semantic relation within the block or the weight relation model of multi-dimension space.

Compared with the widely used CBAM, this paper focuses on the role of global context attention block in underwater detection scenarios. Global context attention can optimize the computation of long-distance dependence, model the whole image semantics, and simplify the network structure branching. This paper adds this module to the network structure and makes targeted improvements. On one hand, the global context module is based on the NL algorithm to obtain a simplified structure for the redundancy problems. On the other hand, a mask convolution layer can be used to obtain a semantic mask map of the context, and only two layers of convolution transform are used to compress the channel dimension and reduce the number of calculation parameters. In addition, to improve the accuracy after reducing the computational load, the global context attention module is designed by referring to the classical attention model squeeze and excitation (SE)-Block, in which the feature graph can fully integrate the context semantic information.

In other words, the global context attention module reduced the complexity of the model to achieve more robust results based on the nonlocal algorithm. The architecture of the global context attention module is shown in Fig. 3, where C × H × W represents the product of channels, heights, and widths of the input feature graph, and ⊕ represents the sum of broadcast elements. The module is used to enhance the feature expression of the model, which to some extent overcomes the limitations of the convolutional kernel and enlarges the receptive field at each depth.

Fig. 3 Architecture of the global context attention block.

This module focuses on the weighted average values between features acquired from the overall information of the image. To calculate the response of a certain position in the image sequence, the following formula can be used.

$ {y_i} = \frac{1}{{C(x)}}\sum\limits_{\forall j} {f({x_i}, {x_j})g({x_j})}, $

(1)

where x_i and y_i represent the coordinate and the response respectively at a certain position i in the image, C(x) is the normalization factor, f is the similarity measure scalar between two coordinate positions, and g is used to calculate the characteristic expression of the input signal at the position j. g(x_i) is a mapping function, by which the points are mapped to feature vectors, and the positions can be modelled by a learnable weight matrix W_g.

$ g(x) = {\mathit{\boldsymbol{W}}_g} \times {x_j}. $

(2)

After the point mapping, the relations between the position and each point are calculated by the similarity degree. An extensive Gaussian model is used to improve the similarity degree, and a Softmax function is used to generate attention weights in the network structure to obtain context information features. At the same time, the weight matrix above can be extracted, so the items can be obtained independently of the position and designed as a feature transfer structure. Therefore, the global context can be expressed as follows.

$ {z_i} = {x_i} + {W_{v2}}{\text{ReLU}}(BN({W_{v1}}\sum\limits_{j = 1}^{{N_p}} {\frac{{{e^{{W_k}{x_j}}}}}{{\sum\nolimits_{m = 1}^{{N_p}} {{e^{{W_k}{x_m}}}} }}{x_j}})) . $

(3)

W_v2 and W_v1 are two-layer bottleneck structures, and the size of the input characteristic diagram is H × W. In the

bottleneck structure, the channel is transformed r times by the channel scaling factor and the system performance is improved by the double convolution; however, this also increases the difficulty of optimization due to the weight transfer. Therefore, the data normalization operation is added before the nonlinear rectified linear unit (ReLU) activation function, increasing the generalization ability. Residuals preserve previously learned features to prevent gradient problems during propagation.

3.4 Learnable Semantic Fusion

The essence of attention mechanism lies in the process of extracting the spatial weight of the output features between dimension transformations. To highlight the key information, the network should disregard the irrelevant features; this is the most important content of this mechanism. In terms of weight allocation, the network of the self-attention mechanism uses the information of different levels to extract features from the whole image to reduce training costs. From another perspective, the fusion of semantic features is an important part of the self-attention mechanism. With DenseNet, the maximization of information transmission is realized by adopting dense mapping between all the same output feature maps and residual blocks of the extraction network (Qiu et al., 2022). This has been shown to make full use of the output of each residual block and realize the weighted coefficient fusion. In this paper, experiments were carried out on this basis, and it was found that outputs between two residual stages could be extracted to achieve similar effects. A cross-stage fusion structure was proposed by using the output dimension number of YOLOv3 residual stages. The learnable semantic fusion structure was designed in this paper, as shown in Figs. 2 – 4.

Fig. 4 Illustration of the improved feature semantic fusion architecture.

Fig. 4 shows the body frame and detailed description. This paper defines a method, which is different from the traditional weighted feature map, to establish feature relations between different stages. By assigning the weights, we hope to highlight the expression ability of features extracted from the network. By referring to the residual formula, the semantic fusion structure can be expressed as follows.

$ {Z_{{\text{out}}}} = {\mathbb{F}_c}({x_i}) + \alpha \times \delta ({x_1}), $

(4)

where $ \mathbb{F}(\cdot) $ is the basic residual block sequence expression, c is the network depth between residual stages, α is the learnable parameter, and δ(·) is the jump sampling and dimensional stretching transformation for the feature graph of the last stage.

3.5 PPM Block

In the underwater environment, the ability to detect targets is affected in the same manner as described above. Considering that aquatic products are difficult to be detected due to environmental constraints, the analysis ability of such scenarios should be strengthened. Currently, popular scene resolution frameworks are usually based on Feature Pyramid Networks (FPNs), and their disadvantages are evident.

1) Lost relation matching. Context relation matching is vital for complex scenes. For instance, the sea cucumbers in gravel whose texture colour is similar to loose sediments on the seabed are often identified falsely.

2) Confused recognition. Sea urchins like to cluster together and attach to a rock wall. In this state, it is easy to identify the rock block close to the target as sea urchins by mistake.

3) In the blurred visual range, the FPN frame is more likely to accept biological targets and image frame intersections than large targets, and also ignore small targets. For example, there is visual overlap between the creatures closer to the image and those farther away that cannot be separated. To alleviate the inability of the common FPN and to effectively process inter scene and global information, a pyramid pooling module (PPM) (He et al., 2017) is used to upgrade the parsing ability for high-level semantic scenes in the 3-layer pyramid of the YOLO network at the high-dimensional semantic feature layer.

The PPM structure, shown in Fig. 5, is based on the FPN framework and the nesting of scene information features, in which the feature graphs output by convolution blocks are used for multiscale partition pooling to aggregate highlevel semantic information in different regions. Bilinear interpolation is used for upsampling to unify the dimensions. In this way, the fixed size constraint of the standard convolutional receptive field can be removed, and the information loss in different regions can be reduced.

Fig. 5 PPM structure optimized for scenarios.

In this paper, four PPM branches with 1 × 1, 3 × 3, 6 × 6 and 8 × 8 dimensions are improved based on the biological size and the number of characteristic pyramid layers of the scene in the data set. In addition, the cross-stage semantic information fusion method simultaneously stacks the feature outputs of different residual stages after upsampling. This is different from common global average pooling, which handles global information together and easily blurs the spatial relations. The results of pyramid pooling are used for dimensional integration, including the multiscale global information of different regions, which strengthens the semantic association of the output features of the threelayer feature pyramid in YOLOv3.

4 Experimental Verification and Discussion 4.1 Data Analysis and Preprocessing

The video images that are used for underwater environmental operation in this study were captured in the Bohai Sea in China, and the images that include targets were retained. The image set includes clear imaging, motion blur imaging, low illumination imaging, and gamut distortion imaging among other types. The images are compiled into a data set upon manually labelling the targets as training labels. Aquatic creatures in the data set include sea cucumbers, sea urchins, starfish and shells. Due to the camera motion and the forest of reef rocks, the target images often include streamers, ghosts and large light ratio pixels. In the underwater environment, the long wavelength of red light make it attenuate faster in water, resulting in the chromatic deviation of the image. Suspended media in the water environment can be analogous to those in atmospheric imaging, attenuating the direct information signals of the scene by close range and the backscatter signals reflected by suspended objects (Raihan et al., 2021; Zhang et al., 2021). Therefore, this paper estimates the computational cost of restoring the degradation of the imaging data set. The parameters of image preprocessing are set in reference to the underwater degradation model. The dark channel restoration and adaptive histogram equalization methods are applied to a certain proportion of images. A preview of the results after processing is shown in Fig. 6.

Fig. 6 Data-enhancement results.

The enhanced effects can be observed in the restored images. Restricted contrast equalization based on histograms can enhance the dynamic range of the images in underwater environments with high light ratios and uneven light, improve the details of the images, stretch data values of various prospects to a wider distribution range, and strengthen feature expression. Dark channel restoration based on a physical model can effectively remove the degradation problem caused by underwater suspended objects. Seemly, the reason for the poor imaging effect may be that there is no adaptive brightness adjustment and colour balance. Actually, the data distribution is disrupted through subsequent image enlargement (Feng et al., 2021). For the data used for target detection, the dark channel-based recovery algorithm can better reflect the data distribution in the real scene and can effectively filter out the interference noises.

4.2 Data Augmentation

The dataset is acquired by separating images from video frames. Large deep convolutional neural networks require a large number of datasets as the driver. However, it is difficult to sample underwater scenes in real time, and the data used cannot realistically contain all types of underwater scenes. The generalization ability of the network can be enhanced by adding noises and distorting or enhancing the data images. Therefore, the dataset is augmented and the images are adjusted. After data pre-processing, the dataset will be expanded to 3.75 times, as shown in the Table 2.

Table 2 Data-augmentation scheme

In terms of the hyperparameter settings, the PyTorch deep learning framework is used for training. The computing device is a host with Nvidia GTX1660. The weight parameters are optimized by the stochastic gradient descent (SGD) algorithm, the momentum is set at 0.9, the weight decay is 0.005 and the initial learning rate is 0.003. The original YOLOv3 model is used for training to a 416 × 416 input dataset.

4.3 Transfer Learning

The data are processed by image preprocessing and image augmentation and then fed into the deep feedforward neural network. COCO pretraining model parameters are used for transfer learning, and the network detector is finetuned to gradually expand the learning rate. In this paper, YOLOv3 is implemented as the baseline network, and an ablation experiment is used to verify the proposed method and the addition module. Secondly, other popular algorithms are used to test the image data taken in this environment, and the mean average precision (mAP) is used to evaluate the quality of the target detection algorithm. The results are compared to verify the theoretical and practical feasibility of the proposed method. In the evaluation, mAP @0.5 (the mAP when the intersection-over-union (IOU) is 0.5) is used to evaluate the performance of the algorithms. The performances of YOLOv3, YOLOv3-CBAM, YOLOv3-GC and the improved YOLO-v3 in this paper on underwater images are compared, and the following training and verification are conducted on a GTX1660 graphics processing unit (GPU). As shown in Table 3, the fused attention module and pooling pyramid module implemented in this paper could improve the performance of the baseline network in learning the key feature information. It also has advantages over common algorithm modules in the final detection results. The modules in this paper can improve the detection performance of the target detection algorithm in the underwater scene.

Table 3 Ablation experiment by using transfer learning

4.4 Training Performance on the Data Set

In this paper, UGC-YOLO is trained on the data sets. Fig. 7 shows the accuracy and recall rate of each aquatic product target category in this algorithm (PR diagram). Fig. 8 shows the detection results of UGC-YOLO. UGC-YOLO can well detect most targets in the underwater environment, but its performance is not as good for some targets with poor imaging and dense population as for targets with short exposure times and significant biological characteristics because UGC-YOLO divides detection grids on the image. The improvement point is aimed at distinguishing the difference between the foreground and background. If the foreground overlaps with targets drastically, the following targets is easily to be ignored.

Fig. 7 PR diagrams of the UGC-YOLO algorithm.

Fig. 8 AP curves and detection result maps.

This paper also compares the performance of each model on the divided test data sets in this scenario, as presented in Table 4. The image sizes are set according to the algorithm recommendations. For similar detection works, YOLOv4, as the master of this series, has made a success in the learning rate change, optimizer improvement, data enhancement, structure reconstruction, loss design and many other aspects. YOLOv4 is superior to UGC-YOLO with respect to accuracy. However, the UGC-YOLO algorithm can achieve good detection results without changing the basic framework, representing a method for realizing the compromise between accuracy and construction cost.

Table 4 Comparison of common detection algorithms

4.5 Attention Visualization

In the research process, the feature activation visualization to high-level semantic feature maps can help to intuitively determine the feature activation after adding attention modules and enhancing the context of semantic relations. The feature activation diagram shows that the network highlights the feature extraction in the regions of interest during the transmission process; therefore, such extraction can pay more attention to the dependence of environmental information on the target decision. The original network has a good effect on the clear single target, but it is greatly affected by environmental interferences. In this paper, networks are improved to reduce the misjudgment caused by degraded environments, as shown in Fig. 9.

Fig. 9 Feature activation graphs.

For the limited input image, the feature activation map constructs the basic elements according to the weights, highlights the actual area that should have an enhanced focus, and assigns the weight of each position to the original feature data to help suppress the irrelevant information. Therefore, the attention mechanism is applied to enhance the attention intensity of the model to the regions of interest. Meanwhile, the context extraction structure is strengthened, thus the model detection accuracy is improved.

5 Conclusions

To sum it up, the proposed network extraction structure UGC-YOLO has a convolutional kernel with strong target morphology extraction capability, and a pyramidal pooling module with the ability to help the aggregation of upper semantic information. The global contextual attention module is performed after fusing the semantic information of each residual stage, which realizes the better fusion of highlevel semantic information, mitigates the negative impact due to the underwater environment, and improves the detection accuracy. The fusion strategy of learnable weights makes full use of feature outputs at different scales, reduces the redundant information, and preserves the learned feature information, allowing the head detector to utilize image features for object detection adequately. UGC-YOLO improves the detection accuracy for severely degraded underwater images, which could be quickly transplanted to other classical algorithm models without changing the main network framework. The focus of future researches will be mainly on the optimization of algorithm complexity and the compression of model parameters, so that the model and related research thoughts could be applied to a wider range of underwater detections and operations.

Acknowledgements

The work is supported by the National Natural Science Foundation of China (No. 62271199), the Natural Science Foundation of Hunan Province, China (No. 2020JJ5170), and the Scientific Research Fund of Hunan Provincial Education Department (No. 18C0299).

References

Chen, L., Zhou, F., Wang, S., Dong, J., Li, N., Ma, H., et al., 2022. SWIPENET: Object detection in noisy underwater scenes. Pattern Recognition, 132: 108926, DOI. DOI:10.1016/j.patcog.2022.108926 (

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., et al., 2017. Deformable convolutional networks. The Proceedings of the IEEE International Conference on Computer Vision. Venice, 764-773. (

Ding, X., Guo, Y., Ding, G., and Han, J., 2019. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. The Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, 1911-1920. (

Feng, H., Xu, L., Yin, X., and Chen, Z., 2021. Underwater salient object detection based on red channel correction. 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE). Nanchang, 446-449. (

He, K., Gkioxari, G., Dollár, P., and Girshick, R., 2017. Mask R-CNN. The Proceedings of the IEEE International Conference on Computer Vision. Venice, 2980-2988. (

Jiang, Z., and Wang, R., 2020. Underwater object detection based on improved single shot multibox detector. The 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence. Sanya, 1-7. (

Li, Y., Guo, J., Guo, X., Liu, K., Zhao, W., Luo, Y., et al., 2020. A novel target detection method of the unmanned surface vehicle under all-weather conditions with an improved YOLOV3. Sensors, 20(17): 4885. DOI:10.3390/s20174885 (

Liu, T., Pang, B., Ai, S., and Sun, X., 2020. Study on visual detection algorithm of sea surface targets based on improved YOLOv3. Sensors, 20(24): 7263. DOI:10.3390/s20247263 (

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al., 2016. SSD: Single shot multibox detector. The European Conference on Computer Vision. Amsterdam, 21-37. (

Qiu, M., Huang, L., and Tang, B., 2022. ASFF-YOLOv5: Multielement detection method for road traffic in UAV images based on multiscale feature fusion. Remote Sensing, 14(14): 3498. DOI:10.3390/rs14143498 (

Raihan, A. J., Abas, P. E., and De Silva, L. C., 2021. Role of restored underwater images in underwater imaging applications. Applied System Innovation, 4(4): 96. DOI:10.3390/asi4040096 (

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A., 2016. You Only Look Once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, 779-788. DOI: 10.1109/CVPR.2016.91. (

Ren, S., He, K., Girshick, R., and Sun, J., 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transaction on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149. (

Wang, J., He, X., Shao, F., Lu, G., Jiang, Q., Hu, R., et al., 2022a. A novel attention-based lightweight network for multiscale object detection in underwater images. Journal of Sensors, 2022: 1-14. DOI:10.1155/2022/2582687 (

Wang, X., Zhu, Y., Li, D., and Zhang, G., 2022b. Underwater target detection based on reinforcement learning and ant colony optimization. Journal of Ocean University of China, 21(2): 323-330. DOI:10.1007/s11802-022-4887-4 (

Yang, H., Liu, P., Hu, Y., and Fu, J., 2021a. Research on underwater object recognition based on YOLOv3. Microsystem Technologies, 27(4): 1837-1844. DOI:10.1007/s00542-019-04694-8 (

Yang, J., Xie, K., and Qiu, K., 2021b. Integrate YOLOv3 with a self-attention mechanism for underwater object detection based on forward-looking sonar images. 2021 7th International Conference on Robotics and Artificial Intelligence. Guangzhou, 1-7. (

Yin, W., Lu, P., Zhao, Z., and Peng, X., 2021. Yes, 'attention is all you need', for exemplar based colorization. Proceedings of the 29th ACM International Conference on Multimedia. New York, 2243-2251. (

Yu, Y., Zhao, J., Gong, Q., Huang, C., Zheng, G., and Ma, J., 2021. Real-time underwater maritime object detection in sidescan sonar images based on transformer-YOLOv5. Remote Sensing, 13(18): 3555. DOI:10.3390/rs13183555 (

Zhang, X., Fang, X., Pan, M., Yuan, L., Zhang, Y., Yuan, M., et al., 2021. A marine organism detection framework based on the joint optimization of image enhancement and object detection. Sensors, 21(21): 7205. DOI:10.3390/s21217205 (

Zhou, T., Si, J., Wang, L., Xu, C., and Yu, X., 2022. Automatic detection of underwater small targets using forward-looking sonar images. IEEE Transactions on Geoscience and Remote Sensing, 60: 1-12. DOI:10.1109/TGRS.2022.3181417 (

收稿日期：2021-12-12；修订日期：2022-09-08；接受日期：2022-10-26