International Journal of Automation and Computing  2018, Vol. 15 Issue (6): 656-672 PDF
An Overview of Contour Detection Approaches
Xin-Yi Gong1,2, Hu Su1,2, De Xu1,2,3, Zheng-Tao Zhang1,2, Fei Shen1,2, Hua-Bin Yang1,2
1 Research Center of Precision Sensing and Control, Institute of Automation, Chinese Academy of Science, Beijing 100190, China;
2 University of Chinese Academy of Science, Beijing 100049, China;
3 Tianjin Intelligent Technology Institute of Institute of Automation, Chinese Academy of Science Co., Ltd, Tianjin 300300, China
Abstract: Object contour plays an important role in fields such as semantic segmentation and image classification. However, the extraction of contour is a difficult task, especially when the contour is incomplete or unclosed. In this paper, the existing contour detection approaches are reviewed and roughly divided into three categories: pixel-based, edge-based, and region-based. In addition, since the traditional contour detection approaches have achieved a high degree of sophistication, the deep convolutional neural networks (DCNNs) have good performance in image recognition, therefore, the DCNNs based contour detection approaches are also covered in this paper. Moreover, the future development of contour detection is analyzed and predicted.
Key words: Contour detection     contour salience     gestalt principle     contour grouping     active contour
1 Introduction

Contour is known as a representative feature of imagery object, and thus, its detection is problem in the field of computer vision. Contour detection algorithms are fundamentally required for performing practical tasks, such as object recognition and scene understanding. Until now, numerous researchers have studied the problem constantly, and have gained significant achievements[1].

As described in [2], a contour is defined as an outline representing or bounding the shape or form of an object. Contour detection attempts to extract curves which represent object shapes from images. In fact, the concept of contour is based on human′s common experience, which does not have a formal mathematical definition. Contour is closely related to two additional concepts, i.e., edge[3] and boundary[4], which correspond to the discontinuities in photometrical, geometrical and physical characteristics of objects in images. It is believed that a clear distinction among the three concepts facilitates the feature selection when designing certain special detectors.

In [4], a boundary is described as a contour in the image, representing a change in pixel ownership from one object or surface to another. Some researchers tend to regard contours as the boundaries of interesting regions. However, contour detectors cannot guarantee to produce closed contours and divide the image into different regions[5]. These cases will frequently occur when contours do not arise from region boundaries. In this sense, the concepts of contour and boundary are closely related, but not identical.

In addition, an edge is represented in the image by changes in the intensity function[3] and can be detected by certain low-level image features such as brightness or color[4]. Therefore, edge detection is a low-level technique which could be applied toward the goal of boundary or contour detection. As illustrated in Fig. 1, the four classical means by which contours are observed are luminance change, texture change, perceptual grouping, and illusory contour, respectively[1]. In the first and second cases, the contours arise from regions boundaries, whereas in the third and fourth cases, global relations give rise to the perception of a contour.

 Download: larger image Fig. 1. Four classical means of observing contours

Considering the variety that may emerge from practical applications, contour detection is a difficult task. Generally, the detectors with local features do not achieve satisfactory results, especially for textures, low-contrast objects, or severely noisy images. To pursue better performance, more sophisticated detectors are developed in which high-level information and sometimes prior knowledge (e.g., the shape of the contour to be detected) are also deployed.

In summary, contour detection is of theoretical and practical significance. Despite all of that, there lack overviews of research studies related to this field. To the best of our knowledge, the most recent overview is published in 2011[1]. It focuses mainly on “edge and line oriented approaches”, while other types of approaches are not discussed in depth. Moreover, this review, to some degree, lags behind the development of the era since remarkable progress has been made on this subject. Specially, deep learning techniques[6] are successfully applied to solve the problem of contour detection, achieving comparable performance with the other state-of-the-art methods[7]. Therefore, an overview of the contour detection approaches including the recent progresses is necessary.

This paper presents a taxonomy of the existing contour detection algorithms. We attempt to distinguish the following categories. The first one, pixel-based approach, directly determines whether each pixel belongs to a contour according to certain extracted features. The second one, edge-based approach, detects edges and then contours are obtained by grouping or optimizing the edge fragments. The third one, region-based approach, regards contours as boundaries of interesting regions and internal information of the region is taken into account. The final category is deep-network-based approaches, in which deep networks with good generalization abilities are proposed for contour detection.

The rest of this paper is organized as follows. In Section 2, pixel-based approaches are discussed in two steps: feature selection and classification. Edge-based and region-based approaches are discussed in Sections 3 and 4, respectively. In Section 5, recently published papers on deep networks for contour detection are reviewed. Section 6 presents specific applications of contours. Finally, the paper is concluded in Section 7.

2 Pixel-based approaches

In pixel-based approaches, features are constructed and then employed to determine whether each pixel of the image belongs to a contour. Discontinuity in grey-scale intensity is the most salient features in the image and is the first to be discovered. By convolving the image with local filters, this feature can be detected as the pixels with the highest gradient magnitude in their local neighborhood. To extract the discontinuity feature arising from step edge, several linear filters[810] are introduced, such as Sobel, Prewitt and Canny operators. However, as Perona and Malik[11] pointed out, real image edges are usually not step functions, but are more typically a combination of steps, peaks, and roof profiles. Step edges are thus an inadequate model for the discontinuities in the image that result from the composite edges. A class of nonlinear filters, known as quadratic filters, are suggested in [11]. In addition, filters of different scales[12, 13] and orientations[14] are considered as well for providing a richer description.

Nevertheless, the discontinuity feature is not robust, especially in the case of textures and ambiguities. Therefore, it is not appropriate to employ this feature alone to accurately localize an object contour, although it has been widely used in edge detection. In view of this, in recently proposed contour detection approaches, attempts are made to employ higher-level features. Considering the motivations of the features, these approaches can be divided into two categories, namely, approaches based on brain-inspired features and approaches based on natural features, both of which we would discuss below.

2.1 Methods based on brain-inspired feature

Visual neural mechanism forms a theoretical basis for methods based on brain-inspired features that imitate the reaction of human perception in contour detection. Very significant progress has been made in the research of the visual neural mechanism which we now briefly introduce. Firstly, several concepts need to be clarified so as to smooth the following discussion. The primate visual cortex is in general referred to as V1. In V1, the receptive field (center area) is called as classical receptive field (CRF) and its surrounding area is called non-classical receptive field (NCRF). As identified by the studies[1519], stimuli falling only in NCRF is not sufficient for driving spiking responses, but can modulate the activity evoked by the stimulus located within the CRF. The modulatory effect is referred to as center-surround interaction. Intense research of cortical neurons in CRF and NCRF has revealed that the visual cortex plays an important role in feature extraction, and that center-surround interaction is desirable for contour perception. Different models are developed to construct the features in accordance with the responses of different areas in V1, which are respectively described in detail as follows.

Daugman et al.[20, 21] describe the response of stimuli in the CRF of simple cells which uses two-dimensional (2D) Gabor filters. The filters fit the 2D spatial response profile of each simple cell in the least-squared error sense (with a simplex algorithm) to estimate whether each pixel is on an edge. In addition, to describe the response of stimuli in the CRF of complex cells, the concept of Gabor energy is proposed in [22]. It is computed by using a bank of orthogonal pairs of Gabor filters for texture analysis and to obtain better results. Zhao et al.[23] calculate the Gabor energy maps and the preferred orientations by convolution with a series of directional Gabor functions. The histograms of the preferred orientation are counted to calculate the modulations and then to ameliorate the Gabor energy. It is demonstrated in these studies that the Gabor filters can approximately simulate the neuronal response characteristics of CRF, such as orientation and frequency. In allusion to the inhibition of the NCRF, various models have been proposed, represented by difference of Gaussian (DoG) proposed by Sceniak et al.[24, 25] and ratio of Gaussian (RoG) proposed by Cavanaugh et al.[26] For the sake of texture suppression in contour detection, Grigorescu et al.[2729] utilize the NCRF inhibition mechanism, which consists of isotropic and anisotropic inhibition. They consider isolated lines and edges to be non-texture features, which are not affected by the inhibition, while groups of lines and edges viewed as texture features are suppressed. Thereafter, the NCRF inhibition mechanism is further investigated to improve the effect of contour detection. For example, Papari et al.[30, 31] split the inhibition surround into two truncated half-rings oriented along the concerned edge which can reduce the impact of self-inhibition. Ursino and La Cara[32] present a model of contour extraction based on inhibition mechanisms that combine the feed-forward and feedback, the former of which suppresses non-optimal stimuli and ensures contrast invariance and the latter suppresses noise. Long et al.[3335] propose an adaptive inhibition approach to remove the non-meaningful texture elements for effective contour extraction. Azzopardi et al.[36, 37] propose a computational model of a simple cell referred to as the combination of receptive fields (CORF) model. The model uses responses of model lateral geniculate nucleus (LGN) as afferent inputs with center-surround receptive fields. In [38], multiple features are suggested for the classical center-surround inhibition which is common to most cortical neurons, such as orientation, luminance, and luminance contrast. For NCRF modulation, in addition to the inhibitory effect, the facilitatory effect, the importance of which is fully validated[39], is also applied to contour detection[40, 41]. Tang et al.[42, 43] establish a unified contour model which fully utilizes the configurational feature of smooth contour to renovate the incomplete contour. The method is appropriate for complex natural scene and more in line with physiological mechanisms. As demonstrated, the NCRF-related methods can dramatically suppress trivial edges arising from texture and thus improve the contour detection performance. For more details on this issue, please refer to [44].

2.2 Methods based on natural features

Approaches based on natural features use the features that are extracted according to natural attributes of the image, such as brightness and color. The approaches include two crucial components in implementation, feature selection and determination.

2.2.1 Feature selection

As mentioned above, texture and noise are important factors that influence the effectiveness of contour detection. To overcome this difficulty, the discontinuity feature is combined with other gradient features, including color and texture gradients, resulting in a probabilistic detector Pb[4]. Motivated by the gradient features in Pb, several related features have been proposed. In [45, 46], color and gradient information are considered via local, textural, oriented, and profile gradient-based features. In [47], additional localization and relative contrast cues are defined and fed to the boundary classifier. The localization cue captures the distance from a pixel to the nearest peak response and the relative contrast cue normalizes the contrasts in two half disks around each pixel. In [48], the hand-designed gradient features in Pb are replaced by the sparse code gradients (SCG) features, bringing in a significant improvement in detection accuracy. SCG features play a similar role with gradient features in measuring local contrast. However, SCG features can automatically learn from image data through sparse coding, thus minimizing human′s involvements. Dollar et al.[49] spare more efforts to reduce human burden in designing and selecting appropriate features. In the literature, thousands of simple features computed on images patches are considered to establish a discriminative model. These features include gradients at multiple scales and locations, difference between histograms computed on filter responses at multiple scales and locations, and Haar wavelets as well.

In addition to the above approaches that aim to extract local information, the spectral graph theory, particularly its normalized cuts criterion, provides a means to explore the role of global information in contour detection. The normalized cuts criterion is investigated and applied to image segmentation in [14, 50]. In [5, 51], it extended and improved the performance of detector Pb. As a result, a novel detector gPb is developed which involves both local and global information. In the framework, an affinity matrix W is defined, the elements of which measure the similarity of pixels. Afterwards, diagonal matrix D is calculated with elements ${D_{ii}} = {\Sigma _{ii}}{W_{ii}}$ and the generalized eigenvectors of the system $(D - W){{v}} =$ $\lambda D{{v}}$ are solved. Local contrasts extracted from the eigenvectors are combined to provide global information. Despite of its good performance, detector gPb is limited in scalability as it has a high computational cost and memory footprint. In view of this point, an improved normalized cuts algorithm is proposed[52, 53], providing a 20 speed-up to the eigenvector computation.

Therefore, the issue is briefly discussed here. Traditional classification methods[9] are based mainly on dividing the response of filter with a predetermined threshold, or finding local maximum of the responses. In recent research, learning-based methods attract increasingly attention. Compared with traditional methods, the learning-based methods with various image cues can be used to achieve higher robustness. In [4], comparison experiments are conducted with a wide array of classifiers, including hierarchical mixtures of experts (HME), support vector machine (SVM), etc. Accordingly, logistic model is chosen therein. Consistent with the observations reported in [4], the logistic model is also adopted in [47, 51]. Another classification method, known as probabilistic boosting tree (PBT), is employed in [49] to deal with thousands of features. Until now, we have reviewed some of the existing contour detection methods. We offer a clearer descriptive presentation in Table 1.

Table 1
Summary of the pixel-based contour detection methods

2.3 Discussions

In general, it can be concluded that continuous effort has been invested in developing novel features to improve detection accuracy. Different features have been proposed, involving both local and global image information. The features can be combined to improve the robustness, especially when textures and noise occur. The first attempt is reported in [4], which combines brightness, color, and texture gradient features. However, the extent to which the features are sufficient and necessary has not yet completely understood. Dollar et al.[49] attempt to overcome the difficulty, where a large number of simple features are employed to train a discriminative model. The primary advantage of such approach is that the human effort required is minimized, and instead, the work is shifted to classification algorithm. Up to now, few research studies have been conducted on this topic and further discussion is necessary.

For pixel-based approaches, another significant concern is to figure out the range of scales that contours may appear in the images. This issue has been addressed in early research studies on edge detection. In [12, 55], focusing on the Laplacian of Gaussian (LoG) operator, a rigorous analysis of linear edges at different scales is carried out to study the mutual influence of edges. Basic knowledge about the behavior of edges in scale space is provided and can be used for scale space reasoning. In [56], a mechanism is proposed to automatically determine the scale levels that are appropriate for extracting a given edge. To resolve the problem of contour detection, the detection accuracy of several detectors mentioned was further improved by taking the advantage of multi-scale information. Ren[47] demonstrates the benefit of combining information from multiple scales of the Pb detector developed in [4]. Furthermore, a multi-scale version of the Pb detector, named mPb, is proposed in [5, 51]. In [49], gradient features at multiple scales are employed in the discriminative model. However, some aspects of contour detection still have not been investigated sufficiently and more research is needed, such as selecting of the scales of operators for an image and effectively combining of the information recovered at different scales.

3 Edge-based approaches

Edge-based approaches are based on contour related edges or curves provided by edge detectors or human prior experience, aiming to determine whether they are contained in a certain contour. In many of these approaches, global optimization is conducted, whereby information from the entire image can be taken into consideration simultaneously. The difficulty is, in most cases, the optimization problem is NP-hard, which makes it unlikely to find the global optimum in polynomial time. Edge-based approaches can be categorized into two classes: edge grouping and parametric active contour.

3.1 Contour extraction

Although it has not been strictly defined, contour extraction has been mentioned in some of the literatures. In [57], it is regarded as a process of identifying a subset of fragments produced by preprocessing and connecting the fragments in that subset. Cox et al.[58] point that contour extraction is a two-step process: edge detection followed by edge grouping. As stated, the grouping process, which is primarily addressed in this section, involves collecting individual edge points together to form continuous contours.

Note that perceptual grouping is a fairly general process that may emerge from different domains. As explained in [59], grouping processes rearrange the given data by eliminating irrelevant data items and sorting the rest into groups, each of which corresponds to a particular object, as illustrated in Fig. 2[60]. In [59], a general grouping approach is proposed, which can be used for grouping various types of measurements, and for incorporating different grouping cues. In our context, measurements are edge elements, and the Gestalt laws which have been used as guidelines for many edge grouping methods, need to be introduced. Gestalt laws are the factors that lead to human perceptual grouping, which can be summarized as similarity, proximity, continuity, symmetry, parallelism, closure and familiarity.

 Download: larger image Fig. 2. Edge grouping

Earlier research studies[58, 61, 62] have concentrated on probability modeling which determines the likelihood certain edge element contained in the same contour. These studies differed in the way of probability calculation. More specifically[58], based on empirical statistics, whereas in [61, 62], the Bayesian inference is adopted. Subsequent work on probability edge grouping focuses on saliency measure, which can be derived by transition probabilities between different edge elements. In [60], the concepts of link saliency and contour saliency are introduced, which are used to identify smooth closed contours, bounding objects of unknown shape in real images. In [60], a method, named stochastic completion fields[63, 64], is adopted for the calculation of transition probability. The method[63, 64] models the Gestalt laws of proximity and continuity by the distribution of smooth curves which are traced by particles that move with constant speed in directions undergoing Brownian motion. The method is not claimed as a saliency measure, but it is easy and natural to use this method to compute saliency. In addition, different saliency measures have been proposed[6468]. Each of the measures is a function of a set of affinity values assigned to pairs of edges, incorporating the Gestalt principles with good continuation and proximity in some form. However, the affinity between a pair of edges is defined somewhat differently in different measures. Their performances are compared in [69]. As reported in [69], the measure of perceptual saliency significantly affects the performance of the grouping method and thus needs to be emphasized. In addition to grouping edge elements, the saliency measure is proven to be effective in the detection of illusory contours, as shown in Fig. 3.

 Download: larger image Fig. 3. Detection of illusory contours[63, 64]

The optimization framework is pioneered by Jermyn and Ishikawa[70] and is generalized in [71]. The difference between them is that, the latter imposes curvature regularity of the region boundary whereas the former does not. The authors also provide graph-based methods to solve the optimization problems, which, in most cases, are NP-hard, meaning that it is unlikely to find the global optimum in polynomial time. Based on the framework, the approaches called ratio contour are proposed in [57, 72]. It is noteworthy that the above-mentioned methods are designed for extracting salient region boundaries, and in particular, the approach presented in [72] is intended to detect boundaries with symmetry. As the concept of boundary is much narrower than that of contour, their application areas are limited. The restriction can be overcome by using the graph topology proposed in [73]. In this topology, each edgel is represented by two duplicated nodes with opposite directions. Based on the two nodes, open curves could be represented by circles by adding in the connections between them[74]. The same topology is adopted in [75]. Unlike the method in [73], the number of groupings recovered by the algorithm in [75] can be specified by a human user. The application scope is thus expanded. Edgel mentioned here is not a universal concept and has been referred to in some of the above literatures[73, 75, 76] without a strict definition. We take the description in [76] to illustrate the concept:

Edge pixels with very similar gradient direction in a local neighborhood are combined to edgels. The edgel is located at the center position and has the mean direction of the local estimate.

In addition, other grouping algorithms for certain specific purpose are proposed. For example, the path-based grouping approach is proposed in [76], in which connectedness of image elements via mediating elements is sufficiently explored. The grouping approach yields superior results when objects are distributed on low-dimensional extended manifolds in a feature space. In [77], based on the defined cost functions, the proposed grouping approach tends to extract a small set of smooth curves. With the curves, a global interpretation of the image can be derived. And the approach proposed in [78] is designed to find convex sets of line segments. In essence, the optimization approaches mentioned in this paragraph differ mainly in their choice of grouping principles and optimizing approaches. A clearer illustration is summarized in Table 2.

Table 2
Summary of the grouping principles and optimization methods

3.2 Contour completion

An additional point that plays a critical role in human perception and thus needs to be emphasized is the closure principle. Enforcement of the closure principle can produce more meaningful contours. However, given the noise and ambiguities, contour frequently breaks into fragments, even with state-of-the-art edge detectors. Unlike other relatively local Gestalt principles such as proximity and continuity, the closure principle is more global, and thus the enforcement of such principle is not a simple task. Some attempts to address the issue are reported in [2, 79, 80]. Both of the approaches adopt conditional random field (CRF) model to achieve contour closure. The main differences between them lie in the means used to complete contours across gaps and the potential functions employed in the CRF model. The approaches have the limitation that they can be only applied to contours with small and simple gaps; otherwise, large deviation will occur. A more general gap filling strategy is proposed in [57], which can, to some extent, overcome not only the limitation, but also the negative impact of noise at fragment endpoints. However, there is a possibility that the filled contour would deviate significantly from the original curve. Given this difficulty, the topic still requires more discussions and more efficient approaches.

3.3 Edge-based active contour

Active contour is a variational approach to contour localization, which is also referred to as snake or deformable model in literatures. The main idea is to evolve a hand-drawn curve into the specific line or contour by minimizing predefined energy function, as shown in Fig. 4. The basic concept of active contour is first proposed by Kass et al.[81] in the 1980s. Another pioneer approach that′s worth mentioning is to apply level set to solve the model of active contour, which is carried out by Caselles et al.[8284] In their approaches, the contour can be expressed implicitly, and thus, its application range is expanded. Built on these works, further research has been conducted and significant progress has been made in recent years. Active contour approaches can be divided into two categories: edge-based approaches and region-based approaches. They are differentiated in whether the luminance inside the curve-bounded region is involved in the definition of energy function. More explicitly, the region-based approaches derive a contour representation from a segmentation of the image into well-defined regions, while in the edge-based approaches, a contour representation fitting to boundary points is characterized by the differential property. Edge-based approaches are reviewed in this section and the region-based approaches are followed in the next section. With the parametric representation of a snake v(s) = (x(s), y(s))[81], the energy function is expressed as

 $\begin{split}E_{snake}^* & = \mathop \int \nolimits_{snake}^* {E_{snake}}(v(s)){\rm d}(s)=\\ & \mathop \int \nolimits_0^1 ({E_{int}}(v(s)) + {E_{img}}(v(s)) + {E_{con}}(v(s))){\rm d}s\end{split}$ (1)

where Eint, Eimg and Econ respectively represent different aspects of the forces that drive the curve to the boundary of the object. In particular, Eint represents the internal energy of the curve, Eimg gives rise to the image forces, and Econ gives rise to external constraint forces. According to [81], Eimg can be combined with Econ to form the external energy, the value of which represents how closely the curve approximates to the actual contour[83]. By minimizing the energy function, the curve is evolved until it reaches the boundary of the object.

The model proposed in [81] combines the local features with a priori information such as location, shape, and size, making it advantageous in high accuracy and better adaptability. Moreover, it can effectively detect illusory contours. An example is presented in Fig. 4. In Fig. 4, with the initial configuration drawn by the user (Fig. 4 (a)), the model can detect both edges and boundaries (Fig. 4 (b))[85]. However, it also has some drawbacks, such as slow convergence, strong dependence on the initial configuration, and poor performance in detecting low-contrast contours. To overcome the limitations, a large variety of improvements have been developed. For example, balloons snake[86, 87], gradient vector flow (GVF) snake[88], and the model in [89] are developed, all of which aim to reduce the dependence on the initial curve. Different strategies have been adopted in these models. In contrast to the original model proposed in [81], additional forces are introduced to the ballons snake and GVF snake. Futhermore, in these models of [81, 8688], the curve can move in one direction, whereas in the model in [89], the curve is allowed to evolve with more freedom, in the inward and outward directions simultaneously.

 Download: larger image Fig. 4. Example of active contour[85]

In addition, the geodesic active contour (GAC) model proposed in [83] improves the real-time performance, but having the risk of leaking through the boundary. This risk can be avoided in the model in [90] which takes the shape information into account. In [91], the snakes model is extended to color images. Other examples, including the Fourier snakes[92], the dual snakes[93, 94], the water snake[95], and the directional snakes[96], are also typical modifications of active contour model, which, to some extent, overcome the disadvantages of the original model. For an overview, please refer to [85, 97].

3.4 Discussions

We classify the edge-based approaches into two categories according to whether or not they require an initialization from the user. The first category, discussed in Section 3.1, aims to group edge pixels together into contours according to the Gestalt laws. In the second category, reviewed in Section 3.2, the initial curve is evolved by minimizing an energy function.

In the former category, optimization is usually performed in a graph-based framework. The nodes in the graph have different representations in different optimization methods. The most common representations are image pixels[70], line segments[71], and edgels[73, 75]. In general, the edgel representation may be a more convenient choice when considering the Gestalt principles of closure, proximity, and continuity. In addition, as analyzed in [57], the choice of the most-salient boundary may depend on image scale. In this sense, another consideration should be given to design scale-invariant proximity measures.

In the latter category, active contour is an energy-minimizing approach guided by the external forces and influenced by the image forces that attract the initial drawn curve to the edges and contours. As demonstrated in [98], the approach has a unique advantages of locating the object accurately and efficiently. Moreover, it can effectively detect illusory contours. Nevertheless, original model also has the limitations such as poor real-time performance, strong dependence on the initial configuration and being sensitive to noise. Many modifications have been proposed, to some extent, to overcome those disadvantages. However, we should note that most of them still have their own limitations, and few could satisfy practical requirements. Considering of this, we suggest that, it is necessary to investigate how to remove the requirement of drawing an initial curve and how to improve the robustness.

4 Region-based approaches

Regarding contours as boundaries of interesting regions, region-based approaches take advantage of internal information of the regions to enhance their effectiveness and robustness. The advantages of the approaches are indisputable, but they also have disadvantages, because they are invalid for contours that do not arise from region boundaries. These approaches are considered in this section from two different aspects: region segmentation approaches and region-based active contours.

4.1 Region segmentation approaches

In fact, region segmentation approaches are closely related to another research area, image segmentation, which plays an important role in object reorganization and scene understanding[99101]. A detailed discussion on segmentation would not provided here because of space limitation. In this section, only those approaches that have a direct application on contour detection are discussed. A classical region-based approach, named ultrametric contour map (UCM), is proposed in [102, 103], which addresses region-based segmentation and contour extraction as a single task. The key of the approach is the definitions of ultrametric distances, where local contrast and region contribution are involved in measuring the dissimilarities between adjacent regions. Furthermore, Jones et al.[18] propose an oriented watershed transform (OWT) algorithm to form initial regions for the construction of UCM. And the study is further extended in [52, 53] by effectively utilizing multiscale information. Ming et al.[104, 105] introduce a contour detection framework that combines region and contour representations based on the winding number concept. In their approach, region similarity and contour cues such as curvature are involved.

4.2 Region-based active contour

As analyzed in Section 3.2, the region-based active contour models compute the sum of integrals over the regions bounded by the curve, rather than over the boundaries. Different from edge-based models, the region-based models can detect boundaries which are not necessarily defined by gradient or not very smooth. Therefore, we know that, region-based models are usually more robust to noise and less sensitive to initialization than edge-based models.

Cohen et al.[106] first suggest that region information should be considered in energy function. According to the energy function in which region function is involved, the evolution speed of the curve could be determined. Subsequently, using Mumfor-Shah functional and level sets, Chan and Vese[107] propose a region-based model which can effectively detect boundaries that are not defined by gradient. The energy function is defined as

 $\begin{split}E =\; & \mu Length(C) + \upsilon Area(inside(C)) + \\ & {\lambda _1}\iint_{inside(C)} {|{I_{(x,y)}} - {c_1}{|^2}{\rm d}x{\rm d}y} + \\ & {\lambda _2}\iint_{outside(C)} {|{I_{(x,y)}} - {c_2}{|^2}{\rm d}x{\rm d}y} \end{split}$ (2)

where $\mu$ = 0, $\upsilon$ = 0, and $\lambda_1$ , $\lambda_2$ = 0, are fixed parameters. For effective computation, level sets methods are utilized in [107] to transform the mixed integrals over regions and over boundaries into the single integrals over boundaries. Vese and Chan[108] then proposed a new multiphase level set framework for image segmentation, which can be seen as the generalization of the model proposed in [107]. However, in the models proposed in [107, 108], only region information is used. Considering that and to further improve the performance, the fusion methods which are based on both region information and edge information are suggested. The model presented in [109] combines the perceptual notions of edge/shape information with gray level homogeneity. The authors attempt to form a unified approach by integrating the two types of information to enhance the robustness to noise and improve the poor initialization. Then, Zhu and Yuille[110] present a novel statistical and variational approach to image segmentation based on an algorithm called region competition. A more general expression of energy function is obtained, which is

 $\begin{split}E(C(p),\{ {a_i}\} ) = & \sum\limits_{i = 1}^M \Bigg\{ \frac{\mu }{2}\int_{\partial {R_i}} {{\rm d}s} - \Bigg.\\ & \Bigg.\log P(\{ I(x,y):(x,y) \in R\} |{a_i}) + \lambda \Bigg\}. \end{split}$ (3)

As demonstrated in [110], the snake model is a special case of (3) and can be directly deduced with fixed parameters ai in (3). Moreover, other models, such as geodesic active region (GAR) model[111], diffusion snake[112], and polar snake[113], have improved performance in some aspects.

4.3 Discussions

Thus far, the investigation of region-based approaches is relatively limited in existing publications. More global statistics are considered in region-based approaches than in edge-based approaches. Therefore, these approaches are more robust to noise and can be adapted to the contours which are not very smooth. Specially, in the case of region-based active contour models, they can even overcome the limitation of strong dependence on initial configuration. However, the disadvantage is also obvious. That is, the approaches are invalid for contours that do not arise from region boundaries. This limitation may be overcome by the idea from [73] that open curves can be represented by circles through adding several connections between duplicated nodes. Thus, a more general region-based approach can be derived and the application of this approach will be further extended.

5 Deep-learning based approach 5.1 Existing methods

Deep convolutional neural networks (DCNNs) have recently made impressive performances in various tasks such as classification, image and video detection, and segmentation[114]. The notable network (named AlexNet) is the one from Krizhevsky et al.[115], which is the first to achieve a significant improvement in large-scale image classification. Subsequently, a number of attempts have been made to improve the original architecture of AlextNet, several of which are prevalent and worth mentioning. Their differences lie mainly in the depth and convolutional strategy. With AlexNet′s depth being 9 layers, VGG net[116] and GoogLeNet[117] are pushed to 16–19 layers and 22 layers, respectively. Concerning with the convolutional strategy, VGG net adopts smaller convolutional filters while GoogLeNet takes the “Inception module” to increase the width of the net. In GoogLeNet, the “Inception module” is used only at higher layers while the lower layers are kept in traditional convolutional fashion. He et al.[118] suggest a novel residual learning framework, based on which they design the named residual nets with the depth up to 152 layers. The nets are 8 deeper than the VGG nets and are substantially deeper than those used previously. The nets are successfully trained with the proposed learning frameworks, and have shown impressive performances in ILSVRC & COCO 2015 competitions. In these nets, several additional strategies are also employed, such as batch normalization [119] and dropout[120]. Inheriting the ideas of LeCun originally proposed[121], these models are typically referred to as standard DCNNs. According to the previous publications, the DCNNs that function as hierarchical feature extractors deliver strikingly better results than systems relying on carefully engineered representations.

However, despite of their great success in image classification, standard DCNNs cannot be directly applied to pixel-wise tasks, such as segmentation and detection. The repeated combination of max-pooling and striding at consecutive layers of the networks reduces significantly the spatial resolution of the resulting feature maps. In light of this, a number of effective improvements on standard DCNNs are made and new achievements have been obtained. Fully convolutional network (FCN)[122] converts the fully-connected layers of VGG into convolutional ones and attempts to harness information from multiple layers to better estimate the object boundaries. Within the same FCN framework, denser score maps are obtained by using atrous (dilated) convolutions[123125] and the label map is refined with a fully connected CRF in [123, 124]. The networks are shown to produce high resolution segmentation. The upsampling layer employed firstly in FCN is extended to deep deconvolutional networks in [126, 127] which are composed of deconvolution, unpooling and rectified linear unit (ReLU) layers. As demonstrated, the network[126] performs significantly better than FCN even requiring more complex training and inference. The pressure can be alleviated with the architecture proposed in [127] by discarding the fully connected layers in the VGG net. Furthermore, being fully convolutional[128] presents the feature extractor named OverFeat, based on which the integrated framework is presented for classification, localization and detection.

An orthogonal line of work on segmentation focuses on semi-supervised or weakly-supervised learning, aiming to alleviate the problem of inadequate training data by exploiting weak annotations. The methods investigate the problem of DCNNs learning for segmentation from either weakly annotated training data[129, 130] or a combination of few strongly labeled and many weakly labeled data[131, 132]. This is reasonable since weak annotations require only a fraction of efforts compared with strong annotations. Concerning with the difference in weak annotations, [129, 132] use bounding box annotations while [130, 131] are based only on image-level annotations. Using these weak annotations and sometimes few strong annotations, the methods achieve results that are competitive with those of some fully supervised methods.

Girshick et al.[133] explicitly point out the differences between detection and classification. Detection requires localizing objects within an image while classification is not. To solve the detection task, a method called region convolutional neural network (RCNN) is proposed in [133]. Given an input image, the method first generates region proposals, extracts a fixed-length feature vector from each proposal using AlexNet and then classifies each region with linear SVMs. Considering the excellent performance of the method on detection, several variants are proposed to further improve the computational efficiency, including spatial pyramid pooling (SPP) net[134] and faster RCNN[135].

A popular means of introducing deep learning techniques into the contour detection problem is to replace the traditional hand-designed features by deep features. For example, N4-fields[136] extracts patch features from the pre-trained AlexNet, and then maps them to the nearest neighbor annotation from a predefined dictionary. Shen et al.[137] cluster contour patches for shape classes and then learn the deep features for contour detection by using the proposed positive-sharing loss function. The authors adopt a convolutional neural network (CNN) architecture that is simpler than the usual one. In their opinion, the contour can be represented by a local image patch with smaller size than generic objects. While the above methods in [136, 137] can be categorized as path-level feature extraction and pixel-level feature extraction[138, 139], which extracts deep feature per pixel and classifies it to edge or non-edge class. In [139], a top-down detection method is proposed, which constituted the first attempt to use higher-level object cues to predict contours. The authors insist that higher-level object-features leads to a substantial gain in contour detection accuracy.

Another series of studies could be summarized as image-level prediction[140143], which predicts the entire contour map in an end-to-end manner. Motivated by the architecture of the VGG net, Xie and Tu[140] propose the holistically-nested edge detection (HED) method which performs holistic image training and prediction. By adopting deep supervision strategy[144] and fully utilizing multiple side outputs, the HED method achieves a state-of-the-art performance. On this basis, Liu and Lew[141] suggest a similar diverse supervision strategy that can fit to all of the intermediate layers for edge detection. In contrast to the HED method, supervision in [141] varies from coarse level to fine level as deep features become more discriminative. Liu et al.[142] propose a novel network that combines multi-scale and multi-level information from all the convolutional layers to realize image-to-image edge detection. In fact, it is the first attempt to adopt such rich convolutional features. In [7], a fully convolutional encoder-decoder network is proposed, aiming to detect higher-level object contours. The architecture takes advantage of high efficiency from FCN[122] and high resolution achieved by deconvolutional network[126], welcoming a superior performance.

However, the approach proposed in [145] is very different from the approaches mentioned above, which thus cannot be included in those classes. In this method, a variety of situations are defined and a specialized object boundary detector is trained for each of them. Given an input image, the situation is first determined based on the features extracted by AlexNet. Only those boundary detectors are applied. In addition, as similar as segmentation problem, weak-supervision techniques for contour detection are also investigated since both segmentation images and instance-wise boundaries are among the most expensive types of annotations. To relax the requirement of carefully annotated images, Khoreva et al.[146] show that bounding box annotations alone are sufficient to reach high-quality object boundaries. In addition, they propose several means of generating object boundary annotations with different levels of supervision.

5.2 Discussions

Recent DCNN-based methods for contour detection demonstrate promising improvements in performance on traditional methods. Compared with the classification networks such as VGG net[115] and AlexNet[116], these methods adopt a similar architecture but different implementation strategies. Different deep learning architectures and implementation strategies are summarized in Fig. 5[140]. In particular, Figs. 5(b) and 5(c) represent multi-level and multi-scale information, respectively. In the method described in [142], both of these two types of information are involved in edge detection. This marks the first attempt to adopt such rich convolutional features in computer vision tasks. As pointed out by the authors, this can be seen as a development direction for fully connected network, like FCN[122] and HED[140]. It is also noteworthy that in several recent methods, deep network is employed only in feature extraction, but in the stage of classification, traditional machine learning is applied, such as nearest neighbor search[136], SVM[137] and structured forest[137]. This strategy is firstly studied by Girshick et al.[133] and intends to further improve the prediction accuracy. Another consideration is the requirement of extensive training data. This problem has become more prominent for contour detection as labeling object boundaries is one of the most expensive types of annotations. In [146], this issue is addressed and it is shown that bounding box annotations alone are sufficient to reach high-quality object boundaries. However, researches on this topic are limited. This could be another development direction for contour detection. Methods that utilize more weak annotations such as image-level annotations[130] need further study.

 Download: larger image Fig. 5. Illustration of different multi-scale deep learning architecture configurations[140]

6 Challenging issues

As concluded from the literatures, a typical trend is to combine different kinds of features in contour detection. The first attempt is made in [4] and is extended in [49], where a large number of simple features are employed. In [142], the proposed network uses multi-level and multi-scale information simultaneously to obtain high-quality edges. However, the sufficiency and necessity of the features is not completely explored. Hence, it would be a promising research topic to analyze the interrelationship among these features and then reduce feature dimension, so as to improve the computational efficiency.

Another consideration is to introduce prior shape information about the contours, so that so as to improve contours will be detected. This is important for some specialized tasks such as tracking. The methods should have the ability to reject undesired contours and detect objects even in the presence of occlusion. A related approach is presented in [147], which aims to detect semantic contours. Moreover, the accuracy of approach in [147] is not satisfactory, which presents a particularly attractive area.

Finally, another future research direction is to analyze the relationship between semantic visual attributes and trained DCNN. On this basis, human experience can be fully utilized and the less critical features learned by DCNN can be canceled to further improve the efficiency. In addition, it is possible to control the trade-off between recall and precision, which is important for the tasks requiring high recall regardless of precision or vice versa.

7 Conclusions

In this paper, we have discussed and reviewed various approaches for contour detection that have been proposed in the last two decades. To distinguish the edge detection, the difference among the edge, the boundary and the contour is first introduced. Then, traditional contour extraction approaches are combed and analyzed in detail, which are divided into three categories. The differences and relations among those three categories are revealed. Furthermore, we also present a survey on recent theoretical progress in DCNN-based methods for contour detection. Finally, we suggest several potentially developmental issues that are valuable to be further investigated.

Acknowledgements

This work was supported by National Natural Science Foundation of China (Nos. 61503378, 61473293, 51405485 and 61403378), the Project of Development in Tianjin for Scientific Research Institutes, and Tianjin Government (No. 16PTYJGX00050)

References