2. 北京市科学技术研究院模式识别重点实验室 北京 100094
2. Key Laboratory of Pattern Recognition, Beijing Academy of Science and Technology, Beijing 100094, China
Saliency detection is the problem of identifying the points that attract the visual attention of human beings. Callet et al. introduced the concepts of overt and covert visual attention and the concepts of bottomup and topdown processing [1]. Visual attention selectively processes important visual information by filtering out less important information and is an important characteristic of the human visual system (HVS) for visual information processing. Visual attention is one of the most important mechanisms that are deployed in the HVS to cope with large amounts of visual information and reduce the complexity of scene analysis. Visual attention models have been successfully applied in many domains, including multimedia delivery, visual retargeting, quality assessment of images and videos, medical imaging, and 3D image applications [1].
Borji et al. provided an excellent overview of the current stateoftheart 2D visual attention modeling and included a taxonomy of models (cognitive, Bayesian, decision theoretic, information theoretical, graphical, spectral analysis, pattern classification, and more) [2]. Many saliency measures have emerged that simulate the HVS, which tends to find the most informative regions in 2D scenes [3][10]. However, most saliency models disregard the fact that the HVS operates in 3D environments and these models can thus investigate only from 2D images. Eye fixation data are captured while looking at 2D scenes, but depth cues provide additional important information about content in the visual field and therefore can also be considered relevant features for saliency detection. The stereoscopic content carries important additional binocular cues for enhancing human depth perception [11], [12]. Today, with the development of 3D display technologies and devices, there are various emerging applications for 3D multimedia, such as 3D video retargeting [13], 3D video quality assessment [14], [15], 3D ultrasound images processing [16], [17] and so forth. Overall, the emerging demand for visual attentionbased applications for 3D multimedia has increased the need for computational saliency detection models for 3D multimedia content. In contrast to saliency detection for 2D images, the depth factor must be considered when performing saliency detection for RGBD images. Therefore, two important challenges when designing 3D saliency models are how to estimate the saliency from depth cues and how to combine the saliency from depth features with those of other 2D lowlevel features.
In this paper, we propose a new computational saliency detection model for RGBD images that considers both colorand depthbased contrast features within a Bayesian framework. The main contributions of our approach consist of two aspects: 1) to estimate saliency from depth cues, we propose the creation of depth feature maps based on superpixel contrast computation with spatial priors and model the depth saliency map by approximating the density of depthbased contrast features using a Gaussian distribution, and 2) by assuming that colorbased and depthbased features are conditionally independent given the classes, the discriminative mixedmembership naive Bayes (DMNB) model is used to calculate the final saliency map by applying Bayes' theorem.
The remainder of this paper is organized as follows. Section 2 introduces the related work in the literature. In Section 3, the proposed model is described in detail. Section 4 provides the experimental results on eye tracking databases. The final section concludes the paper.
2 Related WorkAs introduced in Section 1, many computational models of visual attention have been proposed for various 2D multimedia processing applications. However, compared with the set of 2D visual attention models, only a few computational models of 3D visual attention have been proposed [18][36]. These models all contain a stage in which 2D saliency features are extracted and used to compute 2D saliency maps. However, depending on the way in which they use depth information in terms of the development of computational models, these models can be classified into three different categories:
1) Depthweighting Models: This type of model adopts depth information to weight a 2D saliency map to calculate the final saliency map for RGBD images with feature map fusion [18][21]. Fang et al. proposed a novel 3D saliency detection framework based on color, luminance, texture and depth contrast features, which designed a new fusion method to combine the feature maps to obtain the final saliency map for RGBD images [18]. Ciptadi et al. proposed a novel computational model of visual saliency that incorporates depth information and demonstrated the method by explicitly constructing 3D layout and shape features from depth measurements [19]. In [20], color contrast features and depth contrast features are calculated to construct an effective multifeature fusion to generate saliency maps, and multiscale enhancement is performed on the saliency map to further improve the detection precision focused on the 3D salient object detection. The models in this category combine 2D features with a depth feature to calculate the final saliency map, but they do not include the depth saliency map in their computation processes. Apart from detecting the salient areas by using 2D visual features, these models share a common step in which depth information is used as a weighting factor for the 2D saliency.
2) Depthpooling Models: This type of model combines depth saliency maps and traditional 2D saliency maps to simply obtain saliency maps for RGBD images [11], [12], [22][32]. Ouerhani et al. aimed at extension of the visual attention model to the depth component of the scene. They attempted to integrate depth into the computational model built around conspicuity and saliency maps [23]. Desingh et al. investigated the role of depth in saliency detection in the presence of competing saliencies due to appearance, depthinduced blur and centrebias and proposed a 3Dsaliency formulation in conjunction with 2D saliency models through nonlinear regression using a support vector machine (SVM) to improve saliency [12]. Xue et al. proposed an effective visual object saliency detection model via RGB and depth cues mutual guided manifold ranking and obtained the final result by fusing RGB and depth saliency maps [24]. Ren et al. presented a twostage 3D salient object detection framework, which first integrates the contrast region with the background, depth and orientation priors to achieve a saliency map and then reconstructs the saliency map globally [25]. Song et al. proposed an effective saliency model to detect salient regions in RGBD images through a location prior of salient objects integrated with color saliency and depth saliency to obtain the regional saliency map [26]. Guo et al. proposed a saliency fusion and propagation strategybased salient object detection method for RGBD images, in which the saliency maps based on color cues, location cues and depth cues are independently fused to provide high precision detection results, and saliency propagation is utilized to improve the completeness of the salient objects [27]. Fan et al. proposed an effective saliency model that combines regionlevel saliency maps generated using depth, color and spatial information to detect salient regions in RGBD images [28]. Peng et al. proved a simple fusion framework that combines existing RGBproduced saliency with new depthinduced saliency: the former one is estimated from existing RGB models, while the latter one is based on the multicontextual contrast model [29]. In [30], stereo saliency based on disparity contrast analysis and domain knowledge from stereoscopic photography was computed. Furthermore, Ju et al. proposed a novel saliency method that worked on depth images based on anisotropic centresurround difference [31]. Wang et al. proposed two different ways of integrating depth information in the modeling of 3D visual attention, where the measures of depth saliency are derived from the eye movement data obtained from an eye tracking experiment using synthetic stimuli [32]. Lang et al. analyzed the major discrepancies between 2D and 3D human fixation data of the same scenes, which are further abstracted and modelled as novel depth priors with a mixture of Gaussians [11]. To investigate whether the depth saliency is helpful for determining 3D saliency, some existing 2D saliency detection method are combined [12], [22], [31]. Iatsun et al. proposed a 3D saliency model relying on 2D saliency features jointly with depth obtained from monocular cues, in which 3D perception is significantly based on monocular cues [22]. The models in this category rely on the existence of "depth saliency maps". Depth features are extracted from the depth map to create additional feature maps, which are then used to generate the depth saliency maps (DSM). These depth saliency maps are finally combined with 2D saliency maps using a saliency map pooling strategy to obtain a final 3D saliency map.
3) Learningbased Models: Instead of using a depth saliency map directly, this type of model uses machine learning techniques to build a 3D saliency detection model for RGBD images based on extracted 2D features and depth features [31][36]. Iatsun et al. proposed a visual attention model for 3D video using a machine learning approach. They used artificial neural networks to define adaptive weights for the fusion strategy based on eye tracking data [33]. Inspired by the recent success of machine learning techniques in building 2D saliency detection models, Fang et al. proposed a learningbased model for RGBD images using linear SVM [34]. Zhu et al. proposed a learningbased approach for extracting saliency from RGBD images, in which discriminative features can be automatically selected by learning several decision trees based on the ground truth, and those features are further utilized to search the saliency regions via the predictions of the trees [35]. Bertasius et al. developed an EgoObject representation, which encodes these characteristics by incorporating shape, location, size and depth features from an egocentric RGBD image, and trained a random forest regressor to predict the saliency of a region using ground truth salient object [36].
From the above description, the key to 3D saliency detection models is determining how to integrate the depth cues with traditional 2D lowlevel features. In this paper, we propose a learningbased 3D saliency detection model with a Bayesian framework that considers both colorand depthbased contrast features. Instead of simply combining a depth map with 2D saliency maps as in previous studies, we propose a computational saliency detection model for RGBD images based on the DMNB model [37]. Experimental results from a public eye tracking database demonstrate the improved performance of the proposed model over other strategies.
3 The Proposed ApproachIn this section, we introduce a method that integrates the color saliency probability with the depth saliency probability computed from Gaussian distributions based on multiscale superpixel contrast features and yields a prediction of the final 3D saliency map using the DMNB model within a Bayesian framework. First, the input RGBD images are represented by superpixels using multiscale segmentation. Then, we compute the color and depth map using the weighted summation and normalization of the colorand depthbased contrast features, respectively, at different scales. Second, the probability distributions of both the color and depth saliency are modelled using the Gaussian distribution based on the color and depth feature maps, respectively. The parameters of the Gaussian distribution can be estimated in the DMNB model using a variational inferencebased expectation maximization (EM) algorithm. The general architecture of the proposed framework is presented in Fig. 1.
We introduce a colorbased contrast feature and a depthbased contrast feature to capture the contrast information of salient regions with spatial priors based on multiscale superpixels, which are generated at various grid interval parameters
1) RGBD Images Multiscale Superpixel Segmentation: For an RGBD image pair, superpixels are segmented according to both color and depth cues. We notice that when applying the SLIC algorithm directly to the RGB image and depth map, the segmentation result is unsatisfactory due to the lack of a mutual context relationship. We redefine the distance measurement incorporating depth as shown in (1):
$ D_s=\sqrt{d_{lab}^2+\omega_dd_d^2+\frac{m}{\mathcal{S}}d_{xy}^2} $  (1) 
where
Algorithm 1. Superpixel segmentation of the RGBD images 
Input: Initialization: Initialize clusters Set label Output: 1: Perturb cluster centres in a 2: for 3: for each cluster centre 4: Assign the best matching pixels from a for each pixel p in a Compute the distance if Set Set end if end for 5: end for 6: Computer new cluster centres. After all the pixels are associated with the nearest cluster center, a new center is computed as the average 7: end for 8: Enforce connectivity. 
We obtain more accurate segmentation results as shown in Fig. 2 by considering the color and depth cues simultaneously. The boundary between the foreground and the background is segmented more accurately.
2) Colorbased Contrast feature: An input image is oversegmented at
$ f(p_c^l)=\omega_c^lSC_{GMR}^l $  (2) 
where
$ SC_{GMR}^l=SC_{GMR_{t}}^l\times SC_{GMR_{l}}^l\times SC_{GMR_{r}}^l\times SC_{GMR_{b}}^l $  (3) 
where
With multiscale fusion, the color feature map is constructed by weighted summation of
3) Depthbased Contrast Feature: Similar to the construction of the color feature maps, we formulate the depth feature maps based on multiscale superpixels in the depth maps:
$ f(p_d^l)=\omega_d^lSD_{GMR}^l $  (4) 
where
$ \omega_{ij}^l=e^{\frac{(\overline{d}_j^l\overline{d}_i^l)^2}{\sigma^2}} $  (5) 
where
4) Bayesian Framework for Saliency Detection: Let the binary random variable
$ p(\pmb{z}_s\pmb{x}_c, \pmb{x}_d) = \frac{p(\pmb{z}_s, \pmb{x}_c, \pmb{x}_d)}{p(\pmb{x}_c, \pmb{x}_d)} $  (6) 
where
In this paper, the classconditional mutual information (CMI) is used as a measure of dependence between two features
We employ a CMI threshold
By assuming that color and depth features are conditional independent given class, the DMNB model is adopted to calculate the final saliency map from the depth saliency probability and color saliency probability by applying Bayes's theorem. DMNB could be considered as a generalization of NB classifier extend in the following aspects: First, NB shares a component among all features, but DMNB has a separate component for each feature and maintains a Dirichletmultinomial prior on all possible combination of component assignments. Second, NB uses the shared component as a class indicator, whereas DMNB uses the mixed membership over separate components as inputs to a logistic regression model which finally generates the class label. In this paper, the DMNB model has Gaussian distribution for each color and depth feature and is applicable to predict final saliency map.
Given the graphical model of DMNB for saliency detection shown in Fig. 6, the generative process for
Algorithm 2. Generative process for saliency detection following the DMNB model 
1: Input: 2: Choose a component proportion: 3: For each feature: choose a component choose a feature value 4: Choose the label: 
In this work, each feature
$ \begin{align}\label{eqn_graph} &p(\pmb{x}_{1:N}, \pmb{y}\alpha, \Omega, \eta)=&\nonumber\\ &\qquad \int p(\theta\alpha)\left(\prod\limits_{j=1}^N\sum\limits_{\pmb{z}_j} p(\pmb{z}_j\theta)p(\pmb{x}_j\pmb{z}_j, \Omega_j)p(\pmb{y}\pmb{z}_j, \eta)\right)d\theta& \end{align} $  (7) 
where
Due to the latent variables, the computation of the likelihood in (7) is intractable. In this paper, we use a variational inference method, which alternates between obtaining a tractable lower bound to the true loglikelihood and choosing the model parameters to maximize the lower bound.
For each feature value, to obtain a tractable lower bound to
$ \begin{align}\label{eqn_inequality} &\log p(\pmb{y}, \pmb{x}_{1:N}\alpha, \Omega, \eta)\nonumber\\ &\qquad\geq{E}_q(\log p(\pmb{y}, \pmb{x}_{1:N}, \pmb{z}_{1:N}\alpha, \Omega, \eta))+\pmb{H}(q(\pmb{z}_{1:N}, \theta\gamma, \phi)). \end{align} $  (8) 
Noticing that
$ q(\pmb{z}_{1:N}, \theta\gamma, \phi)=q(\theta\gamma)\prod\limits_{j=1}^Nq(\pmb{z}_j\phi) $  (9) 
where
$ \begin{align}\label{eqn_lowerbound} \mathcal{L}=&\ {E}_q[\log p(\theta\alpha)]+{E}_q[\log p(\pmb{z}_{1:N}\theta)]&\nonumber\\ &\, +{E}_q[\log p(\pmb{x}_{1:N}\pmb{z}_{1:N}, \gamma)]{E}_q[\log q(\theta)]&\nonumber\\ &\, {E}_q[\log q(\pmb{z}_{1:N})]+{E}_q[\log p(\pmb{y}\pmb{z}_{1:N}, \eta)]& \end{align} $  (10) 
where
$ \phi_k\propto e^{(\Psi(\gamma_k)\Psi(\sum\limits_{l=1}^K\gamma_l)+\frac{1}{N}(\eta_k\pmb{y}_i\frac{e^{\eta_k}}{\xi}\sum\limits_{j=1}^N\frac{(\pmb{x}_{ij}\mu_{jk})^2}{2\sigma_{jk}^2}))} $  (11) 
$ \gamma_k=\alpha + N\phi_k $  (12) 
$ \xi=1+\sum\limits_{k=1}^K\phi_{k}e^{\eta_k}. $  (13) 
The variational parameters
Variational parameters
$ \begin{align*} &\mu_{jk}=\frac{\sum\limits_{i=1}^\mathcal{M}\phi_{ik}\pmb{x}_{ij}}{\sum\limits_{i=1}^\mathcal{M}\phi_{ik}}\\[2mm] &\sigma_{jk}=\frac{\sum\limits_{i=1}^\mathcal{M}\phi_{ik}(\pmb{x}_{ij}\mu_{jk})^2}{\sum\limits_{i=1}^\mathcal{M}\phi_{ik}}\\[2mm] &\eta_k=\log\left(\frac{\sum\limits_{i=1}^\mathcal{M}\phi_{ik}\pmb{y}_i}{\sum\limits_{i=1}^\mathcal{M}\frac{\phi_{ik}}{\xi_i}}\right).\end{align*} $ 
Based on the variational inference and parameter estimation updates, it is straightforward to construct a variant EM algorithm to estimate
Algorithm 3. Variational EM algorithm for DMNB 
1: repeat 2: Estep: Given Then, 3: Mstep: Improved estimates of the model parameters 4: until 
After obtaining the DMNB model parameters from the EM algorithm, we can use
$ \begin{align}\label{eqn_prediction} & {E}[\log p(\pmb{y}\pmb{x}_{1:N}, \alpha, \Omega, \eta)]\notag\\[2mm] &\qquad=\begin{cases} \eta^T{E}[\overline{\pmb{z}}]{E}[\log(1+e^{\eta^T\overline{\pmb{z}}})], &{\pmb{y}=1}\\[2mm] 0{E}[\log(1+e^{\eta^T\overline{\pmb{z}}})], &{\pmb{y}=0} \end{cases} \end{align} $  (14) 
where
1) Dataset: In this section, we conduct some experiments to demonstrate the performance of our method. We use four databases to evaluate the performance of the proposed model, as shown in Table Ⅱ. We distinguish between two cases. The first case includes images that show a single salient object over an uninteresting background. For such images, we expect that only the object's pixels will be identified as salient. The first databases were presented in the NLPR dataset^{1} and NJUDS400 dataset^{2}. The NLPR dataset includes 1000 images of diverse scenes in real 3D environments, where the groundtruth was obtained by requiring five participants to select regions where objects are presented, i.e., the salient regions were marked by hand. The NJUDS400 dataset includes 400 images of different scenes, where the groundtruth was obtained by four volunteers labelling the salient object masks. The second case includes images of complex scenes. The EyMIR dataset^{3} and NUS dataset^{4} are somewhat different. In these datasets, the images were presented to human observers for several seconds each, and eye tracking data were collected and averaged. In the NUS dataset, Lang et al. collected a large human eye fixation database from a pool of 600 2Dvs3D image pairs viewed by 80 subjects, where the depth information is directly provided by the Kinect camera and the eye tracking data are captured in both 2D and 3D freeviewing experiments. In the EyMIR dataset, 10 images from the database were selected from the Middlebury 2005/2006 image dataset, and the rest of the database consisted of the set of images from the IVC 3D image dataset, which contains two outdoor scenes and six indoor scenes. To create the groundtruth map, observers viewed the stereoscopic stimuli through a pair of passive polarized glasses at a distance for 15 seconds.
^{1}http://sites.google.com/site/rgbdsaliency
^{2}http://mcg.nju.edu.cn/en/resource.html
^{3}http://www.irccyn.ecnantes.fr/spip.php?article1102&lang=en
^{4}https://sites.google.com/site/vantam/nus3dsaliencydataset
2) Evaluation Metrics: To date, there are no specific and standardized measures for computing the similarity between the fixation density maps and saliency maps created using computational models in 3D situations. Nevertheless, there is a range of different measures that are widely used to perform comparisons of saliency maps for 2D content. We introduce two types of measures to evaluate algorithm performance on the benchmark. The first one is the gold standard: Fmeasure. The Fmeasure is the overall performance measurement computed by the weighted harmonic of precision and recall:
$ F_\beta=\frac{(1+\beta^2)Precision\times Recall}{\beta^2Precision+Recall} $  (15) 
where we set
$ T=\frac{2}{W\times H}\sum\limits_{x=1}^W\sum\limits_{y=1}^HS(x, y) $  (16) 
where
The second is the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC). By thresholding over the saliency maps and plotting true positive rate vs. false positive rate, an ROC curve is acquired. The AUC score is calculated as the area underneath the ROC.
3) Parameter Setting: To evaluate the quality of the proposed approach, we divided the datasets into two subsets according to their CMI values, and we held out 90
Our algorithm is implemented in MATLAB v7.12 and tested on a Intel Core(TM)2 Quad CPU 2.33 GHz with 4 GB RAM. A simple computational comparison is shown in Table Ⅲ in terms of EyMIR, NUS, NLPR and NJUDS400 datasets. It should be noted that there are lots of works left for computational optimization, including prior parameters optimization, algorithm optimization for variable inference during the prediction process.
4) The Effect of the Parameters: In particular, we performed the experiments while varying
The parameter
The parameter
We are also interested in the contributions of different features in our model. The ROC curves of saliency estimation from different features are shown in Fig. 10. This may be why the color and depth saliency maps show comparable performances, whereas their combination produces a much better result.
During the experiments, we compare our algorithm with five stateoftheart saliency detection methods, among which three are developed for RGBD images and two for traditional 2D image analysis. One RGBD method performs saliency detection at lowlevel, midlevel, and highlevel stages and is therefore referred to as LMH [29]. One RGBD method is based on anisotropic centresurround difference and is therefore denoted ACSD [31]. The other RGBD method exploits global priors, which include the background, depth, and orientation priors to achieve a saliency map and is therefore denoted GP [25]. The two 2D methods are Hemami's frequencytuned method [15], which is denoted FT, and the approach from the graphbased manifold ranking [10], which is denoted GMR. For the two 2D saliency approaches, we also add and multiple their results with the DSM produced by our proposed depth feature map; these results are denoted FT
Fig. 11 compares our results with FT [5], FT
The comparison of the ACSD [31], LMH [29] and GP [25] RGBD approaches is presented in Figs. 1215. ACSD works on depth images on the assumption that salient objects tend to stand out from the surrounding background, which takes relative depth into consideration. In Fig. 13, ACSD generates unsatisfying results without color cues. LMH uses a simple fusion framework that takes advantage of both depth and appearance cues from the low, mid, and highlevels. In [29], the background is nicely excluded; however, many pixels on the salient object are not detected as salient, as shown in Fig. 14. Ren et al. proposed two priors, which are the normalized depth prior and the globalcontext surface orientation prior [25]. Because their approach uses the two priors, it has problems when such priors are invalid, as shown in Fig. 12. We can see that the proposed method can accurately locate the salient objects and produce nearly equal saliency values for the pixels within the target objects.
1) Comparison of the 2D Models Combined With DSM: In this experiment, we first compare the performances of existing 2D saliency models before and after DSM fusing. We select two stateoftheart 2D visual attention models: FT [5] and GRM [10].
Figs. 16 and 17 present the experimental results, where
2) Comparison of 3D Models: To obtain a quantitative evaluation, we compared ROC curves and Fmeasures from the EyMIR, NUS, NLPR and NJUDS400 datasets. We compared the proposed model with the other existing models, i.e., GP, LMH, and ACSD described in [25], [29] and [31], respectively. In this paper, the GP model, LMH model and ACSD model are classified as depthpooling models. Figs. 18 and 19 show the quantitative comparisons among these method on the constructed RGBD datasets in terms of ROC curves and Fmeasures. Methods such as [31] are not designed for such complex scenes but rather single dominantobject images. For the case that a single salient object is over an uninteresting background in the NJUDS400 dataset, ACSD presented impressive results, as shown in Figs. 18(d) and 19(d). In the NJUDS400 dataset, we do not have experimental results for the LMH [29] and GP [25] methods due to the lack of depth information, which is required by their codes.
Due to the lack of globalcontext surface orientation priors in the EyMIR dataset, GP [25] is not able to apply the orientation prior to refine the saliency detection, which has lower performance compared to the ACSD method, as shown in Figs. 18(a) and 19(a). Interestingly, the LMH method, which uses Bayesian fusion to fuse depth and RGB saliency by simple multiplication, has lower performance compared to the GP method, which uses the Markov random field model as a fusion strategy, as shown in Figs. 18(c) and 19(c). However, LMH and GP achieve better performances than ACSD by using fusion strategies. The proposed RGBD method is superior to the baselines in terms of all the evaluation metrics. Although the ROC curves are very similar, Fig. 19 shows that the proposed method improves the recall and Fmeasure when compared to LMH and GP, particularly in the NLPR dataset. This is mainly because the feature extraction using multiscale superpixels enhances the consistency and compactness of salient patches.
3) Limitations: Because our approach requires training on large datasets to adapt to specific environments, it has the problem that properly tuning the parameters for specific new tasks is important to the performance of the DMNB model. The DMNB model does classification in one shot via a combination of mixedmembership models and logistic regression, where the results may depend on different choices of
In this study, we proposed a saliency detection model for RGBD images that considers both colorand depthbased contrast features within a Bayesian framework. The experiments verify that the proposed model's depthproduced saliency can serve as a helpful complement to the existing colorbased saliency models. Compared with other competing 3D models, the experimental results based on four recent eye tracking databases show that the performance of the proposed saliency detection model is promising. We hope that our work is helpful in stimulating further research in the area of 3D saliency detection.
1 
P. Le Callet and E. Niebur, "Visual attention and applications in multimedia technologies, "Proc. IEEE, 2013, 101(9): 20582067. DOI:10.1109/JPROC.2013.2265801 
2 
A. Borji and L. Itti, "Stateoftheart in visual attention modeling, "IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35(1): 185207. DOI:10.1109/TPAMI.2012.89 
3 
A. Borji, D. N. Sihite, and L. Itti, "Salient object detection: A benchmark, " in Proc. 12th European Conf. Computer Vision, Florence, Italy, 2012, pp. 414429.

4 
L. Itti, C. Koch and E. Niebur, "A model of saliencybased visual attention for rapid scene analysis, "IEEE Trans. Pattern Anal. Mach. Intell., 1998, 20(11): 12541259. DOI:10.1109/34.730558 
5 
R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, "Frequencytuned salient region detection, " in Proc. 2009 IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 15971604.

6 
X. D. Hou and L. Q. Zhang, "Saliency detection: A spectral residual approach, " in Proc. 2007 IEEE Conf. Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 2007, pp. 18.

7 
J. Harel, C. Koch, and P. Perona, "Graphbased visual saliency, " in Advances in Neural Information Processing Systems, Vancouver, British Columbia, Canada, 2006, pp. 545552.

8 
M. M. Cheng, G. X. Zhang, N. J. Mitra, X. L. Huang, and S. M. Hu, "Global contrast based salient region detection, " in Proc. 2011 IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, 2011, pp. 409416.

9 
S. Goferman, L. ZelnikManor, and A. Tal, "Contextaware saliency detection, " in Proc. 2010 IEEE Conf. Computer Vision and Pattern Recognition, San Francisco, CA, USA, 2010, pp. 23762383.

10 
C. Yang, L. H. Zhang, H. C. Lu, X. Ruan, and M. H. Yang, "Saliency detection via graphbased manifold ranking, " in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 31663173.

11 
C. Y. Lang, T. V. Nguyen, H. Katti, K. Yadati, M. Kankanhalli, and S. C. Yan, "Depth matters: Influence of depth cues on visual saliency, " in Proc. 12th European Conf. Computer Vision, Florence, Italy, 2012, pp. 101115.

12 
K. Desingh, K. M. Krishna, D. Rajan, and C. V. Jawahar, "Depth really matters: Improving visual salient region detection with depth, " in Proc. 2013 British Machine Vision Conf. , Bristol, England, 2013, pp. 98. 198. 11.

13 
J. L. Wang, Y. M. Fang, M. Narwaria, W. S. Lin, and P. Le Callet, "Stereoscopic image retargeting based on 3D saliency detection, " in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy, 2014, pp. 669673.

14 
H. Kim, S. Lee and A. C. Bovik, "Saliency prediction on stereoscopic videos, "IEEE Trans. Image Process., 2014, 23(4): 14761490. DOI:10.1109/TIP.2014.2303640 
15 
Y. Zhang, G. Y. Jiang, M. Yu, and K. Chen, "Stereoscopic visual attention model for 3D video, " in Proc. 16th Int. Multimedia Modeling Conf. , Chongqing, China, 2010, pp. 314324.

16 
M. Uherčík, J. Kybic, Y. Zhao, C. Cachard and H. Liebgott, "Line filtering for surgical tool localization in 3D ultrasound images, "Comput. Biol. Med., 2013, 43(12): 20362045. DOI:10.1016/j.compbiomed.2013.09.020 
17 
Y. Zhao, C. Cachard and H. Liebgott, "Automatic needle detection and tracking in 3D ultrasound using an ROIbased RANSAC and Kalman method, "Ultrason. Imaging, 2013, 35(4): 283306. DOI:10.1177/0161734613502004 
18 
Y. M. Fang, J. L. Wang, M. Narwaria, P. Le Callet and W. S. Lin, "Saliency detection for stereoscopic images, "IEEE Trans. Image Process., 2014, 23(6): 26252636. DOI:10.1109/TIP.2014.2305100 
19 
A. Ciptadi, T. Hermans, and J. M. Rehg, "An in depth view of saliency, " in Proc. 2013 British Machine Vision Conf. , Bristol, England, 2013, pp. 913.

20 
P. L. Wu, L. L. Duan, and L. F. Kong, "RGBD salient object detection via feature fusion and multiscale enhancement, " in Proc. 2015 Chinese Conf. Computer Vision, Xi'an, China, 2015, pp. 359368.

21 
F. F. Chen, C. Y. Lang, S. H. Feng, and Z. H. Song, "Depth information fused salient object detection, " in Proc. 2014 Int. Conf. Internet Multimedia Computing and Service, Xiamen, China, 2014, pp. 66.

22 
I. Iatsun, M. C. Larabi, and C. FernandezMaloigne, "Using monocular depth cues for modeling stereoscopic 3D saliency, " in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy, 2014, pp. 589593.

23 
N. Ouerhani and H. Hugli, "Computing visual attention from scene depth, " in Proc. 15th Int. Conf. Pattern Recognition, Barcelona, Spain, 2000, pp. 375378.

24 
H. Y. Xue, Y. Gu, Y. J. Li, and J. Yang, "RGBD saliency detection via mutual guided manifold ranking, " in Proc. 2015 IEEE Int. Conf. Image Processing, Quebec City, QC, Canada, 2015, pp. 666670.

25 
J. Q. Ren, X. J. Gong, L. Yu, W. H. Zhou, and M. Y. Yang, "Exploiting global priors for RGBD saliency detection, " in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 2015, pp. 2532.

26 
H. K. Song, Z. Liu, H. Du, G. L. Sun, and C. Bai, "Saliency detection for RGBD images, " in Proc. 7th Int. Conf. Internet Multimedia Computing and Service, Zhangjiajie, Hunan, China, 2015, pp. Atricle ID 72.

27 
J. F. Guo, T. W. Ren, J. Bei, and Y. J. Zhu, "Salient object detection in RGBD image based on saliency fusion and propagation, " in Proc. 7th Int. Conf. Internet Multimedia Computing and Service, Zhangjiajie, Hunan, China, 2015, pp. Atricle ID 59.

28 
X. X. Fan, Z. Liu, and G. L. Gun, "Salient region detection for stereoscopic images, " in Proc. 19th Int. Conf. Digital Signal Processing, Hong Kong, China, 2014, pp. 454458.

29 
H. W. Peng, B. Li, W. H. Xiong, W. M. Hu, and R. R. Ji, "Rgbd salient object detection: A benchmark and algorithms, " in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 92109.

30 
Y. Z. Niu, Y. J. Geng, X. Q. Li, and F. Liu, "Leveraging stereopsis for saliency analysis, " in Proc. 2012 IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 454461.

31 
R. Ju, L. Ge, W. J. Geng, T. W. Ren, and G. S. Wu, "Depth saliency based on anisotropic centersurround difference, " in Proc. 2014 IEEE Int. Conf. Image Processing, Paris, France, 2014, pp. 11151119.

32 
J. L. Wang, M. P. Da Silva, P. Le Callet and V. Ricordel, "Computational model of stereoscopic 3D visual saliency, "IEEE Trans. Image Process., 2013, 22(6): 21512165. DOI:10.1109/TIP.2013.2246176 
33 
I. Iatsun, M. C. Larabi, and C. FernandezMaloigne, "Visual attention modeling for 3D video using neural networks, " in Proc. 2014 Int. Conf. 3D Imaging, Liege, Belgium, 2014, pp. 18.

34 
Y. M. Fang, W. S. Lin, Z. J. Fang, P. Le Callet, and F. N. Yuan, "Learning visual saliency for stereoscopic images, " in Proc. 2014 IEEE Int. Conf. Multimedia and Expo Workshops, Chengdu, China, 2014, pp. 16.

35 
L. Zhu, Z. G. Cao, Z. W. Fang, Y. Xiao, J. Wu, H. P. Deng, and J. Liu, "Selective features for RGBD saliency, " in Proc. 2015 Conf. Chinese Automation Congr. , Wuhan, China, 2015, pp. 512517.

36 
G. Bertasius, H. S. Park, and J. B. Shi, "Exploiting egocentric object prior for 3D saliency detection, " arXiv preprint arXiv: 1511. 02682, 2015.

37 
H. H. Shan, A. Banerjee, and N. C. Oza, "Discriminative mixedmembership models, " in Proc. 2009 IEEE Int. Conf. Data Mining, Miami, FL, USA, 2009, pp. 466475.

38 
R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua and S. Süsstrunk, "SLIC superpixels compared to stateoftheart superpixel methods, "IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34(11): 22742282. DOI:10.1109/TPAMI.2012.120 
39 
I. Rish, "An empirical study of the naive Bayes classifier,, "J. Univ. Comput. Sci., 2001, 3(22): 4146. 
40 
D. M. Blei and M. I. Jordan, "Variational inference for Dirichlet process mixtures, "Bayes. Anal., 2006, 1(1): 121143. DOI:10.1214/06BA104 