^{2} Nanjing University of Science and Technology, Nanjing 210094, China
Mapping, as a means of mobile robot perception and understanding of the scene, is the premise and basis to ensure the correct decision of the movement.
In the past, a significant amount of work has been done in building 2D grip maps, and a lot of effective algorithms of navigation and planning based on 2D grid maps have been developed. At the same time, the increasing complexity of application scenarios requires more powerful perception ability for the surroundings. Therefore, 3D reconstruction of environment has been an important way to improve navigation and location accuracy in mobile robot application. However, the traditional 3D reconstruction focuses on the reconstruction of the spatial structure of environment and ignores the necessary understanding of the environment, which seriously restricts the practical application of 3D maps.
With this background, this paper proposes a new method of constructing a "dense 3D semantic map" by introducing semantic labels into the existing 3D reconstruction. Maps provided with semantic labels are called semantic maps. Semantic maps provide semantic information for scene understanding, which can improve the efficiency of robot navigation, localization and automatic driving applications, such as lane recognition and free space extraction. It is important for robots to interact with the objects in the scene, including target capture, change detection^{[1, 2]} and object search^{[3]}.
The system uses a binocular camera to capture color images as inputs, and each occupied voxel contains a predefined category, such as the vegetable, vehicle and road. In the meanwhile, it is able to detect and remove dynamic obstacles, which reduces the impact of moving vehicles and pedestrians. Our system uses a hash table as the structure of data storage to avoid the dependence upon CPU memory. In addition, we have used the latest deep learning based image semantic segmentation algorithm SegNet^{[4]}, which enables the processing speed to be close to real time.
Fig. 1 shows the process of the dense 3D semantic mapping based on the KITTI datasets^{[5]} in an outdoor scene, which includes two parts: the construction of local semantic map (single frame) and the construction of global semantic map (multiple frames).
Download:


Fig. 1. Architecture of the dense 3D semantic map reconstruction 
Our system decomposes the process of construction of the local semantic map into four parallel steps: 3D reconstruction, image semantic segmentation, image motion segmentation and camera motion estimation. Then, a fully connected voxel conditional random field inference algorithm (Voxel conditional random field (CRF)) is used to optimize the semantic labels by considering the spatial relation of the point cloud reconstructed from the 3D reconstruction. After that, we use semantic information to refine the image motion segmentation, which gives a motion mask to remove dynamic obstacles. We keep the static point set as local 3D semantic map. The process for incremental dense 3D semantic mapping for largescale scenarios is challenging, so we fuse the key frame sequence into the system. Specifically, we use visual odometry to estimate the camera pose of each frame, which converts local semantic map to global map coordinates.
The paper is organized as follows. Section 3 introduces the 3D reconstruction method used in this paper. Semantic segmentation and its refinement are introduced in Sections 4 and 5.Then, motion segmentation and its refinement are introduced in Sections 6 and 7, respectively. In Section 8, we present an efficient keyframe fusion method. Finally, the experimental results are shown in Section 9.
2 Related worksA dense 3D map can provide more information for the robot
As dynamic obstacles directly affect the environment map construction precision, reliable realtime detection of dynamic obstacles is the fundamental of the mapping of unknown environment. Most of the existing methods^{[12]} have not considered the dynamic obstacles, which is incomplete for the actual process of mapping. Reference [13] focuses more on estimating the trajectory of the camera and the moving objects, with less consideration of the speed and accuracy of the scene reconstruction.
Offline map construction methods^{[1, 14]} can usually construct a semantic map of wide scale of scene with high accuracy. However, it cannot be used in realtime conditions. A common way to construct map online is simultaneous localization and mapping (LAM)^{[15]} consisting of sparse feature points, but it does not contain most of the environment details such as traffic lights and telegraph poles.
Another important point is that the dense 3D semantic mapping^{[16]} claims to be able to run online, but it is not a complete online processing process in the strict sense, the method requires to obtain image sequences beforehand, and to do offline feature extraction for image semantic segmentation. To solve the above mentioned problems, we propose an effective method for the construction of dense 3D semantic maps.
3 3D reconstructionThe input of our system is rectified stereo color images. The model used for stereo vision depth measurement is shown in Fig. 2.
Download:


Fig. 2. Depth measurement using stereo vision 
$ \begin{equation} \left[ \begin{array}{c} x_p \\ y_p \\ z_p \end{array} \right] = \left[ \begin{array}{c} (u_lc_u)\times b/\Deltab/2 \\ (v_lc_v)\times b /\Delta \\ b\times f/\Delta \end{array} \right] \label{stereoFunc} \end{equation} $  (1) 
wherein,
Computing the disparity map, namely the stereo matching process, is one of the most important problems of 3D reconstruction. Based on the taxonomy of Scharstein and Szeliski^{[17]}, the stereo matching algorithm can be divided into four parts, i.e., matching cost computation, cost aggregation, disparity optimization and disparity refinement. SGM^{[18]} is a wellknown stereo algorithm whose matching cost is initialized by computing the pixelwise mutual information. This algorithm is implemented in a more computationally efficient way by using batchbased sum of absolute difference (SAD) cost in the matching cost calculation procedure and renamed as SGBM. In order to obtain more accurate results, LIBELAS^{[19]} builds a prior on the disparities by triangulating several robustly matched support points which is useful in dealing with lowtextured areas, moreover, MCCNN^{[20]} uses a convolutional neural network to predict the value of stereo matching costs. However, considering the realtime situation, SGBM is more suitable for this project. After the depth map computation, we propose a multiframe fusion scheme to improve the accuracy of the 3D reconstruction which will be introduced in Section 6.
4 Image semantic segmentationIn this paper, we predefined the road, building, pole, traffic sign, pedestrian, vehicle, etc. These kinds of object categories are common in the outdoor environment, and image semantic segmentation aims at classifying each pixel in the image into one of the categories.
Different from the existing methods^{[1, 12, 16]} using CRF model for image semantic segmentation, this paper employs a new semantic segmentation algorithm based on deep learning architecture SegNet^{[4]}. The semantic segmentation obtained from SegNet is realtime and accurate.
Fig. 3 shows the architecture of the SegNet, which is composed of an encoder network, decoder network and a softmax classifier in the last layer. The input of SegNet is a RGB image. An encoder uses the convolutionReLUmax poolingsubsampling pipeline, a decoder upsamples its input using the transferred pool indices from its encoder, and the output of the last layer of the decoder is passed to the softmax classifier. The semantic segmentation result can be obtained by a forward calculation.
Download:


Fig. 3. Deep learning based semantic segmentation method the architecture of SegNet 
The structure of encoder and decoder in SegNet is shown in Fig. 4. It transforms fully connected layers in the classical convolutional neural network (CNN) into convolution layers, and the encoder consists of convolution, batch normalization, nonlinear and maxpooling. Upsampling operation is implemented to feature map in each decoder, which makes the feature maps gradually restored to the same size of the input image. Once trained, it can process input images of any size and output the segmentation of the same size.
Download:


Fig. 4. Structure of encoder and decoder in SegNet 
Moreover, the SegNet is totally datadriven and can learn better features with deep structures, both of these lead to superior performance and extensibility when comparing with the traditional CRF based approaches of solving the segmentation problem.
5 Semantic segmentation refinementThe direct accuracy of image semantic segmentation is generally not high, and the pipeline usually needs a refinement process. CRF inference^{[21]} is the most common choice, but traditional CRF inference algorithm only considers the twodimensional spatial features of an image, such as the color, position, semantic labels between two pixels in the image for optimizing.
In the previous sections, we have obtained the space position and semantic label of each pixel. In order to make full use of these information, a fully connected voxelCRF method^{[21]} is introduced to refine the result of SegNet by maximizing the similarity consistency, which is shown in Fig. 5. Here, we define several kernels to model the relationship of pixels, including spatial and semantic relations.
Download:


Fig. 5. Adjust semantic segmentation by voxelCRF inference 
We divide 3D space into
$ \begin{equation} E(X) = \sum\limits_i \psi _u (x_i) + \sum\limits_{i<j} \psi_p (x_i, x_j) \end{equation} $  (2) 
with
$ \begin{equation} \psi _u (x_i) = \textrm{log}{P(x_i)} \end{equation} $  (3) 
where
$ \begin{equation} \psi _p (x_i, x_j) = \mu(x_i, x_j) \sum\limits_{m} k^{(m)}({\mathit{\boldsymbol{f_i}}}, {\mathit{\boldsymbol{f_j}}}) \end{equation} $  (4) 
where
$ \begin{equation} k^{(m)}({\mathit{\boldsymbol{f_i}}}, {\mathit{\boldsymbol{f_j}}}) = \omega ^{(m)}\exp\left(\frac{1}{2}({\mathit{\boldsymbol{f_i}}}{\mathit{\boldsymbol{f_j}}})^{\rm T}{\mathit{\boldsymbol{\wedge^{m}}}}({\mathit{\boldsymbol{f_i}}}{\mathit{\boldsymbol{f_j}}})\right) \end{equation} $  (5) 
where
Three kinds of Gauss kernel functions are defined as constraint to be minimized. The first one is a smoothness kernel:
$ \begin{equation} k^{(1)} = \omega ^{(1)}\exp\left(\frac{{\mathit{\boldsymbol{p_i}}}{\mathit{\boldsymbol{p_j}}}}{2{\theta_p^2}}\right) \end{equation} $  (6) 
where
The second one is also a smoothness kernel:
$ \begin{equation} k^{(2)} = \omega ^{(2)}\exp\left(\frac{{\mathit{\boldsymbol{p_i}}}{\mathit{\boldsymbol{p_j}}}}{2{\theta_{p, n}^2}}  \frac{{\mathit{\boldsymbol{n_i}}}{\mathit{\boldsymbol{n_j}}}}{2{\theta_n}^2}\right) \end{equation} $  (7) 
where
The third one is the appearance kernel:
$ \begin{equation} k^{(3)} = \omega ^{(3)}\exp\left(\frac{{\mathit{\boldsymbol{p_i}}}{\mathit{\boldsymbol{p_j}}}}{2{\theta_{p, a}^2}}  \frac{{\mathit{\boldsymbol{c_i}}}{\mathit{\boldsymbol{c_j}}}}{2{\theta_c^2}}\right) \end{equation} $  (8) 
where
This model takes the color, position and normal vector into consideration. We can set a fixed value for each
According to the definition of feature point matching, the projection of the same object in two images is considered as the corresponding matching point, which should also be the same for the object class label of the matching point. Therefore, for the two consecutive images
Download:


Fig. 6. Feature point matching verification using semantic labels 
7 Motion segmentation
The moving objects often occur in the outdoor scene, which affects the accuracy of mapping. As shown in Fig. 7, we present a method to eliminate the moving objects in dynamic background by leveraging semantic information^{[22]}.
Download:


Fig. 7. Removal of moving object from a single frame 
7.1 Coarse motion segmentation
The traditional solution of motion segmentation provides two ways. One is the background modeling combined with motion compensation, but this kind of method does not perform well in a complex environment. The other one is the optical flow method, where the dense optical flow computation is heavy while sparse optical flow method based on feature points is difficult to extract the exact shape of moving objects. In this paper, a feature matching method is used to generate the moving seeds, and a Udisparity map^{[23]} is employed for coarse motion segmentation.
The process of moving objects segmentation in dynamic background is illustrated in Fig. 8. In the coarse part, we first calculate the disparity map, and the Udisparity map is constructed by accumulating the same disparity value into columns(
Download:


Fig. 8. Motion segmentation process 
In parallel, the image features are extracted, matched and tracked^{[25]}. We can use random sample consensus (RANSAC) for feature point selection, and the feature points can be divided into two categories: the inliers (matched) and the outliers (unmatched). The outliers are obviously different from the background, indicating that it is likely a moving target. Meanwhile, inliers can be used to calculate the translation and rotation matrix of camera
Then, we project both the inliers and the outliers into Udisparity, and treat them as seeds of region growing to segment pixels in Udisparity with the same disparity into image patches. Since the outliers comprise noises, only pixels in Udisparity with an intensity higher than a threshold are candidates for segmentation. In order to obtain more candidates of moving objects, a moving object would be counted as detected if it contains at least one pixel that is classified as moving. Finally, motion segmentation is done by projecting moving patches back to the RGB image. Specifically, for each column
Some falsely moving pixels can be corrected using semantic information. After projecting the motion mask into the semantic segmentation map, we remove the overlap region which is impossible to be in a moving state according to the corresponding semantic segmentation map. As shown in Fig. 9, we only keep the area that has potential to be in a moving state. Here cars, cyclists and pedestrians are considered as potential moving objects.
Download:


Fig. 9. Refine motion segmentation. ① Contour of potential to be moving object extraction. ② Area validation. ③ Semanticmotion overlap validation. ④ Motion mask merged into one 
It should be noticed that the result of previous section is the semantic point cloud, so it is necessary to project the point cloud back to the image plane in order to get the refined semantic segmentation image:
$ \begin{equation} \begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = \begin{pmatrix} f_x & 0 & c_u \\ 0 & f_y & c_v \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} x_s \\ y_s \\ z_s \end{pmatrix} \end{equation} $  (9) 
where
In this paper, 12 semantic labels are defined to represent the objects in the outdoor scene. We consider cars and pedestrians have potential to be moving. While buildings and roads have no potential of movement. Once the refined semantic segmentation has been given, we extract the contours of image patches which have the potential to be moving. The contour area which is less than
In this section, we have introduced the motion segmentation used in our system and shown the way to make full use of the semantic information to improve the accuracy of the motion segmentation, which provides a guarantee for the robustness of mapping.
8 Keyframe fusionIn the previous sections, we introduced the process of getting the local semantic map from a single frame. To map from image sequence to generate a largescale global semantic map, we need to estimate the camera pose of each frame and fuse the new data into the global map. However, difficulties will arise due to the huge amount of data for each frame if the point cloud in each frame is simply put together into the generated map. Therefore, the main content we will discuss in this section is how to efficiently and accurately integrate the new frame into the global map. The overall workflow of data fusion is shown in Fig. 10.
Download:


Fig. 10. Overall workflow of data fusion 
8.1 Camera egomotion estimation
For the new frame, we first estimate the camera egomotion in the global coordinate system at that moment. The rotation and translation will be calculated to convert the local coordinate
We use visual odometry^{[25]} to estimate the rotation matrix
$ \begin{equation} (Rt) = \left( \begin{array}{cccc} r_{00} & r_{01} & r_{02} & t_0 \\ r_{10} & r_{11} & r_{12} & t_1 \\ r_{20} & r_{21} & r_{22} & t_2 \end{array} \right). \end{equation} $  (10) 
As shown in Fig. 11, we set the first frame as the origin of the global map, and the motion of camera can be accumulated:
$ \begin{equation} (Rt)_{t1} = \prod\limits_{t>1}{(Rt)_{tt1}} \end{equation} $  (11) 
Download:


Fig. 11. Camera egomotion estimation 
8.2 Keyframe selection
In this paper, we use 3D grid maps as the storage structure to do the fusion process. Keyframe selection can release the memory consumption and speed up the fusion process. A simple strategy to detect the keyframe: We set the first frame image as the keyframe, then calculate the
$ \begin{equation} KeyFrame({Frame_{t}}) = \left\{ \begin{aligned} (R)_{t1} &> THR_{R} \\ \textrm{or} \\ (t)_{t1} &> THR_{t}. \\ \end{aligned} \right. \end{equation} $  (12) 
In order to improve the depth accuracy of keyframes, we also maintain a historical queue to store
$ \begin{equation} \begin{pmatrix} x_w \\ y_w \\ z_w \end{pmatrix} = (Rt)_{t1} \begin{pmatrix} x_s \\ y_s \\ z_s \\ 1 \end{pmatrix}. \end{equation} $  (13) 
The keyframe reduces the frequency of fusion, which is not a very efficient approach to largescale dense 3D mapping. In order to solve the problem of memory explosion, we use the hash table as a data storage structure.
Because the trajectory of a mobile robot is not predictable, we have to occupy a great deal of memory of size
Download:


Fig. 12. Traditional memory storage for 3D grid map 
As illustrated in Fig. 13, a hash table is a compromise between an array and a linked list. It uses both indexing and list traversal to store and retrieve data elements. Assuming that the resolution of voxelization is
$ \begin{equation} H(x, y, z)=(i\cdot p_1 \oplus j\cdot p_2\oplus k\cdot p_3)~\textrm{mod}~n \end{equation} $  (14) 
Download:


Fig. 13. Hash table structure for storage 
where large prime numbers
Struct HashEntry {
int position [3];
char label;
}
where position is the position of voxel, and the label represents the semantic label of the voxel. We adopt the Winnertakesall method for data fusion. In case multiple points fall into the same voxel, we count the frequency of each category of objects and select the label which is more frequent than others. This provides a naive probability estimation for a voxel taking label
We know that storing and retrieving items in a linked list grow more inefficient as the list grows longer. So it makes sense to keep these linked lists as short as possible. To solve this problem, we set a distance threshold
The effectiveness of our approach is demonstrated on the KITTI dataset^{[5]}, and qualitative and quantitative results of our approach are provided.
9.1 Dense 3D semantic mapWe reconstruct the labeled largescale scene from the KITTI odometry dataset. The odometry dataset contains over 20 sequences of stereo images of size about
Download:


Fig. 14. Semantic map of the reconstructed scene of Seq05 overlayed with the corresponding Google Earth image 
In Fig. 14, a large scale semantic reconstruction, comprising of 1 250 frames from KITTI Odometry dataset Seq05, is illustrated. In Fig. 15, some other large scale semantic reconstruction results are illustrated. An overhead view of the reconstructed semantic map is shown along with the corresponding Google Earth image. Roads (purple), buildings (red) and vegetation (yellow) can be clearly distinguished.
Download:


Fig. 15. Semantic maps of the reconstructed scenes overlayed with the corresponding Google Earth images 
9.2 Visual odometry
As shown in Fig. 16, the ground truth of camera trajectory of Seq07 is compared with the result of localization method proposed in this paper. In Fig. 16, the red point represents the ground truth value of the trajectory, the blue point represents the trajectory of visual odometry without using the semantic verification proposed in Section 6, and the green point is the trajectory of visual odometry using the semantic verification. As can be seen, the visual odometry trajectory drifts more and more as the number of frames increases. However, the visual odometry trajectory using semantic verification drifts to a lesser extent.
Download:


Fig. 16. Trajectory comparison between ground truth and visual odometry 
We evaluate the translation error and rotation error over an increasing number of frames with the provided ground truth of Seq07. In Table 1, visual odometry using semantic verification is denoted as sVO and unused as VO. It can be observed that the translation and rotation error of visual odometry using semantic verification is overall lower than the method without using semantic verification.
In addition, we have demonstrated the proposed method on our driverless platform IN
Download:


Fig. 17. Our platform – IN^{2}Bot 
Download:


Fig. 18. Qualitative result using our dataset and semantic purificated visual odometry 
9.3 Semantic segmentation
We also get 60 annotated images with perpixel class labels from [12] for evaluation of semantic segmentation. The class labels are the road, building, vehicle, pedestrian, pavement, tree, sky, signage, post/pole and wall/fence. We compare our results on moving scenes with other approaches, the results of which are summarized in Table 2. The "Avg" refers to the average of the per class measures, and the "Global" refers to the overall percentage of pixels correctly classified. From Table 2, we can tell that under the circumstance of unobvious change of "Global" the proposed method performs better in details of most classes.
In order to demonstrate the effectiveness of our method more clearly, we also present the map generated from a portion of the sequence image in Figs. 19 and 20. From these pictures, we can see that not only the large objects such as the road, building and vegetation, even the details can be well reconstructed such as the traffic sign and lane.
Download:


Fig. 19. Single frame 3D semantic map reconstruction 
Download:


Fig. 20. Some frames 3D semantic map reconstruction 
9.4 Motion segmentation
We use Optical Flow Evaluation 2015^{[28]} to evaluate the performance of the proposed algorithm. As shown in Fig. 21, we can clearly observe an improvement of motion segmentation in accuracy using semantic labels. To present a quantitative evaluation of motion segmentation, we choose the average intersectionoverunion (IOU) as the evaluation measurement. It is defined as
Download:


Fig. 21. Motion segmentation result 
9.5 Computing efficiency
In order to evaluate the effectiveness of our approach, we have made a detailed list of all the operations involved in this method in Table 4. Notice that keyframe distance thresholds are set to
Compared with those methods which need 30 s to do semantic segmentation^{[30]}, in this paper, semantic segmentation based on deep learning technology is implemented, and the algorithm has a great improvement in computing efficiency. In fact, the most timecost part of the proposed method, namely, disparity calculation, can be further accelerated by using GPU implementation since the operation is highly independent. Meanwhile, the highly modular design ensures the scalability and further improvement of the system.
10 ConclusionsThis paper proposes a framework of dense 3D semantic mapping. We design as many processes as possible in a parallel fashion, which makes the mapping process more efficient. We use depth information to refine semantic segmentation. We, in turn, use semantic information to optimize the motion segmentation, which obtains a more accurate result. Meanwhile, the introduction of deep learning technology also greatly improves the scalability and the speed of semantic mapping.
For future work, we intend to build an endtoend learning and integrate all steps into deep learning, which will eventually lead to a purely data driven system.
[1]  G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, A. M. Lopez. Visionbased offlineonline perception paradigm for autonomous driving. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Waikoloa, USA, pp. 231238, 2015. DOI: 10.1109/WACV.2015.38. 
[2]  J. Mason, B. Marthi. An objectbased semantic world model for longterm change detection and semantic querying. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Vilamoura, Portugal, pp. 38513858, 2012. DOI: 10.1109/IROS.2012.6385729. 
[3]  A. Nüchter, J. Hertzberg . Towards semantic maps for mobile robots. Robotics and Autonomous Systems., vol.56 , no.11 , pp.915–926, 2008. doi:10.1016/j.robot.2008.08.001 
[4]  V. Badrinarayanan, A. Kendall, R. Cipolla . Segnet:A deep convolutional encoderdecoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence., vol.39 , no.12 , pp.2481–2495, 2017. doi:10.1109/TPAMI.2016.2644615 
[5]  A. Geiger, P. Lenz, R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, USA, pp. 33543361, 2012. DOI: 10.1109/CVPR.2012.6248074. 
[6]  S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, M. Seitz Steven, R. Szeliski . Building Rome in a day. Communications of the ACM., vol.54 , no.10 , pp.105–112, 2011. doi:10.1145/2001269.2001293 
[7]  D. Munoz, J. A. Bagnell, N. Vandapel, M. Hebert. Contextual classification with functional maxmargin Markov networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, USA, pp. 975982, 2009. DOI: 10.1109/CVPR.2009.5206590. 
[8]  B. Douillard, D. Fox, F. Ramos, H. DurrantWhyte . Classification and semantic mapping of urban environments. The International Journal of Robotics Research., vol.30 , no.1 , pp.5–32, 2011. doi:10.1177/0278364910373409 
[9]  R. Zhang, S. A. Candra, K. Vetter, A. Zakhor. Sensor fusion for semantic segmentation of urban scenes. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Seattle, USA, pp. 18501857, 2015. DOI: 10.1109/ICRA.2015.7139439. 
[10]  F. Endres, J. Hess, J. Sturm, D. Cremers, W. Burgard . 3D mapping with an RGBD camera. IEEE Transactions on Robotics., vol.30 , no.1 , pp.177–187, 2014. doi:10.1109/TRO.2013.2279412 
[11]  M. Gunther, T. Wiemann, S. Albrecht, J. Hertzberg. Building semantic object maps from sparse and noisy 3d data. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Tokyo, Japan, pp. 22282233, 2013. DOI: 10.1109/IROS.2013.6696668. 
[12]  S. Sengupta, E. Greveson, A. Shahrokni, P. H. S. Torr. Urban 3D semantic modelling using stereo vision. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Karlsruhe, Germany, pp. 580585, 2013. DOI: 10.1109/ICRA.2013.6630632. 
[13]  N. D. Reddy, P. Singhal, V. Chari, K. M. Krishna. Dynamic body VSLAM with semantic constraints. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Hamburg, Germany, pp. 18971904, 2015. DOI: 10.1109/IROS.2015.7353626. 
[14]  J. P. C. Valentin, S. Sengupta, J. Warrell, A. Shahrokni, P. H. S. Torr. Mesh based semantic modelling for indoor and outdoor scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Portland, USA, pp. 20672074, 2013. DOI: 10.1109/CVPR.2013.269. 
[15]  J. Civera, D. GálvezLópez, L. Riazuelo, J. D. Tardós, J. M. M. Montiel. Towards semantic SLAM using a monocular camera. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, San Francisco, USA, pp. 12771284, 2011. DOI: 10.1109/IROS.2011.6094648. 
[16]  V. Vineet, O. Miksik, M. Lidegaard, M. Niessner, S. Golodetz, V. A. Prisacariu, O. Kähler, D. W. Murray, S. Izadi, P. Pérez, P. H. S. Torr. Incremental dense semantic stereo fusion for largescale semantic scene reconstruction. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Seattle, USA, pp. 7582, 2015. DOI: 10.1109/ICRA.2015.7138983. 
[17]  D. Scharstein, R. Szeliski . A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. International Journal of Computer Vision, vol. 47, no. 13., vol.47 , no.13 , pp.7–42, 2002. doi:10.1023/A:1014573219977 
[18]  H. Hirschmuller . Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence., vol.30 , no.2 , pp.328–341, 2008. doi:10.1109/TPAMI.2007.1166 
[19]  A. Geiger, M. Roser, R. Urtasun. Efficient largescale stereo matching. In Proceedings of the 10th Asian Conference on Computer Vision, Springer, Queenstown, New Zealand, pp. 2538, 2010. DOI: 10.1007/97836421931563. 
[20]  J. Žbontar, Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 15921599, 2015. DOI: 10.1109/CVPR.2015.7298767. 
[21]  P. Krähenbühl, V. Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In Proceedings of Advances in Neural Information Processing Systems, Granada, Spain, pp. 109117, 2011. 
[22]  F. Qiu, Y. Yang, H. Li, M. Y. Fu, S. T. Wang. Semantic motion segmentation for urban dynamic scene understanding. In Proceedings of IEEE International Conference on Automation Science and Engineering, IEEE, Fort Worth, USA, pp. 497502, 2016. DOI: 10.1109/COASE.2016.7743446. 
[23]  Z. Hu, K. Uchimura. UVdisparity: An efficient algorithm for stereovision based scene analysis. In Proceedings of IEEE Intelligent Vehicles Symposium, IEEE, Las Vegas, USA, pp. 4854, 2005. DOI: 10.1109/IVS.2005.1505076. 
[24]  Y. Li, Y. Ruichek . Occupancy grid mapping in urban environments from a moving onboard stereovision system. Sensors., vol.14 , no.6 , pp.10454–10478, 2014. doi:10.3390/s140610454 
[25]  A. Geiger, J. Ziegler, C. Stiller. StereoScan: Dense 3D reconstruction in realtime. In Proceedings of IEEE Intelligent Vehicles Symposium, IEEE, BadenBaden, Germany, pp. 963968, 2011. DOI: 10.1109/IVS.2011.5940405. 
[26]  Niessner M, Zollhöfer M, S. Izadi, M. Stamminger. Realtime 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics, vol. 32, no. 6, Article number 169, 2013. DOI: 10.1145/2508363.2508374. 
[27]  R. MurArtal, J. M. M. Montiel, J. D. Tardós . ORBSLAM:A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics., vol.31 , no.5 , pp.1147–1163, 2015. doi:10.1109/TRO.2015.2463671 
[28]  M. Menze, A. Geiger. Object scene flow for autonomous vehicles. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 30613070, 2015. DOI: 10.1109/CVPR.2015.7298925. 
[29]  L. Ladický, C. Russell, P. Kohli, P. H. S. Torr . Associative hierarchical random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence., vol.36 , no.6 , pp.1056–1077, 2014. doi:10.1109/TPAMI.2013.165 
[30]  S. Sengupta, P. Sturgess, L. Ladický, P. H. S. Torr. Automatic dense visual semantic mapping from streetlevel imagery. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Vilamoura, Portugal, pp. 857862, 2012. DOI: 10.1109/IROS.2012.6385958. 
[31]  H. He, B. Upcroft. Nonparametric semantic segmentation for 3D street scenes. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Tokyo, Japan, pp. 36973703, 2013. DOI: 10.1109/IROS.2013.6696884. 
[32]  A. Kundu, K. M. Krishna, J. Sivaswamy. Moving object detection by multiview geometric techniques from a single camera mounted robot. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, St. Louis, USA, pp. 43064312, 2009. DOI: 10.1109/IROS.2009.5354227. 
[33]  T. H. Lin, C. C. Wang. Deep learning of spatiotemporal features with geometricbased moving point detection for motion segmentation. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Hong Kong, China, pp. 30583065, 2014. DOI: 10.1109/ICRA.2014.6907299. 
[34]  N. D. Reddy, P. Singhal, K. M. Krishna. Semantic motion segmentation using dense CRF formulation. In Proceedings of Indian Conference on Computer Vision Graphics and Image Processing, ACM, Bangalore, India, Article number 56, 2014. DOI: 10.1145/2683483.2683539. 