International Journal of Automation and Computing  2018, Vol. 15 Issue (2): 194-206 PDF
Large-scale 3D Semantic Mapping Using Stereo Vision
Yi Yang1, Fan Qiu1, Hao Li1, Lu Zhang1, Mei-Ling Wang1, Meng-Yin Fu1,2
1 School of Automation and National Key Laboratory of Intelligent Control and Decision of Complex Systems, Beijing Institute of Technology, Beijing 100081, China;
2 Nanjing University of Science and Technology, Nanjing 210094, China
Abstract: In recent years, there have been a lot of interests in incorporating semantics into simultaneous localization and mapping (SLAM) systems. This paper presents an approach to generate an outdoor large-scale 3D dense semantic map based on binocular stereo vision. The inputs to system are stereo color images from a moving vehicle. First, dense 3D space around the vehicle is constructed, and the motion of camera is estimated by visual odometry. Meanwhile, semantic segmentation is performed through the deep learning technology online, and the semantic labels are also used to verify the feature matching in visual odometry. These three processes calculate the motion, depth and semantic label of every pixel in the input views. Then, a voxel conditional random field (CRF) inference is introduced to fuse semantic labels to voxel. After that, we present a method to remove the moving objects by incorporating the semantic labels, which improves the motion segmentation accuracy. The last is to generate the dense 3D semantic map of an urban environment from arbitrary long image sequence. We evaluate our approach on KITTI vision benchmark, and the results show that the proposed method is effective.
Key words: Semantic map     stereo vision     motion segmentation     visual odometry     simultaneous localization and mapping (SLAM)
1 Introduction

Mapping, as a means of mobile robot perception and understanding of the scene, is the premise and basis to ensure the correct decision of the movement.

In the past, a significant amount of work has been done in building 2D grip maps, and a lot of effective algorithms of navigation and planning based on 2D grid maps have been developed. At the same time, the increasing complexity of application scenarios requires more powerful perception ability for the surroundings. Therefore, 3D reconstruction of environment has been an important way to improve navigation and location accuracy in mobile robot application. However, the traditional 3D reconstruction focuses on the reconstruction of the spatial structure of environment and ignores the necessary understanding of the environment, which seriously restricts the practical application of 3D maps.

With this background, this paper proposes a new method of constructing a "dense 3D semantic map" by introducing semantic labels into the existing 3D reconstruction. Maps provided with semantic labels are called semantic maps. Semantic maps provide semantic information for scene understanding, which can improve the efficiency of robot navigation, localization and automatic driving applications, such as lane recognition and free space extraction. It is important for robots to interact with the objects in the scene, including target capture, change detection[1, 2] and object search[3].

The system uses a binocular camera to capture color images as inputs, and each occupied voxel contains a pre-defined category, such as the vegetable, vehicle and road. In the meanwhile, it is able to detect and remove dynamic obstacles, which reduces the impact of moving vehicles and pedestrians. Our system uses a hash table as the structure of data storage to avoid the dependence upon CPU memory. In addition, we have used the latest deep learning based image semantic segmentation algorithm -SegNet[4], which enables the processing speed to be close to real time.

Fig. 1 shows the process of the dense 3D semantic mapping based on the KITTI datasets[5] in an outdoor scene, which includes two parts: the construction of local semantic map (single frame) and the construction of global semantic map (multiple frames).

 Download: larger image Fig. 1. Architecture of the dense 3D semantic map reconstruction

Our system decomposes the process of construction of the local semantic map into four parallel steps: 3D reconstruction, image semantic segmentation, image motion segmentation and camera motion estimation. Then, a fully connected voxel conditional random field inference algorithm (Voxel conditional random field (CRF)) is used to optimize the semantic labels by considering the spatial relation of the point cloud reconstructed from the 3D reconstruction. After that, we use semantic information to refine the image motion segmentation, which gives a motion mask to remove dynamic obstacles. We keep the static point set as local 3D semantic map. The process for incremental dense 3D semantic mapping for large-scale scenarios is challenging, so we fuse the key frame sequence into the system. Specifically, we use visual odometry to estimate the camera pose of each frame, which converts local semantic map to global map coordinates.

The paper is organized as follows. Section 3 introduces the 3D reconstruction method used in this paper. Semantic segmentation and its refinement are introduced in Sections 4 and 5.Then, motion segmentation and its refinement are introduced in Sections 6 and 7, respectively. In Section 8, we present an efficient key-frame fusion method. Finally, the experimental results are shown in Section 9.

2 Related works

A dense 3D map can provide more information for the robot$'$s perception, but it is limited to the size of the memory when building a large scale of scenes[6]. Classification based on 3D LiDAR point cloud[3, 7, 8] and the fusion of images and 3D point cloud[9] can directly construct the semantic map of the environment, but the use of expensive laser or radar restricts its application. The depth camera like Kinect is cheaper and has fast access to depth information[10, 11] which can reduce dependence on computational capacity and storage memory. But this kind of sensors work only in a small scale of indoor environment which cannot be applied to outdoor environment. In the field of large-scale reconstruction of outdoor scenes, the use of reconstruction technology can reconstruct virtual reality scenes, such as a digital city, digital tourism, city planning, disaster relief, etc. However, the classic methods, such as [6], for large scenes reconstruction run out of memory frequently.

As dynamic obstacles directly affect the environment map construction precision, reliable real-time detection of dynamic obstacles is the fundamental of the mapping of unknown environment. Most of the existing methods[12] have not considered the dynamic obstacles, which is incomplete for the actual process of mapping. Reference [13] focuses more on estimating the trajectory of the camera and the moving objects, with less consideration of the speed and accuracy of the scene reconstruction.

Offline map construction methods[1, 14] can usually construct a semantic map of wide scale of scene with high accuracy. However, it cannot be used in real-time conditions. A common way to construct map online is simultaneous localization and mapping (LAM)[15] consisting of sparse feature points, but it does not contain most of the environment details such as traffic lights and telegraph poles.

Another important point is that the dense 3D semantic mapping[16] claims to be able to run online, but it is not a complete online processing process in the strict sense, the method requires to obtain image sequences beforehand, and to do offline feature extraction for image semantic segmentation. To solve the above mentioned problems, we propose an effective method for the construction of dense 3D semantic maps.

3 3D reconstruction

The input of our system is rectified stereo color images. The model used for stereo vision depth measurement is shown in Fig. 2.

$P$ is a 3D point in the space, $p_l, p_r$ are the two projection points on the left and right image plane, whose respective corresponding coordinates are $(u_l, v_l)$ and $(u_r, v_r)$ in the image coordinate. We assume the coordinate of point $P$ is $(x_p, y_p, z_p)$ in the local coordinate system. According to the similar triangle principle, the coordinate of $P$ can be derived as

 $$$\left[ \begin{array}{c} x_p \\ y_p \\ z_p \end{array} \right] = \left[ \begin{array}{c} (u_l-c_u)\times b/\Delta-b/2 \\ (v_l-c_v)\times b /\Delta \\ b\times f/\Delta \end{array} \right] \label{stereoFunc}$$$ (1)

wherein, $f$ is the focal length and $b$ is the baseline of the stereo camera. We can obtain the depth map of the scene according to (1) once the disparity value $\Delta$ is valid.

Computing the disparity map, namely the stereo matching process, is one of the most important problems of 3D reconstruction. Based on the taxonomy of Scharstein and Szeliski[17], the stereo matching algorithm can be divided into four parts, i.e., matching cost computation, cost aggregation, disparity optimization and disparity refinement. SGM[18] is a well-known stereo algorithm whose matching cost is initialized by computing the pixel-wise mutual information. This algorithm is implemented in a more computationally efficient way by using batch-based sum of absolute difference (SAD) cost in the matching cost calculation procedure and renamed as SGBM. In order to obtain more accurate results, LIBELAS[19] builds a prior on the disparities by triangulating several robustly matched support points which is useful in dealing with low-textured areas, moreover, MC-CNN[20] uses a convolutional neural network to predict the value of stereo matching costs. However, considering the real-time situation, SGBM is more suitable for this project. After the depth map computation, we propose a multi-frame fusion scheme to improve the accuracy of the 3D reconstruction which will be introduced in Section 6.

4 Image semantic segmentation

In this paper, we pre-defined the road, building, pole, traffic sign, pedestrian, vehicle, etc. These kinds of object categories are common in the outdoor environment, and image semantic segmentation aims at classifying each pixel in the image into one of the categories.

Different from the existing methods[1, 12, 16] using CRF model for image semantic segmentation, this paper employs a new semantic segmentation algorithm based on deep learning architecture -SegNet[4]. The semantic segmentation obtained from SegNet is real-time and accurate.

Fig. 3 shows the architecture of the SegNet, which is composed of an encoder network, decoder network and a soft-max classifier in the last layer. The input of SegNet is a RGB image. An encoder uses the convolution-ReLU-max pooling-subsampling pipeline, a decoder upsamples its input using the transferred pool indices from its encoder, and the output of the last layer of the decoder is passed to the soft-max classifier. The semantic segmentation result can be obtained by a forward calculation.

 Download: larger image Fig. 3. Deep learning based semantic segmentation method -the architecture of SegNet

The structure of encoder and decoder in SegNet is shown in Fig. 4. It transforms fully connected layers in the classical convolutional neural network (CNN) into convolution layers, and the encoder consists of convolution, batch normalization, nonlinear and max-pooling. Up-sampling operation is implemented to feature map in each decoder, which makes the feature maps gradually restored to the same size of the input image. Once trained, it can process input images of any size and output the segmentation of the same size.

 Download: larger image Fig. 4. Structure of encoder and decoder in SegNet

Moreover, the SegNet is totally data-driven and can learn better features with deep structures, both of these lead to superior performance and extensibility when comparing with the traditional CRF based approaches of solving the segmentation problem.

5 Semantic segmentation refinement

The direct accuracy of image semantic segmentation is generally not high, and the pipeline usually needs a refinement process. CRF inference[21] is the most common choice, but traditional CRF inference algorithm only considers the two-dimensional spatial features of an image, such as the color, position, semantic labels between two pixels in the image for optimizing.

In the previous sections, we have obtained the space position and semantic label of each pixel. In order to make full use of these information, a fully connected voxel-CRF method[21] is introduced to refine the result of SegNet by maximizing the similarity consistency, which is shown in Fig. 5. Here, we define several kernels to model the relationship of pixels, including spatial and semantic relations.

We divide 3D space into $N$ voxels and define an observed set $V=\{v_1, \cdots, v_N\}$, where $v_i$ represents a voxel in the 3D grid, whose corresponding hidden variables belong to set $X=\{x_1, \cdots, x_N\}$. Each variable $x_i$ corresponds to a voxel $v_i$ and is assigned a label $l_i$. The label category set is $\mathscr{L}=\{l_1, \cdots, l_L\}$. Each $v_i$ contains some attributes like position in space, color vector and normal vector. $l_i\in \mathscr{L}$ represents a specific label category such as the vehicle, road or building. Voxel-CRF inference aims at deducing $X$ from $V$, which is converted into a formula to minimize the energy function:

 $$$E(X) = \sum\limits_i \psi _u (x_i) + \sum\limits_{i with$1 \leq i, j \leq N$. Here, we define the unary potential$\psi _u (x_i)$as $ $$\psi _u (x_i) = -\textrm{log}{P(x_i)}$$ $(3) where$P(x_i)$denotes the probability of$x_i$belonging to each category. This probability value can be obtained directly from the last layer of the output of SegNet. Unary potential term measures the initial probability of$x_i$. The form of pairwise potential$\psi_p (x_i, x_j)$is $ $$\psi _p (x_i, x_j) = \mu(x_i, x_j) \sum\limits_{m} k^{(m)}({\mathit{\boldsymbol{f_i}}}, {\mathit{\boldsymbol{f_j}}})$$ $(4) where$\mu (x_i, x_j) = [x_i\neq x_j]$, and$k^{(m)}$represents a kind of Gauss kernel function for evaluating feature vector similarity metric between$v_i$and$v_j$: $ $$k^{(m)}({\mathit{\boldsymbol{f_i}}}, {\mathit{\boldsymbol{f_j}}}) = \omega ^{(m)}\exp\left(-\frac{1}{2}({\mathit{\boldsymbol{f_i}}}-{\mathit{\boldsymbol{f_j}}})^{\rm T}{\mathit{\boldsymbol{\wedge^{m}}}}({\mathit{\boldsymbol{f_i}}}-{\mathit{\boldsymbol{f_j}}})\right)$$ $(5) where${\mathit{\boldsymbol{f_i}}}$and${\mathit{\boldsymbol{f_j}}}$denote the feature vector of$v_i$and$v_j$,$\omega^{(m)}$is a linear combination of weights,${\mathit{\boldsymbol{\wedge^{m}}}}$is a symmetric and positive definite matrix. Three kinds of Gauss kernel functions are defined as constraint to be minimized. The first one is a smoothness kernel: $ $$k^{(1)} = \omega ^{(1)}\exp\left(-\frac{|{\mathit{\boldsymbol{p_i}}}-{\mathit{\boldsymbol{p_j}}}|}{2{\theta_p^2}}\right)$$ $(6) where${\mathit{\boldsymbol{p_i}}}$and${\mathit{\boldsymbol{p_j}}}$denote the space position of$v_i$and$v_j$and$\theta_p$controls the influence range of the kernel. This smoothness kernel attends to removing small isolated regions and assigning the same semantic label to two close voxels. The second one is also a smoothness kernel: $ $$k^{(2)} = \omega ^{(2)}\exp\left(-\frac{|{\mathit{\boldsymbol{p_i}}}-{\mathit{\boldsymbol{p_j}}}|}{2{\theta_{p, n}^2}} - \frac{|{\mathit{\boldsymbol{n_i}}}-{\mathit{\boldsymbol{n_j}}}|}{2{\theta_n}^2}\right)$$ $(7) where${\mathit{\boldsymbol{n_i}}}$and${\mathit{\boldsymbol{n_j}}}$denote the normal vectors,$\theta_{p, n}$controls the influence range of the kernel,$\theta_{n}$defines the degree of similarity of the normals, and$\theta_{n}$controls the measure of the similarity of the normal vectors. It attends to assigning the same semantic label to two close voxels with a similar surface orientation. The convergence speed will accelerate while adding normal vectors into consideration. The third one is the appearance kernel: $ $$k^{(3)} = \omega ^{(3)}\exp\left(-\frac{|{\mathit{\boldsymbol{p_i}}}-{\mathit{\boldsymbol{p_j}}}|}{2{\theta_{p, a}^2}} - \frac{|{\mathit{\boldsymbol{c_i}}}-{\mathit{\boldsymbol{c_j}}}|}{2{\theta_c^2}}\right)$$ $(8) where${\mathit{\boldsymbol{c_i}}}$and$ {\mathit{\boldsymbol{c_j}}}$denote the color vectors of$v_i$and$v_j$,$\theta_{p, a}$controls the influence range of the kernel, and$\theta_{c}$defines the degree of similarity of the color. It attends to assigning the same semantic label to two voxels with a similar color vector. This model takes the color, position and normal vector into consideration. We can set a fixed value for each$\omega ^{m}$to adjust the proportion of their influence in the whole optimization. Finally, we can apply a highly efficient inference method[21] to obtain refined semantic segmentation result. 6 Feature point matching verification According to the definition of feature point matching, the projection of the same object in two images is considered as the corresponding matching point, which should also be the same for the object class label of the matching point. Therefore, for the two consecutive images$I_1$and$I_2$, if the matching feature points have the same semantic label,$ label _ {{I_1}(i, j)} = label _ {{I_2}(i ^ {'}, J ^ {'})} $, this pair of matching points is preserved. Otherwise it is considered as an error match. As shown in Fig. 6, different colors represent different semantic labels in the consecutive images. Matches with different semantic labels are eliminated, so that many error matches are filtered in the calculation of visual odometry, and the robustness of pose estimation is ensured.  Download: larger image Fig. 6. Feature point matching verification using semantic labels 7 Motion segmentation The moving objects often occur in the outdoor scene, which affects the accuracy of mapping. As shown in Fig. 7, we present a method to eliminate the moving objects in dynamic background by leveraging semantic information[22].  Download: larger image Fig. 7. Removal of moving object from a single frame 7.1 Coarse motion segmentation The traditional solution of motion segmentation provides two ways. One is the background modeling combined with motion compensation, but this kind of method does not perform well in a complex environment. The other one is the optical flow method, where the dense optical flow computation is heavy while sparse optical flow method based on feature points is difficult to extract the exact shape of moving objects. In this paper, a feature matching method is used to generate the moving seeds, and a U-disparity map[23] is employed for coarse motion segmentation. The process of moving objects segmentation in dynamic background is illustrated in Fig. 8. In the coarse part, we first calculate the disparity map, and the U-disparity map is constructed by accumulating the same disparity value into columns($u$-axis). Perpendicular objects within the environment appear as horizontal lines in the U-disparity map. Assume the intensity of a pixel in U-disparity$I(u, \Delta)$represents the number of pixels with the same disparity$\Delta$in column$u$in the disparity image. A brighter pixel in U-disparity indicates there are many pixels with the same disparity, which suggests a perpendicular object. Then, objects higher than a threshold measured in pixels are identified. In consideration that objects close to the stereo-camera would be captured with more pixels than those far away from the stereo-camera, we apply intensity adjustment for the situation similar to [24]. We use inverse sigmoid function, a function that will amplify the intensity of an area far away from the stereo-camera, to adjust the intensity of U-disparity.  Download: larger image Fig. 8. Motion segmentation process In parallel, the image features are extracted, matched and tracked[25]. We can use random sample consensus (RANSAC) for feature point selection, and the feature points can be divided into two categories: the inliers (matched) and the outliers (un-matched). The outliers are obviously different from the background, indicating that it is likely a moving target. Meanwhile, inliers can be used to calculate the translation and rotation matrix of camera$'$s motion (visual odometry). Then, we project both the inliers and the outliers into U-disparity, and treat them as seeds of region growing to segment pixels in U-disparity with the same disparity into image patches. Since the outliers comprise noises, only pixels in U-disparity with an intensity higher than a threshold are candidates for segmentation. In order to obtain more candidates of moving objects, a moving object would be counted as detected if it contains at least one pixel that is classified as moving. Finally, motion segmentation is done by projecting moving patches back to the RGB image. Specifically, for each column$u$, if pixels have the same disparity value with the disparity of outliers$\Delta_{outliers} (u)$, they are deemed to be moving objects. Finally, by back-projecting them into the$I_\Delta$, the coarse motion segmentation mask can be obtained. 7.2 Refined motion segmentation Some falsely moving pixels can be corrected using semantic information. After projecting the motion mask into the semantic segmentation map, we remove the overlap region which is impossible to be in a moving state according to the corresponding semantic segmentation map. As shown in Fig. 9, we only keep the area that has potential to be in a moving state. Here cars, cyclists and pedestrians are considered as potential moving objects.  Download: larger image Fig. 9. Refine motion segmentation. ① Contour of potential to be moving object extraction. ② Area validation. ③ Semanticmotion overlap validation. ④ Motion mask merged into one It should be noticed that the result of previous section is the semantic point cloud, so it is necessary to project the point cloud back to the image plane in order to get the refined semantic segmentation image: $ $$\begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = \begin{pmatrix} f_x & 0 & c_u \\ 0 & f_y & c_v \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} x_s \\ y_s \\ z_s \end{pmatrix}$$ $(9) where$(c_u, c_v)$is the optical center of camera. Because we treat each camera of stereo vision as the same model,$f_x = f_y$for the camera focal length.$u, v$is the semantic image coordinate projected from semantic point$(x_s, y_s, z_s)$. In this paper, 12 semantic labels are defined to represent the objects in the outdoor scene. We consider cars and pedestrians have potential to be moving. While buildings and roads have no potential of movement. Once the refined semantic segmentation has been given, we extract the contours of image patches which have the potential to be moving. The contour area which is less than$s_\alpha$will be eliminated. Preserved contours will be used to overlap with the coarse motion mask. If the overlapping area is greater than threshold$s_\beta$, we keep the contour as a motion mask. Finally, the motion masks are merged as the refined motion segmentation. In this section, we have introduced the motion segmentation used in our system and shown the way to make full use of the semantic information to improve the accuracy of the motion segmentation, which provides a guarantee for the robustness of mapping. 8 Key-frame fusion In the previous sections, we introduced the process of getting the local semantic map from a single frame. To map from image sequence to generate a large-scale global semantic map, we need to estimate the camera pose of each frame and fuse the new data into the global map. However, difficulties will arise due to the huge amount of data for each frame if the point cloud in each frame is simply put together into the generated map. Therefore, the main content we will discuss in this section is how to efficiently and accurately integrate the new frame into the global map. The overall workflow of data fusion is shown in Fig. 10.  Download: larger image Fig. 10. Overall workflow of data fusion 8.1 Camera egomotion estimation For the new frame, we first estimate the camera egomotion in the global coordinate system at that moment. The rotation and translation will be calculated to convert the local coordinate$(X_s, Y_s, Z_s)$into the global coordinate$(X_w, Y_w, Z_w)$. We use visual odometry[25] to estimate the rotation matrix$R_{t|t-1}$and translation$t_{t|t-1}$between consecutive frames. $ $$(R|t) = \left( \begin{array}{cccc} r_{00} & r_{01} & r_{02} & t_0 \\ r_{10} & r_{11} & r_{12} & t_1 \\ r_{20} & r_{21} & r_{22} & t_2 \end{array} \right).$$ $(10) As shown in Fig. 11, we set the first frame as the origin of the global map, and the motion of camera can be accumulated: $ $$(R|t)_{t|1} = \prod\limits_{t>1}{(R|t)_{t|t-1}}$$ $(11)  Download: larger image Fig. 11. Camera egomotion estimation 8.2 Key-frame selection In this paper, we use 3D grid maps as the storage structure to do the fusion process. Key-frame selection can release the memory consumption and speed up the fusion process. A simple strategy to detect the key-frame: We set the first frame image as the key-frame, then calculate the$L_2$-norm distance from the key-frame to the current frame, if the distance of rotation or translation is greater than$THR_{R}$or$THR_{t}$, we deem the current frame to be the new key-frame. $ KeyFrame({Frame_{t}}) = \left\{ \begin{aligned} ||(R)_{t|1}|| &> THR_{R} \\ \textrm{or} \\ ||(t)_{t|1}|| &> THR_{t}. \\ \end{aligned} \right. $(12) 8.3 Data fusion In order to improve the depth accuracy of key-frames, we also maintain a historical queue to store$K$nearest frames, and the key-frame will be first fused by reprojecting reconstructed 3D points of these frames into the image plane of the current frame[25]. In case a point falls onto a valid disparity, we fuse both 3D points by computing their 3D mean. Then, the key-frame will be fused into the global coordinate system. The global coordinate$(X_w, Y_w, Z_w)$can be calculated by transfer local coordinate$(X_s, Y_s, Z_s)$using camera egomotion: $ $$\begin{pmatrix} x_w \\ y_w \\ z_w \end{pmatrix} = (R|t)_{t|1} \begin{pmatrix} x_s \\ y_s \\ z_s \\ 1 \end{pmatrix}.$$ $(13) The key-frame reduces the frequency of fusion, which is not a very efficient approach to large-scale dense 3D mapping. In order to solve the problem of memory explosion, we use the hash table as a data storage structure. Because the trajectory of a mobile robot is not predictable, we have to occupy a great deal of memory of size$length_x\times length_y\times length_z$for 3D grid map maintaining. However, as shown in Fig. 12, most of the space has not been observed (transparent cube), and only a part of it is filled with obstacles (colored cube). This structure will cause a great waste of memory. For this reason, we choose to adopt hash table for storing data.  Download: larger image Fig. 12. Traditional memory storage for 3D grid map As illustrated in Fig. 13, a hash table is a compromise between an array and a linked list. It uses both indexing and list traversal to store and retrieve data elements. Assuming that the resolution of voxelization is$L$, a point$(x, y, z)$in space is converted into$(i, j, k)$after voxelization.$i=\textrm{round}(x/L), j=\textrm{round}(y/L), k=\textrm{round}(z/L)$. We employ a hash function[26] to obtain the key: $ $$H(x, y, z)=(i\cdot p_1 \oplus j\cdot p_2\oplus k\cdot p_3)~\textrm{mod}~n$$ $(14)  Download: larger image Fig. 13. Hash table structure for storage where large prime numbers$p_1, p_2, p_3$are respectively correspondent with$73 856 093$,$19 349 669$,$83 492 791$.$n$denotes the length of the hash table. Each voxel stores the position after voxelization and a semantic label: Struct HashEntry { int position [3]; char label; } where position is the position of voxel, and the label represents the semantic label of the voxel. We adopt the Winner-takes-all method for data fusion. In case multiple points fall into the same voxel, we count the frequency of each category of objects and select the label which is more frequent than others. This provides a naive probability estimation for a voxel taking label$l$. We know that storing and retrieving items in a linked list grow more inefficient as the list grows longer. So it makes sense to keep these linked lists as short as possible. To solve this problem, we set a distance threshold$d_w$. When the distance from current position to initial position exceeds a fixed$d_w$, the built map and pose at this moment will be stored to the hard disk, and then we use a "soft reset" to swap the space of memory, so that we can build a map of arbitrary size. 9 Experiment results and analysis The effectiveness of our approach is demonstrated on the KITTI dataset[5], and qualitative and quantitative results of our approach are provided. 9.1 Dense 3D semantic map We reconstruct the labeled large-scale scene from the KITTI odometry dataset. The odometry dataset contains over 20 sequences of stereo images of size about$1241\times376$acquired from a moving vehicle. The results of the proposed semantic map construction method on the odometry dataset of KITTI are shown in Fig. 14.  Download: larger image Fig. 14. Semantic map of the reconstructed scene of Seq05 overlayed with the corresponding Google Earth image In Fig. 14, a large scale semantic reconstruction, comprising of 1 250 frames from KITTI Odometry dataset Seq05, is illustrated. In Fig. 15, some other large scale semantic reconstruction results are illustrated. An overhead view of the reconstructed semantic map is shown along with the corresponding Google Earth image. Roads (purple), buildings (red) and vegetation (yellow) can be clearly distinguished.  Download: larger image Fig. 15. Semantic maps of the reconstructed scenes overlayed with the corresponding Google Earth images 9.2 Visual odometry As shown in Fig. 16, the ground truth of camera trajectory of Seq07 is compared with the result of localization method proposed in this paper. In Fig. 16, the red point represents the ground truth value of the trajectory, the blue point represents the trajectory of visual odometry without using the semantic verification proposed in Section 6, and the green point is the trajectory of visual odometry using the semantic verification. As can be seen, the visual odometry trajectory drifts more and more as the number of frames increases. However, the visual odometry trajectory using semantic verification drifts to a lesser extent.  Download: larger image Fig. 16. Trajectory comparison between ground truth and visual odometry We evaluate the translation error and rotation error over an increasing number of frames with the provided ground truth of Seq07. In Table 1, visual odometry using semantic verification is denoted as sVO and unused as VO. It can be observed that the translation and rotation error of visual odometry using semantic verification is overall lower than the method without using semantic verification. Table 1 Visual odometry evaluation: Translational and rotational error with increasing number of frames In addition, we have demonstrated the proposed method on our driverless platform -IN$^2$Bot. As shown in Fig. 17, this platform is remodeled from Polaris Ranger 800 XP all-terrain vehicle, with a top speed of 55 miles per hour. It uses gasoline as power source and has autonomous driving ability in urban and wild environment. The panoramic camera -Ladybug is used as the data source, and the top view of the surrounding roads is obtained by the inverse projection transformation (IPM) to restore the actual relative position of objects with the autonomous vehicle. We clip the front-view part of the panoramic camera as the input, and apply the semantic verification to visual odometry used in [27] to estimate the camera pose. The qualitative result is shown in Fig. 18.  Download: larger image Fig. 17. Our platform – IN2Bot  Download: larger image Fig. 18. Qualitative result using our dataset and semantic purificated visual odometry 9.3 Semantic segmentation We also get 60 annotated images with per-pixel class labels from [12] for evaluation of semantic segmentation. The class labels are the road, building, vehicle, pedestrian, pavement, tree, sky, signage, post/pole and wall/fence. We compare our results on moving scenes with other approaches, the results of which are summarized in Table 2. The "Avg" refers to the average of the per class measures, and the "Global" refers to the overall percentage of pixels correctly classified. From Table 2, we can tell that under the circumstance of unobvious change of "Global" the proposed method performs better in details of most classes. Table 2 Semantic evaluation (Pixel-level accuracy) In order to demonstrate the effectiveness of our method more clearly, we also present the map generated from a portion of the sequence image in Figs. 19 and 20. From these pictures, we can see that not only the large objects such as the road, building and vegetation, even the details can be well reconstructed such as the traffic sign and lane.  Download: larger image Fig. 19. Single frame 3D semantic map reconstruction  Download: larger image Fig. 20. Some frames 3D semantic map reconstruction 9.4 Motion segmentation We use Optical Flow Evaluation 2015[28] to evaluate the performance of the proposed algorithm. As shown in Fig. 21, we can clearly observe an improvement of motion segmentation in accuracy using semantic labels. To present a quantitative evaluation of motion segmentation, we choose the average intersection-over-union (IOU) as the evaluation measurement. It is defined as$TP/(TP+FP+FN)$, where$TP$represents the true positive,$FP$the false positive and$FN$as the false negative. Table 3 represents the motion accuracy evaluation of the proposed approach. In Table 3, an interesting fact is observed that our approach attains a better motion segmentation performance. In the experiment, the accuracy degenerates when pedestrians appear. This is because pedestrians usually move slower than cars, which causes the detected feature points likely to be classified as static. Besides, our method is not sensitive to longitudinal motion. Table 3 Motion evaluation  Download: larger image Fig. 21. Motion segmentation result 9.5 Computing efficiency In order to evaluate the effectiveness of our approach, we have made a detailed list of all the operations involved in this method in Table 4. Notice that key-frame distance thresholds are set to$THR_{R}=5$m,$THR_{t}=25$deg, and "reset" distance$d_w=2$km. The experiments are executed on an Intel Core i7-5930K CPU and GTX980Ti with 16 GB RAM and the codes are all written in C++. As shown in Table 4, since the first four steps are parallel processes with multithreading, the total computing time$T_{sum}=\$ 250 ms + 60 ms + 10 ms + 20 ms = 340 ms.

Table 4
Computing time

Compared with those methods which need 30 s to do semantic segmentation[30], in this paper, semantic segmentation based on deep learning technology is implemented, and the algorithm has a great improvement in computing efficiency. In fact, the most time-cost part of the proposed method, namely, disparity calculation, can be further accelerated by using GPU implementation since the operation is highly independent. Meanwhile, the highly modular design ensures the scalability and further improvement of the system.

10 Conclusions

This paper proposes a framework of dense 3D semantic mapping. We design as many processes as possible in a parallel fashion, which makes the mapping process more efficient. We use depth information to refine semantic segmentation. We, in turn, use semantic information to optimize the motion segmentation, which obtains a more accurate result. Meanwhile, the introduction of deep learning technology also greatly improves the scalability and the speed of semantic mapping.

For future work, we intend to build an end-to-end learning and integrate all steps into deep learning, which will eventually lead to a purely data driven system.

References