^{2} University of Chinese Academy of Sciences, Beijing 100049, China
Reconstructing the real world scenes is known as a particularly challenging problem in computer vision field. Many tools have been applied to perceive accurate 3D world, including stereo cameras, laser range finders, monocular cameras, and depth cameras.
The emergence of consumer depth cameras, in particular the Microsoft Kinect, provides an opportunity to develop reconstruction systems conveniently. Izadi et al.^{[1, 2]} introduced the Kinectfusion algorithm which used a volumetric representation of the scene, known as the truncated signed distance function (TSDF)^{[3]}, in conjunction with fast iterative closest point (ICP)^{[4]} pose estimation to provide a realtime fused dense model of the scene. Kinectfusion works according to fixed grid spaces and the algorithm has no loop closure detection or global optimization. Therefore, it has good effectiveness only for local small scenes.
When we reconstruct complete and highquality real world scenes with consumergrade depth cameras, the principal problems are serious sensor noise and accumulated visual odometry errors which may result in distortions in the reconstructed 3D models. For the past few years, researchers have explored a number of approaches to address these issues.
Some systems achieved high accuracy localization by combining the depth data with red green blue (RGB) images^{[57]} or an inertial measurement unit (IMU)^{[810]}. But most depth cameras are not accompanied by color cameras. Even if the color camera is present, their view points are different and their shutter may not be perfectly synchronized. Besides, a consumergrade IMU also suffers from sensor noise and is subject to large drifts over time.
Other systems tried to detect loop closures more explicitly and distributed the accumulated error across the pose graphs^{[1113]}. Choi et al.^{[11]} have demonstrated impressive globally optimized 3D surface models, which extended the frametomodel incremental tracking and reconstruction technique utilized in Kinectfusion. The key idea of Choi
In this paper, we present an elaborate and robust scene reconstruction method, which can be applied to realworld scenes and has high reconstruction quality. The main contributions of our work contain three aspects: First, in order to increase the accuracy of 3D model, we smooth the depth images by a depth adaptive bilateral filter according to the depth camera
This paper is structured as follows. Section 2 discusses the related work of indoor scene 3D reconstruction while Section 3 describes the pipeline of our 3D reconstruction system. The details of the proposed method are presented in Section 4. Section 5 presents experiment results and discussions while Section 6 presents some concluding remarks.
2 Related workMany algorithms are designed for depth image augmentation, complete scene processing, and volumetric integration. Now we briefly discuss the related work and further state the detailed motivations of our methods.
The raw depth images obtained from commercially available depth cameras are easy to use, but are affected by significant amounts of noise. Researchers have made lots of analysis for the accuracy and resolution of depth data^{[1418]}. A commonly used modification is the bilateral filter^{[19]} which modifies the weighting to account for variation of intensity thereby effectively carrying out a robust smoothing operation. But the bilateral filter applied to depth images implicitly assumes that depth values have uniform uncertainty. Xiao et al.^{[20]} improved the depth map by using TSDF to voxelize the space, accumulating the depth map from nearby frames using the camera poses, and then using ray casting to get a reliable depth map for each frame. Chen and Koltun^{[21]} developed a global highresolution media resource function (MRF) optimization approach to improve the accuracy of depth images. The algorithm performed block coordinate descent by optimally updating a horizontal or vertical line in each step. The idea of joint bilateral upsampling^{[22]} is to apply a spatial filter to the low resolution image, while a range filter is jointly applied on another full resolution image. It is used to augment the quality of image with the help of a high resolution color image. In contrast to these, we smooth the depth image by a depth adaptive bilateral filter which is derived from the noise model of a structuredlight stereo based depth camera, and can be used easily.
A complete scene is reconstructed from views acquired along the camera trajectory, and each view exposes only a small part of the environment. Whelan et al.^{[12, 23]} permitted the area mapped by the TSDF to move over time, which allows to continuously augment the reconstructed surface in an incremental fashion as the camera translates and rotates in the real world. An inherent problem is dealing with the tracking drift due to accumulated pose estimation errors. Zeng et al.^{[24]} introduced 3DMatch to robustly match local geometry, which is a datadriven local feature learner that jointly learns a geometric feature representation and an associated metric function from a large collection of real world scanning data. Halber and Funkhouser^{[25]} introduced a finetocoarse algorithm that detects planar structures spanning multiple RGBD frames and establishes geometric constraints between them as they become aligned. Detection and enforcement of these structural constraints in the inner loop of a global registration algorithm guides the solution towards more accurate global registrations, even without detecting loop closures. Choi et al.^{[11, 13]} dealt with the accumulated pose estimation errors by reconstructing locally smooth scene fragments and deforming these fragments in order to align to each other. However, it is not very effective for the reconstruction of real world scenes with a handheld camera. Therefore, we extend this method and design a contentbased segmenting strategy to increase the accuracy of local fusion and global registration.
In volumetric integration, TSDF is discretized into a voxel grid to represent a physical volume of space. Each voxel
An overview of our scene reconstruction framework is shown in Fig. 1. The scene reconstruction pipeline consists of three main stages and each stage is briefly described as follows.
Download:


Fig. 1. Pipeline of the proposed scene reconstruction system 
Image capture and processing. The raw depth images are captured with a depth camera based on structuredlight stereo, such as Microsoft Kinect for Windows and Asus Xtion Pro Live. Before the scene reconstruction, we improve the quality of depth images by the proposed depth adaptive bilateral filter algorithm, which can effectively remove the noises from these depth cameras.
Localtoglobal registration. We introduce a localtoglobal registration strategy to reduce visual odometry drift errors and achieve complete scene reconstruction. The large scene is partitioned into fragments of various sizes with the proposed contentbased segmentation method. All fragments are locally fused with ICP registration algorithm, and a global loop closure is detected for each pairs of fragments with a geometric registration algorithm^{[11, 30]}. The benefit of this registration is that we can get more reliable geometry information because information extracted from contentbased fragment is more complete than an individual depth image. The pose of fragment
Weighted volumetric integration. The registered fragments are fused into a global model through volumetric integration. The proposed weighting function of TSDF is based on the camera
The proposed new techniques specifically include three aspects: depth adaptive bilateral filter, contentbased segmentation, and adaptive weighted TSDF (WTSDF). The following subsections describe the core methods in our system.
4.1 Depth adaptive bilateral filterThe consumer depth camera based on structuredlight stereo can be treated as a pair of stereo cameras in a canonical position^{[16]}. The depth
$ \begin{equation} \frac{\partial z}{\partial D}=\frac{z^2}{fB}. \end{equation} $  (1) 
The standard deviation (STD) of noise in depth measurement is proportional to the square of the depth. Thus, we propose a depth adaptive bilateral filtering method which is more effective to smooth depth images than the bilateral filtering^{[19]}.
Consider an observed depth image
$ \begin{equation} \hat{{\boldsymbol Z}}({\boldsymbol u})=\frac{1}{W} \sum\limits_{{N}({\boldsymbol u}_k)}w_s({\boldsymbol u}{\boldsymbol u}_k)w_c ({\boldsymbol Z}({\boldsymbol u}){\boldsymbol Z}({\boldsymbol u}_k)){\boldsymbol Z}({\boldsymbol u}) \end{equation} $  (2) 
where
$ \begin{eqnarray} \left\{ \begin{aligned} w_s &= {\rm exp}\left({\frac{({\boldsymbol u}{\boldsymbol u}_k)^2}{2(\delta_s)^2}}\right)\\ w_c &= {\rm exp}\left({\frac{({\boldsymbol Z}({\boldsymbol u}){\boldsymbol Z}({\boldsymbol u}_k))^2}{2(\delta_c)^2}}\right). \end{aligned} \right.\nonumber\\[6.5mm] \end{eqnarray} $  (3) 
Unlike the bilateral filter, here the values of
$ \begin{equation} {\delta_c} = K{\boldsymbol Z}({\boldsymbol u})^2 \end{equation} $  (4) 
where
Fig. 2 gives an example of the results of a depth image by the standard bilateral filter and the proposed depth adaptive bilateral filter. The color is for visualization only. We can see from the point cloud and mesh models that depth adaptive bilateral filter for the depth image is more effective to remove the noise and protect the edges. Both the foreground and background regions are appropriately smoothed while preserving depth discontinuity features since the proposed filter is adaptive to the variation of depth.
Download:


Fig. 2. Results of filtering on the depth image. Raw data shows the fusion image of a floor with the raw depth and color information. (a)(c): Results with the raw depth image, depth image with bilateral filter, and depth image with the proposed adaptive bilateral filter, respectively. Top shows point clouds; Bottom shows mesh models. 
4.2 Contentbased segmentation
The segmentation of a depth image sequence is the key of the localtoglobal registration. Segmentation based on visual content can effectively reduce the odometry drift and make the global loop closure more reliable.
The data obtained from a handheld depth camera is usually related to the camera
Download:


Fig. 3. Convisibility between two depth images 
Consider the estimated pose
First, we reconstruct the 3D point
Second, we transform the 3D point
$ \begin{equation} {\boldsymbol u}_q=\left(\frac{f_xx_q}{z_q}+c_x, \frac{f_yy_q}{z_q}+c_y\right)^{\rm T} \end{equation} $  (5) 
where
Third, we compute the number of available pixels of the new ith depth frame and the first depth frame respectively, and then obtain the ratio
Segmenting the input depth image sequence into fragments with the same size is a simple method, but it is difficult to select an appropriate number of frames for each fragment to reconstruct a good 3D scene model. We made some reconstruction experiments on augmented ICLNUIM dataset by no segmentation, uniform segmentation (50frame) and contentbased segmentation for depth image sequence. The odometry drifts estimated with Median error and root mean square error (RMSE) are shown in Table 1. It comes out that the proposed segmentation can effectively reduce the odometry drift on average.
We also made some reconstruction experiments on real world scenes by the methods of uniform segmentation and the contentbased segmentation for depth image sequence. The comparison results can be seen from Figs. 4 (a) and 5 (a) to Figs. 4 (c) and 5 (c). The depth image sequences with the method of Choi et al.^{[11]} are partitioned into fragments of 50 frames. Compared with the uniform segmentation, the contentbased segmentation can automatically adjust the size of the fragments according to different datasets and data scanned by different operators. It can provide a good initial value for posegraph optimization to increase the robustness, since the number of frames in each fragment is adaptive to the scanning process.
Download:


Fig. 4. Reconstruction results of fr1/room scene from the RGBD simultaneous localization and mapping (SLAM) Dataset. (a) Results with the method of Choi et al.^{[11]}. (b) Surfel model with Elasticfusion^{[32]}. (c) Results with the proposed method. 
Download:


Fig. 5. Reconstruction results of indoor scene which is scanned through a robot equipped with Microsoft Kinect for Windows. (a) Results with the method of Choi et al.^{[11]} (b) Surfel model with Elasticfusion^{[32]}. (c) Results with the proposed method. 
4.3 Adaptive weighted TSDF
In this subsection, a TSDF with new adaptive weights is proposed to merge the registered data into a complete scene model, where different positions of points are considered. This can give sufficient details to the regions with high accuracies and interests.
For a given voxel
$ \begin{eqnarray} \left\{ \begin{aligned} F({\boldsymbol v}) &= \frac{\sum\limits_{i=1}^{n} f_i({\boldsymbol v})w_i({\boldsymbol v})}{W({\boldsymbol v})}\\ W({\boldsymbol v}) &= \sum\limits_{i=1}^{n} w_i({\boldsymbol v}) \end{aligned} \right.\nonumber\\[6.5mm] \end{eqnarray} $  (6) 
where the signed distance function
$ \begin{equation}\label{eq:F} f_i({\boldsymbol v})=[{\boldsymbol K}^{1}{\boldsymbol Z}_i({\boldsymbol u})[{\boldsymbol u}^{\rm T}, 1]^{\rm T}]_{z}[{\boldsymbol v}]_{z} \end{equation} $  (7) 
where
The weighting function
On one hand, as discussed in Section 4.1, the main noise in depth measurements is quantization noise, and the depth estimate
On the other hand, since the consumer depth camera has a narrow field of view, when we scan around the objects in a scene, we usually make the principal axis of the camera directly aligned the regions in which we are interested because the errors increase with the distances from points to the principal axis increasing. In order to emphasize the regions of interest, we give high weights to the points based on their distances from the principal axis.
Let
Download:


Fig. 6. Regions of interest model 
Thus, we propose an adaptive weighting function motivated by the depth noise and ROI model, and the weighting function is assigned as follows:
$ \begin{align} w_i({\boldsymbol v})=\left\{ \begin{aligned} & \frac{{\rm exp}(\frac{d_i^2}{2\delta_r^2})}{z_i^4}, ~~ 0<z_i<d\\ & 0, ~~ z_i\ge d \end{aligned} \right.\nonumber\\[5mm] \end{align} $  (8) 
where we use an Gaussian exponential model which uses Gaussian lateral noise^{[16]} as the exponent to indicate the ROI model.
Fig. 7 shows a comparison of volumetric integration by the standard TSDF with
Download:


Fig. 7. Comparison of volumetric integration with the standard TSDF and the proposed adaptive WTSDF 
5 Experiments
To illustrate the effectiveness of the proposed reconstruction method, we have carried out some experiments to evaluate the qualitative performance of the system.
5.1 HardwareFor all experiments, we ran the proposed system on a standard desktop PC with an Intel Core i74790 3.6 GHz CPU, and an Nvidia GeForce GTX 750 Ti 2GB GPU.
5.2 DataOne part of the data used in our experiments is captured by us with Microsoft Kinect for Windows. It streams VGA resolution (640
RGBD SLAM dataset. This dataset is provided by Handa et al.^{[33]} for the evaluation of visual odometry and visual SLAM systems. Our experiments are conducted on the fr1/room sequence which is a complete indoor scene captured by the robot with Microsoft Kinect for Windows. The data is recorded at full frame rate (30 Hz) and sensor resolution (640
3D scene Dataset. This dataset is provided by ^{[30]}. They used an Asus Xtion Pro Live camera, which streams VGA resolution (640
Augmented ICLNUIM dataset. The original ICLNUIM dataset is based on the synthetic environments provided by ^{[33]}. Choi et al.^{[11]} have augmented it in a number of ways to adapt it for evaluation of complete scene reconstruction pipeline. The average trajectory length is 36 m and the average surface area coverage is 88
From the overall analysis of our experimental results, our reconstruction system is more robust than the stateofart system proposed by Choi et al.^{[11]} and Elasticfusion^{[32]}, especially in realworld scenes scanned through the robot. The precision of the reconstructed models with the data captured by us and public datasets are both better than the stateofart offline reconstruction method.
Fig. 4 shows reconstruction results of fr1/room scene from the RGBD SLAM dataset. Fig. 5 shows the reconstruction results of realworld indoor scene scanned by the robot equipped with Microsoft Kinect for Windows. The reconstruction results of the scene with Choi et al.^{[11]} and Elasticfusion^{[32]} are shown in Figs. 4(a), 5(a), 4(b) and 5(b). Both of the reconstructions are not good due to erroneous alignments. Figs. 4(c) and 5(c) show the reconstruction results with the proposed method. The numbers of depth frames used in Figs. 4 and 5 are 1 352 and 2 082, respectively. Fig. 8 shows the reconstruction result of a dynamic working area scanned by the robot equipped with Microsoft Kinect for Windows. Although the data captured by the robot have less surface information since the movements of robot are less flexible, we still get the complete scene models with good geometric structures since the proposed method is more robust.
Download:


Fig. 8. Reconstruction results of the dynamic working area (with 10 000 frames) scanned through a robot equipped with Microsoft Kinect for Windows 
Fig. 9 shows the reconstruction result of a realworld indoor scene manually scanned with Microsoft Kinect for Windows. The left of Fig. 9 illustrates the complete scene model reconstructed by the proposed reconstruction system. The room is 4 m wide and 5 m long. The number of depth frames is 6 000, and the total camera trajectory length is about 68 m. The center region of the room is a desk with a laptop on it. There are some dynamic disturbances by the wire to have more noises. During the scanning process, the data link of our Kinect is connected to the laptop and the power cord of it is plugged into a power strip with long stumbled tail. However, we still get a complete model of the scene. As shown in top right of Fig. 9, the Venus and sofa in which we are interested are reconstructed with highfidelity. The bottom right of Fig. 9 shows the reconstruction results of the corresponding objects in the room with the stateofart method. Both the models of the desk and laptop have weak geometric details due to the reflective laptop surface and disturbed wires. We can obviously see from Fig. 9 that the 3D models reconstructed with the proposed method are better since our system is more robust to reconstruct realworld scenes with a consumer depth camera.
Download:


Fig. 9. Reconstruction results of indoor scene, which is manually scanned with Microsoft Kinect for Windows. The left shows the complete scene model and the camera trajectory information with the proposed method. The right shows the enlarged views of A, B and C in the room. Top of right shows the results with the proposed method. The bottom of right shows the results with the method of Choi et al.^{[11]} 
Fig. 10 shows the reconstruction results of realworld scene from 3D scene dataset, which is manually scanned with Asus Xtion Pro Live camera. The reconstruction results of the scene with Choi et al.^{[11]} and Elasticfusion^{[32]} are shown in Figs. 9(A) and 9(B). The corresponding details of 3D models for the burghers and copy room are shown in Figs. 7 and 11, respectively. The depth image sequences of the burghers and copy room are reconstructed with the method of Choi et al.^{[11]} by segmenting the depth image sequence into fragments of 50 frames each and the proposed method respectively. The statues of the burghers are 2 m tall and the total camera trajectory length is about 184 m. The size of the copy room is 14 m
Download:


Fig. 10. Reconstruction results of realworld scene from 3D scene dataset, which is manually scanned with Asus Xtion Pro Live. The top shows the burghers. The bottom shows the copy room. (a) Results with the method of Choi et al.^{[11]}. (b) Surfel model with Elasticfusion^{[32]}. (c) Results with the proposed method. 
Download:


Fig. 11. Reconstruction details of the copy room from 3D scene dataset 
5.4 Synthetic scenes
To evaluate the accuracy of camera trajectory and 3D model surfaces, we use four depth image sequences of augmented ICLNUIM scenes. The accuracies are estimated by Kintinuous^{[23]}, DVO SLAM^{[34]}, SUN3D SfM^{[20]}, Choi et al.^{[11]}, Elasticfusion^{[32]} and the proposed method. Note that the results of Choi et al.^{[11]}, Elasticfusion^{[32]} were run by ourselves. And the results of DVO SLAM^{[34]} and SUN3D SfM^{[20]} are included in the paper of Choi et al.^{[11]}
Camera trajectory evaluation. Table 2 reports the accuracy of the camera trajectories using the root mean square error (RMSE) metric described by Handa et al.^{[33]} The RMSE of trajectory errors are in meters. As can be seen from Table 2, the average accuracy of trajectories with the proposed method is higher since it can reduce the accumulated odometry errors.
Assessing quality of 3D reconstruction. We use the opensource tool called CloudCompare to evaluate the surface reconstruction quality. The reconstruction surfaces of Augmented ICLNUIM scenes can be compared against the groundtruth 3D model surfaces. The median distance of each reconstructed model to the groundtruth surface are reported in Table 3. It indicates that our method can effectively reduce the average error.
6 Conclusions
We presented a robust approach to elaborate scene reconstruction from a consumer depth camera. The main contribution of our research is using the localtoglobal registration to obtain complete scene reconstruction and then the accuracy of 3D scene models is improved in the process of depth images filtering and weighted volumetric integration. The experimental results demonstrated that the proposed approach improves the robustness of reconstruction and enhances the fidelity of the 3D models produced from a consumer depth camera.
AcknowledgementsThis work was supported by the National Key Technologies R & D Program (No. 2016YFB0502002) and in part by National Natural Science Foundation of China (Nos. 61472419, 61421004 and 61572499).
[1] 
S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, A. Fitzgibbon. KinectFusion: Realtime 3D reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, ACM, Santa Barbara, USA, pp. 559568, 2011.

[2] 
R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, A. Fitzgibbon. KinectFusion: Realtime dense surface mapping and tracking. In Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, IEEE, Basel, Switzerland, pp. 127136, 2011.

[3] 
B. Curless, M. Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, ACM, New York, USA, pp. 303312, 1996. http://portal.acm.org/citation.cfm?doid=237170.237269

[4] 
S. Rusinkiewicz, M. Levoy. Efficient variants of the ICP algorithm. In Proceedings of the 3rd International Conference on 3D Digital Imaging and Modeling, IEEE, Quebec, Canada, pp. 145152, 2001.

[5] 
C. Kerl, J. Sturm, D. Cremers. Robust odometry estimation for RGBD cameras. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Karlsruhe, Germany, pp. 37483754, 2013. http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6631104

[6] 
F. Steinbrücker, J. Sturm, D. Cremers. Realtime visual odometry from dense RGBD images. In Proceedings of IEEE International Conference on Computer Vision Workshops, IEEE, Barcelona, Spain, pp. 719722, 2011. http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6130321

[7] 
T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, J. McDonald. Realtime largescale dense RGBD SLAM with volumetric fusion. International Journal of Robotics Research, vol.34, no.45, pp.598626, 2015. DOI:10.1177/0278364914551008 
[8] 
J. Huai, Y. Zhang, A. Yilmaz. Realtime large scale 3D reconstruction by fusing Kinect and IMU data. In Proceedings of ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, ISPRS, La Grande Motte, France, vol. Ⅱ3/W5, pp. 491496, 2015.

[9] 
M. Niessner, A. Dai, M. Fisher. Combining inertial navigation and ICP for realtime 3D surface reconstruction. Eurographics, E. Galin, M. Wand, Eds., Strasbourg, France: The Eurographics Association, pp. 1316, 2014.

[10] 
K. H. Yang, W. S. Yu, X. Q. Ji. Rotation estimation for mobile robot based on Singleaxis gyroscope and monocular camera. International Journal of Automation and Computing, vol.9, no.3, pp.292298, 2012. DOI:10.1007/s116330120647z 
[11] 
S. Choi, Q. Y. Zhou, V. Koltun. Robust reconstruction of indoor scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 55565565, 2015. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7299195

[12] 
T. Whelan, M. Kaess, J. J. Leonard, J. McDonald. Deformationbased loop closure for large scale dense RGBD SLAM. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Tokyo, Japan, pp. 548555, 2013.

[13] 
Q. Y. Zhou, S. Miller, V. Koltun. Elastic fragments for dense scene reconstruction. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 473480, 2013. http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6751168

[14] 
A. Chatterjee, V. M. Govindu. Noise in structuredlight stereo depth cameras: Modeling and its applications. arXiv: 1505. 01936, 2015.

[15] 
K. Khoshelham. Accuracy analysis of Kinect depth data. In Proceedings of International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, ISPRS, Calgary, Canada, vol. XXXVⅢ5/W12, pp. 133138, 2011.

[16] 
K. Khoshelham, S. O. Elberink. Accuracy and resolution of Kinect depth data for indoor mapping applications. Sensors, vol.12, no.2, pp.14371454, 2012. DOI:10.3390/s120201437 
[17] 
C. V. Nguyen, S. Izadi, D. Lovell. Modeling Kinect sensor noise for improved 3D reconstruction and tracking. In Proceedings of the 2nd International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, IEEE, Zurich, Switzerland, pp. 524530, 2012.

[18] 
J. Smisek, M. Jancosek, T. Pajdla. 3D with Kinect. In Proceedings of IEEE International Conference on Computer Vision Workshops, IEEE, Barcelona, Spain, pp. 11541160, 2011.

[19] 
C. Tomasi, R. Manduchi. Bilateral filtering for gray and color images. In Proceedings of the 6th International Conference on Computer Vision, IEEE, Bombay, India, pp. 839846, 1998.

[20] 
J. X. Xiao, A. Owens, A. Torralba. SUN3D: A database of big spaces reconstructed using SFM and object labels. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 16251632, 2013. http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6751312

[21] 
Q. F. Chen, V. Koltun. Fast MRF optimization with application to depth reconstruction. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 39143921, 2014. http://ieeexplore.ieee.org/document/6909895/

[22] 
J. Kopf, M. F. Cohen, D. Lischinski, M. Uyttendaele. Joint bilateral upsampling. ACM Transactions on Graphics, vol. 26, no. 3, Article number 96, 2007.

[23] 
T. Whelan, M. Kaess, M. Fallon, H. Johannsson, J. Leonard, J. McDonald. Kintinuous: Spatially extended KinectFusion. In Proceedings of RSS Workshop on RGBD: Advanced Reasoning with Depth Cameras, Sydney, Australia, pp. 314, 2012.

[24] 
A. Zeng, S. Song, M. Niessner, M. Fisher, J. X. Xiao, T. Funkhouser. 3DMatch: Learning the matching of local 3D geometry in range scans. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Puerto Rico, USA, 2017.

[25] 
M. Halber, T. Funkhouser. Structured global registration of RGBD scans in indoor environments. arXiv: 1607. 08539, 2016.

[26] 
S. Parker, P. Shirley, Y. Livnat, C. Hansen, P. P. Sloan. Interactive ray tracing for isosurface rendering. In Proceedings of the Conference on Visualization, IEEE, North Carolina, USA, pp. 233238, 1998. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=745713

[27] 
W. E. Lorensen, H. E. Cline. Marching cubes:A high resolution 3D surface construction algorithm. ACM SIGGRAPH Computer Graphics, vol.21, no.4, pp.163169, 1987. DOI:10.1145/37402 
[28] 
M. Zollhöfer, A. Dai, M. Innmann, C. L. Wu, M. Stamminger, C. Theobalt, M. Niessner. Shadingbased refinement on volumetric signed distance functions. ACM Transactions on Graphics, vol. 34, no. 4, Article number 96, 2015.

[29] 
Q. Y. Zhou, V. Koltun. Dense scene reconstruction with points of interest. ACM Transactions on Graphics, vol. 32, no. 4, Article number 112, 2013. http://dl.acm.org/citation.cfm?id=2461919

[30] 
Q. Y. Zhou, J. Park, V. Koltun. Fast global registration. In Proceedings of 14th European Conference on Computer Vision, Amsterdam, The Netherlands, pp. 766782, 2016.

[31] 
R Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, W. Burgard. g2o: A general framework for graph optimization. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Shanghai, China, pp. 36073613, 2011. http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=5979949

[32] 
T. Whelan, S. Leutenegger, R. F. SalasMoreno, B. Glocker, A. J. Davison. ElasticFusion: Dense SLAM without a pose graph. In Proceedings of Robotics: Science and Systems, Rome, Italy, 2015.

[33] 
A. Handa, T. Whelan, J. McDonald, A. J. Davison. A benchmark for RGBD visual odometry, 3D reconstruction and SLAM. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Hong Kong, China, pp. 15241531, 2014. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6907054

[34] 
C. Kerl, J. Sturm, D. Cremers. Dense visual SLAM for RGBD cameras. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Tokyo, Japan, pp. 21002106, 2013. http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6696650
