Deep Learning for Robotic Scene Understanding
Files
(Thesis)
Date
2022
Authors
Sun, Libo
Editors
Advisors
Shen, Chunhua
Liu, Yifan
Pang, Guansong
Liu, Yifan
Pang, Guansong
Journal Title
Journal ISSN
Volume Title
Type:
Thesis
Citation
Statement of Responsibility
Conference Name
Abstract
Scene understanding is a complex yet essential task for intelligent robots. However, how to achieve reliable scene understanding is still a challenging problem. With the widely successful application of deep learning, many breakthroughs have been witnessed in various areas. In this thesis, we aim to investigate how to use deep learning-based methods to significantly improve the scene understanding ability of robots. Specifically, our work involves four fundamental robotic scene understanding subtasks, namely road detection, semantic segmentation, depth estimation, and visual odometry (VO). We present details of how to use proposed deep learning-based approaches to improve the performance of these subtasks. To begin with, as drivable area detection is a critically important task for autonomous driving and robotics, we propose a road detection method which can reduce device reliance while maintaining performance. Unlike previous road detection methods that rely on LiDAR, our method can obtain state-of-the-art performance with RGB images only. In our framework, we exploit a pseudo-LiDAR using monocular depth estimation and propose a feature fusion network to fuse RGB and pseudo- LiDAR information. To optimize the network architecture and improve the efficiency of our network, we propose a method to search for the information propagation paths. Additionally, we design a modality distillation strategy which can significantly reduce network parameters and inference time. Furthermore, because autonomous vehicles and robots are commonly equipped with stereo cameras to capture binocular images, we propose a stereo vision-based semantic segmentation framework which enables current monocular architectures to exploit stereo image data to improve semantic segmentation performance. The improvements are obtained via two approaches: label generation and pre-training, and stereo vision-based information fusion. Comprehensive experiments using different well-known semantic segmentation architectures on different datasets demonstrate the efficacy of our method. Finally, to obtain better 3D scene understanding, we propose a framework to exploit monocular depth estimation for improving monocular VO. The core of this framework is a monocular depth estimation module with a strong generalization capability for diverse scenes. It consists of two separate working modes to assist the localization and mapping. With a single monocular image input, the depth estimation module predicts a relative depth to help the localization module on improving the accuracy. With a sparse depth map and an RGB image input, the depth estimation module can generate accurate scale-consistent depth for dense mapping. Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes. More significantly, our framework is able to boost the performances of existing geometry-based VO methods by a large margin.
School/Discipline
School of Computer Science
Dissertation Note
Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 2022
Provenance
This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals.