Please use this identifier to cite or link to this item:
|Title:||Real-Time Structure and Object Aware Semantic SLAM|
|School/Discipline:||School of Computer Science|
|Abstract:||Simultaneous Localization And Mapping (SLAM) is one of the fundamental problems in mobile robotics and addresses the reconstruction of a previously unseen environment while simultaneously localising a mobile robot with respect to it. For visual-SLAM, the simplest representation of the map is a collection of 3D points that is sparse and efficient to compute and update, particularly for large-scale environments, however it lacks semantic information and is not useful for high-level tasks such as robotic grasping and manipulation. Although methods to compute denser representations have been proposed, these reconstructions remain equivalent to a collection of points and therefore carry no additional semantic information or relationship. Man-made environments contain many structures and objects that carry high-level semantics and can potentially act as landmarks of a SLAM map, while encapsulating semantic information as opposed to a set of points. For instance, planes are good representations for feature deprived regions, where they provide information complimentary to points and can also model dominant planar layouts of the environment with very few parameters. Furthermore, a generic representation for previously unseen objects can be used as a general landmark that carries semantics in the reconstructed map. Integrating visual semantic understanding and geometric reconstruction has been studied before, however due to various reasons, including high- level geometric entities in the SLAM framework has been restricted to a slow, offline structure-from-motion context, or high-level entities merely act as regulators for points in the map instead of independent landmarks. One of those critical reasons is the lack of proper mathematical representation for high-level landmarks and the other main reasons are the challenge of detection and tracking of these landmarks and formulating an observation model – a mapping between corresponding image observable quantities and estimated parameters of the representations. In this work, we address these challenges to achieve an online real-time SLAM framework with scalable maps consisting of both sparse points and high-level structural and semantic landmarks such as planes and objects. We explicitly target real-time performance and keep that as a beacon which influences critically the representation choice and all the modules of our SLAM system. In the context of factor graphs, we propose novel representations for structural entities as planes and general unseen and not-predefined objects as bounded dual quadrics that decompose to permit clean, fast and effective real-time implementation that is amenable to the nonlinear leastsquare formulation and respects the sparsity pattern of the SLAM problem. In this representation we are not concerned with high-fidelity reconstruction of individual objects, but rather to represent the general layout and orientation of objects in the environment. Also the minimal representations of planes is explored leading to a representation that can be constructed and updated online in a least-squares framework. Another challenge that we address in this work is to marry high-level landmark detections based on deep-learned frameworks, with geometric SLAM systems. Due to the recent success of CNN-based object detections and also depth and surface normal estimations from single image, it is feasible now to detect and estimate these semantic landmarks from single RGB images, therefore leading us seamlessly from RGB-D SLAM system to pure monocular SLAM thanks to the real-time predictions of the trained CNN and appropriate representations. Furthermore, to benefit from deep-learned priors, we incorporate high-fidelity single-image reconstructions and hallucinations of objects on top of the coarse quadrics to enrich the sparse map semantically, while constraining the shape of the coarse quadrics even more. Pertinent to our beacon, proposed landmark representations in the map also provide the potential for imposing additional constraints and priors that carry crucial semantic information about the scene, without incurring great extra computational cost. In this work, we have explored and proposed constraints such as priors on the extent and shape of the objects, point-plane regularizer, plane-plane (Manhattan assumption), and plane-object (supporting affordance) constraints. We evaluate our proposed SLAM system extensively using different input sensor modalities from RGB-D to monocular in almost all publicly available benchmarks both indoors and outdoors to show its applicability as a general-purpose SLAM solution. The extensive experiments show the efficacy of our SLAM through different comparisons and ablation studies including high-level structures and objects with imposed constraints among them in various scenarios. In particular, the estimated camera trajectories have been improved significantly in varied sequences of visual SLAM datasets and also our own captured sequences with UR5 robotic arm equipped with a depth camera. In addition to more accurate camera trajectories, our system yields enriched sparse maps with semantically meaningful planar structures and generic objects in the scene along with their mutual relationships|
|Dissertation Note:||Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 2019|
|Provenance:||This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals|
|Appears in Collections:||Research Theses|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.