Domain Generalisation in Reinforcement Learning
Date
2025
Authors
Orenstein, Adrian
Editors
Advisors
Reid, Ian
Abbasnejad, Ehsan
Abbasnejad, Ehsan
Journal Title
Journal ISSN
Volume Title
Type:
Thesis
Citation
Statement of Responsibility
Conference Name
Abstract
Deep reinforcement learning (RL) aims to learn a general policy for the agent to act in an environment. The policy learns state representations from observations using deep neural networks. As such, deep RL inherently suffers from their defects. For instance, neural networks suffer from shortcut learning in which models latch on to rudimentary input patterns leading to a lack of generalisation. Common approaches to remedy their problems are to (1) obtain larger datasets for better data coverage, (2) train neural networks with more parameters, or (3) include more input variations either by simply adding them to the training set (i.e. data augmentation) or by additionally regularising the objective the information on these variations (i.e. domain information in domain generalisation approaches). However, while these approaches are investigated in supervised learning, their effectiveness remains underexplored in RL. In this thesis, we investigate methods of improving the generalisation capability of on-policy RL agents. We conduct our investigation on a relatively new procedural benchmark named ProcGen (Cobbe et al., 2020), where for a particular game, various levels are procedurally generated - each receiving their own domain label - giving us an ideal platform to investigate domain generalisation methods developed for supervised learning and their efficacy in on-policy RL. In our investigation we find that utilising domain information does indeed result in an improvement in generalisation performance. We apply a supervised learning method, AND-mask (Parascandolo et al., 2021), to a PPO (Schulman et al., 2017) agent and find that when the agent does not get many variations of the same environment to learn from, AND-mask effectively regularises the learned representations and enables the agent to generalise better in novel domains. We also identify a limitation of ANDmask that limits scalability when learning from more training domains. This investigation results in agents that can generalise more effectively when the diversity of training domains are limited, which is beneficial when gathering more diverse data in the real world is costly. Lastly, we explore how model scale, the number of training samples, and the diversity of these training samples contribute to generalisation performance. Our investigation focuses on the onpolicy case in RL where data gathered by the policy is more difficult for deep neural networks to learn from as the data is highly correlated with the policy. In other words, the data gathered by the policy is not independant and identically distributed (i.i.d.), which is often an assumption required for generalisation. Furthermore, as the agent learns and its behaviour is continually changing, the dataset the agent gathers in its experience replay is non-stationary. Given these two complications of non-i.i.d data, and non-stationarity, we observe results that larger models - using a single backbone to extract features - are more effective at generalisation than smaller networks that are decoupled, regardless of the number of domains provided during training.
School/Discipline
School of Computer and Mathematical Sciences
Dissertation Note
Thesis (MPhil) -- University of Adelaide, School of Computer and Mathematical Sciences, 2025
Provenance
This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals