Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/100723
Type: Theses
Title: Improving partial mutual information based input variable selection for data driven environmental and water resources models
Author: Li, Xuyuan
Issue Date: 2015
School/Discipline: School of Civil, Environmental and Mining Engineering
Abstract: Artificial neural networks (ANNs), as one of the most commonly used data driven models for environmental and water resources problems, have been applied successfully and extensively over the last two decades and are still gaining in popularity. Consideration of the methods used in the steps in the development of ANNs, which consist of data collection, data processing, input variable selection, data division, calibration and validation, are vitally important, as ANN model development is based on data, rather than understanding of the underlying physical processes. Among these methods, input variable selection (IVS) plays a significant role, as the performance of the developed model can be compromised if inputs having a pronounced relationship with the modelled output are omitted. In contrast, calibration becomes extremely challenging and modelling validation, as well as knowledge extraction, are problematic if redundant or superfluous inputs are included. Given the facts explained above, various techniques have been developed for the sake of more accurate IVS. Partial mutual information (PMI) is one of the most promising approaches to IVS, as it has a number of desirable properties, such as the ability to account for input relevance, the ability to cater to both linear and non-linear input-output relationships and the ability to check the redundancy of selected inputs. PMI is a stepwise input selection algorithm, which only selects one variable per iteration, as part of which the strength of the relationship between each potential input and the output is quantified using mutual information (MI) and input redundancy is accounted for by removing the influence of already selected inputs. This is achieved by developing models between the selected input and the output and assessing the strength of the relationship (in terms of MI) between the remaining potential inputs and the residuals of these models during the next iteration, which is referred to as PMI. Although PMI IVS has already been applied successfully to a number of studies in hydrological and water resource modelling, present implementations predominantly depend on the assumption that the data used to develop the model follow a Gaussian distribution. This assumption has the potential to affect two steps in the PMI process, including the estimation of MI/PMI and the estimation of the residuals. In terms of MI/PMI estimation, this requires kernel density estimates of the modelling data to be obtained for the estimation of marginal and joint probability density functions (PDFs), which rely on estimates of kernel bandwidths (or smoothing parameters) and in most studies, the Gaussian reference rule is used for this purpose, which only results in optimal bandwidth estimates if the modelling data follow a Gaussian distribution. However, this is unlikely to be the case when dealing with water resources and other environmental data. In terms of residual estimation (RE), this has generally been done using general regression neural networks (GRNNs), which also require estimates of kernel bandwidths to be obtained and therefore suffers from the same issues as MI/PMI estimation. The purpose of this thesis is to assess the impact the assumption that the data follow a Gaussian distribution has on the performance of PMI IVS and the efficacy of potential methods for overcoming any problems associated with this assumption. In order to achieve this, a large number of numerical tests are conducted on synthetic data with different degrees of normality and non-linearity, investigating the effectiveness of a range of options for (i) bandwidth estimation (caused by making Gaussian assumptions for non-Gaussian circumstances when adopting kernel based estimations in both MI/PMI and RE) and (ii) for dealing with boundary issues (caused by using a symmetrical kernel for bounded/unsymmetrical data when implementing kernel based estimations in both MI/PMI and RE), as well as methods for RE that do not require kernel density estimates. The results from these tests are used to develop preliminary guidelines for the selection of the most appropriate bandwidth and the most effective treatment of the boundary issue, which are then validated for two water resources case studies with different data properties and problem linearity, including forecasting of river salinity in the River Murray, Australia, and rainfall-runoff modelling in the Kentucky River, USA. The major research contributions are presented in three journal publications. The motivations underlying these publications include: 1) the development and testing of rigorous and novel analytical procedures for assessing if, and to what degree, the performances of residual and MI estimates are affected by bandwidth selection and boundary issues; 2) clear explanation of the inaccurate performance of conventional PMI IVS under the influence of bandwidth selection and boundary issues; 3) the development of effective preliminary guidelines based upon synthetic studies dealing with both bandwidth selection and boundary issues under different scenarios categorised by data normality and problem linearity; 4) the development of more robust and reliable PMI IVS software for realistic environmental and water resource problems. Overall, the research outcomes suggest that the performance of PMI IVS is significantly influenced by bandwidth selection and boundary issues and can be effectively improved by following the proposed empirical guidelines, although the findings of this work could be tested more broadly, including for data sets with a wider range of attributes, such as different degrees of noise, collinearity and interdependency, as well as incomplete information.
Advisor: Maier, Holger R.
Zecchin, Aaron Carlo
Dissertation Note: Thesis (Ph.D.) (Research by Publication) -- University of Adelaide, School of Civil, Environmental and Mining Engineering, 2015.
Keywords: PMI IVS
GRNNs
ANNs
kernel estimation
bandwidth estimation
boundary issues
water resources and environmental modelling
Provenance: This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals
Appears in Collections:Research Theses

Files in This Item:
File Description SizeFormat 
01front.pdf447.34 kBAdobe PDFView/Open
02whole.pdf19.69 MBAdobe PDFView/Open
Permissions
  Restricted Access
Library staff access only487.27 kBAdobe PDFView/Open
Restricted
  Restricted Access
Library staff access only19.86 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.