On how data are partitioned in model development and evaluation: Confronting the elephant in the room to enhance model generalization

Maier, H.R.; Zheng, F.; Gupta, H.; Chen, J.; Mai, J.; Savic, D.; Loritz, R.; Wu, W.; Guo, D.; Bennett, A.; Jakeman, A.; Razavi, S.; Zhao, J.

doi:10.1016/j.envsoft.2023.105779

On how data are partitioned in model development and evaluation: Confronting the elephant in the room to enhance model generalization

dc.contributor.author	Maier, H.R.
dc.contributor.author	Zheng, F.
dc.contributor.author	Gupta, H.
dc.contributor.author	Chen, J.
dc.contributor.author	Mai, J.
dc.contributor.author	Savic, D.
dc.contributor.author	Loritz, R.
dc.contributor.author	Wu, W.
dc.contributor.author	Guo, D.
dc.contributor.author	Bennett, A.
dc.contributor.author	Jakeman, A.
dc.contributor.author	Razavi, S.
dc.contributor.author	Zhao, J.
dc.date.issued	2023
dc.description.abstract	Models play a pivotal role in advancing our understanding of Earth’s physical nature and environmental systems, aiding in their efficient planning and management. The accuracy and reliability of these models heavily rely on data, which are generally partitioned into subsets for model development and evaluation. Surprisingly, how this partitioning is done is often not justified, even though it determines what model we end up with, how we assess its performance and what decisions we make based on the resulting model outputs. In this study, we shed light on the paramount importance of meticulously considering data partitioning in the model development and evaluation process, and its significant impact on model generalization. We identify flaws in existing data-splitting approaches and propose a forward-looking strategy to effectively confront the “elephant in the room”, leading to improved model generalization capabilities.
dc.description.statementofresponsibility	Holger R. Maier, Feifei Zheng, Hoshin Gupta, Junyi Chen, Juliane Mai, Dragan Savic, Ralf Loritz, Wenyan Wu, Danlu Guo, Andrew Bennett, Anthony Jakeman, Saman Razavi, Jianshi Zhao
dc.identifier.citation	Environmental Modelling and Software, 2023; 167:105779-1-105779-8
dc.identifier.doi	10.1016/j.envsoft.2023.105779
dc.identifier.issn	1364-8152
dc.identifier.issn	1873-6726
dc.identifier.orcid	Maier, H.R. [0000-0002-0277-6887]
dc.identifier.orcid	Wu, W. [0000-0003-3907-1570]
dc.identifier.uri	https://hdl.handle.net/2440/139464
dc.language.iso	en
dc.publisher	Elsevier BV
dc.relation.grant	http://purl.org/au-research/grants/arc/DE210100117
dc.rights	© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
dc.source.uri	https://doi.org/10.1016/j.envsoft.2023.105779
dc.subject	Calibration
dc.subject	Data partitioning
dc.subject	Data splitting
dc.subject	Earth systems
dc.subject	Model development
dc.subject	Model evaluation
dc.subject	Uncertainty
dc.subject	Validation
dc.title	On how data are partitioned in model development and evaluation: Confronting the elephant in the room to enhance model generalization
dc.type	Journal article
pubs.publication-status	Published

Files

Original bundle

Now showing 1 - 1 of 1

Name:: hdl_139464.pdf
Size:: 1.95 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

Collections

Civil and Environmental Engineering publications