Cross-validation for supervised learning with tuning parameters

dc.contributor.authorWinderbaum, L.
dc.contributor.authorKoch, I.
dc.date.issued2022
dc.description.abstractRecent advances in machine learning and data science have led to widespread adoption of complex predictive modelling. Increasing awareness of the ‘reproducibility crisis’ has led to calls for improved transparency and accountability in scientic reporting. One important aspect of veridical data science is the robust estimation of prediction error. Availability of computational resources has led to cross-validation (CV) as a main tool for such estimation. We consider CV estimation in supervised learning for high-dimensional data, and focus on linear regression and discriminant analysis approaches based on variable selection with direct dimension reduction as well as lasso-type sparsity criteria. We highlight how the same description of a method could in fact apply to any one of several different cross-validation implementations. We outline key principles underpinning good cross-validation practice, several ‘pitfall’ implementations which subtly violate these principles in different ways as well as a more complex and computationally intensive implementation which does not. We demonstrate the differences in the estimated error resulting from these different implementations with real data relating to endometrial cancer, in the context of high-stakes decision making where accurate and robust estimation of prediction error is critical. We use simulated data to illustrate how these different implementations result in estimators for prediction error with very different properties and relationships to the true prediction error. We call for increased detail in method-reporting, present principles for good practice in the implementation of cross-validation, and make recommendations to guide cross-validation implementation.
dc.identifier.citationJournal of Statistics and Computer Science, 2022; 1(1):55-76
dc.identifier.doi10.47509/JSCS.2022.v01i01.05
dc.identifier.issn2583-5068
dc.identifier.urihttps://hdl.handle.net/11541.2/35775
dc.language.isoen
dc.publisherARF India
dc.rightsCopyright 2022 ARF India. ARF India provides online open access to all its journals. (https://www.arfjournals.com/ethics-policy)
dc.source.urihttps://www.arfjournals.com/jscs/issue/163
dc.subjectcross-validation
dc.subjectprediction
dc.subjectproteomics
dc.subjectreproducibility
dc.titleCross-validation for supervised learning with tuning parameters
dc.typeJournal article
pubs.publication-statusPublished
ror.fileinfo12275806980001831 13275847190001831 Open Access Published Version
ror.mmsid9916794923801831

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
9916794923801831_12275806980001831_5_Lyron Winderbaum_Final-new.pdf
Size:
823 KB
Format:
Adobe Portable Document Format
Description:
Published version

Collections