Cross-validation for supervised learning with tuning parameters
| dc.contributor.author | Winderbaum, L. | |
| dc.contributor.author | Koch, I. | |
| dc.date.issued | 2022 | |
| dc.description.abstract | Recent advances in machine learning and data science have led to widespread adoption of complex predictive modelling. Increasing awareness of the ‘reproducibility crisis’ has led to calls for improved transparency and accountability in scientic reporting. One important aspect of veridical data science is the robust estimation of prediction error. Availability of computational resources has led to cross-validation (CV) as a main tool for such estimation. We consider CV estimation in supervised learning for high-dimensional data, and focus on linear regression and discriminant analysis approaches based on variable selection with direct dimension reduction as well as lasso-type sparsity criteria. We highlight how the same description of a method could in fact apply to any one of several different cross-validation implementations. We outline key principles underpinning good cross-validation practice, several ‘pitfall’ implementations which subtly violate these principles in different ways as well as a more complex and computationally intensive implementation which does not. We demonstrate the differences in the estimated error resulting from these different implementations with real data relating to endometrial cancer, in the context of high-stakes decision making where accurate and robust estimation of prediction error is critical. We use simulated data to illustrate how these different implementations result in estimators for prediction error with very different properties and relationships to the true prediction error. We call for increased detail in method-reporting, present principles for good practice in the implementation of cross-validation, and make recommendations to guide cross-validation implementation. | |
| dc.identifier.citation | Journal of Statistics and Computer Science, 2022; 1(1):55-76 | |
| dc.identifier.doi | 10.47509/JSCS.2022.v01i01.05 | |
| dc.identifier.issn | 2583-5068 | |
| dc.identifier.uri | https://hdl.handle.net/11541.2/35775 | |
| dc.language.iso | en | |
| dc.publisher | ARF India | |
| dc.rights | Copyright 2022 ARF India. ARF India provides online open access to all its journals. (https://www.arfjournals.com/ethics-policy) | |
| dc.source.uri | https://www.arfjournals.com/jscs/issue/163 | |
| dc.subject | cross-validation | |
| dc.subject | prediction | |
| dc.subject | proteomics | |
| dc.subject | reproducibility | |
| dc.title | Cross-validation for supervised learning with tuning parameters | |
| dc.type | Journal article | |
| pubs.publication-status | Published | |
| ror.fileinfo | 12275806980001831 13275847190001831 Open Access Published Version | |
| ror.mmsid | 9916794923801831 |
Files
Original bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- 9916794923801831_12275806980001831_5_Lyron Winderbaum_Final-new.pdf
- Size:
- 823 KB
- Format:
- Adobe Portable Document Format
- Description:
- Published version