Abstract

In this paper we test and compare advanced predictive approaches for estimating wine age in the context of the production of a high quality fortified wine – Madeira Wine. We consider four different data sets, namely, volatile, polyphenols, organic acids and the UV–vis spectra. Each one of these data sets contain chemical information of a different nature and present diverse data structures, namely a different dimensionality, level of collinearity and degree of sparsity. These different aspects may imply the use of different modelling approaches in order to better explore the data set's information content, namely their predictive potential for wine age. This happens to be so, because different regression methods have different prior assumptions regarding the predictors, response variable(s) and the data generating mechanism, which may or may not find good adherence to the case study under analysis. In order to cover a wide range of modelling domains, we have incorporated in this work methods belonging to four very distinct classes of approaches that cover most applications found in practice: linear regression with variable selection, penalized regression, latent variables regression and tree-based ensemble methods. We have also developed a rigorous comparison framework based on a double Monte Carlo cross-validation scheme, in order to perform the relative assessment of the performance of the various methods. Upon comparison, models built using the polyphenols and volatile composition data sets led to better wine age predictions, showing lower errors under testing conditions. Furthermore, the results obtained for the polyphenols data set suggest a more sparse structure that can be further explored in order to reduce the number of measured variables. In terms of regression methods, tree-based methods, and boosted regression trees in particular, presented the best results for the polyphenols, volatile and the organic acid data sets, suggesting a possible presence of a nonlinear relationship between predictors and response. Regarding the UV–vis data set, penalized regression methods (ridge regression, LASSO and elastic nets) presented the best results, albeit methods such as partial least squares (PLS) or principal component regression (PCR) are often the practitioners’ preferred choice.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.