Modelling wine astringency from its chemical composition using machine learning algorithms

Aims: The present work aims to predict sensory astringency from wine chemical composition using machine learning algorithms. Material and results: Moristel grapes from different vineblocks and at different stages of ripening were collected. Eleven different wines were produced in 75 L tanks in triplicate, and further sensory factors were described by the rate-all-that-apply method with a trained panel of participants. The polyphenolic composition was characterised in wines by measuring the concentration and activity of tannins using UHPLC-UV/VIS, the mean degree of polymerisation (mDP. and the composition of tannins using thiolysis followed by UHPLC-MS. Conventional oenological parameters were analysed using FTIR and UV-Vis. Machine learning was applied to build models for predicting a wines astringency from its chemical composition. The best model was obtained using the support vector regressor (radial kernel) algorithm presenting a root-mean-square error (RMSE) value of 0.190. Conclusions: The main variables of the astringency model were the % of procyanidins constituting tannins and ethanol content, followed by other eight variables related to tannin structure and acidity. Significance of the study: These results increase the knowledge of chemical variables related to the perception of wine astringency and provide tools to control and optimise grape and wine production stages to modulate astringency and maximise quality and the consumer appeal of wines.


INTRODUCTION
Consumption experience and thus wine appreciation is the result of interactions between the consumer and products properties (Charters and Pettigrew, 2007;Prescott et al., 2002). Product-related factors refer to both intrinsic and extrinsic categories (Jover et al., 2004). Intrinsic cues are related to flavour, while extrinsic cues refer to properties that are not physically part of the wine, such as bottle weight, bottling place, type of wine or appellation, etc. While consumers rely on both types of cues when selecting a product (D'Alessandro and Pecotich, 2013), there is a wide body of work focused on understanding the impact of extrinsic cues on wine appreciation. However, much less is known about the intrinsic cues I.e. flavour. A products flavour is the result of the interaction between sensory modalities including colour, aroma, taste and mouthfeel. In complex systems, the formation of mouthfeel is the least understood overall. This is especially true for astringency perception, which is reported to be mainly driven by alcohol content, polyphenolic compounds, their interaction with oral components (i.e. saliva, mucosa and oral receptors) and brain processing (Canon et al., 2018). There is a lack of consensus in the scientific literature when defining the perceptual phenomenon and mechanisms driving astringency, however, the interactions between phenolics and proteins seem to be the most important driver (García-Estévez et al., 2018). It is important to note that there is great interindividual variability in astringency perception, which has been attributed to differences in saliva composition and buccal microbiota among consumers (Lamy, 2018). This variability in salivary proteome is related to differences in the sensory sensitivity for astringency (Lamy et al., 2017) and has shown to influence the acceptability of phenolrich foods (Dinnella et al., 2011;Masi et al., 2015).
Tannins have been reported as important drivers of wine astringency, which is mainly understood as dryness perception. These molecules are constituted by catechin or epicatechin (procyanidins. epigallocatechin (prodelphinidins) or epicatechin gallate (gallocatechins) units linked through C4-C8 interflavanoid bonds. Differences in constitutive molecules, concentrations and mean degree of polymerisation (mDP) (Ma et al., 2014) have been reported to have an impact on wine astringency.
Recently, tannin activity, measured as the enthalpy of interaction between polyphenols and an hydrophobic surface, has been hypothesised to correlate with wine astringency (Revelette et al., 2014;Watrelot et al., 2016). However, the correlation of sensory astringency with the different chemical variables that characterise phenolic compounds is far from clear. This difficulty in establishing consistent relationships between phenolic compounds and astringency has been attributed to: 1) a lack of enough chemical variability among the samples studied to induce significant sensory differences; 2) a lack of analytical relevant variables analysed in samples; or 3) the presence of cross-modal sensory interactions between aroma or taste properties with astringency and thus the capacity of other sensory modalities to modulate astringent perception (Watrelot et al., 2016). Besides these limitations, in the present work it is hypothesised that relationships between phenolic chemical properties and astringency do not necessarily have to be linear, as most studies have tried to show.
In this context, the present work aimed to firstly generate wines with enough phenolic variability able to induce astringency differences and then model astringency by applying different machine learning algorithms.
The first hypothesis was that grapes from the same variety (Moristel in this case), harvested at different maturity levels and processed with a similar winemaking strategy would yield wines with maximal variability in phenolic composition and most probably astringency, and with minimal aroma variability (as observed among wines from different varieties and winemaking processes). This would reduce the presence of cross-modal interactions. The second hypothesis was that sensory-chemical relationships between phenolics and astringency do not necessarily have to be linear.

Samples and winemaking
Eleven different wines were produced in 2017 with Moristel red grapes (Vitis vinifera L.) at the Pirineos winery (Barbastro, Spain). Four different vineyard blocks were selected based on historical data related to maturity evolution as measured by Dyostem® (Vivelys, France). According to commercial information, this tool monitors sugar loading and changes in the colour Maria P. Sáenz-Navajas et al. of the fruit to determine the maturity level of grapes (i.e. polyphenolic and technological maturities related to sugar and acidity levels) and optimal harvest date. Based on this approach, one block did not experience any significant maturity evolution in a four-week period, and thus it was harvested only once. In contrast, the other three blocks were harvested at three or four different points, each separated by one or two weeks. According to the commercial system, the second point of maturity was the optimal point to harvest; therefore the fruit was harvested one week before and one and/or two weeks after this optimal point, to ensure grapes with different maturity levels and thus, a priori, maximal variability in chemical composition and most probably in astringency.
Grapes were manually harvested from the Somontano region (Huesca, Spain). For each of the 11 selected vineblocks and/or maturity points, 150 kg of fruit was harvested and processed with an automatic crusher/destemmer. Each lot was divided into three stainless steel tanks (75 L capacity. sulphur dioxide was adjusted to 50 mg L -1 and the fruit was further inoculated with commercial yeasts (Lalvin ICV D254, Lallemand) at 106 cells mL -1 . Alcoholic fermentations (FOH) took place on skins for an average of 10 days. Once FOH was finished, lactic bacteria (Lalvin VP41) were inoculated at a rate of 9 mg L -1 . Wines were bottled around 3 months after FOH (free SO 2 adjusted to 30 mg L -1 ). Glass bottles (750 mL capacity) were sealed with natural cork closures.

Conventional oenological analysis
The total polyphenol index (TPI) was estimated as absorbance at 280 nm (Ribéreau-Gayon, 1970) and colour intensity (CI) as the sum of absorbance at 420, 520 and 620 nm (Glories, 1984) For TPI determination, the absorbance at 280 nm of samples diluted 1:100 in deionised water was measured in 1-cm quartz cuvettes. For CI, absorbance of undiluted samples was measured in 2-mm crystal cuvettes. Reducing sugars, ethanol content, pH, malic and lactic acid, as well as titratable and volatile acidities, were analysed using infrared spectrometry with Fourier Transformation with a WineScanTM FT 120 (FOSS®, Barcelona, Spain. which was previously calibrated with the official OIV methods.

Analysis of anthocyanin-derived pigments
Determination of monomeric (MP. small polymeric pigments (SPP) and large polymeric pigments (LPP) in wines and fractions was carried out as described elsewhere (Harbertson et al., 2003). MPs were the group of compounds bleachable with bisulphite, while SPP and LPP were resistant to bisulphite bleaching. SPP did not precipitate with ovalbumin, in contrast with LPP, which did. Levels of MP, SPP and LPP were expressed as absorbance at 520 nm.

Characterisation of tannins
Acid-catalysed degradation in the presence of toluene-α-thiol was performed according to the method described by Labarde et al. (1999) but with some modifications as described by Gonzalo-Diago et al. (2013). Quantification was performed in the negative mode from the extracted ion chromatogram (EIC) for flavan-3ols and in the positive mode for malvidin-3-Oglucosyde. The area under the peaks of malvidin-3-O-glucosyde and flavan-3-ol monomers (terminal units) before and after thiolysis, as well as toluene-α-thiol adducts (extension units) released from the depolymerisation reaction were integrated.
Calibration curves were established with malvidin-3-O-glucosyde, (+)-catechin, (−)epicatechin, (−)-epicatechin-3-O-gallate and (−)epigallocatechin. In the absence of the standards of the thiol derivatives and considering that the thiolytic derivatives were shown to have similar response factors as the correspondent monomeric units, their concentrations were calculated from the respective monomer calibration curves. The mean degree of polymerisation (mDP) was calculated as the ratio of total units (extension + terminal) to terminal units (calculated as the difference before and after thiolysis). The % of tannins linked to malvidin-3-O-glucosyde (%T-M) was calculated as the molar ratio of malvidin-3-Oglucosyde linked to tannins (calculated as the difference before and after thiolysis) to the sum of the total units of terminal malvidin-3-Oglucosyde and extension + terminal units of (+)catechin, (−)-epicatechin, (−)-epicatechin-3-Ogallate and (−)-epigallocatechin (i.e. total units of tannins). The % of procyanidins (%PC) was calculated as the ratio of total units (extension and terminal) of catechin and epicatechin to total units of tannins. The % of prodelphinidins (%PD) and galloylated (%G) units was calculated as the ratio of total units of PD and G to the total units of tannins, respectively.
The % of malvidin-3-O-glucosyde linked to tannins (%M-T) was calculated as the molar ratio of the malvidin linked to flavanols (difference of malvidin before and after thiolysis) to total malvidin (before and after thiolysis).
Concentration and activity of tannins were estimated using HPLC-UV-Vis following the method proposed by Revelette et al. (2014). Tannin activity is related to the thermodynamics of interaction between tannins and a hydrophobic surface (polystyrene divinylbenzene HPLC column) as discussed elsewhere (Barak and Kennedy, 2013;Revelette et al., 2014).

Participants and procedure
The 33 wines produced (11 different wines in triplicate) were evaluated sensory characterised in February and March 2018 by 17 participants at the Instituto de Ciencias de la Vid y del Vino (ICVV) and Universidad de La Rioja (Spain). The participants were mainly oenology students and oenologists (11 women and five men, age range 22-34 years, average age 28) recruited on the basis of interest and availability and were not paid for their participation. They attended a total of 13 sessions spread over four weeks, comprising nine training sessions (90 min each and taking place at 12 p.m.) and four sessions to describe the wines studied (eight or nine samples per session). The participants worked in two subgroups and followed the same guidelines. The first session was devoted to generating aroma terms that differed among samples. During the following training sessions, reference standards (prepared at Universidad de Zaragoza) were presented for the 12 selected aroma terms and six for taste and mouthfeel terms. For in-mouth terms, solutions were prepared containing different concentrations of table sugar (0-7 g L -1 ) for testing sweetness, tartaric acid (0-3 g L -1 ) for acidity, quinine sulphate (0-40 mg L -1 ) for bitterness and potassium, aluminium sulphate (0-5 g L -1 ) for astringency, absolute alcohol (0-15% v/v) for alcoholic sensation and carboxymethylcellulose (0-1.5 g L -1 ) for viscosity. During a typical training session, the participants were presented with references illustrating the different aroma, taste and mouthfeel terms. Next, between two and four wines were first individually described and then the ratings were discussed until the participants achieved consensus. The wines were described in the last four sessions: participants were asked to taste the wines and rate the intensity on a 7point scale (1 = very low; 7 = very high) using only those terms (out of 18 available terms) that applied to the sample, according to rate-all-thatapply (RATA) methodology (Ares et al., 2014). Terms that did not apply to the wine were allocated a value of 0 when collecting data. To avoid bias due to order of presentation, terms in the list appeared in different and randomised orders for each assessor. The use of a sip (rinsing solutions: water and 1 g L -1 pectin solution) and spit protocol between each sample was imposed as described elsewhere (Colonna et al., 2004). Participants tasted samples in a sequential monadic manner: 20-mL samples were served in dark wine glasses labelled with random threedigit codes and covered with plastic Petri dishes according to a random arrangement that was different for each participant. Samples were served at room temperature and evaluated in a ventilated and air-conditioned tasting room at around 20 °C.

Data analysis
Only the data for astringency are reported here.
The discriminability potential of chemical variables among wines was calculated as the rate between maximal and minimum level (max/min) for each variable.
Two-way ANOVAs (participants as random and wines as fixed factors) were calculated for the term astringency. Next, a pair-wise comparison test (Fischer test) was applied (5% risk) using XLSTAT (2015).
The first step for modelling was to search for the best simple models, then the best learning algorithms were merged to obtain better predictive performance. This technique is known as model ensemble.. Machine learning algorithms were boosted by SDG using the DataRobot Platform. Therefore, dozens of independent challenger models were developed and validated by cross-validation. Model accuracy was evaluated by root-mean-square error (RMSE), i.e. differences between astringency scores predicted by a model and the scores observed. A robust model k-fold cross-Maria P. Sáenz-Navajas et al. validation framework to test the out-of-sample stability of each model was employed. In addition to the cross-validation partitioning, a holdout sample was calculated to further test outof-sample model performance and ensure that overfitting did not occur. Therefore, 18% of the training data was set aside as a holdout dataset. This dataset was used to verify that the final model performs well on data that has not been examined throughout the training process. For further model validation, the remainder of the data was divided into five cross-validation partitions (selected by random sampling). For the best models, five-fold cross-validation training and scoring was completed. Then, the mean score of the complete model cross-validation was calculated across all folds.
The best model for astringency consisted of applying a smooth ridit transformation followed by the calculation of the support vector regressor (SVR; radial kernel). The ridit (or score for a variable) transformation can be interpreted as an adjusted percentile score and extends the Bross ridit method (Bross, 1958) by applying the method to numerical values and normalising the score such that the mean calculated for the reference population will always be 0 and the score will be in the interval [-1,1]. The SVR is a generalised version of support vector machine classifiers that uses a special loss function to convert a regression problem into a classification problem. Support vector machines (SVMs) are an extremely robust machine learning model and are very efficient in high-dimensional spaces. In addition, a "kernel" function was used, which allows for a non-linear transformation of the data before fitting the SVM. These kernel functions can be a very useful way to transform a non-linear problem into a linear domain.
The permutated impact of variables on the models is calculated by observing the effect on model scores when altering the input data of a given variable. The algorithm employed normalises the scores so that the values of chemical variables included in the model are

RESULTS AND DISCUSSION
The first objective of the present work was to produce Moristel wines with important chemical variabilities, focusing on parameters typically known to be related to astringency perception. Results show that for the 20 chemical variables analysed, highly significant differences (P<0.0001 in all cases) were observed among the 33 wines. Table 1 shows ranges and median values of the parameters measured. These data show important chemical variability among wines, with particular importance placed on the differences, and thus the discriminability potential (measured as the rate max/min) among wines for lactic acid (max/min = 70.8. % of galloylation of tannins (max/min = 55. mean degree of polymerisation of tannins (max/min = 41) and % of prodelphinidins constituting tannins (gallocatechins and epigallocatechins) (max/min = 11).
It is important to note that bitterness and astringency do not present significant linear correlations (r=0.40, P>0.1. which confirms that the participants were not confused and were able to differentiate both sensations (Lea and Arnold, 1978). Astringency scores ranged between 0 and 4 (6 being the maximum possible score) and significant differences among wines (F=15.13; P<0.0001) were observed. These data confirm our first hypothesis related to the strategy followed (selection of grapes from different vineblocks at different maturity points) to generate wines with different chemical compositions, inducing sensory differences in astringency. Table 1 shows that astringency scores present significant (P<0.05) positive linear correlations with six of the 20 chemical variables studied: tannin activity, TPI, ethanol content, % of prodelphinidins in tannins, colour intensity and monomeric pigments.
Our second hypothesis was that sensory and chemical composition do not necessarily follow a linear correlation (i.e. the higher/lower concentration the higher astringency). Thus, astringency scores were modelled from the 20 chemical variables using machine learning algorithms. A highly satisfactory model was obtained. Residual error (measured through rootmean-square deviation, RMSE, by full crossvalidation) was 0.19. The best algorithm was SVR (radial kernel). Figure 1 shows the lift chart, which confirms model performance and thus its capability to predict astringency. Model performance was evaluated by calculating possible predictions partitioned into subsegments, deciles or bin. For each bin, average predicted astringency scores (blue line) were compared to average actual values (orange value). Both predicted and actual scores were very closely projected and lines consistently increased, both being indicators of satisfactory model performance and the accuracy of the model.  These results confirm the importance of tannin structure, ethanol content and acids (measured as titratable acids or total acidity and volatile acidity) on the modulation of astringency perception, which is not surprising given the many publications that mention these variables as important drivers of this sensory perception in wine (Ma et al., 2014;Sáenz-Navajas et al., 2012;Soares et al., 2017;Watrelot et al., 2016). However, most of the existing literature tries to establish linear correlations between astringency and chemical composition (i.e. higher/lower levels of a component generate higher astringency. which could be the main source of contradictory results reported when establishing sensory-chemical relationships. In the present work, different tendencies in astringency perception can be observed depending on the levels of a given chemical variable (Figure 3 and supplementary material). Figure 3a shows the partial dependence plot of astringency and % procyanidins (%PC) in tannin structures. Three main tendencies could be identified depending on the %PC: 1) a steep negative linear relationship for %PC< 68%; 2) a moderate positive linear relationship for the 68-76% range, and 3) no change in astringency associated with different %PC for %PC>76%. Interestingly, the % of total (extension + terminal) catechins and epicatechins (%PC) presents a significant correlation with the % of epicatechin units (r = 0.80; P < 0.001). Thus, this modulation of astringency with %PC could be attributed to changes in the stereochemistry of tannins related to the % of epicatechins in procyanidins. Thus, at low levels of epicatechin subunits, astringency decreases with increases in terminal and extension epicatechins in tannins. However, at intermediate levels (68-76%) of epicatechin units in tannins, increasing epicatechins generate higher astringency, but at higher levels no effect is observed. Results observed for intermediate levels (68-76%) are in agreement with results reported in the literature (Quijada-Morín et al., 2012) in which higher astringency is observed for tannins with higher proportions of epicatechin than catechin subunits. However, in the present non-linear model, two further trends could be identified depending on the % levels of PC (one with negative effect for low levels of PC and other with no effect for higher levels). These results could be explained in terms of structural/ conformational differences of tannins with different structural properties that have different site-specific bindings with tannin (De Freitas and Mateus, 2001;Thorngate and Noble, 1995).
Further research in this topic should be carried out to find a plausible explanation. well in accordance with data reported for commercial wines containing ethanol levels of 13-17% (v/v) (Sáenz-Navajas et al., 2010;Sáenz-Navajas et al., 2012;Watrelot et al., 2016) but contradicts studies carried out with model wines at typical wine ethanol levels of 11-15% (Fontoin et al., 2008;Vidal et al., 2004). These studies report a decrease in astringency with ethanol content, which has been attributed to a decrease of the interaction power between tannins and proteins from 10% of ethanol (hydrophobic + hydrogen-bond interactions) to 15% (hydrogen-bond interactions) (McRae et al., 2015). Thus, the positive correlation between ethanol content and astringency observed in the present work could be attributed to an indirect relationship with phenolic content. Grapes harvested at earlier stages present lower levels of extractable polyphenols but also lower sugar content, yielding wines with lower ethanol levels and polyphenolic concentration and resulting in lower astringency. However, it cannot be ruled out that ethanol can induce astringency-related sensations by mechanisms other than polyphenol-protein interactions. This would be supported by an important number of papers that have established positive relationships between ethanol content and astringency perception (Sáenz-Navajas et al., 2010;Sáenz-Navajas et al., 2012;Watrelot et al., 2016). Additional investigation is needed to understand the relationship between ethanol and astringent sensations.
For the rest of chemical variables included in the model, three different tendencies were globally identified (see supplementary material).
The first trend is observed for the mean degree of polymerisation. For low mDP values (up to approximately 1.4) astringency increases with mDP, while for higher values astringency decreases. This result is well in accordance with the positive linear relationships observed between astringency and DP with low molecular tannins by Peleg et al. (1999). Interestingly, Chira et al. (2009) also found significant (P = 0.04) positive correlations, but only for skin tannins in one of the two years studied. This sample set presented an average mDP of 21 (range 4-49. which is far out of the range for the wines studied here. This lack of significance for the rest of sample sets (year 2016 and seed tannins of 2006 and 2007) could be attributed to the presence of different relationships depending on the level of mDP as observed in the present work. The effect of the size of tannins (measured through the mDP) on astringency could be explained in terms of tannin hydrophobicity. Thus, even if higher tannin polymerisation can bring more hydrophobic parts, and thus higher astringency (due to higher tannin-protein interactions. this relationship is thought not to be linear and is attributed to conformational arrangements and aggregation processes (Ma et al., 2014).
The second trend is observed for total acidity, volatile acidity, % of prodelphinidins (%PD. total polyphenol index (TPI) and tannin activity (measured as the enthalpy of interaction of OENO One 2019, 3, 471-486 tannins with a hydrophobic surface). These present positive linear relationships with astringency, with this relationship more pronounced at higher values of the corresponding chemical variable and astringency perception. It has been shown that the effect of acidity on astringency is attributed to changes in pH. Thus, for similar pH values, changes in acidity do not have significant effects on astringency (Fontoin et al., 2008). This increase of astringency is attributed to the presence of more phenolate forms and an augmentation of charged molecules, susceptible to participating in protein binding (Siebert and Euzen, 2008). Concerning the positive relationship observed between astringency and %PD, it is interesting to note that this is more important at higher levels (range 1.8-3.2) of astringency. This result is in apparent contradiction with other studies, which have shown in simple model solutions that procyanidins present a faster and stronger interaction with salivary proteins than prodelphinidins (Ferrer-Gallego et al., 2015). At present it is difficult to explain such a relationship, because it is likely that astringency differences related to polyphenolic structures are the result of conformational differences among tannins that cannot yet be measured in such a complex mixture such as wine. To this regard, the measure of tannin activity by HPLC seems to be a promising index. Tannin activity is a parameter that measures the enthalpy of interaction of tannins with a hydrophobic surface. It appeared as an interesting measure of tannin affinity to proteins and thus of wine astringency (Revelette et al., 2014). However, until now no direct relationship with tannins could be established that was attributed to the presence of strong interactions (with polysaccharides or aroma perception) appearing in wines with very different chemical and sensory spaces (i.e. different varieties, winemaking processes, origins, etc) (Watrelot et al., 2016). Thus, working with the same grape variety from a similar origin and processed with the same winemaking protocol could have helped establish such interesting linear relationships between tannin activity and sensory astringency (i.e. drying sensation). It is interesting to note that this is the first time a relationship of this chemical variable with sensory perception has been established.
The third trend is related to the % of galloylated tannins (%G) and tannins linked to malvidin (%T-M. which show a similar relationship with astringency as % PC. Thus, two segments are observed: one for low levels of the chemical variable with a negative linear relationship with astringency and a second for high values with a positive linear correlation. As explained above, the constitutive units of tannins as well as their polymerisation degree play an important role in tannin conformational structures, which drives tannin hydrophobicity, tannin-protein interactions and thus perceived astringency.

CONCLUSIONS
The present work has successfully modelled the perception of wine astringency from its chemical composition by applying machine learning approaches. This strategy has explained nonlinear relationships by means of the SVR (radial kernel) algorithm, which showed a very low residual error between actual and predicted astringency scores. This and the fact that sensory perception is distinctly non-linear show the necessity of considering non-linear models to explain sensory precepts from chemical composition.
The main drivers of the astringency model were related to ethanol content (potentially elicited by a mechanism different from polyphenol-protein interactions. acidity (related to pH variations. as well as to effects of chemical variables linked to tannin structure, such as 1) the constitutive subunits of tannins (%PC, %PD, %G and %T-M. 2) tannin activity measured as the enthalpy of interaction with a hydrophobic surface, and 3) the size of tannins measured by the mean degree of polymerisation.
The results presented here increase understanding of astringency perception and provide wine producers with objective tools to help them control and optimise grape and wine production stages for further modulating astringency and thus maximising the quality and consumer appeal of their wines.