Multivariate characterisation of Italian monovarietal red wines using MIR spectroscopy

Aim : The aim of this study was to investigate the application of mid-infrared (MIR) spectroscopy combined with multivariate analysis, to provide a rapid screening tool for discriminating among different Italian monovarietal red wines based on the relationship between grape variety and wine composition in particular phenolic compounds. Methods and results : The MIR spectra (from 4000 to 700 cm ‒ 1 ) of 110 monovarietal Italian red wines, vintage 2016, were collected and evaluated by selected multivariate data analyses, including principal component analysis (PCA), linear discriminant analysis (DA), support vector machine (SVM), and soft intelligent modelling of class analogy (SIMCA). Samples were collected directly from companies across different regions of Italy and included 11 grape varieties: Sangiovese, Nebbiolo, Aglianico, Nerello Mascalese, Primitivo, Raboso, Cannonau, Teroldego, Sagrantino, Montepulciano and Corvina. PCA showed five wavelengths that mainly contributed to the PC 1 , including a much-closed peak at 1043 cm ‒ 1 , which correspond to the C–O stretch absorption bands that are important regions for glycerol, whereas the ethanol peaks at around 1085 cm ‒ 1 . The band at 877 cm ‒ 1 are related to the C–C stretching vibration of organic molecules, whereas the asymmetric stretching for C–O in the aromatic –OH group of polyphenols is within spectral regions from 1050 to 1165 cm ‒ 1 . In particular, the (1175)–1100–1060 cm ‒ 1 vibrational bands are combination bands, involving C–O stretching and O–H deformation of phenolic rings. The 1166–1168 cm ‒ 1 peak is attributable to in-plane bending deformations of C–H and C–O groups of polyphenols, respectively, for which polymerisation may cause a slight peak shift due to the formation of H-bridges. The best result was obtained with the SVM, which achieved an overall correct classification for up to 72.2% of the training set, and 44.4% for the validation set of wines, respectively. The Sangiovese wines (n=19) were split into two sub-groups (Sang-Romagna, n=12 and Sang-Tuscany, n=7) considering the indeterminacy of its origins, which is disputed between Romagna and Tuscany. Although the classification of three grape varieties was problematic (Nerello Mascalese, Raboso and Primitivo), the remaining wines were almost correctly assigned to their actual classes. Conclusions : MIR spectroscopy coupled with chemometrics represents an interesting approach for the classification of monovarietal Italian red wines, which is important in quality control and authenticity monitoring. Significance and impact of the study : Authenticity is a main issue in winemaking in terms of quality evaluation and adulteration, in particular for origin certified/protected wines, for which the added marketing value is related to the link of grape variety with the area of origin. This study is part of the D-wine project “The diversity of tannins in Italian red wines”. ABSTRACT


INTRODUCTION
With over 300 grape varieties, Italy has one of the richest ampelographic heritages worldwide. Among these grapes, several varieties are extremely rich in tannins (Mattivi et al., 2002;Mattivi et al., 2009). From a marketing point of view, the ability to associate each wine with specific sensory attributes (Gambuti et al., 2013;García-Estévez et al., 2017) is often a tool for the commercial success. Moreover, the assessment of wine authenticity is of great importance for consumers, producers and regulatory agencies to guarantee the wine labelling in terms of geographical origin and grape variety (Arvanitoyannis et al., 1999;Versari et al., 2014;Villano et al., 2017).
As wine is a complex mixture of organic and inorganic compounds whose composition is affected by several variables (e.g. soil, climate, year, grape variety, winemaking practices), from a chemical point of view the authenticity of wine should rely on those parameters that are not affected during production or are difficult to forge. Thus, UV, visible (Vis), near infrared (NIR) and mid-infrared (MIR) spectroscopies with multivariate data analysis are suitable tools to ascertain wine composition, including phenolic compounds (Pique et al., 2001;Fernández and Agosin, 2007;Bauer et al., 2008;Jensen et al., 2008;Laghi et al., 2011;Dambergs et al., 2012;Martelo-Vidal and Vázquez, 2014;Aleixandre-Tudo et al., 2015;Saad et al., 2016;Palade and Popa, 2018).
Several authors have focused on predicting wine authenticity using a combination of wine composition and multivariate data analysis. For instance, Edelmann et al. (2001) correctly discriminated more than 97% of Austrian red wines (n=38) -including the cultivars Cabernet Sauvignon, Merlot, Pinot Noir, Blaufränkisch (Lemberger), St. Laurent and Zweigelt -using MIR with soft independent modelling of class analogy (SIMCA) of the phenolic extracts. Cozzolino et al. (2003) correctly classified 100% of Australian Riesling wines (n=144) and up to 96% of Chardonnay wines (n=125) using Vis-NIR in combination with discriminant partial least squares (DPLS) regression. Around 86% of the Sauvignon Blanc wines (n=64) from Australia and New Zealand were correctly classified using the MIR spectrum with full cross-validation (leave-one-out) and partial least squares discriminant analysis (PLS-DA) (Cozzolino et al., 2011). Similarly, MIR coupled with linear discriminant analysis (LDA) with cross-validation could discriminate around 88% of five Marsala wines with different ageing times (Condurso et al., 2018). Louw et al. (2009) obtained a 98.3% correct separation of South African wines (n=496)including Pinotage, Merlot, Cabernet Sauvignon, Shiraz, Chardonnay and Sauvignon Blancbased on their MIR spectra and LDA. Similarly, Bevin et al. (2007) applied MIR for varietal classification of three red (n=119) and four white (n=72) Australian wines, and achieved a classification rate of more than 90% using LDA statistical analysis.
Furthermore, the combination of MIR spectroscopy and LDA made it possible to correctly classify more than 75% of the red and white wines sourced from organic (n=57) and non-organic (n=115) production systems from 13 growing regions in Australia (Cozzolino et al., 2009). Recently, Basalekou et al. (2016) correctly classified four wines (n=88) made by two white (Vilana and Dafni) and two red grape varieties (Kotsifali and Mandilari) based on their phenolic content and colour parameters using the 1800-900 cm 1 MIR spectral region. Basalekou et al. (2017) further implemented the dataset of wines (n=154) with 82% correct classification using LDA and cross-validation.
As part of the D-wine project "The diversity of tannins in Italian red wines", this study aimed to investigate the application of MIR spectroscopy combined with multivariate analysis to provide a rapid screening tool for discriminating among different Italian monovarietal red wines.

Samples
A total of 110 monovarietal red wines, vintage 2016, were collected directly from companies across different regions of Italy ( Figure 1) and included 11 grape varieties: Sangiovese (n=19; seven from Tuscany and 12 from Romagna); Nebbiolo (Piemonte: n=11); Aglianico (Puglia: n=10); Nerello Mascalese (Sicilia: n=3); Primitivo (Puglia: n=11); Raboso (Veneto: n=10); Cannonau (Sardegna: n=9); Teroldego (Trentino: n=11); Sagrantino (Umbria: n=11); Montepulciano (Abruzzo: n=8); and Corvina (Veneto: n=7). All samples were monovarietal, i.e. 100% from a single grape variety, and were Guiseppina.P. Parpinello et al. not blended with wines from other regions. Therefore, although they do not represent commercially available products, these wines exhibit their respective varietal uniqueness and were selected based on their importance at regional level. The aim of this study was to gain insight on the composition of real wines in terms of authenticity of grape and wine, therefore each winery performed its original vinification protocols optimised for each specific grape variety.

Tannin assay
All wines were analysed for iron reactive tannins [tannins-Fe] according to the literature (Harbertson et al., 2003) and as described elsewhere in detail (Versari et al., 2007). Briefly, the wine tannins are precipitated with bovin serum albumin (BSA), then the pellet is dissolved in buffer and the tannins are determined by reaction with ferric chloride, yielding a coloured product quantified at 510 nm (UV-Vis spectrophotometer Cary 60, Agilent Technologies, Santa Clara, CA) and using (+)catechin as calibration standard (mg/L CE) (Sigma, Milano, Italy).

Mid-infrared analysis
Infrared analysis was carried out using a diamond Attenuated Total Reflectance (ATR) Smart Orbit accessory (Thermo Optec), equipped with a deuterated triglycine sulfate detector and a KBr window for measuring the medium infrared (MIR) region. The incident beam had a 45° geometry with respect to the diamond surface, yielding 25 internal reflections. The wines were analysed without any pretreatment or purification, simply by pouring a few drops of the samples over the ATR crystal. For each sample, the whole MIR spectra range from 4000 to 700 cm 1 was averaged over 128 consecutive scans with a resolution of 4 cm 1 (Figure 2). The samples were analysed in duplicate against an air background that was collected immediately prior to analysis, and the averaged spectrum was processed as a single sample spectrum for further multivariate analysis.

Data processing and multivariate analyses
The whole MIR spectra and several spectral regions were tested for chemometric elaboration, including the so-called "fingerprint region" from 1500 to 700 cm 1 . According to Louw et al. (2009), some spectral regions (5000-2970 cm 1 ; 1716-1543 cm 1 ) can be excluded prior to multivariate analysis to avoid strong absorption of water and spectral features that are little related to wine composition.
To attempt the classification of wines based on the relationship between grape variety and wine composition, the raw MIR spectra were mean-centred and scaled to the same variance by standardisation throughout all multivariate statistical procedures (Unscrambler software version 10.3, Camo AS, Nedre Vollgate, Norway; XLSTAT 2018, Addinsoft, Paris, France), using principal component analysis (PCA), linear discriminant analysis (LDA), soft intelligent modelling of class analogy (SIMCA), and support vector machine (SVM). The selected multivariate tools have been previously used for classification of grapevine varieties (Yu et al., 2017;Canuti et al., 2018) and grape nectars authenticity (Whei Miaw et al., 2018).
In particular, PCA is an unsupervised technique that reduces the dimensionality of the response matrix (i.e. MIR spectra) to a few new principal components (PCs) and was used to examine the hidden structure of the dataset, to determine correlations between observations and variables and to describe the overall variation in the data (Esbensen, 2002;Naes et al., 2002).
LDA, a supervised classification technique, was used to assign the wine samples according to variety, and the classification performance was evaluated by comparing the number of correctly assigned objects to their total number. LDA maximises between-group variance and performs best when there are fewer variables than samples and when the variables are orthogonal, i.e. uncorrelated (Naes et al., 2002). LDA classification matrices were developed using a full cross-validation (leave-one-out) method on the PCA sample scores for the five principal components (PC 1-5 ) that gave the highest level of separation (high variance) in the PCA models developed. PCA reduces the spectral data, thus allowing two important criteria to be satisfied: (i) data is orthogonal and (ii) the number of variables is lower than the number of samples.
SIMCA is a supervised classification technique based on PCA, which exploits the similarity among samples within a class more than the difference between the classes (Ballabio and Todeschini, 2009). The SIMCA was modelled in two steps: (i) the PCA was carried out for each grape variety and the number of components was determined through cross-validation; (ii) the classification of new samples was carried out by means of a SIMCA model formed through PCA. Thus, the sample set was divided at random into calibration and validation sets, grouping 70% and 30% of wine samples, respectively. In particular, the modelling power (MP) is a measure of the influence of a variable over a given model. This measure has values between 0 and 1; the closer to 1, the more that variable is taken into account in the class model, the higher the influence of that variable, and the more relevant it is to that particular class. Usually, any variable with a MP higher than 0.3 is considered relevant in the model (Wold et al., 1981). The discrimination power (DP) of a variable shows the ability of each variable to discriminate between two models, i.e. wines. Thus, a variable with a high DP (regarding two particular models) is very important for the differentiation between the two corresponding classes.
SVM is a supervised learning technique that uses an optimal separation hyperplane to separate samples in multi-class feature space. SVM can model cases of fewer samples and nonlinear relations (Bishop, 2006) and it does not require the use of PCA to reduce the dimensionality of the input matrix. The parameters of nonlinear

Tannin composition
Results of tannin composition from the whole dataset are summarised in Figure 3. As expected, the tannin content of red wines showed great variability (range, 6-2327 mg/L CE) due to the varietal diversity, with a minor right-skewed distribution (data not shown), meaning a large proportion of samples was found on the left side of the distribution compared to the Gaussian distribution. The sampling strategy effectively selected red wines with a wide range of tannins and therefore the dataset is also suitable for further study on astringency.

Principal component analysis
The raw spectra within the fingerprint region (i.e. 1500-700 cm 1 ) were analysed without rotation to improve the interpretation of Xloadings. The original X-matrix, when analysed by PCA, was reduced to five principal components (PC 1-5 ) globally explaining the 98% of total variability, and these five PC scores were later used for further LDA analysis. The PCA plot scattered the samples in two dimensions and therefore the data are suitable to be independently treated. Although the PC 1 was responsible for 78% of variance there was a lack of visual grouping according to grape variety on the first two PCs (PC 2 = 12% of variance explained) (data not shown). The X-loading plot showed five wavelengths (1043, 877, 1083, 1164 and 1178 cm 1 ) that mainly contributed to the PC 1 .

LDA
LDA was used in the first attempt to classify the red wines according to their grape variety. Due to the limited classification rate achieved (Table 1), the SVM approach was then used to improve the classification of the wines.

SIMCA
SIMCA was modelled on the fingerprint spectra region (i.e. MIR spectra 1500-700 cm 1 ) to counterbalance the limited sample size and the number of independent variables. Although SIMCA can work with as few as ten samples per class, the Nerello Mascalese wines were not considered for SIMCA due to limited samples available (n=3). SIMCA does not provide a single plot for looking at all the groups concurrently as it uses different PC models for each group. While SIMCA showed limited performance to model the identification of red wines due to the complexity of the dataset, it provides useful information to visualise the important factors in terms of modelling power (MP) and discrimination power (DP). The heat plot of MP for the selected red wines showed the more relevant variables (i.e. MIR wave numbers) for each particular class of wine (Figure 4). The variables with higher influence for each class model are coloured in green and show that some wines/cultivars (e.g., Sangiovese Toscana, Primitivo, and Raboso) have many variables that are potentially relevant for helping each principal component to model variation in the data.
The DP, i.e. the variables most capable of discriminating between two models, showed two main groups: (i) Corvina, Primitivo, Montepulciano, Teroldego and Sagrantino; and (ii) Cannonau, Sangiovese, Raboso, Aglianico and Nebbiolo. The former has variables with a high DP (with regard to two particular models), and therefore it is important for the differentiation between the two corresponding classes ( Figure 5). For practical reasons, the Sangiovese wine from Tuscany was selected as the reference class due to the great importance of this variety at national level.

SVM
SVM was further attempted to improve the class modelling of wines. Like PCA, the SVM method   OENO One, 2019, vol. , x used the fingerprint spectra region (i.e. MIR spectra 1500-700 cm 1 ), and the cross-validation approach due to the small number of samples. The results obtained with the SVM classification are encouraging, with 72.2% of overall correct classification for training set (Table 2), and 44.4% for the validation set of wines.

DISCUSSION
The D-wine project "The diversity of tannins in Italian red wines" focuses on polyphenols due to the great contribution of this class of compounds on red wine colour, mouthfeel and aroma longevity. The tannin content of Italian red wines is consistent with the literature (Harbertson et al., 2008) in which a range from 30 to 1895 mg/L catechin equivalents (CE) was reported for five grape varieties: Pinot noir (n = 261 wines), Syrah (n = 266), Merlot (n = 197), Zinfandel (n = 182), and Cabernet Sauvignon (n = 364). As a matter of fact, the Italian red wines with high tannin content (e.g. Sagrantino, Nebbiolo, and Aglianico) are more suitable for ageing, whereas the other wines with low tannin concentration (e.g. Corvina, Teroldego, and Montepulciano) seem more appropriate for young wines. Thus, there are several well-known practices that can modify the 'natural' tannin content of red wine, including the addition of exogenous tannins, the aging in barriques and the blending. It is worth noting out that all the wines sampled in this study were obtained without contact with any wood source (e.g. barriques, staves, chips, oenological tannins, etc.), therefore the current findings disclosed the 'natural' tannin content of Italian monovarietal red wines.
Regarding PCA, the lack of grouping according to grape variety origin based on the first two PCs can be due to several variables, including the high number of grape varieties (n=11), the lack of equality in group sizes and the limited sample size. Bevin et al. (2008), using MIR spectra of wines, failed to discriminate among wines from four white grape varieties (Chardonnay, Riesling, Sauvignon Blanc and Viognier) with the first two PCs that explained 94% of variation in the spectra. Similarly, the red wines (Cabernet Sauvignon, Shiraz, and Merlot) showed little grouping using the first two PCs (PC 1 = 69% and PC 2 = 20%). Regarding the wavelengths that mainly contributed to the PC 1 , Zhang et al. (2010) showed that the much-closed peak at 1043 cm 1 corresponds to the C-O stretch absorption bands, which are important regions for glycerol, whereas the ethanol peaks at about 1085 cm -1 (Shurvell, 2001). The band at 877 cm 1 would be related to C-C stretching vibration of organic molecules. Some authors located the asymmetric and symmetric stretching for C-O in the aromatic -OH group of hydrolyzable tannins, with the spectral regions 1050 to 1165 cm 1 (Pantoja-Castroa and González-Rodríguez, 2011). The band close to peak 1176 cm 1 was found to be a typical feature of C-O stretching of hydrolisable tannins (Agatonovic-Kustrin et al., 2013). According to Jensen et al. (2008) two MIR regions (1485-1425 and 1060-995 cm 1 ) were likely particularly important for tannin quantification.
It is clear that for optimum performance LDA needs a balanced design (i.e. a similar number of objects in various classes), and the relevant information should be in the mean of the data (not in the variance). Moreover, LDA is a parametric method that provides inferior results for nonlinear problems, so SVM was used to address non-linearly separable cases by applying the kernel approach with RBF. The Sangiovese wines (n=19) were split into two sub-groups (Sang-Romagna n=12 and Sang-Tuscany n=7) considering the indeterminacy of its origins, disputed between Romagna and Tuscany.
Although the SVM classification of three grape varieties was problematic (Nerello Mascalese, Raboso and Primitivo), the remaining wines were correctly assigned up to 100% (Table 2).
These accuracy values are comparable to those obtained for grapevine varietal classification using near infrared (NIR) spectroscopy with cross-validation and SVM (Yu et al., 2017).
SIMCA may not provide satisfactory results if sample distribution in variable space is not uniform (Di Egidio et al., 2011). Moreover, the high dimensionality of wine identification requires the selection of variables to ensure the model's performance with fewer variables. SIMCA creates an individual submodel for each class based on a supervised pattern recognition approach, and therefore the variables selected can differ from the global PCA. Thus, the MP of SIMCA outlined the important variables (i.e. wavenumbers) for model variation, whereas the discriminatory power related the contribution of the variables to the identification of wines in the data set. Deleting variables with both a low MP and a low DP may sometimes help in improving the classification. However, the current wine dataset is problematic due to the high number of classes and low number of samples, and therefore it was a challenge to select (few) features to advance the performance of the model and to simplify the analysis of the results. In future, the SIMCA approach could be useful to refine the model once further wines are sampled.
For this reason, the SVM with nonlinear kernel was attempted considering that the method does not need a large number of samples to be trained and is not affected by the presence of outliers. The superior results of SVM confirmed its ability to classify samples when linear functions are not adequate to achieve complete class separation. Acevedo et al. (2007) discriminated Spanish red and white wines -with classification rates above 96% -according to their denomination of origin by ultraviolet (UV)- Guiseppina.P. Parpinello et al. visible spectrophotometric techniques combined with SVM.
In conclusion, according to the literature the MIR spectrum of wine can be used to discriminate the varietal origin of wines, and this preliminary study challenged for the first time a large number of grape varieties (11). Although the number of red wines under investigation was limited (110 samples), to provide a definitive classification of each and every variety based on their MIR spectra, the current findings showed the occurrence of a peculiar MIR pattern for some Italian grape cultivars. This needs further study to disclose the effect of additional variables, such as vintage. In particular, Nebbiolo, Corvina, Teroldego, Sangiovese Romagna and Sagrantino wines were, to a large extent, classified using the current SVM approach.
Further analysis is in progress to provide full information about the distinctive physicochemical and sensory characteristics of the selected monovarietal red wines.