Quantitative determination of macro components and classification of some cultivated mushrooms using near-infrared spectroscopy

Fourier transform near-infrared spectroscopy

primarily due to their protein content, a significant part of which is made up of essential amino acids. The amino acid composition of the proteins found in mushrooms is much more similar to animal proteins, thus they are more valuable than plant proteins (Meenu & Xu, 2019;Wang et al., 2014).
The carbohydrate content of mushrooms is important in a nutrition physiological respect, as it plays an important role in the composition of cell wall and energy supply. Recent studies have emphasized the importance of the carbohydrate content of mushrooms, as functional nutrition and bioactive components help to preserve health (Castro-Alves & do Nascimento, 2016;Rodrigues Barbosa, dos Santos Freitas, da Silva Martins, & de Carvalho, 2020).
Diffuse reflectance Fourier transform near infrared spectroscopy was employed for the identification of species and taxa delimitation of Pleurotus mushrooms. Principal component analysis (PCA) of FTIR spectra of Agaricus bisporus revealed that physical damage of mushrooms exhibits a significant effect on the tissue structure and aging process (Meenu & Xu, 2019).
The exact definition of the quantity of the nutrition-physiologically important components of mushrooms (protein, carbohydrate, and energy content) is a key task of quality control. Nowadays, classical wet chemical methods are used to define these components, which require a considerable amount of time, chemicals, and energy besides the significant impacts on the environment. (Table 1).
In modern analytical chemistry, examination methods based on chemometric and platform-free multivariate statistical processing of-"big data"-sets became more common. One of these is Fourier transform near-infrared spectroscopy (FT-NIRS), which makes the replacement of wet chemical methods possible being a fast, chemical-free, environmentally friendly, and green chemical method.
The purpose of this work was twofold. The first aim was to investigate and explain the differences between the mushroom species

Dry matter content
The collected, freshly chopped mushroom samples were dried to constant weight in a programmable air circulating drying oven (Memmert, Schwabach, Germany). Results were expressed as % w/w. The estimation function was based on the spectra of fresh samples.

Protein content
The measurement of the protein content of the mushroom samples was carried out according to ISO standards (ISO, 2013) with Kjeldahl digestion method. Calculated nitrogen was multiplied by 4.38 (Ferreira, Morales, & Barros, 2017). Results were expressed as % w/w.

Carbohydrate content
The measurement of the carbohydrate content of the dried samples was carried out by the Luff-Schoorl method (Gafta, 2018 II is subsequently determined iodometrically. As a preparative procession long term, acidic hydrolysis was used (25 g/L HCl, 100°C, 3 hr) before the polysaccharides quantification. Results were expressed as % w/w.

Total carbohydrate content
The total carbohydrate content was calculated from the ash content to the following equations (Gezer et al., 2016): Total carbohydrates (g) = 100 − (g protein + g fat + g ash). The ash content was determined after a 12 hr long time cremation at 600°C (Labor MIM, Budapest, Hungary). This method makes it possible to determine the polysaccharide components that act as a fiber and cannot be considered as hydrolysates of starch. Results were expressed as % w/w.

Energy content
The calculation of energy content was carried out in accordance with Regulation 1169/2011/EU of the European Council (European Commission, 2011). According to the resolution the fat and fiber content of the sample is needed for the measurement of energy content. Mushrooms typically have a low value of fat content (0.1 − 0.5 g/100 g fresh weight), thus we did not measure it but used data found in the special literature concerning the different species (Ferreira et al., 2017).
In the case of fiber content, we considered that all the polysaccharides (glycogen, chitin, beta-glucan, hemicellulose, and pectin derivatives) making up the fiber content of mushrooms can be broken down to monomers by long-time acid hydrolysis.
As only some of the listed polysaccharides are utilizable, the energy content was calculated based on the amount of hydrolysable polysaccharides.
Energy was calculated according to the following equations: wherein the carbohydrate value is the data obtained after 3 hr of hydrolysis.
Given that the hydrolysate does not represent the total carbohydrate content, this value is defined as "usable" energy. Results were expressed as kJ/100 g.

| Spectral measurements and pre-processing
The samples were subsequently measured on a Bruker MPA-Multipurpose FT-NIR analyzer (Bruker Optik GmbH, Ettlingen, German) equipped with a quartz beam splitter, an integrated Rocksolid interferometer, a PbS detector, working in the 12,500-3,800 cm −1 range combined with OPUS 7.2 (Bruker Optik GmbH, Ettlingen, German) software. Absorption spectra were collected in diffuse reflectance mode. The resolution was 8 cm −1 . The scanner speed was 10 kHz and each spectrum was the average spectrum of 32 subsequent scans. The internal background was measured using the integrated gold-coated surface of the integrating sphere. For spectral pre-processing standard normal variate (SNV), multiplicative scattering correction (MSC), first derivative (1st)

| Statistics
Principal Component Analysis (PCA) was used for detecting the spectral outliers and linear discriminant analysis (LDA) was used to classify the six species of mushrooms. Partial Least Squares Regression (PLSR) was used for building the prediction models and interval Partial Least Squares Regression (iPLSR) was used for building the prediction model of energy.

| Principal component analysis
PCA is a frequently applied method for multivariate overview analysis of spectral data; it is a method of data reduction or data compression. Applied to a data matrix of samples by variables it constructs new variables, known as principal components (PCs

| Linear discriminant analysis
LDA is a very common technique for dimensionality reduction problems as a pre-processing step for machine learning and pattern classification applications to find a linear combination of features that characterizes or separates two or more classes of objects or events. The LDA technique is developed to transform the features into a lower dimensional space, which maximizes the ratio of the between-class variance to the within-class variance, thereby guaranteeing maximum class separability (Tharwat, Gaber, Ibrahim, & Hassanien, 2017).
LDA, as implemented in STATISTICA 12.0 (Tulsa, Oklahoma, USA), has different options to select the significant variables for model building, such as forward stepwise, backward stepwise, or all effects. All effects model building method and X-scrambling randomization test were used five times.

| Partial least square regression
The idea behind PLSR is to find a few linear combinations (compo-

| Interval partial least squares regression
The interval partial least squares regression (iPLSR) method is a common choice for variable selection especially in the case of nearinfrared spectra because spectral data are highly correlated and the usage of variable windows is a better option than examination of each variable individually. This process is very similar to the original PLS method, but here the spectra are divided into a number of intervals (equal length or manually made intervals). The number of intervals is optional. PLS models are made for each interval and the aim of the method is to choose those few intervals which have the smallest RMSECV. The use of these intervals gives better prediction instead of using the whole spectrum (Islam et al., 2018

| RE SULTS AND D ISCUSS I ON
Comparing the average spectra (Figure 1) of the studied species it can be seen that the spectra mainly differ in height. This difference was due to the different particle sizes and reflections deriving from it. The first derivative curve is an excellent way of detecting fine characteristic differences. However, no characteristic difference was observed here.

| Principal Components Analysis
Prior to creating a model, PCA must be performed in every case in order to examine spectral outliers (Figure 2). The ellipse indicates a 95% confidence interval.
The residual statistics on the ordinate axis (F-residuals) describe the sample distance to model, whereas the Hotelling's T 2 describes how well the sample is described by the model.

| Linear discriminant analysis
In the first step, the average spectra of the samples from 12,500 to 3,800 cm −1 were used for principal component analysis. No data pre-treatment was applied as data pre-processing. The first 20 PCA scores were used for further analysis with LDA. The confusion matrix (

| Partial least squares regression
We defined the dry material, protein, carbohydrate, and total carbohydrate content of the samples according to the procedure described in 2.3.1. and then by this data, we calculated the energy content in the described way. Each parameter was defined by three parallel measurements.
The calibration models were made for the original dataset, which contains 192 samples and the spectrum range between 12,500 and 3,800 cm −1 .
The models were optimized with different data pre-processing methods. Ten PLS components were used for the calibration model and random fivefold cross-validation and test set validation were used as validation procedures for all our models. For test set validation, 33 percent of the samples were selected with "automatic selection" for test validation. This automatic selection method is based on the scores for the two first principal components. The closer the samples are in the scores plot, the more similar they are with respect to PC1 and PC2. Conversely, samples far away from each other are different from each other (Optik, 2011).
The best estimation functions are summarized in Table 3. The selection of the evaluation ranges was based on the characteristic vibration areas of the examined macro component (Workman & Weyer, 2012).
Cross-validation and test set validation methods were used to test the calibration model.
The Q 2 value of cross-validations was above 0.9 for both dry matter, carbohydrate, total carbohydrate, and protein content. The Q 2 T value of test set validations are lower than these but except for the carbohydrate content, in all cases, we obtained more than 0.9 here too.

TA B L E 2 Confusion matrix of LDA
TA B L E 3 Performance parameters, scaling methods, number of latent variables (Rank), and evaluation ranges of the best models Note: R 2 -percentage of variance accounted for by the calibration; Q 2 , Q 2 T -percentage of variance accounted for by the fivefold cross and test set validation; RMSEC-root mean error of calibration; RMSECV-root mean square error of fivefold cross validation; RMSEP-root mean square error of test set validation; Rank-number of PLS vectors. It was chosen based on the global minimum of the root mean squared error of cross-validation (RMSECV). Q 2 of the prediction of carbohydrate content is the lowest value, which can be explained by the fact that the method of determining classical data (Luff Schoorl method) has a lot of uncertainties and errors.

TA B L E 4 Determination of utilized energy content-PLS-R data for the whole measurement range
As these results meet the general fit-for-purpose requirements for the quantification of macronutrients in cultivated mushrooms, the FT-NIR related methods can be adapted for routine applications.
Similar NIR prediction functions have been developed to determine the total polysaccharide content of some Ganoderma species, which can be characterized by RMSEP = 0.224% w/w and 0.603% w/w (Chen et al., 2012;Ma et al., 2018).
Similar NIR prediction functions have been developed to determine the total polysaccharide content of some Ganoderma species, which can be characterized by RMSEP = 0.224% w/w and 0.603% w/w (Chen et al., 2012;Ma et al., 2018). Although their results cannot be compared with ours because the spectra of Ganoderma samples differ so significantly from the spectra of the examined cultivated mushroom species, that they were spectral outliers.

| Interval partial least squares regression
Given that energy content is a complex property-it can be calculated based on the common influence of several macro-components (carbohydrates, proteins, fat)-interval PLS regression (iPLSR) was used to predict the energy content.
As a first step, we completed the model building concerning the whole range of measurement (12,500-3,800 cm −1 ) by different data pre-processing methods. In accordance with earlier models, a maximum of 10 latent variables was taken into consideration.
In this case, the checking of the models was also carried out by a sevenfold segment random cross-validation (Table 4).
We got the most favorable relations (best Q 2 and least RMSECV) after the combined data pre-processing of SNV + first derivate.
During the iPLSR examination, the whole range of measurement was divided into 20 and 30 identical parts. The RMSECV value of each model (Figure 4a,b) was examined based on the PLS regression results of each area and the ranges with the lowest RMSECV values were selected.
We have examined the models that can be set up by extending the evaluation ranges and also considering two of those ranges where the RMSECV value is only slightly higher than the average RMSECV value. (Figure 4b). Using method 30_2, two areas that are only slightly above the average RMSECV were added to the 30_1 The results obtained by PLSR and iPLSR regression are summarized in Table 5.
By examining the ranges resulting from the iPLSR procedure it can be seen that in accordance with earlier presented models domains with large energy (12 500-9,000 cm −1 ) do not contain useful information.
The most favorable result was achieved with the extended 30units iPLS (30_2 method) ( Figure 5).
Applied data preprocessing was SNV + first derivative combined procedure. To set up the PLS model, eight PLS components were needed.
The R 2 value was 0.907 for the calibration and the Q 2 value was 0.867 for the validation. The RMSEC value was 36.8 (kJ/100 g) for the calibration and RMSECV value was 44.4 (kJ/100 g) for the validation.

| CON CLUS ION
We successfully applied the FT-NIRS technique combined with chemometrical methods to evaluate the quantitative characteristics of the nutritional values of cultivated mushrooms.
The developed prediction function can be used to determine the dry matter content with a root mean square error of 0.37% w/w in a concentration range of 5.5 to 12.4% w/w.
The prediction function developed for the determination of carbohydrate content can be used with an average error of 0.867% w/w in a concentration range of 5.4 to 36.3% w/w.
Prediction of total carbohydrate content can be performed with an average error of 0.936% w/w in a concentration range of 5.4%-36.3% w/w, while for the protein content, this average prediction error is 0.974% w/w in the concentration range of 12.9%-34.6% w/w.
Using iPLS, the root mean square error of the original prediction was improved from 81.8 kJ/100 g to 44.4 kJ/100 g. These data also demonstrate expedience to use a variable selection method for such a complex property.
All six investigated species were successfully discriminated by the LDA-supervised pattern recognition method solely based on the NIR spectra.
Established estimation relationships and successful pattern recognition can be of great help to mushroom growers to choose the right species and variety.
The primary aspect of processing plants and of the processing trade is quality raw materials, because these are the base for quality products.
The FT-NIR method allows obtaining a rapid answer to this issue, and furthermore due to its versatility, also offers the possibility of in-line, on-line or at-line application.

ACK N OWLED G M ENTS
The Project was supported by the European Union and cofinanced by the European Social Fund (grant agreement no. EFOP-3.6.3-VEKOP-16-2017-00005). We are grateful for the support of the Szent István University's Doctoral School of Food Sciences.

CO N FLI C T O F I NTE R E S T
The authors have declared no conflicts of interest for this article.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data available on request from the authors.