Machine learning for a rapid discrimination of ginseng cultivation age using 1H-NMR spectra

The scientific and systematic classification of cultivation age is important for preventing age falsification and ensuring the quality of ginseng. Therefore, we applied deep learning to classify the cultivation age of ginseng. Deep learning, which is based on an artificial neural network, is one of the new class of models for machine learning, and is state-of-the-art. It is a powerful tool and has been used to solve complex problems in many fields. In the present study, powdered samples of 4-, 5-, and 6-year-old ginseng were measured using high-resolution magic angle spinning nuclear magnetic resonance (HR-MAS NMR) spectroscopy. NMR data were analyzed with deep learning and partial least-squares discriminant analysis (PLS-DA) to improve accuracy. The accuracy of the PLS-DA was 87.1% and the accuracy of the deep learning model was 93.9%. NMR spectroscopy with deep learning can be a useful tool for discrimination of ginseng cultivation age.


Introduction
Ginseng, which is a perennial crop, has been used for a natural medicinal ingredient for thousands of years in East Asia. Ginseng cultivated for 4-6 years is used as a medicine [1]. However, 6-year-old ginseng is known to be the most effective and is therefore the most expensive among them [2]. It is important to establish an efficient and accurate model for classifying and predicting ginseng cultivation age.
Previous studies conducted metabolic profiling and multivariate statistical analyses of origin, species, variety, age, and processing method of ginseng using nuclear magnetic resonance (NMR) spectroscopy, and all of these studies analyzed the extract of ginseng [3][4][5][6][7][8]. However, in this study, which analyzed with a lot of samples, the ginseng powder was directly used without extraction because solvents or procedures for extraction can affect the data. Several previous studies performed various tools such as combination of NMR spectrometer, ultrahigh-performance liquid chromatography quadrupole time-of-flight mass spectrometry (UPLC-QTOF/MS), and gas chromatography quadrupole time-of-fight mass spectrometry (GC-TOF/MS) [9], and Fourier-transform infrared (FT-IR) spectroscopy [10] for age discrimination of ginseng. HR-MAS technique combines the advantage of solid and solution-state NMR. Fast spinning the sample at the magic angle (54.7°) with respect to the static magnetic field (B0) reduces line-broadening interaction caused by dipolar coupling and chemical shift anisotropy (Fig. 1). Thus, high resolution can be achieved [11,12]. This technique offers spectra with a similar resolution to that of solution without sample extraction process [13]. The advantage of HR-MAS compared to solutionstate NMR is that samples are measured in a state close to the intact conditions. Furthermore, HR-MAS does not require an extraction process, which therefore reduces the experiment time [14].
In this study, in order to discriminate the age of various ginseng harvested over many years, 1 H HR-MAS NMR spectroscopy data was analyzed by partial leastsquares discriminant analysis (PLS-DA) which is subfield of machine learning. PLS-DA is widely used in metabolic analyses [15][16][17][18]. We also applied deep learning analysis to improve accuracy for discrimination of ginseng age. Deep learning architectures such as deep neural networks (DNN) is one of the new class of models for machine learning [19], and is state-of-the-art. It is a powerful tool and has been used to solve complex problems in many fields. Many important advances in machine learning have been made recently following improvements in computing power and the amount of data available [20]. Moreover, open source machine learning libraries and frameworks have made deep learning more accessible to non-experts. This study is meaningful because it is the first study to apply HR-MAS NMR data to machine learning.

Plant materials
Panax ginseng roots were cultivated in the experimental field at Kyung Hee University and at the Department of Herbal Crop Re-search located in Gangwon Province, according to the protocol described in the 'ginseng GAP standard cultivation guide' developed by the Rural Development Administration, Republic of Korea (Rural Development Administration, 2009). Four-, five-, and six-year-old ginseng roots were harvested in October 2014, October 2015, and October 2016, respectively. The voucher specimens (NIHHS141010 and NIHHS161001) were deposited at the herbarium of the Department of Herbal Crop Research, NIHHS, RDA, Eumseong, Republic of Korea.

Sample preparation
We analyzed a total 385 ginseng samples, which contained 162 of 4-year-old samples, 108 of 5-year-old samples, and 115 of 6-year-old samples. Each sample was dried at 40 °C in a forced-air convection-drying oven for 48 h after washing, and then weighed. The main roots were used for the experiments after removing the lateral and fine roots. The roots were ground (< 0.5 mm) and thoroughly mixed using a Hanil Scientific Inc. mixer (Seoul, Korea), and the subsamples were homogenized further using a Retsch MM 400 mixer mill (Retsch GmbH, Haan, Germany) for the analyses.
All NMR spectra were measured using a 600.167 MHz Agilent spectrometer equipped with a 4-mm gHX NanoProbe (Agilent Technologies). The spinning rate was 2050 Hz for HR-MAS. We collected 128 transients using the Carr-Purcell-Meiboom-Gill (CPMG) pulse sequence with PRESAT for the suppression of water and high-molecular-mass compounds [21]. The one-dimensional (1D) spectra were obtained using a 1.704 s acquisition time, a 1 s relaxation delay. The TSP-d 4 peak at 0.00 ppm was used as a reference to calibrate the chemical shift [22]. And correlation spectroscopy (COSY), which is one of the homonuclear two-dimensional (2D) experiment, was recorded and processed with 128 of scans and 256 of t1 increments.

Metabolic profiling
All spectra were phased, and baseline corrected manually. Metabolites in spectra were assigned using Chenomx NMR Suite 7.1 professional with the Chenomx 600 MHz library database (Chenomx Inc., Canada). Concentrations of metabolites were calculated based on the concentration of TSP-d 4 at 2 mM. And overlapped signal areas were analyzed by a 2D correlation spectroscopy (COSY) NMR spectra [23].

Data pre-processing
Each NMR spectrum was binned from 0.83 to 6.8 ppm, and the water peak area and spinning sideband area (1.15-1.185, 1.948-1.995, and 4.68-4.88 ppm) were excluded using Chenomx NMR suite 7.1 professional. The binning size was 0.01 ppm. To minimize the difference in concentration between the samples, the binning data were normalized to the total area. In this procedure, the NMR spectrum was divided into 480 variables. The pre-processed dataset was used to train and validate the classification model for cultivation age classification of ginseng. The dataset (385 samples) was randomly divided into a training dataset (70%) and test dataset (30%).

PLS-DA
Partial least-squares discriminant analysis (PLS-DA) is one of the multivariate data analysis methods and is currently a popular classification method for metabolic analysis. The procedure was conducted using the scikit-learn 0.19.2 [24] along with custom python codes. And, PLS-DA model was optimized by finding the suitable number of components that minimizes the validation loss.

Deep learning
Deep learning processes were carried out using Keras package 2.2.2 in Python language 3.6. Figure 2 illustrates the construction of a deep neural network (DNN) model. Deep learning comprises a multi-layered neural network with multiple hidden layers in between the input and output layers. Each neuron takes in a weighted input on the connection and provides an activation value as output through an activation function. In this non-linear processing, raw data can be automatically transformed into learned features and representative data [25][26][27][28]. The DNN comprises four layers: an input layer, two hidden layers, and an output layer. Rectified Linear Units (ReLU) was used as an activation function in each layer [29] except for the output layer, in which Softmax was used instead of ReLU. This function is often used in multiclass classification methods in the output layer [30]. Dropout (the dropout ratio = 0.1) and batch normalization [31] were used in the present model. Dropout is effective for avoiding overfitting, and batch normalization enables higher learning rates and regularization. We used 'cat-egorical_crossentropy' optimizer, ' Adam' loss function, and 'accuracy' metric function to compile the model. Other hyper-parameters were set to default. When training the deep learning models, we used 'Modelcheckpoint'

Results and discussion
Total 385 ginseng samples were measured by 1 H highresolution magic angle spinning (HR-MAS) nuclear magnetic resonance (NMR) spectroscopy. By using HR-MAS NMR spectroscopy, the easy sample preparation without extraction process and the short analysis time are possible. Therefore, minimal sampling process allows to minimize the handling variations in sample preparation process. Representative 1 H-NMR spectra are shown in Fig. 3. They reflect individual metabolites of ginseng measured by HR-MAS NMR spectroscopy. And metabolites in the spectra were assigned and quantified using Chenomx NMR Suite 7.1 professional and 2D COSY spectra which provide the information of spin-spin coupling between protons (Fig. 4) [23]. Metabolites in each spectrum were identified and quantified by targeted analysis (Table 1). However, this approach is time-consuming and costly for a routine use in the quality control of ginseng. Therefore, we focused on a non-targeted analysis such as PLS-DA and deep learning to compare the patterns of the entire spectra. Since this approach compares the calculated values by digitizing the spectra, subjective interpretation can be excluded and a large amount of data can be handled simultaneously. Classification report ( Table 2) and confusion matrix (Fig. 5) were carried out using scikit-learn 0.19.2 [24] to show performance of classification models. The accuracy of the PLS-DA model when used to distinguish between 4-, 5-, and 6-year-old ginseng was 87.1%. Table 2a shows the classification report of the PLS-DA model. The classification report included the precision, recall, and F1 score for each class of the test dataset. The precision is the fraction of true positives among the total predicted positive observations. The recall is the fraction of true positives within all observations of relevant instances. The F1 score is a harmonic average of the precision and recall; the optimum F1 score is 100%. In the PLS-DA model, the values of precision, recall, and F1 score for the 4-year-old ginseng sample were better than for the 5-and 6-yearold ginseng samples. The weighted average values for all classes were 87.4% precision, 87.1% recall, and 87.0% F1 score. Figure 5a shows the confusion matrix of the PLS-DA model. The confusion matrix provides an evaluation of the quality of the output of a classification model for the ginseng test dataset. The diagonal elements represent the number of values for which the predicted class is equal to the true class, whereas offdiagonal elements are those that are mislabeled by the classification model. The higher the diagonal values, the more accurate the predictions are. The accuracy of the DNN model was 93.9%. In DNN model (Table 2b), the values of precision, recall, and F1 score for the 4-year-old ginseng sample were better than for the 5and 6-year-old ginseng samples. The average values for all classes were 93.9% precision, 93.9% recall, and 93.9% F1 score. Figure 5b shows the confusion matrix for the DNN model. Compared to the PLS-DA model, all classification report values were higher in the DNN model. Therefore, the DNN model achieved a higher classification result than the PLS-DA model. Previous studies conducted multivariate statistical analyses with NMR data for discrimination of ginseng ages, and the ginseng samples were extracted with the solvent consisted of methanol and water [5][6][7]. As in our study, when comparing the 4 to 6-year old ginseng that are most widely distributed on the market, it was found that the discriminant rate was not high compared to the overall discriminant rate in their papers. This study was able to further increase the discrimination rate, which was low in the PLS-DA model, by using the DNN model, even though the variation between samples with varying regions and harvesting times.
These results demonstrate that HR-MAS NMR spectroscopy with deep learning, which achieved a higher classification result than the PLS-DA, might be useful model for the classification of ginseng cultivation age. The schematic workflow for the classification of ginseng cultivation age using HR-MAS NMR Spectroscopy with deep learning is shown in Fig. 6. In the present study, the number of ginseng datasets were relatively small, which might have caused increased overfitting in the classification models. A limited number of training datasets compared to the number of learnable variables can lead to overfitting. This is a major challenge in training models. Nevertheless, we obtained an encouraging result by applying dropout, batch normalization, and model checkpoint to prevent overfitting. It is possible to improve the classification capability by using more datasets. Moreover, by using HR-MAS NMR, we were able to reduce the sampling variation and achieve reproducibility. Furthermore, samples from two different cultivation areas and three different harvesting years helped to improve the reliability of the results in classifying ginseng cultivation age. Fig. 6 The schematic workflow for the classification of ginseng cultivation age. Powdered ginseng samples were analysed using HR-MAS NMR. And acquired 1 H NMR spectra were preprocessed. After data preprocessed, deep learning was used to make a prediction model for classifying ginseng cultivation age. When new data is available, deep learning model is retrained and updated