- Open Access
Machine learning for a rapid discrimination of ginseng cultivation age using 1H-NMR spectra
Applied Biological Chemistry volume 63, Article number: 64 (2020)
The scientific and systematic classification of cultivation age is important for preventing age falsification and ensuring the quality of ginseng. Therefore, we applied deep learning to classify the cultivation age of ginseng. Deep learning, which is based on an artificial neural network, is one of the new class of models for machine learning, and is state-of-the-art. It is a powerful tool and has been used to solve complex problems in many fields. In the present study, powdered samples of 4-, 5-, and 6-year-old ginseng were measured using high-resolution magic angle spinning nuclear magnetic resonance (HR-MAS NMR) spectroscopy. NMR data were analyzed with deep learning and partial least-squares discriminant analysis (PLS-DA) to improve accuracy. The accuracy of the PLS-DA was 87.1% and the accuracy of the deep learning model was 93.9%. NMR spectroscopy with deep learning can be a useful tool for discrimination of ginseng cultivation age.
Ginseng, which is a perennial crop, has been used for a natural medicinal ingredient for thousands of years in East Asia. Ginseng cultivated for 4–6 years is used as a medicine . However, 6-year-old ginseng is known to be the most effective and is therefore the most expensive among them . It is important to establish an efficient and accurate model for classifying and predicting ginseng cultivation age.
Previous studies conducted metabolic profiling and multivariate statistical analyses of origin, species, variety, age, and processing method of ginseng using nuclear magnetic resonance (NMR) spectroscopy, and all of these studies analyzed the extract of ginseng [3,4,5,6,7,8]. However, in this study, which analyzed with a lot of samples, the ginseng powder was directly used without extraction because solvents or procedures for extraction can affect the data. Several previous studies performed various tools such as combination of NMR spectrometer, ultrahigh-performance liquid chromatography quadrupole time-of-flight mass spectrometry (UPLC-QTOF/MS), and gas chromatography quadrupole time-of-fight mass spectrometry (GC-TOF/MS) , and Fourier-transform infrared (FT-IR) spectroscopy  for age discrimination of ginseng. HR-MAS technique combines the advantage of solid and solution-state NMR. Fast spinning the sample at the magic angle (54.7°) with respect to the static magnetic field (B0) reduces line-broadening interaction caused by dipolar coupling and chemical shift anisotropy (Fig. 1). Thus, high resolution can be achieved [11, 12]. This technique offers spectra with a similar resolution to that of solution without sample extraction process . The advantage of HR-MAS compared to solution-state NMR is that samples are measured in a state close to the intact conditions. Furthermore, HR-MAS does not require an extraction process, which therefore reduces the experiment time .
In this study, in order to discriminate the age of various ginseng harvested over many years, 1H HR-MAS NMR spectroscopy data was analyzed by partial least-squares discriminant analysis (PLS-DA) which is subfield of machine learning. PLS-DA is widely used in metabolic analyses [15,16,17,18]. We also applied deep learning analysis to improve accuracy for discrimination of ginseng age. Deep learning architectures such as deep neural networks (DNN) is one of the new class of models for machine learning , and is state-of-the-art. It is a powerful tool and has been used to solve complex problems in many fields. Many important advances in machine learning have been made recently following improvements in computing power and the amount of data available . Moreover, open source machine learning libraries and frameworks have made deep learning more accessible to non-experts. This study is meaningful because it is the first study to apply HR-MAS NMR data to machine learning.
Materials and methods
Panax ginseng roots were cultivated in the experimental field at Kyung Hee University and at the Department of Herbal Crop Re-search located in Gangwon Province, according to the protocol described in the ‘ginseng GAP standard cultivation guide’ developed by the Rural Development Administration, Republic of Korea (Rural Development Administration, 2009). Four-, five-, and six-year-old ginseng roots were harvested in October 2014, October 2015, and October 2016, respectively. The voucher specimens (NIHHS141010 and NIHHS161001) were deposited at the herbarium of the Department of Herbal Crop Research, NIHHS, RDA, Eumseong, Republic of Korea.
We analyzed a total 385 ginseng samples, which contained 162 of 4-year-old samples, 108 of 5-year-old samples, and 115 of 6-year-old samples. Each sample was dried at 40 °C in a forced-air convection-drying oven for 48 h after washing, and then weighed. The main roots were used for the experiments after removing the lateral and fine roots. The roots were ground (< 0.5 mm) and thoroughly mixed using a Hanil Scientific Inc. mixer (Seoul, Korea), and the subsamples were homogenized further using a Retsch MM 400 mixer mill (Retsch GmbH, Haan, Germany) for the analyses.
For measuring high-resolution magic angle spinning (HR-MAS) nuclear magnetic resonance (NMR) spectroscopy, each ginseng sample (3 mg) was transferred to a 4-mm HR-MAS NMR sample tube (Agilent Technologies, Santa Clara, CA, USA). Heavy water (D2O; 37 μL) containing 2 mM of 3-(trimethylsilyl) propionic-2,2,3,3-d4 acid sodium salt (TSP-d4) was added to each sample tube.
All NMR spectra were measured using a 600.167 MHz Agilent spectrometer equipped with a 4-mm gHX NanoProbe (Agilent Technologies). The spinning rate was 2050 Hz for HR-MAS. We collected 128 transients using the Carr–Purcell–Meiboom–Gill (CPMG) pulse sequence with PRESAT for the suppression of water and high-molecular-mass compounds . The one-dimensional (1D) spectra were obtained using a 1.704 s acquisition time, a 1 s relaxation delay. The TSP-d4 peak at 0.00 ppm was used as a reference to calibrate the chemical shift . And correlation spectroscopy (COSY), which is one of the homonuclear two-dimensional (2D) experiment, was recorded and processed with 128 of scans and 256 of t1 increments.
All spectra were phased, and baseline corrected manually. Metabolites in spectra were assigned using Chenomx NMR Suite 7.1 professional with the Chenomx 600 MHz library database (Chenomx Inc., Canada). Concentrations of metabolites were calculated based on the concentration of TSP-d4 at 2 mM. And overlapped signal areas were analyzed by a 2D correlation spectroscopy (COSY) NMR spectra .
Each NMR spectrum was binned from 0.83 to 6.8 ppm, and the water peak area and spinning sideband area (1.15–1.185, 1.948–1.995, and 4.68–4.88 ppm) were excluded using Chenomx NMR suite 7.1 professional. The binning size was 0.01 ppm. To minimize the difference in concentration between the samples, the binning data were normalized to the total area. In this procedure, the NMR spectrum was divided into 480 variables. The pre-processed dataset was used to train and validate the classification model for cultivation age classification of ginseng. The dataset (385 samples) was randomly divided into a training dataset (70%) and test dataset (30%).
Partial least-squares discriminant analysis (PLS-DA) is one of the multivariate data analysis methods and is currently a popular classification method for metabolic analysis. The procedure was conducted using the scikit-learn 0.19.2  along with custom python codes. And, PLS-DA model was optimized by finding the suitable number of components that minimizes the validation loss.
Deep learning processes were carried out using Keras package 2.2.2 in Python language 3.6. Figure 2 illustrates the construction of a deep neural network (DNN) model. Deep learning comprises a multi-layered neural network with multiple hidden layers in between the input and output layers. Each neuron takes in a weighted input on the connection and provides an activation value as output through an activation function. In this non-linear processing, raw data can be automatically transformed into learned features and representative data [25,26,27,28]. The DNN comprises four layers: an input layer, two hidden layers, and an output layer. Rectified Linear Units (ReLU) was used as an activation function in each layer  except for the output layer, in which Softmax was used instead of ReLU. This function is often used in multiclass classification methods in the output layer . Dropout (the dropout ratio = 0.1) and batch normalization  were used in the present model. Dropout is effective for avoiding overfitting, and batch normalization enables higher learning rates and regularization. We used ‘categorical_crossentropy’ optimizer, ‘Adam’ loss function, and ‘accuracy’ metric function to compile the model. Other hyper-parameters were set to default. When training the deep learning models, we used ‘Modelcheckpoint’ to ensure that only the best model was saved if the validation loss improves over the best values. The stored best prediction model was then applied to make predictions on the test dataset.
Results and discussion
Total 385 ginseng samples were measured by 1H high-resolution magic angle spinning (HR-MAS) nuclear magnetic resonance (NMR) spectroscopy. By using HR-MAS NMR spectroscopy, the easy sample preparation without extraction process and the short analysis time are possible. Therefore, minimal sampling process allows to minimize the handling variations in sample preparation process. Representative 1H-NMR spectra are shown in Fig. 3. They reflect individual metabolites of ginseng measured by HR-MAS NMR spectroscopy. And metabolites in the spectra were assigned and quantified using Chenomx NMR Suite 7.1 professional and 2D COSY spectra which provide the information of spin–spin coupling between protons (Fig. 4) . Metabolites in each spectrum were identified and quantified by targeted analysis (Table 1). However, this approach is time-consuming and costly for a routine use in the quality control of ginseng. Therefore, we focused on a non-targeted analysis such as PLS-DA and deep learning to compare the patterns of the entire spectra. Since this approach compares the calculated values by digitizing the spectra, subjective interpretation can be excluded and a large amount of data can be handled simultaneously.
Classification report (Table 2) and confusion matrix (Fig. 5) were carried out using scikit-learn 0.19.2  to show performance of classification models. The accuracy of the PLS-DA model when used to distinguish between 4-, 5-, and 6-year-old ginseng was 87.1%. Table 2a shows the classification report of the PLS-DA model. The classification report included the precision, recall, and F1 score for each class of the test dataset. The precision is the fraction of true positives among the total predicted positive observations. The recall is the fraction of true positives within all observations of relevant instances. The F1 score is a harmonic average of the precision and recall; the optimum F1 score is 100%. In the PLS-DA model, the values of precision, recall, and F1 score for the 4-year-old ginseng sample were better than for the 5- and 6-year-old ginseng samples. The weighted average values for all classes were 87.4% precision, 87.1% recall, and 87.0% F1 score. Figure 5a shows the confusion matrix of the PLS-DA model. The confusion matrix provides an evaluation of the quality of the output of a classification model for the ginseng test dataset. The diagonal elements represent the number of values for which the predicted class is equal to the true class, whereas off-diagonal elements are those that are mislabeled by the classification model. The higher the diagonal values, the more accurate the predictions are. The accuracy of the DNN model was 93.9%. In DNN model (Table 2b), the values of precision, recall, and F1 score for the 4-year-old ginseng sample were better than for the 5- and 6-year-old ginseng samples. The average values for all classes were 93.9% precision, 93.9% recall, and 93.9% F1 score. Figure 5b shows the confusion matrix for the DNN model. Compared to the PLS-DA model, all classification report values were higher in the DNN model. Therefore, the DNN model achieved a higher classification result than the PLS-DA model. Previous studies conducted multivariate statistical analyses with NMR data for discrimination of ginseng ages, and the ginseng samples were extracted with the solvent consisted of methanol and water [5,6,7]. As in our study, when comparing the 4 to 6-year old ginseng that are most widely distributed on the market, it was found that the discriminant rate was not high compared to the overall discriminant rate in their papers. This study was able to further increase the discrimination rate, which was low in the PLS-DA model, by using the DNN model, even though the variation between samples with varying regions and harvesting times.
These results demonstrate that HR-MAS NMR spectroscopy with deep learning, which achieved a higher classification result than the PLS-DA, might be useful model for the classification of ginseng cultivation age. The schematic workflow for the classification of ginseng cultivation age using HR-MAS NMR Spectroscopy with deep learning is shown in Fig. 6. In the present study, the number of ginseng datasets were relatively small, which might have caused increased overfitting in the classification models. A limited number of training datasets compared to the number of learnable variables can lead to overfitting. This is a major challenge in training models. Nevertheless, we obtained an encouraging result by applying dropout, batch normalization, and model checkpoint to prevent overfitting. It is possible to improve the classification capability by using more datasets. Moreover, by using HR-MAS NMR, we were able to reduce the sampling variation and achieve reproducibility. Furthermore, samples from two different cultivation areas and three different harvesting years helped to improve the reliability of the results in classifying ginseng cultivation age.
Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Yun TK (2001) Brief introduction of Panax ginseng CA Meyer. J Korean Med Sci 16(4):16–18
Lee DY, Cho JG, Bang MH, Han MW, Lee MH, Yang DC, Baek NI (2011) Discrimination of Korean ginseng (Panax ginseng) roots using rapid resolution LC-QTOF/MS combined by multivariate statistical analysis. Food Sci Biotechnol 20(4):1119–1124
Kang J, Lee S, Kang S, Kwon HN, Park JH, Kwon SW, Park S (2008) NMR-based metabolomics approach for the differentiation of ginseng (Panax ginseng) roots from different origins. Arch Pharm Res 31(3):330–336
Lee EJ, Rustem S, Weljie AM, Vogel HJ, Facchini PJ, Park SU, Kim YK, Yang TJ (2009) Quality assessment of ginseng by 1H NMR metabolite fingerprinting and profiling analysis. J Agric Food Chem 57(16):7513–7522
Yang SO, Shin YS, Hyun SH, Cho S, Bang KH, Lee D, Choi PC, Choi HK (2012) NMR-based metabolic profiling and differentiation of ginseng roots according to cultivation ages. J Pharm Biomed Anal 58(1):19–26
Lin WN, Lu HY, Lee MS, Yang SY, Chen HJ, Chang YS, Chang WT (2010) Evaluation of the cultivation age of dried ginseng radix and its commercial products by using (1)HNMR fingerprint analysis. Am J Chin Med 38(1):205–218
Shin YS, Bang KH, In DS, Kim OT, Hyun DY, Ahn IO, Ku BC, Kim SW, Seong NS, Cha SW, Lee D, Choi HK (2007) Fingerprinting analysis of fresh ginseng roots of different ages using 1H-NMR spectroscopy and principal components analysis. Arch Pharm Res 30:1625–1628
Kim SH, Hyun SH, Yang SO, Choi HK, Lee BY (2010) (1)HNMR- based discrimination of thermal and vinegar treated ginseng roots. J Food Sci 75(6):C577–C581
Yoon D, Choi BR, Ma S, Lee JW, Jo IH, Lee YS, Kim GS, Kim S, Lee DY (2019) Metabolomics for age discrimination of ginseng using a multiplex approach to HR-MAS NMR spectroscopy, UPLC–QTOF/MS, and GC × GC–TOF/MS. Molecules 24(13):2381
Lee BJ, Kim HY, Lim SR, Huang L, Choi HK (2017) Discrimination and prediction of cultivation age and parts of Panax ginseng by Fourier-transform infrared spectroscopy combined with multivariate statistical analysis. PLoS ONE 12(10):e0186664
Flores IS, Martinelli BC, Pinto VS, Queiroz LH Jr, Lião LM (2019) Important issues in plant tissues analyses by HR-MAS NMR. Phytochem Anal 30(1):5–13
Cruciani O, Mannina L, Sobolev AP, Segre A, Luisi P (2004) Multilamellar liposomes formed by phosphatidyl nucleosides: an NMR-HR-MAS characterization. Langmuir 20(4):1144–1151
Mazzei P, Piccolo A (2017) HRMAS NMR spectroscopy applications in agriculture. Chem Bio Agro 4(1):1–13
Taylor JL, Wu CL, Cory D, Gonzalez RG, Bielecki A, Cheng LL (2003) High-resolution magic angle spinning proton NMR analysis of human prostate tissue with slow spinning rates. Magn Reson Med 50(3):627–632
Vignoli A, Rodio DM, Bellizzi A, Sobolev AP, Anzivino E, Mischitelli M, Tenori L, Marini F, Priori R, Scrivo R, Valesini G, Francia A, Morreale M, Ciardi MR, Iannetta M, Campanella C, Capitani D, Luchinat C, Pietropaolo V, Mannina L (2017) NMR-based metabolomic approach to study urine samples of chronic inflammatory rheumatic disease patients. Anal Bioanal Chem 409(5):1405–1413
Worley B, Powers R (2013) Multivariate analysis in metabolomics. Curr Metabol 1(1):92–107
Hong YS (2011) NMR-based metabolomics in wine science. Magn Reson Chem 49:13–21
Hu B, Gao J, Xu S, Zhu J, Fan X, Zhou X (2020) Quality evaluation of different varieties of dry red wine based on nuclear magnetic resonance metabolomics. Appl Biol Chem 63(1):1–8
Mohsen H, El-Dahshan ESA, El-Horbaty ESM, Salem ABM (2017) Classification using deep learning neural networks for brain tumors. Future Comput Inform J 3(1):68–71
Shen D, Wu G, Suk HI (2017) Deep learning in medical image analysis. Annu Rev Biomed Eng 19(1):221–248
Mun JH, Lee H, Yoon D, Kim BS, Kim MB, Kim S (2016) Discrimination of basal cell carcinoma from normal skin tissue using high-resolution magic angle spinning 1H NMR spectroscopy. PLoS ONE 11(3):1–10
Raja G, Kim S, Yoon D, Yoon C, Kim S (2017) 1H-NMR-based metabolomics studies of the toxicity of mesoporous carbon nanoparticles in Zebrafish (Danio rerio). Bull Korean Chem Soc 38(2):271–277
Féraud B, Govaerts B, Verleysen M, de Tullio P (2015) Statistical treatment of 2D NMR COSY spectra in metabolomics: data preparation, clustering-based evaluation of the metabolomic informative content and comparison with 1H-NMR. Metabolomics 11(6):1756–1768
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Ron W, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830. http://scikit-learn.org
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Bychkov D, Linder N, Turkki R, Nordling S, Kovanen PE, Verrill C, Walliander M, Lundin M, Haglund C, Lundin J (2018) Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci Rep 8(1):1–11
Goh GB, Hodas NO, Vishnu A (2017) Deep learning for computational chemistry. J Comput Chem 38(16):1291–1307
Chollet F (2017) Deep learning with python. Manning Publications
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10); p. 807–814
Nogueira K, Penatti OA, Dos Santos JA (2017) Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recogn 61:539–556
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR.-2015-Vol.abs/1502,03167. http://arxiv.org/abs/1502/03167
This work was carried out with the support of “Cooperative Research Program for Agriculture Science & Technology Development (Project No. PJ01175703)” Rural Development Administration, Republic of Korea.
Rural Development Administration (Republic of Korea), Project Number PJ01175703
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Lee, W., Yoon, D., Ma, S. et al. Machine learning for a rapid discrimination of ginseng cultivation age using 1H-NMR spectra. Appl Biol Chem 63, 64 (2020). https://doi.org/10.1186/s13765-020-00548-4
- Deep learning
- Non-targeted analysis