A comprehensive database of human and livestock fecal microbiome for community-wide microbial source tracking: a case study in South Korea

Fecal waste from livestock farms contains numerous pathogens, and improperly managed waste may flow into water bodies, causing water-borne diseases. Along with the popularization of high-throughput technologies, community-wide microbial source-tracking methods have been actively developed in recent years. This study aimed to construct a comprehensive fecal microbiome database for community-wide microbial source tracking and apply the database to identify contamination sources in the Miho River, South Korea. Total DNA was extracted from the samples, and the 16 S rRNA gene was amplified to characterize the microbial communities. The fecal microbiome database was validated by developing machine-learning models that predict host species based on microbial community structure. All machine learning models developed in this study showed high performance, where the area under the receiver operating characteristic curve was approximately 1. Community-wide microbial source tracking results showed a higher contribution of fecal sources to the contamination of the main streams after heavy rain. In contrast, the contribution of fecal sources remained comparatively stable in tributaries after rainfall. Considering that farms are more concentrated upstream of tributaries compared to the main streams, this result implies that the pathway for manure contaminants to reach the main streams could be groundwater rather than surface runoff. Systematic monitoring of the water quality, which encompasses river water and groundwater, should be conducted in the future. In addition, continuous efforts to identify and plug abandoned wells are necessary to prevent further water contamination.


Introduction
Global demand for meat has increased significantly, resulting in extensive livestock farming and increased manure generation [1].A large amount of manure is stockpiled before it is used as fertilizer or temporarily stored before it is transported to livestock waste facilities for energy production and disposal [2][3][4][5][6].Many farms, owing to a lack of indoor spaces, store livestock manure in open areas where it can leach into stream waters through surface runoff or into groundwater.Because fecal materials from livestock contain a myriad of pathogens that may cause waterborne diseases, it is important to properly manage livestock manure and track leaching events to prevent disease outbreaks [2,7].
With the advent of high-throughput sequencing methods, community-wide approaches to MST have emerged.Knights et al. [18] introduced a bioinformatics tool for community-wide MST called SourceTracker.Source-Tracker models the contributions of source communities to the contamination of sink communities.Staley et al. [19] reported the applicability of SourceTracker for MST through double-blind tests using samples spiked with one to five source libraries.Unlike previous MST methods, which have limited use in the detection of predetermined fecal indicator bacteria, community-wide approaches directly estimate source proportions with much higher resolution.
Community-wide MST methods are powerful, but can sometimes be resource-intensive because of the increased size of microbial community sequence data.Recently, other machine learning-based community-wide source-tracking bioinformatics tools, such as FEAST (fast expectation-maximization for microbial source tracking) [20] and STENSL (Microbial Source Tracking with Environment SeLection [21] have been developed to overcome a few of the drawbacks of SourceTracker.FEAST was developed based on a highly efficient expectation maximization-based method that enables communitywide MST on time.STENSL identifies true contributing sources and reduces the noise introduced by noncontributing sources by incorporating sparsity into the model. In South Korea, 142,155 tons of livestock manure are generated daily [22].To understand the contribution of fecal sources to water contamination, we conducted a case study in the Miho River, South Korea, where the average number of total coliforms in 2022 reached 16,870 CFU (colony forming unit)/100 mL [23] (at sampling point MH10 [Fig.1]).We first developed a comprehensive fecal microbiome database and validated it using machine learning models that predict host species based on fecal microbial community structures.Based on the constructed database, we performed a community-wide MST to track the contamination sources of the Miho River in South Korea.We aimed to diagnose the current status of fecal pollution in detail, identify its causes, and suggest appropriate control methods.

Sample collection and physicochemical measurements
In total, 633 fecal samples (125 human samples, 144 poultry samples, 116 swine samples, 42 horse samples, and 206 cow samples) were collected in Jeju and Gwangju, South Korea, between 2016 and 2020 (Supplementary Table S1) and stored at -20 °C before DNA extraction.River samples were collected from the mainstream (MH08, MH09, SY01, BR01, and MH10) and tributaries (BC02, JO02, and WH01) of the Miho River watershed, located across Cheongju (upstream) and Sejong (downstream) in South Korea, where the three major livestock species are cows, pigs, and poultry (chicken and duck) (Supplementary Table S2) (Fig. 1).Three replicate samples were collected at each sampling point before and after the heavy rain on June 26th, 2023 (daily rainfall of 34.1 mm) (Supplementary Fig. S1).Water temperature, pH, and electrical conductivity (EC) were measured using a multifunction meter CX-401 (Elmetron, Poland).Water samples were filtered using cellulose nitrate filters (pore size of 0.45 μm and diameter of 47 mm) (Whatman, UK) and stored at -20 °C before DNA extraction.
The pooled library was sent to Macrogen Inc. (Seoul, South Korea) for sequencing.Fastq-formatted sequence data have been deposited in the Sequence Read Archive under project ID PRJNA1071275 for Miho River samples and PRJNA1071195 for fecal samples, except for 21 human samples that have already been published in another study [26] (Sequence Read Archive project ID of PRJNA544370).

Sequence processing
Sequences were processed using Mothur software [27] following MiSeq SOP (https://mothur.org/wiki/miseq_sop/).Sequences with ambiguous base pairs and homopolymers (> 8 base pairs) were removed.Sequences of < 250 bp or > 550 bp were removed.Sequences were aligned against the Silva database v. 138 [28], and the "pcr.seqs"command (with a start option of 11,895 and an end option of 25,316, which covers the 515-805 bp region of the bacterial 16 S rRNA gene) was used for the consistency between V4 amplicons and V3-4 amplicons.The chimeric sequences were removed using the VSEARCH algorithm [29].Sequences were classified based on RDP database v. 18 [30], and the sequences annotated as "Chloroplast, " "Mitochondria, " "unknown, " and "Eukaryota" were removed.Sequences with a similarity greater than 97% were clustered into operational taxonomic units (OTUs) using the OptiClust algorithm [31].

Statistical analysis, machine learning, and microbial source tracking
To test if the microbial community structure varies significantly depending on their hosts, a pairwise permutational multivariate analysis of variance (PERMANOVA) test was performed using the "vegan" package [32] and the "ranacapa" package [33] in R. Before performing pairwise PERMANOVA, we subsampled 10,575 reads per sample and calculated the Bray-Curtis distance between samples based on the square-root transformed OTU abundance data.To visualize the distance between the samples, we generated a non-metric multidimensional (nMDS) plot using the "vegan" package in R.
We constructed machine learning classification models to predict hosts based on microbial communities.The relative abundances of the genera were used as features (independent variables) of the models, and five hosts (poultry, cow, horse, human, and pig) were used as traits (dependent variables).Unclassified genera were excluded when building machine learning models.We used five machine-learning algorithms: (1) random forest (RF) [34], (2) extreme gradient boosting (XGBoost) [35], (3) support vector machine (SVM) [36], (4) logistic regression (Logr), and (5) K-nearest neighbor (KNN) [37].The Python "xgboost" module was used to construct the XGBoost model, and the "scikit-learn" module [38] was used to construct the other four models.Hyperparameters were tuned using the "GridSearchCV" function in "sklearn.model_selection," except for the XGBoost model, in which we used the default values due to resource limitations.The samples were randomly divided into training and test data 100 times, and the models were evaluated using 5-fold cross-validation.To find out important features in the random forest model, the "feature_impor-tances_" attribute in the Python "scikit-learn" package was used.
Community-wide MST was performed at the OTUlevel using the "FEAST" function in the R "FEAST" package [20].Fecal microbiome data were used as sources, and Miho River data were used as sinks.The source contribution values of the fecal samples were summed for each host.

Community composition of the fecal samples and the Miho River freshwater samples
At the phylum level, Firmicutes comprised over 60% of the gut microbiome in poultry samples and were dominant in other fecal sources (Fig. 2A).Bacteroidetes were dominant in fecal samples, except in poultry.The freshwater samples collected before heavy rain were dominated by Proteobacteria, Bacteroidetes, and Cyanobacteria.In contrast, freshwater samples collected after heavy rainfall were dominated by Proteobacteria, followed by Actinobacteria and Bacteroidetes.
At the genus level, Prevotella was dominant in human and pig samples but not in the other samples (Fig. 2B).In human samples, Phocaeicola, Bacteroides, Faecalibacterium, and Bifidobacterium were dominant.Lactobacillus was dominant in both pig and poultry samples.In poultry samples, Romboutsia and Streptococcus were dominant.Cow samples were dominated by Phocaeicola and Alistipes, whereas horse samples were dominated by Treponema and Methanocorpusculum.No overlap exists in the major (top five) genera between the fecal and freshwater samples.
The nMDS results showed strong clustering for each sample group (Fig. 2C).PERMANOVA results showed significant differences between the different sample groups (global R 2 = 0.51637, p < 0.001; pairwise test results are shown in Supplementary Table S3).The microbial community in freshwater samples shifted towards the fecal microbiomes after heavy rain (Fig. 2C, Supplementary Table S3).The horse samples were most distantly located in the freshwater samples on the nMDS plot.

Verification of the fecal microbiome database using machine learning
To evaluate the fecal microbiome database constructed in this study, we built machine-learning models that predicted hosts based on fecal microbial community composition.The hyperparameters used to construct the final models are listed in Supplementary Table S4.All five machine learning classification models performed well, with areas under the receiver operating characteristic curves of approximately 1 (Fig. 3A, Supplementary Fig. S2, and Table S5).The most important 20 features identified in the RF model included the major genera represented in Fig. 2B, including Faecalibacterium, Alistipes, Methanobrevibacter, Treponema, and Romboutsia, and minor genera such as Paeniclostridum, Clostridioides, Paludibacter, Monoglobus, and Ihubacter (Fig. 3B and C).

Community-wide microbial source tracking of the miho river
The water temperature was approximately 24-27 °C during the sampling period, both before and after the heavy rain (Table 1).The pH and EC decreased after rainfall at most sampling points.The MST results demonstrated that freshwater samples were contaminated with the fecal microbiome of humans, cows, pigs, and poultry to a minor extent before heavy rain (Fig. 4).However, the source contributions of human, cow, pig, and especially poultry samples increased after rain in the mainstream.In contrast, there were relatively smaller changes in the source contribution profiles of tributary samples, such as BC02 and WH01, after the rain.Overall, the downstream samples (BR01 and MH10) showed a highly contaminated profile after rain compared to the other samples.Regardless of sampling time, the horse fecal microbiome had nearly zero contribution.

Discussion
In this study, we constructed a comprehensive fecal microbiome database based on 633 fecal samples collected from poultry, cows, horses, pigs, and humans in South Korea for community-wide MST.The database constructed in this study can save time and effort in collecting fecal samples and facilitate comparative studies.Traditional MST methods, such as rep-PCR and PFGE, have a high possibility of producing false negatives (failure to identify a source when present), as the target indicator bacteria (Escherichia coli, Enterococci, etc.) constitute only a small proportion of the overall community.This issue can be resolved in community-wide MST, as it does not target a single bacterial species but instead considers multiple species collaboratively.
Traditional MST methods have limitations in distinguishing different host species from contamination sources.In this study, the fecal microbiomes were distinguishable by different host groups, as reported in many other studies [39,40].The machine learning models constructed in this study showed nearly 100% accuracy in predicting host groups based on fecal microbial community structures.The features that contributed most to accurate prediction in the RF model included the major genera and the minor ones, which have often been neglected.This indicates that these minor genera can function as important indicators and help enhance the resolution of community-wide MST.
The MST results for the Miho River revealed that humans, chickens, cows, and pigs were the main contributors to fecal contamination.The contribution of poultry samples was generally higher than that of other sources, particularly after rainfall.This could be due to the high number of poultry farms and poultry individuals in the study area (Supplementary Table S2).Horses contributed marginally to the contamination of the Miho River both before and after the rain.This corresponds to the low number of horse farms and horses in the study area (Supplementary Table S2).
Tributary samples, such as WH01 and BC02, were only minimally contaminated before and after the rain, even though a few livestock farms were located upstream of these tributaries (Fig. 1).In contrast, in the mainstream, especially downstream (BR01 and MH10), there was more severe contamination after rain, even though few livestock farms were located nearby.This suggests that the contamination of the mainstream may not originate from surface water flushing from tributaries or nearby surface water, but likely from groundwater.In South Korea, it has been estimated that there are more than one million abandoned tubular wells [41].Abandoned wells function as direct channels for surface contaminants to pollute groundwater; therefore, an appropriate plugging method is generally required.In the study area, it has been reported that the quality of groundwater originating from abandoned tubular wells is lower than that of general groundwater [41].Although the local governments of Cheongju and Sejong have made huge efforts to plug abandoned tubular wells annually, the results of this study suggest that there are still many unmanaged wells that contaminate groundwater and, subsequently, the mainstream of the Miho River.Further source tracking of groundwater could be helpful for assessing the contamination source locations in detail.
The results of this study show that the aquatic environment in farming areas can be contaminated by diverse fecal sources after rainfall.A strong demand exists for proper monitoring plans and surveillance systems to improve water quality.In addition, government support for proper livestock waste storage systems may be helpful.Further efforts are necessary to identify and plug the tubular wells.
In this study, we identified the possible causes of water contamination by applying community-wide microbial source tracking methods.In recent years, more sophisticated MST methods have been developed, such as SNV-FEAST, which uses single nucleotide variants for MST [42].However, these metagenome-based methods are resource-intensive and require high-performance servers, which limits their range of applications.Currently, 16 S rRNA gene-based community-wide source tracking is a reliable and cost-effective MST method.The database construction and validation methods used here and the case study of the Miho River can be applied to other source-tracking studies and can aid policy decision-making processes.As the fecal microbiome can vary geographically [43,44], region-or country-specific databases must be developed before performing community-wide MST.

Fig. 1 A
Fig. 1 A map showing land-use type of the study area and sampling points.Water flows from the north to the south.Land-use information was collected from the National Geographic Information Institute of Korea.Map image was generated through QGIS 3.34.3

Fig. 2 (
Fig. 2 (A) Phylum-level composition of the studied samples.Top 5 phyla (except unclassified) for each sample group were chosen.(B) Phylum-level composition of the studied samples.Top 5 genera (except unclassified) for each sample group were chosen.(C) An nMDS plot generated based on the OTU-level composition of the samples

Fig. 4
Fig. 4 Boxplots representing the source contributions of the fecal microbiome on Miho River samples

Table 1
Physicochemical conditions of the Miho River before and after heavy rain