IntroductionInfections caused by Campylobacter spp. represent a severe threat to public health worldwide. National action plans have included source attribution studies as a way to quantify the contribution of specific sources and understand the dynamic of transmission of foodborne pathogens like Salmonella and Campylobacter. Such information is crucial for implementing targeted intervention. The aim of this study was to predict the sources of human campylobacteriosis cases across multiple countries using available whole-genome sequencing (WGS) data and explore the impact of data availability and sample size distribution in a multi-country source attribution model.MethodsWe constructed a machine-learning model using k-mer frequency patterns as input data to predict human campylobacteriosis cases per source. We then constructed a multi-country model based on data from all countries. Results using different sampling strategies were compared to assess the impact of unbalanced datasets on the prediction of the cases.ResultsThe results showed that the variety of sources sampled and the quantity of samples from each source impacted the performance of the model. Most cases were attributed to broilers or cattle for the individual and multi-country models. The proportion of cases that could be attributed with 70% probability to a source decreased when using the down-sampled data set (535 vs. 273 of 2627 cases). The baseline model showed a higher sensitivity compared to the down-sampled model, where samples per source were more evenly distributed. The proportion of cases attributed to non-domestic source was higher but varied depending on the sampling strategy. Both models showed that most cases could be attributed to domestic sources in each country (baseline: 248/273 cases, 91%; down-sampled: 361/535 cases, 67%;).DiscussionThe sample sizes per source and the variety of sources included in the model influence the accuracy of the model and consequently the uncertainty of the predicted estimates. The attribution estimates for sources with a high number of samples available tend to be overestimated, whereas the estimates for source with only a few samples tend to be underestimated. Reccomendations for future sampling strategies include to aim for a more balanced sample distribution to improve the overall accuracy and utility of source attribution efforts.
Read full abstract