Absence reduction in entomological surveillance data to improve niche-based distribution models for Culicoides imicola

J Peters,B De Baets,J Van Doninck,C Calvete,J Lucientes,E.M De Clercq,E Ducheyne,N.E.C Verhoest

doi:10.1016/j.prevetmed.2011.03.004

Abstract

Abstract Data-driven models for the prediction of bluetongue vector distributions are valuable tools for the identification of areas at risk for bluetongue outbreaks. Various models have been developed during the last decade, and the majority of them use linear discriminant analysis or logistic regression to infer vector–environment relationships. This study presents a performance assessment of two established models compared to a distribution model based on a promising ensemble learning technique called Random Forests. Additionally, the impact of false absences, i.e. data records of suitable vector habitat that are, for various reasons, incorrectly labelled as absent, on the model outcome was assessed using alternative calibration–validation schemes. Three reduction methods were applied to reduce the number of false absences in the calibration data, without loss of information on the environmental gradient of suitable vector habitat: random reduction and stratified reduction based on the distance between absence and presence records in geographical (Euclidean distance) or environmental space (Mahalanobis distance). The results indicated that the predicted vector distribution by the Random Forest model was significantly more accurate than the vector distributions predicted by the two established models (McNemar test, p < 0.01) when the calibration data were not reduced with respect to false absences. The performance of the established models, however, increased considerably by application of stratified false absence reductions. Model validation revealed no significant difference between the performance of the three distinct Culicoides imicola distribution models for the majority of alternative stratified reduction schemes. The main conclusion of this study is that the application of Random Forests, or linear discriminant analysis and logistic regression on the condition that calibration data were first reduced on geographical or environmental information, potentially lead toward better vector distribution models.

Full Text