Anomalies in species abundance data can potentially cause classification errors in ecological forecasting models. Accurate estimation of anomalies locations can enhance the predictive capacity of models. This study aims to propose an approach for precisely identifying and correcting anomalies within imbalanced species abundance data, thereby addressing the challenges posed by both anomalous and imbalanced species abundance distributions (SADs). A model-agnostic statistical tool, Confident Learning (CL) theory, is introduced to estimate the probability of each sample being misclassified during the prediction phase. Specifically, the approach targets classification errors from models trained on imbalanced SADs for data-cleansing, identifying these records as anomalies. The approach is applied to tuna fisheries datasets, focusing specifically on bigeye tuna (Thunnus obesus), a targeted species in longline fishing, and albacore tuna (Thunnus alalunga), a non-target species in the tropical Atlantic Ocean. These datasets, spanning from 2016 to 2019, featured a spatial resolution of 0.5° × 0.5° and daily temporal resolution, providing a comprehensive view of imbalanced data scenarios. The results demonstrate that all the predictors: Support Vector Machine (SVM), Logistic Regression (LR), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) were considerably enhanced after training on the cleaned datasets. Notably, SVM and LR achieved overall accuracy rates of over 90% on predicting both low- and high-abundant fishing grounds. The proposed approach reveals that the elimination of anomalies can enhance the robustness of ecological forecasting models to imbalanced SADs, offering new insights and technical support for the delicate prediction and assessment of ecological resources.
Read full abstract