Abstract
Maps of bottom type are essential to the management of marine resources and biodiversity because of their foundational role in characterizing species' habitats. They are also urgently needed as countries work to define marine protected areas. Current approaches are time consuming, focus largely on grain size, and tend to overlook shallow waters. Our random forest classification of almost 200,000 observations of bottom type is a timely alternative, providing maps of coastal substrate at a combination of resolution and extents not previously achieved. We correlated the observations with depth, depth-derivatives, and estimates of energy to predict marine substrate at 100 m resolution for Canada's Pacific shelf, a study area of over 135,000 km2. We built five regional models with the same data at 20 m resolution. In addition to standard tests of model fit, we used three independent data sets to test model predictions. We also tested for regional, depth, and resolution effects. We guided our analysis by asking: 1) does weighting for prevalence improve model predictions? 2) does model resolution influence model performance? And 3) is model performance influenced by depth? All our models fit the build data well with true skill statistic (TSS) scores ranging from 0.56 to 0.64. Weighting models with class prevalence improved fit and the correspondence with known spatial features. Class-based metrics showed differences across both resolutions and spatial regions, indicating non-stationarity across these spatial categories. Predictive power was lower (TSS from 0.10 to 0.36) based on independent data evaluation. Model performance was also a function of depth and resolution, illustrating the challenge of accurately representing heterogeneity. Our work shows the value of regional analyses to assessing model stationarity and how independent data evaluation and the use of error metrics can improve understanding of model performance and sampling bias.
Highlights
Coastal management depends on understanding how marine species are distributed
We suggest that using build data compiled from several different sampling contexts will improve model performance because the diversity of biases will force the classifications to be more general, much like the generalization of processes described above regarding stationarity
We found differences in how accuracy metrics respond to class weighting, with the True Negative Rate (TNR) more responsive than Overall Accuracy (Table 3), corroborating the observation by Allouche et al [17] that prevalence has a greater influence on TNR than Accuracy
Summary
Our main objective was to build a comprehensive, ecologically-relevant coastwide map of marine substrate to support predictions of quality habitat for benthic species, and other applications The importance of such predictions to marine spatial planning makes timeliness an additional objective, and necessitates using the best available data. The effect of class prevalence on classification models has been well described [e.g., 17, 18], and recent work [8] confirms the random forest algorithm favors the over-sampled class [19] This challenge is a significant area of research in the machine learning community [19], where well-balanced classes are encouraged. Finding little on this topic in the marine substrate classification literature, we tested the effect of class prevalence using two parallel sets of models with and without class-size weighting
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.