ABSTRACT Uncertainty is a common problem in spatial modeling and geographical information systems (GIS). Furthermore, urban gain modeling (UGM) contains various dimensions and components of uncertainties. Data sampling is important in UGM, and may cause the results of the models to contain many uncertainties as well as affects their precision and accuracy. A poorly sampled or biased dataset can lead to inaccurate predictions and decreased performance of the models. This paper aims to present and develop novel strategies for sampling and building training datasets that can enhance the performance of data-driven models. In other words, the present study used maximum entropy (ME) and ecological niche factor analysis (ENFA) models to select pure non-change samples with minimal uncertainty for training datasets in UGM of Isfahan and Tabriz cities in Iran. The urban gain of two time intervals of 1992–2002 and 2002–2012 were used for Tabriz City and two time intervals of 1994–2004 and 2004–2014 for Isfahan City. Nine and 14 urban gain drivers were used in the UGM of Isfahan and Tabriz cities, respectively. After the ME and ENFA models produced a training dataset with change and non-change samples with the lowest uncertainty, three well-known models, namely random forest (RF), artificial neural network (ANN), and support vector machine (SVM) were used for the modeling. Moreover, the ME and ENFA models that were used to investigate the uncertainty of the sampling procedure were used as the one-class prediction models. Compared to extant studies, the proposed ME – based sampling strategy increased the area under the receiver operating characteristic curve (AUROC), figure of merit, producer’s accuracy, and overall accuracy by 5.5%, 5%, 5%, and 3%, respectively, in the validation phase of Isfahan City and by 5%, 6%, 14%, and 17%, respectively, for Tabriz City. For Isfahan, the accuracies of ME (AUROC = 0.649) and ENFA (AUROC = 0.661) one – class models were closer to that of the ANN – ME (AUROC = 0.646), ANN – ENFA (AUROC = 0.619), and RF – ENFA (AUROC = 0.631) models but differed significantly from that of the RF – ME (AUROC = 0.737) model. For Tabriz, the accuracies of ME (AUROC = 0.657) and ENFA (AUROC = 0.688) one – class models were lower than that of the two class RF-ME (AUROC = 0.852), and ANN-ME (AUROC = 0.778) models. The results showed that the ME model was able to identify relatively pure non-change samples and properly remove impure non-change samples from the training dataset. This study discovered that binary models are preferable to one-class models, and showed that an optimal sampling strategy is an essential step in UGM as it can decrease uncertainty. As such, modelers must adopt efficient sampling methods.
Read full abstract