Information retrieval (IR) methods seek to locate meaningful documents in large collections of textual and other data. Few studies apply these techniques to discover descriptions in historical documents for physical geography applications. This absence is noteworthy given the use of qualitative historical descriptions in physical geography and the amount of historical documentation online. This study, therefore, introduces an IR approach for finding meaningful and geographically resolved historical descriptions in large digital collections of historical documents. Presenting a biogeography application, it develops a ‘search engine’ using a boosted regression trees (BRT) model to assist in finding forest compositional descriptions (FCDs) based on textual features in a collection of county histories. The study then investigates whether FCDs corroborate existing estimates of relative abundances and spatial distributions of tree taxa from presettlement land survey records (PLSRs) and existing range maps. The BRT model is trained using portions of text from 458 US county histories. Evaluating the model’s performance upon a spatially independent test dataset, the model helps discover 97.5% of FCDs while reducing the amount of text to search through to 0.3% of total. The prevalence rank of taxa in FCDs (i.e. the number of times a taxon is mentioned at least once in an FCD, divided by the total number of FCDs, then ranked) is strongly related to the abundance rank in PLSRs. Patterns in species mentions from FCDs generally match relative abundance patterns from PLSRs. However, analyses suggest that FCDs contain biases towards large and economically valuable tree taxa and against smaller taxa. In the end, the study demonstrates the potential of IR approaches for developing novel datasets over large geographic areas, corroborating existing historical datasets, and providing spatial coverage of historic phenomena.
Read full abstract