Natural language has proven to be a valuable source of data for various scientific inquiries including landscape perception and preference research. However, large high quality landscape relevant corpora are scare. We here propose and discuss a natural language processing workflow to identify landscape relevant documents in large collections of unstructured text. Using a small curated high quality collection of actively crowdsourced landscape descriptions we identify and extract similar documents from two different corpora (Geograph and WikiHow) using sentence-transformers and cosine similarity scores. We show that 1) sentence-transformers combined with cosine similarity calculations successfully identify similar documents in both Geograph and WikiHow effectively opening the door to the creation of new landscape specific corpora, 2) the proposed sentence-transformer approach outperforms traditional Term Frequency - Inverse Document Frequency based approaches and 3) the identified documents capture similar topics when compared to the original high quality collection. The presented workflow is transferable to various scientific disciplines in need of domain specific natural language corpora as underlying data.
Read full abstract