The work presents a general approach to the software architecture for data collection and analysis using machine learning methods to predict the price of agricultural land. A brief review of recently published literature is made. New research results using LightGBM, Fast Tree, and Fast Forest methods to predict land price are provided. The highlighted stages include data collection, storage, exploration, and transformation, aggregation, optimization, and the use of machine learning methods. It is proposed to use several software applications for the implementation of these stages. An architecture diagram of the interaction of such applications is provided. The use of inflation data and the impact of such data on forecast quality reviewed. In addition to the dataset collected from the auctions and classifieds sites (first data source), data available on the website of the State Service of Ukraine for Geodesy, Cartography, and Cadastre (second data source) were used. The dataset collected from first data source consist of 2123rows, collected from seconds data source contains 815529 (further filtered to less rows due to different types of contract, year, etc). List of datasets for which machine learning was performed has been modified: attribute about soil type, distances from big cities, inflation were added. Information from the data available on the website of the State Service of Ukraine for Geodesy, Cartography, and Cadastre was converted to dataset, and it was further filtered and utilized, ranging from 75843 to 102924rows. From the initial data set (D1), two additional ones were created: without data on land rent (D2) and with prices that take into account inflation (according to the National Bank of Ukraine) (D3). Experiments on removing the value of the regulatory monetary assessment of land reviewed. Also, for data source data was converted to four datasets. Dataset D4 – constructed using date filters for 2022 and 2023 for the "sale-purchase agreement" type. Rows without specified prices will be removed from the dataset. The total number of rows in dataset D4 is 102924. Dataset D5 – additionally apply a filter on the price per hectare of land, ranging from no less than 10000 UAH to no more than 150000 UAH. After this filtration, only 82241 parcels will remain in the dataset. Dataset D6 – expand dataset D5 with information about distances from the largest cities in Ukraine to the land. Dataset D7 – further expand dataset D6 with information about inflation. Thus, machine learning methods LightGBM, Fast Tree, and Fast Forest have been used for datasets collected from two different sources. If the cost prediction has an error of no more than 10%, it is considered correct. The result of the machine learning, depending on the algorithm and data set, for R2 up to 0.897 and percentage of correct predictions up to 91%. Research shows that as the number of rows in the initial dataset increases, the quality of prediction improves. As for second data source – it have slightly worser prediction results then first data source, but second data source usage have shown how fast dataset can be extended with new parameters (e.g. soil type, distances from cities, inflation) to significantly increase prediction quality. Another possible way to improve the quality of the forecast could be the integration of certain data sets on non-market factors affecting land value (e.g., a map of potential contamination by explosive objects). The summary suggests further steps to improve the quality of predictions.
Read full abstract