This study conducts a comprehensive investigation into real estate rental pricing in São Paulo city, employing an innovative approach that combines advanced machine learning techniques with geospatial and natural language processing (NLP) analyses. The research analyzed a robust dataset comprising 47,243 rental listings, gathered through web scraping techniques. Following a rigorous data cleaning and preprocessing procedure, the study focused on 35,486 instances, incorporating a variety of variables that go beyond conventional metrics, including textual descriptions and geographic information, enriching the analysis and market understanding. Several regression models were implemented and compared, including linear approaches, Support Vector Machines, and ensemble methods such as Gradient Boosting, LightGBM, and XGBoost. The Blending model, which integrates multiple modeling techniques, stood out as the most accurate, achieving a Root Mean Squared Logarithmic Error (RMSLE) of 0.2923 on the test set. This result emphasizes the superiority of hybrid modeling strategies in complex pricing tasks. The findings of this study have significant practical implications. They provide landlords and tenants with a powerful data-driven tool for informed decision-making, reflecting the nuances and complexity of São Paulo’s real estate market. The practical implementation of the model in an interactive web application not only demonstrates its utility in the real-world scenario but also serves as a model for future applications in real estate analysis. This work contributes to mitigating the waste of time and energy when it comes to searching for and pricing residential rentals in a large city, through the use of machine learning that shows its power and potential in accurately estimating rental prices in dynamic urban markets, allowing that more assertive and economical decisions can be taken within a social-sustainable-technological perspective.
Read full abstract