With the world economy recovering and the increasing activity in the second-hand market, the second-hand sailboat market has shown enormous potential. In order to help a Hong Kong second-hand sailboat broker better understand the market and make accurate predictions, we searched for and supplemented other relevant data on second-hand sailboats based on the given data, and established a multiple linear regression model and other models on this basis.
 In Task 1, we first supplemented the existing data by adding six variables. We then conducted data cleaning and encoding, and performed multicollinearity tests, finding that there was multicollinearity. We used the stepwise regression method for feature selection to reduce multicollinearity, and then established a multiple linear regression model and calculated the mean absolute error for different variants of sailboats. We then used the Shapiro-Wilk test method to judge the normal distribution of errors and found that the data showed a slight deviation from normal distribution. Therefore, this model could be established. The results showed that Length, Make, and Year had the greatest impact on the listing price of each sailboat, and the model had high accuracy in estimating the prices of different variants of sailboats.
 In Task 2, we conducted one-way ANOVA on the regions, calculated the intergroup differences and total dispersion of each feature, and then used classification and summarization to obtain the average prices of each feature. We found that the regions had an impact on the prices, and the regional effects were not consistent for all sailboat variants. In addition, we found differences in the lengths of second-hand sailboats in different countries, which were mostly distributed between 40ft-50ft.
 In Task 3, we first made box plots of the average prices in different geographical regions, then made radar charts of each feature in these three regions, and finally re-established a multiple linear regression model by using whether each sailboat type was sold in Hong Kong as a 0-1 variable. We found that the second-hand sailboat markets in the United States and Hong Kong were similar. The prices of sailboats in Hong Kong were generally higher than those in other regions, and their impact on Monohulled Sailboats and Catamarans was different.
 In Task 4, we calculated the average prices of second-hand sailboats in each country, visualized the data, and established a map model. We found that GDP per capita (USD) and GDP (USD billion) had significant differences in their impact on length.
 In Task 5, based on the searched second-hand sailboat transaction data in Hong Kong, we established a model for the transaction price and frequency of sailboats according to their single or double type, and found that brokers should pay attention to the main features of second-hand sailboats and successful cases in the US second-hand sailboat market, as well as the price range of sailboats.
 To improve our model, we adopted the random forest model that can handle nonlinear relationships, improve the robustness of the model, avoid overfitting, and increase prediction accuracy.
Read full abstract