Efficiently extracting and analyzing large urban traffic data, accurately predicting traffic conditions, and improving urban traffic management require careful selection of an appropriate data sample size. The suitable size of data sample assumes paramount importance in fostering sustainable transportation development. This paper investigates the relationship between traffic flow prediction performance and data sample size, considering data sample missing rates, duration, and road segment coverage. Real traffic flow data from 13 road sections in Changsha, China, are analyzed using the Decision Tree, Support Vector Machine, Gaussian Process Regression and Artificial Neural Network models. Some key findings include: Lower data sample loss rates improve prediction accuracy by capturing traffic flow patterns effectively, while higher loss rates decrease accuracy; an optimal data sample duration of around 7 days balances prediction accuracy and data stability, with longer durations providing more historical data but risking complexity; Broader road segment coverage gives a more comprehensive traffic flow information, but excessive coverage introduces noise and impacts the improvement of prediction accuracy. The results highlight the significant impact of data sample size on prediction performance. Enhancing reliability can be achieved by reducing data loss, selecting suitable durations, and considering appropriate road segment coverage, supporting improved traffic management and route planning.
Read full abstract