Data scarcity and privacy concerns in various fields, including transportation, have fueled a growing interest in synthetic data generation. Synthetic datasets offer a practical solution to address data limitations, such as the underrepresentation of minority classes, while maintaining privacy when needed. Notably, recent studies have highlighted the potential of combining real and synthetic data to enhance the accuracy of demand predictions for shared transport services, thereby improving service quality and advancing sustainable transportation. This study introduces a systematic methodology for evaluating the quality of synthetic transport-related time series datasets. The framework incorporates multiple performance indicators addressing six aspects of quality: fidelity, distribution matching, diversity, coverage, and novelty. By combining distributional measures like Hellinger distance with time-series-specific metrics such as dynamic time warping and cosine similarity, the methodology ensures a comprehensive assessment. A clustering-based evaluation is also included to analyze the representation of distinct sub-groups within the data. The methodology was applied to two datasets: passenger counts on an intercity bus route and vehicle speeds along an urban road. While the synthetic speed dataset adequately captured the diversity and patterns of the real data, the passenger count dataset failed to represent key cluster-specific variations. These findings demonstrate the proposed methodology’s ability to identify both satisfactory and unsatisfactory synthetic datasets. Moreover, its sequential design enables the detection of gaps in deeper layers of similarity, going beyond basic distributional alignment. This work underscores the value of tailored evaluation frameworks for synthetic time series, advancing their utility in transportation research and practice.
Read full abstract