Abstract
Traffic crash data is often greatly imbalanced with the majority of non-fatal crashes and only a small number of fatal crashes. Such data imbalance issue poses a challenge for crash severity modelling, especially for classifying and interpreting fatal crashes with very limited samples. To address the data imbalance issues, the data resampling techniques are commonly used methods to rebalance the number of samples among all categories of the dataset, such as under-sampling and over-sampling techniques. However, it is challenging for most traditional and existing deep learning-based resampling methods, e.g., synthetic minority oversampling technique (SMOTE) and Generative Adversarial Networks (GAN), to handle both continuous and discrete risk factors in traffic crash datasets, since they are built upon by smooth and continuous functions which are not applicable for processing discrete variables. Though some resampling methods are capable of handling both continuous and discrete variables, they may struggle with mode collapse issues associated with sparse discrete risk factors so that the diversity of the underlying data distribution can not be captured due to oversampling repetitive and similar samples. To address the aforementioned issues, the current study proposes a traffic crash data generation method based on the Conditional Tabular GAN (CTGAN) to rebalance crash datasets for improving performance of crash severity classification and interpretation. The designed experiments are conducted to evaluate contributions of the synthetic data for improving crash severity classification, the distribution consistency between synthetic and benchmark datasets, and the parameter recovery (i.e., the accuracy of parameter estimation and probability prediction) for various resampling strategies. A 4-year real-world dataset collected in Washington State, U.S., and Monte Carlo simulations are utilized for demonstrating the designed experiments. The results indicate that crash severity modeling using synthetic data generated by the mix-resampling of CTGAN and random under-sampling (CTGAN-RU) outperforms all baseline methods. In addition, the proposed deep generative method demonstrates the capability in maintaining distribution consistency and achieving accurate parameter recovery. This study can provide valuable insights for traffic safety researchers and engineers into crash severity modeling, especially when handling imbalanced crash data of various types.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.