Abstract

In terms of the travel demand prediction from the household car ownership model, if the imbalanced data were used to support the transportation policy via a machine learning model, it would negatively affect the algorithm training process. The data on household car ownership obtained from the study project for the expressway preparation in the Khon Kaen Province (2015) was an unbalanced dataset. In other words, the number of members of the minority class is lower than the rest of the answer classes. The result is a bias in data classification. Consequently, this research suggested balancing the datasets with cost-sensitive learning methods, including decision trees, k-nearest neighbors (kNN), and naive Bayes algorithms. Before creating the 3-class model, a k-folds cross-validation method was applied to classify the datasets to define true positive rate (TPR) for the model’s performance validation. The outcome indicated that the kNN algorithm demonstrated the best performance for the minority class data prediction compared to other algorithms. It provides TPR for rural and suburban area types, which are region types with very different imbalance ratios, before balancing the data of 46.9% and 46.4%. After balancing the data (MCN1), TPR values were 84.4% and 81.4%, respectively.

Highlights

  • Data classification is an analysis method used to define data patterns, classification models, and classification rules

  • The findings indicated that the k-nearest neighbors (kNN) algorithm provided a high true positive rate (TPR) with a higher accuracy rate in classifying the dataset in the minority class (Class 0) in every imbalanced ratio (Figure 3a)

  • false negative rate (FNR) was close to 100%; for instance, the decision tree (DT) model in the suburban area showed imbalance ratio (IR) = 5.20, whereas the kNN algorithm gave the lowest FNR in every IR depending on each area type

Read more

Summary

Introduction

Data classification is an analysis method used to define data patterns, classification models, and classification rules This method predicts different data types, either present or future, such as travel demand predictions. The selection for a high performing technique should rely on the parameters indicating the data classification performance, e.g. accuracy, precision, recall, F1-score. Still, these techniques do not work well on every dataset. The imbalanced data has courses with a different number of datasets At this point, the imbalanced data classification becomes a thought-provoking issue because some of the minority classes include either significant or outstanding data. For more effective data analysis, the model’s performance to classify the minority class needs to be improved before algorithm training with suitable parameters for the imbalanced data [5, 6]

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call