Abstract
In terms of the travel demand prediction from the household car ownership model, if the imbalanced data were used to support the transportation policy via a machine learning model, it would negatively affect the algorithm training process. The data on household car ownership obtained from the study project for the expressway preparation in the Khon Kaen Province (2015) was an unbalanced dataset. In other words, the number of members of the minority class is lower than the rest of the answer classes. The result is a bias in data classification. Consequently, this research suggested balancing the datasets with cost-sensitive learning methods, including decision trees, k-nearest neighbors (kNN), and naive Bayes algorithms. Before creating the 3-class model, a k-folds cross-validation method was applied to classify the datasets to define true positive rate (TPR) for the model’s performance validation. The outcome indicated that the kNN algorithm demonstrated the best performance for the minority class data prediction compared to other algorithms. It provides TPR for rural and suburban area types, which are region types with very different imbalance ratios, before balancing the data of 46.9% and 46.4%. After balancing the data (MCN1), TPR values were 84.4% and 81.4%, respectively.
Highlights
Data classification is an analysis method used to define data patterns, classification models, and classification rules
The findings indicated that the k-nearest neighbors (kNN) algorithm provided a high true positive rate (TPR) with a higher accuracy rate in classifying the dataset in the minority class (Class 0) in every imbalanced ratio (Figure 3a)
false negative rate (FNR) was close to 100%; for instance, the decision tree (DT) model in the suburban area showed imbalance ratio (IR) = 5.20, whereas the kNN algorithm gave the lowest FNR in every IR depending on each area type
Summary
Data classification is an analysis method used to define data patterns, classification models, and classification rules This method predicts different data types, either present or future, such as travel demand predictions. The selection for a high performing technique should rely on the parameters indicating the data classification performance, e.g. accuracy, precision, recall, F1-score. Still, these techniques do not work well on every dataset. The imbalanced data has courses with a different number of datasets At this point, the imbalanced data classification becomes a thought-provoking issue because some of the minority classes include either significant or outstanding data. For more effective data analysis, the model’s performance to classify the minority class needs to be improved before algorithm training with suitable parameters for the imbalanced data [5, 6]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.