In supervised classification, decision trees are one of the most popular learning algorithms that are employed in many practical applications because of their simplicity, adaptability, and other perks. The development of effective and efficient decision trees remains a major focus in machine learning. Therefore, the scientific literature provides various node splitting measurements that can be utilized to produce different decision trees, including Information Gain, Gain Ratio, Average Gain, and Gini Index. This research paper presents a new node splitting metric that is based on preordonance theory. The primary benefit of the new split criterion is its ability to deal with categorical or numerical attributes directly without discretization. Consequently, the Preordonance-based decision tree” (P-Tree) approach, a powerful technique that generates decision trees using the suggested node splitting measure, is developed. Both multiclass classification problems and imbalanced data sets can be handled by the P-Tree decision tree strategy. Moreover, the over-partitioning problem is addressed by the P-Tree methodology, which introduces a threshold ϵ as a stopping condition. If the percentage of instances in a node falls below the predetermined threshold, the expansion of the tree will be halted. The performance of the P-Tree procedure is evaluated on fourteen benchmark data sets with different sizes and contrasted with that of five already existing decision tree methods using a variety of evaluation metrics. The results of the experiments demonstrate that the P-Tree model performs admirably across all of the tested data sets and that it is comparable to the other five decision tree algorithms overall. On the other hand, an ensemble technique called “ensemble P-Tree” offers a reliable remedy to mitigate the instability that is frequently associated with tree-based algorithms. This ensemble method leverages the strengths of the P-Tree approach to enhance predictive performance through collective decision-making. The ensemble P-Tree strategy is comprehensively evaluated by comparing its performance to that of two top-performing ensemble decision tree methodologies. The experimental findings highlight its exceptional performance and competitiveness against other decision tree procedures. Despite the excellent performance of the P-Tree approach, there are still some obstacles that prevent it from handling larger data sets, such as memory restrictions, time complexity, or data complexity. However, parallel computing is effective in resolving this kind of problem. Hence, the MR-P-Tree decision tree technique, a parallel implementation of the P-Tree strategy in the Map-Reduce framework, is further designed. The three parallel procedures MR-SA-S, MR-SP-S, and MR-S-DS for choosing the optimal splitting attributes, choosing the optimal splitting points, and dividing the training data set in parallel, respectively, are the primary basis of the MR-P-Tree methodology. Furthermore, several experimental studies are carried out on ten additional data sets to illustrate the viability of the MR-P-Tree technique and its strong parallel performance.
Read full abstract