This paper examines the performance and memory-access behavior of the C4.5 decision-tree induction program, a representative example of data mining applications, for both uniprocessor and parallel implementations. The goals of this paper are to characterize C4.5, in particular its memory hierarchy usage, and to decrease the run-time of C4.5 via algorithmic improvement and parallelization. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit instruction level parallelism to varying degrees. This paper makes the following four contributions. The first contribution is presenting a complete characterization of the C4.5 decision-tree induction program. The results show that with the exception of the input data set, the working set fits into an 8-Kbyte data cache; the instruction working set also fits into an 8-Kbyte instruction cache. For data sets larger than the L2 cache, performance is limited by accesses to main memory. The results further establish that four-way issue can provide up to a factor of two performance improvement over single-issue for larger L2 caches; for smaller L2 caches, out-of-order dispatch provides a large performance improvement over in-order dispatch. The second contribution made by this paper is examining the effect on the memory hierarchy of changing the layout of the input dataset in memory, showing again that the performance is limited by memory accesses. One proposed data layout decreases the dynamic instruction count by up to 24%, but usually results in lower performance due to worse cache behavior. Another proposed data layout does not improve the dynamic instruction count over the original layout, but has better cache behavior and decreases the run-time by up to a factor of two. Third, this paper presents the first decision-tree induction program parallelized for a ccNUMA architecture. A method for splitting the decision tree hash table is discussed that allows the hash table to be updated and accessed simultaneously without the use of locks. The performance of the parallel version is compared to the original version of C4.5 and a uniprocessor version of C4.5 using the best data layout found. Speedup curves from a six-processor Sun E4000 SMP system show a speedup on the induction step of 3.99, and simulation results show that the performance is mostly unaffected by increasing the remote memory access time until it is over a factor of ten greater than the local memory access time. Last, this paper characterizes the parallelized decision-tree induction program. Compared to the uniprocessor version, the parallel version exerts significantly less pressure on the memory hierarchy, with the exception of having a much larger first level data working set.
Read full abstract