In circuit simulators that resemble the Simulation Program with Integrated Circuit Emphasis (SPICE), one of the most crucial steps is the solution of numerous sparse linear equations generated by frequency domain analysis or time domain analysis. The sparse direct solvers based on lower-upper (LU) factorization are extremely time-consuming, so their performance has become a significant bottleneck. Despite the existence of some parallel sparse direct solvers for circuit simulation problems, they remain challenging to adapt in terms of performance and scalability in the face of rapidly evolving parallel computers with multiple NUMA hardware based on ARM architecture. In this paper, we introduce a parallel sparse direct solver named HLU, which re-examines the performance of the parallel algorithm from the viewpoint of parallelism in pipeline mode and the computing efficiency of each task. To maximize task-level parallelism and further minimize the thread waiting time, HLU devises a fine-grained scheduling method based on an elimination tree in pipeline mode, which employs depth-first search (DFS-like) to iteratively search for parent tasks and then place dependent tasks in the same task queue. HLU also suggests two NUMA node affinity strategies: thread affinity optimization based on NUMA nodes topology to guarantee computational load balancing and data affinity optimization to enable effective memory placement when threads access data. The rationality and effectiveness of the sparse solver HLU are validated by the SuiteSparse Matrix Collection. In comparison with KLU and NICSLU, the experimental results and analysis show that HLU attains a speedup of up to 9.14× and 1.26x (geometric mean) on a Huawei Kunpeng 920 Server, respectively.