There has been a great deal of interest recently in load-balancing switches due to their simple architecture and high forwarding bandwidth. Nevertheless, the mis-sequencing problem of the original load-balancing switch hinders the performance of underlying TCP applications. Several load-balancing switch designs have been proposed to address this mis-sequencing issue. They solve this mis-sequencing problem at the cost of either algorithmic complexity or special hardware requirements. In this paper, we address the mis-sequencing problem by introducing a three-stage load-balancing switch architecture enhanced with an output load-balancing mechanism. This three-stage load-balancing switch achieves a high forwarding capacity while preserving the order of packets without the need of costly online scheduling algorithms. Theoretical analyses and simulation results show that this three-stage load-balancing switch provides a transmission delay that is upper-bounded by that of an output-queued switch plus a constant that depends only on the number of input/output ports, indicating the same forwarding capacity as an output-queued switch.