SPHINCS was recently proposed as a stateless, quantum-resilient hash-based signature scheme. However, one possible limitation of SPHINCS is its signing speed, namely, the best known implementation merely produces a few hundred of signatures per second, which is not good enough, e.g., for a social website with a huge amount of users. Aiming at improving the singing throughput, we present highly parallel and optimized implementations of SPHINCS, which can be deployed on various multi-core platforms. As a first step, we give an elementary implementation on x86/64 processors, which proves the effectiveness and correctness of our implementations. To obtain a significantly higher throughput, we implement SPHINCS on Graphics Processing Units (GPUs). Furthermore, we develop a few general and hardware-specific techniques to take full advantage of the computing power of targeted platforms. We instantiate the underlying hash functions with three primitives. Our comprehensive benchmark shows that our work outperforms all the state-of-the-art implementations of SPHINCS regarding throughput with reasonable latency, and has scalability on multiple cores and multiple GPU cards. For instance, for the key generation algorithm instantiated with ChaCha running on a GeForce GTX 1080, we obtain 5152 signatures per second which is $7.88\times$ 7 . 88 × speedup faster than a recent FPGA implementation. When upgrade to TITAN Xp, 6,651 signatures are generated in one second. With four TITAN Xp GPUs, the obtained throughput satisfies vast majority scenarios.