Crystals-Dilithium is one of the digital-signature algorithms in NIST’s ongoing post-quantum cryptography (PQC) standardization final round. Security and computational efficiency concerning software and hardware implementations are the primary criteria for PQC standardization. Many studies were conducted to efficiently apply Dilithium in various environments; however, they are focused on traditionally used PC and 32-bit Advanced RISC Machine (ARM) processors (Cortex-M4). ARMv8-based processors are more advanced embedded microcontrollers (MCUs) and have been widely used for various IoT devices, edge computing devices, and On-Board Units in autonomous driving cars. In this study, we present an efficient Crystals-Dilithium implementation on ARMv8-based MCU. To enhance Dilithium’s performance, we optimize number theoretic transform (NTT)-based polynomial multiplication, the core operation of Dilithium, by leveraging ARMv8’s architectural properties such as large register sets and NEON engine. We apply task parallelism to NTT-based polynomial multiplication using the NEON engine. In addition, we reduced the number of memory accesses during NTT-based polynomial multiplication with the proposed merging and register-holding techniques. Finally, we present an interleaved NTT-based multiplication simultaneously executed with ARM processor and NEON engine. This implementation can further optimize performance by eliminating the ARM processor latency with NEON overheads. Through the proposed optimization methods, for Dilithium 3, we achieved a performance improvement of about 43.83% in key pair generation, 113.25% in signing, and 41.92% in verification compared to the reference implementation submitted to the final round of the NIST PQC competition.