We present ComCTQMC, a GPU accelerated quantum impurity solver. It uses the continuous-time quantum Monte Carlo (CTQMC) algorithm wherein the partition function is expanded in terms of the hybridisation function (CT-HYB). ComCTQMC supports both partition and worm-space measurements, and it uses improved estimators and the reduced density matrix to improve observable measurements whenever possible. ComCTQMC efficiently measures all one and two-particle Green's functions, all static observables which commute with the local Hamiltonian, and the occupation of each impurity orbital. ComCTQMC can solve complex-valued impurities with crystal fields that are hybridized to both fermionic and bosonic baths. Most importantly, ComCTQMC utilizes graphical processing units (GPUs), if available, to dramatically accelerate the CTQMC algorithm when the Hilbert space is sufficiently large. We demonstrate acceleration by a factor of over 600 (100) in a simulation of δ-Pu at 600 K with (without) crystal fields. In easier problems, the GPU offers less impressive acceleration or even decelerates the CTQMC. Here we describe the theory, algorithms, and structure used by ComCTQMC in order to achieve this set of features and level of acceleration. Program summaryProgram Title: ComCTQMCCPC Library link to program files:https://doi.org/10.17632/x2gzgm8njh.1Licensing provisions: GPLv3Programming language: C++/CUDANature of problem: In dynamical mean-field theory (DMFT), the computational bottleneck is the repeated solution of a quantum impurity problem [1]. The continuous-time quantum Monte-Carlo (CTQMC) algorithm has emerged as one of the most efficient methods for solving multiorbital impurity problems at moderate-to-high temperatures [2]. However, the low-temperature regime remains inaccessible, particularly for f-shell systems, and the measurement of two-particle correlation functions on an impurity adds a substantial computational burden. The bottleneck of the CTQMC solver is itself the computation of the local trace which includes the multiplication of many moderate-to-large sized matrices. The efficient solution of the impurity, measurement of the two-particle correlation functions, and acceleration of the trace computation are therefore critical.Solution method: ComCTQMC uses the hybridisation expansion of the impurity action to explore partition space [3]. It uses the worm algorithm [4] to explore the union of the partition space with observables spaces, e.g., the two-particle correlation functions. It uses improved estimators to more accurately measure the one- and two-particle Green's functions [5]. Identical impurities are solved across all MPI ranks (for ideal weak scaling) and the trace computations of these impurities are distributed to and accelerated by GPUs (when available). The lazy-trace algorithm [6] is used to further reduce the burden of the local trace calculation.Additional comments including restrictions and unusual features: ComCTQMC solves nearly arbitrary impurities, including those with complex valued and time-dependent interactions. However, there are two restrictions: (1) The retarded part of the interaction is described by a set of bilinears (a paired creation and annihilation operator), and these bilinears must commute with the local Hamiltonian and have real quantum numbers; (2) If a local Green's function vanishes, then the corresponding hybridisation function also vanishes.