Modern graphics processing units (GPUs) provide an unprecedented level of computing power. In this study, we present a high-performance, multi-GPU implementation of the analytical nuclear gradient for Kohn-Sham time-dependent density functional theory (TDDFT), employing the Tamm-Dancoff approximation (TDA) and Gaussian-type atomic orbitals as basis functions. We discuss GPU-efficient algorithms for the derivatives of electron repulsion integrals and exchange-correlation functionals within the range-separated scheme. As an illustrative example, we calculate the TDA-TDDFT gradient of the S1 state of a full-scale green fluorescent protein with explicit water solvent molecules, totaling 4353 atoms, at the ωB97X/def2-SVP level of theory. Our algorithm demonstrates favorable parallel efficiencies on a high-speed distributed system equipped with 256 Nvidia A100 GPUs, achieving >70% with up to 64 GPUs and 31% with 256 GPUs, effectively leveraging the capabilities of modern high-performance computing systems.