As an advanced technique in remote sensing, hyperspectral target detection (HTD) is widely concerned in civilian and military applications. However, the limitation of prior and heterogeneous backgrounds makes HTD models sensitive to data corruption under various interference from the environment. In this article, a novel united HTD framework based on the concept of transformer is proposed to extract [HTD based on transformer via spectral-spatial similarity (HTD-TS3)] under weak supervision, which opens up more flexible ways to study HTD. For the first time, the transformer mechanism is introduced into the HTD task to extract spectral and spatial features in a unified optimization procedure. By modeling long-range dependence among spectra, it realizes spectral-spatial joint inference based on long-range context, which addresses the issues of insufficient utilization of spatial information. To provide samples for weakly supervised learning (WSL), the coarse sample selection and spectral sequence construction in an efficient way are proposed, which makes full use of limited prior information. Finally, an exponential constrained nonlinear function is adopted to acquire pixel-level prediction via combining discriminative spectral-spatial features and coarse spatial information. Experiments on real hyperspectral images (HSIs) captured by different sensors at various scenes verify the effectiveness and efficiency of HTD-TS3.